<a href="https://colab.research.google.com/github/akash865/spark_101/blob/master/Spark101_CreateDataframe.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Spark 101 - Getting started with spark
<hr size="5"/>

> Notebook containing basic spark functions to get started with data analysis
- Author: Akash Chandra
- Comments: true
- Categories: [Python, PySpark, Pandas]
- Spark version: 3.0.3

## **1. Running Spark in Colab**

Running spark codes in colab need some library imports. Please follow notebook to get started.

### 1.1 Initialize Spark

The below line of codes initializes spark. This installs Apache Spark 3.0.0, Java 8, and Findspark, a library that makes it easy for Python to find Spark. You might also need to refer to correct version while installing. Here I am using 3.0.3. Any update in the library location might lead to an error, please refer to the link below to find the stable version. [Link to library](https://downloads.apache.org/spark/)

> Version: spark-3.0.3


In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-3.0.3/spark-3.0.3-bin-hadoop2.7.tgz
!tar xf spark-3.0.3-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.0.3-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
# sqlContext = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
# Check if spark is initialized
df = spark.sql('''select 'spark' as hello''')
df.show()

+-----+
|hello|
+-----+
|spark|
+-----+



## **2. Ways to create dataframe**

If you are able to run the codes above, you are good to proceed. Let's start with ways to create dataframe in spark. We will also do some basic data manipulation. But let's start with importing useful libraries first. Then we will look at some useful dataframe methods.
<br>
<br>
Please note that there are many functions available in `pyspark.sql.function` which will be helpful for data analysis and descriptive analytics.

In [None]:
# Import spark libraries
from pyspark.sql import Row, DataFrame
from pyspark.sql.types import StringType, StructType, StructField, IntegerType
from pyspark.sql.functions import col, expr, lit, substring, concat, concat_ws, when, coalesce
from pyspark.sql import functions as F # for more sql functions
from functools import reduce

### 2.1 View Spark dataframe

Use show or collect method as shown below to view dataframe in spark.


```
df.show()
OR 
df.collect()
```



### 2.2 Create Dataframe from list

In [None]:
df = spark.createDataFrame(
    [
      ["2015-06-23", 5],
      ["2016-07-20", 7]
    ], # Data rows
    ["data_date", "months_to_add"] # Column names
)

df.show()

+----------+-------------+
| data_date|months_to_add|
+----------+-------------+
|2015-06-23|            5|
|2016-07-20|            7|
+----------+-------------+



### 2.3 Create Dataframe from RDD

In [None]:
# create RDD to load into spark dataframe
l =  [["2015-06-23", 5]
      ,["2016-07-20", 7]] #List with data elements
rdd1 = spark.sparkContext.parallelize(l)


row_rdd = rdd1.map(lambda x: Row(x[0], x[1]))
df = spark.createDataFrame(row_rdd, ['data_date', 'months_to_add'])

df.show()

+----------+-------------+
| data_date|months_to_add|
+----------+-------------+
|2015-06-23|            5|
|2016-07-20|            7|
+----------+-------------+



### 2.4 Create Dataframe from RDD and schema

The `nullable` in the below syntax allows null values in the dataframe. If set to `False`, the dataframe would throw an error for empty values

In [None]:
# create RDD to load into spark dataframe
l =  [["2015-06-23", 5]
      ,["2016-07-20", 7]] #List with data elements
rdd1 = spark.sparkContext.parallelize(l)


schema = StructType([    StructField("data_date", StringType(), True),
    StructField("months_to_add", IntegerType(), True)]) # Col, Type, Nullable


df = spark.createDataFrame(rdd1, schema)
df.show()


+----------+-------------+
| data_date|months_to_add|
+----------+-------------+
|2015-06-23|            5|
|2016-07-20|            7|
+----------+-------------+



### 2.5 Create Dataframe from lists

In [None]:
# Building a simple dataframe:
schema = StructType([
    StructField("data_date", StringType(), True),
    StructField("months_to_add", IntegerType(), True)
    ]) # Col, Type, Nullable


column1 = ["2015-06-23", "2016-07-20"]
column2 = [5, 7]

# Dataframe:
df = spark.createDataFrame(list(zip(column1, column2)), schema=schema)
df.show()

+----------+-------------+
| data_date|months_to_add|
+----------+-------------+
|2015-06-23|            5|
|2016-07-20|            7|
+----------+-------------+



### 2.6 Create Dataframe from Pandas dataframe

In [None]:
import pandas as pd
# df = spark.createDataFrame(pandas_df.toPandas()) # Creating pandas dataframe first

l =  [["2015-06-23", 5]
      ,["2016-07-20", 7]] #List with data elements
    
df = spark.createDataFrame(pd.DataFrame(l),['data_date','months_to_add'])
df.show()

+----------+-------------+
| data_date|months_to_add|
+----------+-------------+
|2015-06-23|            5|
|2016-07-20|            7|
+----------+-------------+



### 2.7 Create Dataframe from hive table

The below is useful when the tables are stored as hive tables which provides SQL-like access for data in HDFS.

```
input_table = <db_name>.<table_name>
df = spark.sql('''select data_date, months_to_add from {0}'''.format(input_table)
```

### 2.8 Create Dataframe from CSV or other text file

I am using data which is freely available from data.gov for FDIC failed bank list. You may download the same from link given or use any text file you have. Data link: https://catalog.data.gov/dataset/fdic-failed-bank-list.
<br>
<br>
Do note the arguments in read function. `header` is True for providing data with first line as header. `inferschema` is just a lazy way of using best possible data types. `delimiter` could be changed to tab (\t), or space(\\s) depending on input file.
<br>
<br>
I have also uploaded the same file to github. Feel free to use the link directly.

In [None]:
# Importing offline files in colab
from google.colab import files
files.upload()

Saving banklist.csv to banklist.csv


{'banklist.csv': b'Bank Name,City,ST,CERT,Acquiring Institution,Closing Date\r\nThe First State Bank,Barboursville,WV,14361,"MVB Bank, Inc.",3-Apr-20\r\nEricson State Bank,Ericson,NE,18265,Farmers and Merchants Bank,14-Feb-20\r\nCity National Bank of New Jersey,Newark,NJ,21111,Industrial Bank,1-Nov-19\r\nResolute Bank,Maumee,OH,58317,Buckeye State Bank,25-Oct-19\r\nLouisa Community Bank,Louisa,KY,58112,Kentucky Farmers Bank Corporation,25-Oct-19\r\nThe Enloe State Bank,Cooper,TX,10716,"Legend Bank, N. A.",31-May-19\r\nWashington Federal Bank for Savings,Chicago,IL,30570,Royal Savings Bank,15-Dec-17\r\nThe Farmers and Merchants State Bank of Argonia,Argonia,KS,17719,Conway Bank,13-Oct-17\r\nFayette County Bank,Saint Elmo,IL,1802,"United Fidelity Bank, fsb",26-May-17\r\n"Guaranty Bank, (d/b/a BestBank in Georgia & Michigan)",Milwaukee,WI,30003,First-Citizens Bank & Trust Company,5-May-17\r\nFirst NBC Bank,New Orleans,LA,58302,Whitney Bank,28-Apr-17\r\nProficio Bank,Cottonwood Heights,UT,

In [None]:
# list all files in the temp directory
! ls

banklist.csv  spark-3.0.3-bin-hadoop2.7
sample_data   spark-3.0.3-bin-hadoop2.7.tgz


In [None]:
# inferschema loads the closest datatype automatically from the data
# header option reads first line as columns, else default value

df = spark.read.options(header="true", inferschema = "true", delimiter=",").csv('banklist.csv')

print('df.count  :', df.count())
print('df.col ct :', len(df.columns))
print('df.columns:', df.columns)

df.count  : 561
df.col ct : 6
df.columns: ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date']


In [None]:
# Using file uploaded to github
from pyspark import SparkFiles

url = "https://raw.githubusercontent.com/akash865/spark_101/master/banklist.csv"
spark.sparkContext.addFile(url)

df = spark.read.options(header="true", inferschema = "true", delimiter=",").csv("file://"+SparkFiles.get("banklist.csv"))

print('df.count  :', df.count())
print('df.col ct :', len(df.columns))
print('df.columns:', df.columns)

df.count  : 561
df.col ct : 6
df.columns: ['Bank Name', 'City', 'ST', 'CERT', 'Acquiring Institution', 'Closing Date']
