# SparkSession

- The `SparkSession` class, defined in the `pyspark.sql` package, is the entry point
to programming Spark with the Dataset and DataFrame APIs. 
- In order to do
anything useful with a Spark cluster, you first need to create an instance of this
class, which gives you access to an instance of `SparkContext`.

# SparkContext

- The `SparkContext` class, defined in the `pyspark` package, is the main entry point
for Spark functionality. 
- A `SparkContext` holds a connection to the Spark cluster
manager and can be used to create RDDs and broadcast variables in the cluster.
- When you create an instance of SparkSession, the SparkContext becomes available
inside your session as an attribute, `SparkSession.sparkContext`.

### Go to the coding 

# Creating a Spark DataFrame 

Different ways we can create a Spark DataFrame. All are offered by `SparkSession`
- read 
- sql()
- table()
- range()
- createDataFrame() 

### `range` Method

In [1]:
# Create a single column DF 
df1 = spark.range(10)
df1.printSchema() 
df1.show()

root
 |-- id: long (nullable = false)



[Stage 0:>                                                          (0 + 2) / 2]

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
+---+



                                                                                

### `createDataFrame` Method 

In [2]:
from datetime import date

my_list = [(1, "Suman", date(1985, 1, 1), 'Bangalore', 100.2), \
           (2, "Kumar", date(1986, 2, 2), 'Singapore', 123.4), \
           (3, "Mike", date(1990, 10, 1), 'Sydney', 110.3)]

In [3]:
my_df = spark.createDataFrame(my_list)
my_df.show()

                                                                                

+---+-----+----------+---------+-----+
| _1|   _2|        _3|       _4|   _5|
+---+-----+----------+---------+-----+
|  1|Suman|1985-01-01|Bangalore|100.2|
|  2|Kumar|1986-02-02|Singapore|123.4|
|  3| Mike|1990-10-01|   Sydney|110.3|
+---+-----+----------+---------+-----+





In [4]:
my_df.printSchema()

root
 |-- _1: long (nullable = true)
 |-- _2: string (nullable = true)
 |-- _3: date (nullable = true)
 |-- _4: string (nullable = true)
 |-- _5: double (nullable = true)



#### How can we change the column name ?

In [5]:
# toDF() -> Transformation 
my_df = spark.createDataFrame(my_list).toDF('id', 'name', 'DoB', 'city', 'blood_sugar')
my_df.show()

+---+-----+----------+---------+-----------+
| id| name|       DoB|     city|blood_sugar|
+---+-----+----------+---------+-----------+
|  1|Suman|1985-01-01|Bangalore|      100.2|
|  2|Kumar|1986-02-02|Singapore|      123.4|
|  3| Mike|1990-10-01|   Sydney|      110.3|
+---+-----+----------+---------+-----------+



In [6]:
my_df.printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- DoB: date (nullable = true)
 |-- city: string (nullable = true)
 |-- blood_sugar: double (nullable = true)



#### How can we define the schema in a different way ?

In [7]:
schema_1 = ['id', 'name', 'DoB', 'city', 'blood_sugar']
schema_2 = 'id int, name string, DoB date, city string, blood_sugar double'


In [8]:
spark.createDataFrame(my_list, schema_1).printSchema()

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- DoB: date (nullable = true)
 |-- city: string (nullable = true)
 |-- blood_sugar: double (nullable = true)



In [9]:
spark.createDataFrame(my_list, schema_2).printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- DoB: date (nullable = true)
 |-- city: string (nullable = true)
 |-- blood_sugar: double (nullable = true)



In [10]:
import pandas as pd 

pd_df = pd.DataFrame({'id': [1, 2, 3],
                      'name': ['Suman', 'Kumar', 'Mike'],
                      'DoB': [date(1985, 1, 1), date(1986, 2, 2), date(1990, 10, 1)],
                      'city': ['Bangalore', 'SIngapore', 'Sydney'],
                      'blood_sugar' : [140.1, 123.4, 110.3]})

In [11]:
df = spark.createDataFrame(pd_df, schema_2)
df.show()

+---+-----+----------+---------+-----------+
| id| name|       DoB|     city|blood_sugar|
+---+-----+----------+---------+-----------+
|  1|Suman|1985-01-01|Bangalore|      140.1|
|  2|Kumar|1986-02-02|SIngapore|      123.4|
|  3| Mike|1990-10-01|   Sydney|      110.3|
+---+-----+----------+---------+-----------+



In [12]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- DoB: date (nullable = true)
 |-- city: string (nullable = true)
 |-- blood_sugar: double (nullable = true)

