The first step in using Spark is to connecting to a `cluster`. In practice, the cluster is hosted on a remote machine that's connected to all other nodes. To create a connection we need to create an instance of `SparkContext` class. The class constructor takes in few optional arguments that allows to specify the attributes of the cluster. An object holding these attributes can be created using `SparkConf` class.

### Examining the Spark Context

here `SparkContext` object `sc` is already loaded into the workspace

In [1]:
from pyspark import SparkConf, SparkContext

configure = SparkConf().setAppName("example-app").setMaster("local")
sc = SparkContext(conf=configure)

In [2]:
print(sc)
print(sc.version)

<SparkContext master=local appName=example-app>
2.4.0


In [3]:
flights = sc.parallelize('flights_small.csv')

Spark's core data structure is `Resilient Distributed Dataset` (RDD). This is a low level object that let's Spark work its magic by splitting data across multiple nodes in the cluster. Spark DataFrame behaves lot like SQL table. To start working on Spark DataFrames we have to create `SparkSession` object from `SparkContext` object

`SparkContext` is like to connection to the cluster and `SparkSession` is like an interface with that connection

### Creating a SparkSession

In [4]:
from pyspark.sql import SparkSession

my_spark = SparkSession.builder.getOrCreate()
print(my_spark)

<pyspark.sql.session.SparkSession object at 0x0000019A0FB315C0>


In [5]:
my_spark.sparkContext.getConf().getAll()

[('spark.master', 'local'),
 ('spark.driver.port', '56235'),
 ('spark.rdd.compress', 'True'),
 ('spark.serializer.objectStreamReset', '100'),
 ('spark.executor.id', 'driver'),
 ('spark.submit.deployMode', 'client'),
 ('spark.driver.host', 'DESKTOP-K2G4QJD'),
 ('spark.app.id', 'local-1554275275590'),
 ('spark.ui.showConsoleProgress', 'true'),
 ('spark.app.name', 'example-app')]

### Loading Data

In [7]:
flights = my_spark.read.csv("flights_small.csv", header=True)

In [8]:
flights.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

### Converting Spark DataFrame to Pandas DataFrame

In [9]:
flights_pd = flights.toPandas()

In [10]:
flights_pd.head()

Unnamed: 0,year,month,day,dep_time,dep_delay,arr_time,arr_delay,carrier,tailnum,flight,origin,dest,air_time,distance,hour,minute
0,2014,12,8,658,-7,935,-5,VX,N846VA,1780,SEA,LAX,132,954,6,58
1,2014,1,22,1040,5,1505,5,AS,N559AS,851,SEA,HNL,360,2677,10,40
2,2014,3,9,1443,-2,1652,2,VX,N847VA,755,SEA,SFO,111,679,14,43
3,2014,4,9,1705,45,1839,34,WN,N360SW,344,PDX,SJC,83,569,17,5
4,2014,3,9,754,-1,1015,1,AS,N612AS,522,SEA,BUR,127,937,7,54


### Viewing tables

In [None]:
my_spark.catalog.listTables()