# Introduction to PySpark

In practice, the cluster will be hosted on a remote machine that's connected to all other nodes. There will be one computer, called the master that manages splitting up the data and the computations. The master is connected to the rest of the computers in the cluster, which are called worker. The master sends the workers data and calculations to run, and they send their results back to the master.

Creating the connection is as simple as creating an instance of the SparkContext class. The class constructor takes a few optional arguments that allow you to specify the attributes of the cluster you're connecting to.


### Creating a SparkSession

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
print(spark)
print(spark.version)

<pyspark.sql.session.SparkSession object at 0x7f77be0155d0>
3.3.0


### Creating table

In [3]:
df = spark.read.csv("flights_small.csv", header=True, inferSchema=True)
df.write.saveAsTable("flights")

### Viewing tables

In [4]:
print(spark.catalog.listTables())

[Table(name='flights', database='default', description=None, tableType='MANAGED', isTemporary=False)]


### Doing query

In [5]:
query = "FROM flights SELECT * LIMIT 10"
flights10 = spark.sql(query)
flights10.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|
|2014|    3|  9|     754|       -1|    1015|        1|     AS| N612AS|   522|   SEA| BUR|     127|     937|   7|    54|
|2014|    1| 15|    1037|        7|    1

### Pandafy a Spark Dataframe

In [8]:
query = "SELECT origin, dest, COUNT(*) as N FROM flights GROUP BY origin, dest"


flight_counts = spark.sql(query)
pd_counts = flight_counts.toPandas()

print(pd_counts.head())

  origin dest    N
0    SEA  RNO    8
1    SEA  DTW   98
2    SEA  CLE    2
3    SEA  LAX  450
4    PDX  SEA  144


In [7]:
pd_counts.to_csv('flight_counts.csv')

### Creating columns

In [10]:
flights = spark.table('flights')
flights = flights.withColumn('duration_hrs', flights.air_time / 60)
flights.show()

+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|year|month|day|dep_time|dep_delay|arr_time|arr_delay|carrier|tailnum|flight|origin|dest|air_time|distance|hour|minute|      duration_hrs|
+----+-----+---+--------+---------+--------+---------+-------+-------+------+------+----+--------+--------+----+------+------------------+
|2014|   12|  8|     658|       -7|     935|       -5|     VX| N846VA|  1780|   SEA| LAX|     132|     954|   6|    58|               2.2|
|2014|    1| 22|    1040|        5|    1505|        5|     AS| N559AS|   851|   SEA| HNL|     360|    2677|  10|    40|               6.0|
|2014|    3|  9|    1443|       -2|    1652|        2|     VX| N847VA|   755|   SEA| SFO|     111|     679|  14|    43|              1.85|
|2014|    4|  9|    1705|       45|    1839|       34|     WN| N360SW|   344|   PDX| SJC|      83|     569|  17|     5|1.3833333333333333|
|2014|    3|  9|     754|  

In [11]:
## Adding columns is similar to do with query 
query = "SELECT origin, dest, air_time/60 as duration_hrs From flights"

flight_at = spark.sql(query)
flight_at.show()

+------+----+------------------+
|origin|dest|      duration_hrs|
+------+----+------------------+
|   SEA| LAX|               2.2|
|   SEA| HNL|               6.0|
|   SEA| SFO|              1.85|
|   PDX| SJC|1.3833333333333333|
|   SEA| BUR|2.1166666666666667|
|   PDX| DEN|2.0166666666666666|
|   PDX| OAK|               1.5|
|   SEA| SFO|1.6333333333333333|
|   SEA| SAN|              2.25|
|   SEA| ORD|               3.3|
|   SEA| LAX|2.1666666666666665|
|   SEA| PHX| 2.566666666666667|
|   SEA| LAS|2.1166666666666667|
|   SEA| ANC|              3.05|
|   SEA| SFO|              2.15|
|   PDX| SFO|               1.5|
|   SEA| SMF|1.2666666666666666|
|   SEA| MDW|               3.6|
|   SEA| BOS| 4.833333333333333|
|   PDX| BUR|              1.85|
+------+----+------------------+
only showing top 20 rows

