In [1]:
import pyspark

In [3]:
from pyspark.sql import SparkSession

To use pySpark, we need to start a SparkSession as follows:

In [4]:
spark = SparkSession.builder \
    .master("local[*]") \
    .appName('test') \
    .getOrCreate()

We will load in the Taxi Zone Lookup file as an example. We can use commands such as `df.show()`, `df.head(n)`, `df.tail(n)` to preview a few rows of the data. 

In [56]:
df = spark.read \
    .option("header", "true") \
    .csv('../module_1_docker_terraform/taxi_zone_lookup.csv')

In [57]:
df.show()

+----------+-------------+--------------------+------------+
|LocationID|      Borough|                Zone|service_zone|
+----------+-------------+--------------------+------------+
|         1|          EWR|      Newark Airport|         EWR|
|         2|       Queens|         Jamaica Bay|   Boro Zone|
|         3|        Bronx|Allerton/Pelham G...|   Boro Zone|
|         4|    Manhattan|       Alphabet City| Yellow Zone|
|         5|Staten Island|       Arden Heights|   Boro Zone|
|         6|Staten Island|Arrochar/Fort Wad...|   Boro Zone|
|         7|       Queens|             Astoria|   Boro Zone|
|         8|       Queens|        Astoria Park|   Boro Zone|
|         9|       Queens|          Auburndale|   Boro Zone|
|        10|       Queens|        Baisley Park|   Boro Zone|
|        11|     Brooklyn|          Bath Beach|   Boro Zone|
|        12|    Manhattan|        Battery Park| Yellow Zone|
|        13|    Manhattan|   Battery Park City| Yellow Zone|
|        14|     Brookly

In [58]:
df.tail(5)

[Row(LocationID='261', Borough='Manhattan', Zone='World Trade Center', service_zone='Yellow Zone'),
 Row(LocationID='262', Borough='Manhattan', Zone='Yorkville East', service_zone='Yellow Zone'),
 Row(LocationID='263', Borough='Manhattan', Zone='Yorkville West', service_zone='Yellow Zone'),
 Row(LocationID='264', Borough='Unknown', Zone='N/A', service_zone='N/A'),
 Row(LocationID='265', Borough='N/A', Zone='Outside of NYC', service_zone='N/A')]

In [63]:
df.schema

StructType([StructField('LocationID', StringType(), True), StructField('Borough', StringType(), True), StructField('Zone', StringType(), True), StructField('service_zone', StringType(), True)])

Checking the schema, we can see that all columns have been read in as Strings by default. To prevent this (e.g. for LocationID), we can specify a schema

In [64]:
from pyspark.sql import types
schema = types.StructType([
	types.StructField('LocationID', types.IntegerType(), True),
	types.StructField('Borough', types.StringType(), True),
	types.StructField('Zone', types.StringType(), True), 
	types.StructField('service_zone', types.StringType(), True)
	])

df = spark.read \
    .option("header", "true") \
    .schema(schema) \
    .csv('./taxi_zone_lookup.csv')

In [65]:
df.schema

StructType([StructField('LocationID', IntegerType(), True), StructField('Borough', StringType(), True), StructField('Zone', StringType(), True), StructField('service_zone', StringType(), True)])

Next, we look at a simple query to demonstrate lazy executions in Spark.

Spark executes **transformations** lazily, meaning a sequence of transformations is maintained but not executed immediately. They are only executed when an **action** is performed. Hence, we call actions **eager**.

e.g. `select` and `filter` are lazy

In [52]:
df.select('LocationId', 'Borough', 'Zone', 'service_zone') \
    .filter(df.Borough == 'Manhattan')

DataFrame[LocationId: int, Borough: string, Zone: string, service_zone: string]

whereas `show` is eager

In [53]:
df.select('LocationId', 'Borough', 'Zone', 'service_zone') \
    .filter(df.Borough == 'Manhattan') \
    .show()

+----------+---------+--------------------+------------+
|LocationId|  Borough|                Zone|service_zone|
+----------+---------+--------------------+------------+
|         4|Manhattan|       Alphabet City| Yellow Zone|
|        12|Manhattan|        Battery Park| Yellow Zone|
|        13|Manhattan|   Battery Park City| Yellow Zone|
|        24|Manhattan|        Bloomingdale| Yellow Zone|
|        41|Manhattan|      Central Harlem|   Boro Zone|
|        42|Manhattan|Central Harlem North|   Boro Zone|
|        43|Manhattan|        Central Park| Yellow Zone|
|        45|Manhattan|           Chinatown| Yellow Zone|
|        48|Manhattan|        Clinton East| Yellow Zone|
|        50|Manhattan|        Clinton West| Yellow Zone|
|        68|Manhattan|        East Chelsea| Yellow Zone|
|        74|Manhattan|   East Harlem North|   Boro Zone|
|        75|Manhattan|   East Harlem South|   Boro Zone|
|        79|Manhattan|        East Village| Yellow Zone|
|        87|Manhattan|Financial

In [7]:
spark.stop()