# PySpark Demo Notebook
## Demo
1. Setup Spark
2. Load Kaggle Data
3. Transform Data with Spark SQL

_Prepared by: [Gary A. Stafford](https://twitter.com/GaryStafford)   
Associated article: https://wp.me/p1RD28-61V_

### Setup Spark
Setup Spark SparkSession

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

VBox()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
2,application_1569711246004_0003,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

In [2]:
spark = SparkSession \
    .builder \
    .getOrCreate()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Load Kaggle Data
Load the Kaggle dataset from the CSV file, containing ~21K records, into a DataFrame

In [3]:
bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

df_bakery1 = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("s3://gstafford-aws-emr-notebooks/files/BreadBasket_DMS.csv", schema=bakery_schema)

df_bakery1.show(10)
df_bakery1.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+--------+-----------+-------------+
|      date|    time|transaction|         item|
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|          1|        Bread|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:07:57|          3|Hot chocolate|
|2016-10-30|10:07:57|          3|          Jam|
|2016-10-30|10:07:57|          3|      Cookies|
|2016-10-30|10:08:41|          4|       Muffin|
|2016-10-30|10:13:03|          5|       Coffee|
|2016-10-30|10:13:03|          5|       Pastry|
|2016-10-30|10:13:03|          5|        Bread|
+----------+--------+-----------+-------------+
only showing top 10 rows

21293

### Transform Data with Spark SQL
Transform the DataFrame's bakery data using Spark SQL

In [4]:
df_bakery1.createOrReplaceTempView("bakery_table_tmp1")

df_bakery2 = spark.sql("SELECT date, transaction, item " +
                       "FROM bakery_table_tmp1 " +
                       "WHERE item NOT LIKE 'NONE'" +
                       "ORDER BY transaction")
df_bakery2.show(5)
df_bakery2.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-----------+-------------+
|      date|transaction|         item|
+----------+-----------+-------------+
|2016-10-30|          1|        Bread|
|2016-10-30|          2| Scandinavian|
|2016-10-30|          2| Scandinavian|
|2016-10-30|          3|Hot chocolate|
|2016-10-30|          3|          Jam|
+----------+-----------+-------------+
only showing top 5 rows

20507

In [5]:
df_bakery2.createOrReplaceTempView("bakery_table_tmp2")

df_bakery3 = spark.sql("SELECT date, count(*) as count " +
                       "FROM bakery_table_tmp2 " +
                       "WHERE date >= '2017-01-01' " +
                       "GROUP BY date " +
                       "ORDER BY date")
df_bakery3.show(5)
df_bakery3.count()

VBox()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------+-----+
|      date|count|
+----------+-----+
|2017-01-01|    1|
|2017-01-03|   87|
|2017-01-04|   76|
|2017-01-05|   95|
|2017-01-06|   84|
+----------+-----+
only showing top 5 rows

98