# PySpark Demo Notebook
## Demo
1. Setup Spark
2. Load Kaggle Data
3. Transform Data with Spark SQL

_Prepared by: [Gary A. Stafford](https://twitter.com/GaryStafford)   
Associated article: https://wp.me/p1RD28-61V_

### Setup Spark
Setup Spark SparkSession

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [2]:
spark = SparkSession \
    .builder \
    .getOrCreate()

### Load Kaggle Data
Load the Kaggle dataset from the CSV file, containing ~21K records, into a DataFrame

In [4]:
bakery_schema = StructType([
    StructField("date", StringType(), True),
    StructField("time", StringType(), True),
    StructField("transaction", IntegerType(), True),
    StructField("item", StringType(), True)
])
df_bakery1 = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("s3://gstafford-aws-emr-notebooks/data/bakery-csv/BreadBasket_DMS.csv", schema=bakery_schema)

df_bakery1.show(10)
df_bakery1.count()

+----------+--------+-----------+-------------+
|      date|    time|transaction|         item|
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|          1|        Bread|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:07:57|          3|Hot chocolate|
|2016-10-30|10:07:57|          3|          Jam|
|2016-10-30|10:07:57|          3|      Cookies|
|2016-10-30|10:08:41|          4|       Muffin|
|2016-10-30|10:13:03|          5|       Coffee|
|2016-10-30|10:13:03|          5|       Pastry|
|2016-10-30|10:13:03|          5|        Bread|
+----------+--------+-----------+-------------+
only showing top 10 rows



21293

### Transform Data with Spark SQL
Transform the DataFrame's bakery data using Spark SQL

In [5]:
df_bakery1.createOrReplaceTempView("tmp_bakery")
df_bakery2 = spark.sql("SELECT date, transaction, item " +
                       "FROM tmp_bakery " +
                       "WHERE item NOT LIKE 'NONE'" +
                       "ORDER BY transaction")
print("DataFrame rows: %d" % df_bakery2.count())
df_bakery2.show(5, False)

DataFrame rows: 20507
+----------+-----------+-------------+
|date      |transaction|item         |
+----------+-----------+-------------+
|2016-10-30|1          |Bread        |
|2016-10-30|2          |Scandinavian |
|2016-10-30|2          |Scandinavian |
|2016-10-30|3          |Hot chocolate|
|2016-10-30|3          |Jam          |
+----------+-----------+-------------+
only showing top 5 rows



In [6]:
df_bakery2.createOrReplaceTempView("tmp_bakery")

df_bakery3 = spark.sql("SELECT date, count(*) as count " +
                       "FROM tmp_bakery " +
                       "WHERE date >= '2017-01-01' " +
                       "GROUP BY date " +
                       "ORDER BY date")
print("DataFrame rows: %d" % df_bakery3.count())
df_bakery3.show(5, False)

DataFrame rows: 98
+----------+-----+
|date      |count|
+----------+-----+
|2017-01-01|1    |
|2017-01-03|87   |
|2017-01-04|76   |
|2017-01-05|95   |
|2017-01-06|84   |
+----------+-----+
only showing top 5 rows

