# PySpark Demo Notebook
## Set-up
1. Download 'BreadBasket_DMS.csv'
2. Run 'basic_scripts.sql'

## Demo
1. Load PostgreSQL Data
2. Create New Record
3. Write New Record to PostgreSQL Table
4. Load CSV Data File
5. Write Data to PostgreSQL
6. Analyze Data with Spark SQL

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import to_timestamp

In [2]:
spark = SparkSession\
    .builder\
    .appName('pyspark_demo_app')\
    .config('spark.driver.extraClassPath', 
            '/home/garystafford/work/postgresql-42.2.5.jar')\
    .getOrCreate()

sc = spark.sparkContext

### Load PostgreSQL Data
Load the the PostgreSQL 'bakery_basket' table's 3 rows of data into a DataFrame

In [3]:
properties = {
    'driver': 'org.postgresql.Driver'
}
url = 'jdbc:postgresql://postgres:5432/demo'

df1 = spark.read \
    .format('jdbc') \
    .option('url', url) \
    .option('user', 'postgres') \
    .option('password', 'postgres1234') \
    .option('driver', properties['driver']) \
    .option('dbtable', 'bakery_basket') \
    .load()

In [4]:
df1.show()
df1.count()

+---+----------+--------+-----------+-------------+
| id|      date|    time|transaction|         item|
+---+----------+--------+-----------+-------------+
|  1|2016-10-30|09:58:11|          1|        Bread|
|  2|2016-10-30|10:05:34|          2| Scandinavian|
|  3|2016-10-30|10:07:57|          3|Hot chocolate|
|  1|2016-10-30|10:13:27|          2|       Pastry|
|  2|2016-10-30|09:58:11|          1|        Bread|
|  3|2016-10-30|10:05:34|          2| Scandinavian|
|  4|2016-10-30|10:05:34|          2| Scandinavian|
|  5|2016-10-30|10:07:57|          3|Hot chocolate|
|  6|2016-10-30|10:07:57|          3|          Jam|
|  7|2016-10-30|10:07:57|          3|      Cookies|
|  8|2016-10-30|10:08:41|          4|       Muffin|
|  9|2016-10-30|10:13:03|          5|       Coffee|
| 10|2016-10-30|10:13:03|          5|       Pastry|
| 11|2016-10-30|10:13:03|          5|        Bread|
| 12|2016-10-30|10:16:55|          6|    Medialuna|
| 13|2016-10-30|10:16:55|          6|       Pastry|
| 14|2016-10

63885

### Create New Record
Create a new bakery record and load into a DataFrame

In [5]:
data = [('2016-10-30', '10:13:27', 2, 'Pastry')]

bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

df2 = spark.createDataFrame(data, bakery_schema)

In [6]:
df2.show()
df2.count()

+----------+--------+-----------+------+
|      date|    time|transaction|  item|
+----------+--------+-----------+------+
|2016-10-30|10:13:27|          2|Pastry|
+----------+--------+-----------+------+



1

### Write New Record to PostgreSQL Table
Append the contents of the DataFrame to the PostgreSQL 'bakery_basket' table

In [7]:
df2.write \
    .format('jdbc') \
    .option('url', url) \
    .option('user', 'postgres') \
    .option('password', 'postgres1234') \
    .option('driver', properties['driver']) \
    .option('dbtable', 'bakery_basket') \
    .mode('append') \
    .save()

In [8]:
df1.show()
df1.count()

+---+----------+--------+-----------+-------------+
| id|      date|    time|transaction|         item|
+---+----------+--------+-----------+-------------+
|  1|2016-10-30|09:58:11|          1|        Bread|
|  2|2016-10-30|10:05:34|          2| Scandinavian|
|  3|2016-10-30|10:07:57|          3|Hot chocolate|
|  1|2016-10-30|10:13:27|          2|       Pastry|
|  2|2016-10-30|09:58:11|          1|        Bread|
|  3|2016-10-30|10:05:34|          2| Scandinavian|
|  4|2016-10-30|10:05:34|          2| Scandinavian|
|  5|2016-10-30|10:07:57|          3|Hot chocolate|
|  6|2016-10-30|10:07:57|          3|          Jam|
|  7|2016-10-30|10:07:57|          3|      Cookies|
|  8|2016-10-30|10:08:41|          4|       Muffin|
|  9|2016-10-30|10:13:03|          5|       Coffee|
| 10|2016-10-30|10:13:03|          5|       Pastry|
| 11|2016-10-30|10:13:03|          5|        Bread|
| 12|2016-10-30|10:16:55|          6|    Medialuna|
| 13|2016-10-30|10:16:55|          6|       Pastry|
| 14|2016-10

63886

### Load CSV File Data
Load the Kaggle dataset from the CSV file, containing ~21K records, into a DataFrame

In [9]:
df3 = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("BreadBasket_DMS.csv", schema=bakery_schema)

In [10]:
df3.show(10)
df3.count()

+----------+--------+-----------+-------------+
|      date|    time|transaction|         item|
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|          1|        Bread|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:07:57|          3|Hot chocolate|
|2016-10-30|10:07:57|          3|          Jam|
|2016-10-30|10:07:57|          3|      Cookies|
|2016-10-30|10:08:41|          4|       Muffin|
|2016-10-30|10:13:03|          5|       Coffee|
|2016-10-30|10:13:03|          5|       Pastry|
|2016-10-30|10:13:03|          5|        Bread|
+----------+--------+-----------+-------------+
only showing top 10 rows



21293

### Write Data to PostgreSQL
Append the contents of the DataFrame to the PostgreSQL 'bakery_basket' table

In [11]:
df3.write \
    .format('jdbc') \
    .option('url', url) \
    .option('user', 'postgres') \
    .option('password', 'postgres1234') \
    .option('driver', properties['driver']) \
    .option('dbtable', 'bakery_basket') \
    .mode('append') \
    .save()

In [12]:
df1.show(10)
df1.count()

+---+----------+--------+-----------+-------------+
| id|      date|    time|transaction|         item|
+---+----------+--------+-----------+-------------+
|  1|2016-10-30|09:58:11|          1|        Bread|
|  2|2016-10-30|10:05:34|          2| Scandinavian|
|  3|2016-10-30|10:07:57|          3|Hot chocolate|
|  1|2016-10-30|10:13:27|          2|       Pastry|
|  2|2016-10-30|09:58:11|          1|        Bread|
|  3|2016-10-30|10:05:34|          2| Scandinavian|
|  4|2016-10-30|10:05:34|          2| Scandinavian|
|  5|2016-10-30|10:07:57|          3|Hot chocolate|
|  6|2016-10-30|10:07:57|          3|          Jam|
|  7|2016-10-30|10:07:57|          3|      Cookies|
+---+----------+--------+-----------+-------------+
only showing top 10 rows



85179

### Analyze Data with Spark SQL
Analyze the DataFrame's bakery data using Spark SQL

In [14]:
df1.createOrReplaceTempView("bakery_table")
df4 = spark.sql("SELECT * FROM bakery_table " +
                "ORDER BY transaction, date, time")
df4.show(15)
df4.count()

+-----+----------+--------+-----------+------------+
|   id|      date|    time|transaction|        item|
+-----+----------+--------+-----------+------------+
|    2|2016-10-30|09:58:11|          1|       Bread|
|63884|2016-10-30|09:58:11|          1|       Bread|
|21296|2016-10-30|09:58:11|          1|       Bread|
|    1|2016-10-30|09:58:11|          1|       Bread|
|42590|2016-10-30|09:58:11|          1|       Bread|
|    2|2016-10-30|10:05:34|          2|Scandinavian|
|    4|2016-10-30|10:05:34|          2|Scandinavian|
|21298|2016-10-30|10:05:34|          2|Scandinavian|
|42592|2016-10-30|10:05:34|          2|Scandinavian|
|    3|2016-10-30|10:05:34|          2|Scandinavian|
|21297|2016-10-30|10:05:34|          2|Scandinavian|
|42591|2016-10-30|10:05:34|          2|Scandinavian|
|63885|2016-10-30|10:05:34|          2|Scandinavian|
|63886|2016-10-30|10:05:34|          2|Scandinavian|
|21295|2016-10-30|10:13:27|          2|      Pastry|
+-----+----------+--------+-----------+-------

85179

In [16]:
df5 = spark.sql("SELECT COUNT(DISTINCT item) AS item_count FROM bakery_table")
df5.show()

df5 = spark.sql("SELECT item, count(*) as count " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE' " +
                "GROUP BY item ORDER BY count DESC")
df5.show(10)
df5.count()

+----------+
|item_count|
+----------+
|        95|
+----------+



+-------------+-----+
|         item|count|
+-------------+-----+
|       Coffee|21884|
|        Bread|13301|
|          Tea| 5740|
|         Cake| 4100|
|       Pastry| 3428|
|     Sandwich| 3084|
|    Medialuna| 2464|
|Hot chocolate| 2361|
|      Cookies| 2160|
|      Brownie| 1516|
+-------------+-----+
only showing top 10 rows



94

In [17]:
df6 = spark.sql("SELECT CONCAT(date,' ',time) as timestamp, transaction, item " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE'" +
                "ORDER BY transaction"
               )
df6.show(10)
df6.count()

+-------------------+-----------+------------+
|          timestamp|transaction|        item|
+-------------------+-----------+------------+
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:13:27|          2|      Pastry|
+-------------------+-----------+------------+
only showing top 10 rows



82035

In [18]:
df7 = df6.withColumn('timestamp', to_timestamp(df6.timestamp, 'yyyy-MM-dd HH:mm:ss'))
df7.printSchema()
df7.show(10)
df7.count()

root
 |-- timestamp: timestamp (nullable = true)
 |-- transaction: integer (nullable = true)
 |-- item: string (nullable = true)



+-------------------+-----------+------------+
|          timestamp|transaction|        item|
+-------------------+-----------+------------+
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 09:58:11|          1|       Bread|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:05:34|          2|Scandinavian|
|2016-10-30 10:13:27|          2|      Pastry|
+-------------------+-----------+------------+
only showing top 10 rows



82035

In [19]:
df7.createOrReplaceTempView("bakery_table")
df8 = spark.sql("SELECT DISTINCT * " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE'" +
                "ORDER BY transaction DESC"
                )
df8.show(10)
df8.count()


+-------------------+-----------+--------------+
|          timestamp|transaction|          item|
+-------------------+-----------+--------------+
|2017-04-09 15:04:24|       9684|     Smoothies|
|2017-04-09 14:57:06|       9683|        Pastry|
|2017-04-09 14:57:06|       9683|        Coffee|
|2017-04-09 14:32:58|       9682|           Tea|
|2017-04-09 14:32:58|       9682|  Tacos/Fajita|
|2017-04-09 14:32:58|       9682|        Muffin|
|2017-04-09 14:32:58|       9682|        Coffee|
|2017-04-09 14:30:09|       9681|           Tea|
|2017-04-09 14:30:09|       9681|Spanish Brunch|
|2017-04-09 14:30:09|       9681|      Truffles|
+-------------------+-----------+--------------+
only showing top 10 rows



18888

In [21]:
df8.write.parquet('bakery_parquet', mode='overwrite')

In [22]:
df9 = spark.read.parquet('bakery_parquet')
df9.show(10)
df9.count()

+-------------------+-----------+---------+
|          timestamp|transaction|     item|
+-------------------+-----------+---------+
|2017-02-15 14:54:25|       6620|     Cake|
|2017-02-15 14:41:27|       6619|    Bread|
|2017-02-15 14:40:41|       6618|   Coffee|
|2017-02-15 14:40:41|       6618|    Bread|
|2017-02-15 14:23:16|       6617| Baguette|
|2017-02-15 14:23:16|       6617|   Coffee|
|2017-02-15 14:23:16|       6617|    Salad|
|2017-02-15 14:23:16|       6617| Art Tray|
|2017-02-15 14:23:16|       6617|Alfajores|
|2017-02-15 14:16:26|       6616|    Bread|
+-------------------+-----------+---------+
only showing top 10 rows



18888