# PySpark Demo Notebook
## Demo
1. Run PostgreSQL Script
2. Load PostgreSQL Data
3. Create New Record
4. Write New Record to PostgreSQL Table
5. Load CSV Data File
6. Write Data to PostgreSQL
7. Analyze Data with Spark SQL
8. Graph Data with BokehJS
9. Read and Write Data to Parquet Format

### Run PostgreSQL Script
Run the PostgreSQL sql script

In [22]:
! pip install psycopg2 psycopg2-binary

Collecting psycopg2
[?25l  Downloading https://files.pythonhosted.org/packages/bc/2a/61a8f9719bd6df5b421abd91740cb0595fc3c17b28eaf89fe4f144472ca6/psycopg2-2.7.6.1-cp36-cp36m-manylinux1_x86_64.whl (2.7MB)
[K    100% |████████████████████████████████| 2.7MB 1.6MB/s ta 0:00:01
[?25hCollecting psycopg2-binary
[?25l  Downloading https://files.pythonhosted.org/packages/cd/eb/4e872a11edd82079b4163035389051668c58cd2acc30777b6bee73f5f8a3/psycopg2_binary-2.7.6.1-cp36-cp36m-manylinux1_x86_64.whl (2.7MB)
[K    100% |████████████████████████████████| 2.7MB 1.7MB/s ta 0:00:01    92% |█████████████████████████████▋  | 2.5MB 2.8MB/s eta 0:00:01
[?25hInstalling collected packages: psycopg2, psycopg2-binary
Successfully installed psycopg2-2.7.6.1 psycopg2-binary-2.7.6.1


In [1]:
%run -i '03_load_sql.py'

ModuleNotFoundError: No module named 'psycopg2'

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import to_timestamp

In [3]:
working_directory = '/home/garystafford/work/'

spark = SparkSession \
    .builder \
    .appName('pyspark_demo_app') \
    .config('spark.driver.extraClassPath',
            working_directory + 'postgresql-42.2.5.jar') \
    .master("local[*]") \
    .getOrCreate()

### Load CSV File Data
Load the Kaggle dataset from the CSV file, containing ~21K records, into a DataFrame

In [4]:
bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

df_bakery = spark.read \
    .format("csv") \
    .option("header", "true") \
    .load("BreadBasket_DMS.csv", schema=bakery_schema)

In [5]:
df_bakery.show(10)
df_bakery.count()

+----------+--------+-----------+-------------+
|      date|    time|transaction|         item|
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|          1|        Bread|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:05:34|          2| Scandinavian|
|2016-10-30|10:07:57|          3|Hot chocolate|
|2016-10-30|10:07:57|          3|          Jam|
|2016-10-30|10:07:57|          3|      Cookies|
|2016-10-30|10:08:41|          4|       Muffin|
|2016-10-30|10:13:03|          5|       Coffee|
|2016-10-30|10:13:03|          5|       Pastry|
|2016-10-30|10:13:03|          5|        Bread|
+----------+--------+-----------+-------------+
only showing top 10 rows



21293

### Analyze Data with Spark SQL
Analyze the DataFrame's bakery data using Spark SQL

In [6]:
df_bakery.createOrReplaceTempView("bakery_table")

In [7]:
df_bakery2 = spark.sql("SELECT CONCAT(date,' ',time) as timestamp, transaction, item " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE'" +
                "ORDER BY transaction"
               )
df_bakery2.show(10)
df_bakery2.count()

+-------------------+-----------+-------------+
|          timestamp|transaction|         item|
+-------------------+-----------+-------------+
|2016-10-30 09:58:11|          1|        Bread|
|2016-10-30 10:05:34|          2| Scandinavian|
|2016-10-30 10:05:34|          2| Scandinavian|
|2016-10-30 10:07:57|          3|Hot chocolate|
|2016-10-30 10:07:57|          3|      Cookies|
|2016-10-30 10:07:57|          3|          Jam|
|2016-10-30 10:08:41|          4|       Muffin|
|2016-10-30 10:13:03|          5|       Coffee|
|2016-10-30 10:13:03|          5|       Pastry|
|2016-10-30 10:13:03|          5|        Bread|
+-------------------+-----------+-------------+
only showing top 10 rows



20507

In [8]:
df_bakery3 = df_bakery2.withColumn('timestamp', to_timestamp(df_bakery2.timestamp, 'yyyy-MM-dd HH:mm:ss'))
df_bakery3.printSchema()
df_bakery3.show(10)
df_bakery3.count()

root
 |-- timestamp: timestamp (nullable = true)
 |-- transaction: integer (nullable = true)
 |-- item: string (nullable = true)

+-------------------+-----------+-------------+
|          timestamp|transaction|         item|
+-------------------+-----------+-------------+
|2016-10-30 09:58:11|          1|        Bread|
|2016-10-30 10:05:34|          2| Scandinavian|
|2016-10-30 10:05:34|          2| Scandinavian|
|2016-10-30 10:07:57|          3|Hot chocolate|
|2016-10-30 10:07:57|          3|      Cookies|
|2016-10-30 10:07:57|          3|          Jam|
|2016-10-30 10:08:41|          4|       Muffin|
|2016-10-30 10:13:03|          5|       Coffee|
|2016-10-30 10:13:03|          5|       Pastry|
|2016-10-30 10:13:03|          5|        Bread|
+-------------------+-----------+-------------+
only showing top 10 rows



20507

In [9]:
df_bakery3.createOrReplaceTempView("bakery_table")
df_bakery4 = spark.sql("SELECT DISTINCT * " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE'" +
                "ORDER BY transaction DESC"
                )
df_bakery4.show(10)
df_bakery4.count()

+-------------------+-----------+--------------+
|          timestamp|transaction|          item|
+-------------------+-----------+--------------+
|2017-04-09 15:04:24|       9684|     Smoothies|
|2017-04-09 14:57:06|       9683|        Pastry|
|2017-04-09 14:57:06|       9683|        Coffee|
|2017-04-09 14:32:58|       9682|        Coffee|
|2017-04-09 14:32:58|       9682|  Tacos/Fajita|
|2017-04-09 14:32:58|       9682|        Muffin|
|2017-04-09 14:32:58|       9682|           Tea|
|2017-04-09 14:30:09|       9681|      Truffles|
|2017-04-09 14:30:09|       9681|           Tea|
|2017-04-09 14:30:09|       9681|Spanish Brunch|
+-------------------+-----------+--------------+
only showing top 10 rows



18887

In [17]:
df_bakery5 = spark.sql("SELECT year(timestamp) as year, month(timestamp) as month, day(timestamp) as month, count(*) as count " +
                "FROM bakery_table " +
                "WHERE item NOT LIKE 'NONE' " +
                "GROUP BY year(timestamp), month(timestamp) " +
                "ORDER BY year(timestamp) ASC, month(timestamp) ASC, day(timestamp) ASC")
df_bakery5.show(10)
df_bakery5.count()

AnalysisException: "expression 'bakery_table.`timestamp`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;\nProject [year#179, month#180, month#181, count#182L]\n+- Sort [year#179 ASC NULLS FIRST, month#180 ASC NULLS FIRST], true\n   +- Aggregate [year(cast(timestamp#64 as date)), month(cast(timestamp#64 as date))], [year(cast(timestamp#64 as date)) AS year#179, month(cast(timestamp#64 as date)) AS month#180, dayofmonth(cast(timestamp#64 as date)) AS month#181, count(1) AS count#182L]\n      +- Filter NOT item#3 LIKE NONE\n         +- SubqueryAlias bakery_table\n            +- Project [to_timestamp(timestamp#35, Some(yyyy-MM-dd HH:mm:ss)) AS timestamp#64, transaction#2, item#3]\n               +- Sort [transaction#2 ASC NULLS FIRST], true\n                  +- Project [concat(date#0,  , time#1) AS timestamp#35, transaction#2, item#3]\n                     +- Filter NOT item#3 LIKE NONE\n                        +- SubqueryAlias bakery_table\n                           +- Relation[date#0,time#1,transaction#2,item#3] csv\n"