# PySpark Demo Notebook
## Demo
1. Run PostgreSQL Script
2. Load PostgreSQL Data
3. Create New Record
4. Append New Record to Database Table
5. Load CSV Data File
6. Overwrite Data to Database Table
7. Analyze Data with Spark SQL
8. Graph Data with BokehJS
9. Read and Write Data to Parquet Format

_Prepared by: [Gary A. Stafford](https://twitter.com/GaryStafford)   
Associated article: https://wp.me/p1RD28-61V_

## Run PostgreSQL Script
Run the sql script to create the database schema and import data from CSV file.

In [1]:
%run -i '03_load_sql.py'

DROP TABLE IF EXISTS "transactions"

DROP SEQUENCE IF EXISTS transactions_id_seq

CREATE SEQUENCE transactions_id_seq INCREMENT 1 MINVALUE 1 MAXVALUE 2147483647 START 1 CACHE 1


CREATE TABLE "public"."transactions"
(
    "id"          integer DEFAULT nextval('transactions_id_seq') NOT NULL,
    "date"        character varying(10)                           NOT NULL,
    "time"        character varying(8)                            NOT NULL,
    "transaction" integer                                         NOT NULL,
    "item"        character varying(50)                           NOT NULL
) WITH (oids = false)

Row count: 21293


## Load PostgreSQL Data
Load the PostgreSQL 'transactions' table's contents into a Spark DataFrame.

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [3]:
spark = SparkSession \
    .builder \
    .appName('04_notebook') \
    .config('spark.driver.extraClassPath',
            'postgresql-42.2.8.jar') \
    .getOrCreate()

In [4]:
properties = {
    'driver': 'org.postgresql.Driver',
    'url': 'jdbc:postgresql://postgres:5432/bakery',
    'user': 'postgres',
    'password': 'postgres1234',
    'dbtable': 'transactions',
}

In [5]:
df1 = spark.read \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .load()

In [6]:
%%time
print('DataFrame rows: %d' % df1.count())
print('DataFrame schema: %s' % df1)
df1.show(10, False)

DataFrame rows: 21293
DataFrame schema: DataFrame[id: int, date: string, time: string, transaction: int, item: string]
+---+----------+--------+-----------+-------------+
|id |date      |time    |transaction|item         |
+---+----------+--------+-----------+-------------+
|1  |2016-10-30|09:58:11|1          |Bread        |
|2  |2016-10-30|10:05:34|2          |Scandinavian |
|3  |2016-10-30|10:05:34|2          |Scandinavian |
|4  |2016-10-30|10:07:57|3          |Hot chocolate|
|5  |2016-10-30|10:07:57|3          |Jam          |
|6  |2016-10-30|10:07:57|3          |Cookies      |
|7  |2016-10-30|10:08:41|4          |Muffin       |
|8  |2016-10-30|10:13:03|5          |Coffee       |
|9  |2016-10-30|10:13:03|5          |Pastry       |
|10 |2016-10-30|10:13:03|5          |Bread        |
+---+----------+--------+-----------+-------------+
only showing top 10 rows

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 3.06 s


## Create a New Record
Create a new bakery record and load into a Spark DataFrame.

In [7]:
bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

In [8]:
data = [('2016-10-30', '10:13:27', 2, 'Pastry')]
df2 = spark.createDataFrame(data, bakery_schema)

In [9]:
print('DataFrame rows: %d' % df2.count())
print('DataFrame schema: %s' % df2)
df2.show(10, False)

DataFrame rows: 1
DataFrame schema: DataFrame[date: string, time: string, transaction: int, item: string]
+----------+--------+-----------+------+
|date      |time    |transaction|item  |
+----------+--------+-----------+------+
|2016-10-30|10:13:27|2          |Pastry|
+----------+--------+-----------+------+



## Append New Record to Database Table
Append the contents of the DataFrame to the bakery PostgreSQL database's 'transactions' table.

In [10]:
df2.write \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .mode('append') \
    .save()

In [11]:
print('DataFrame rows: %d' % df1.count())

DataFrame rows: 21294


## Read CSV-Format File
Read CSV-format data file into a Spark DataFrame.

In [12]:
df3 = spark.read \
        .format('csv') \
        .option('header', 'true') \
        .load('BreadBasket_DMS.csv', schema=bakery_schema)

In [13]:
print('DataFrame rows: %d' % df3.count())
print('DataFrame schema: %s' % df3)
df3.show(10, False)

DataFrame rows: 21293
DataFrame schema: DataFrame[date: string, time: string, transaction: int, item: string]
+----------+--------+-----------+-------------+
|date      |time    |transaction|item         |
+----------+--------+-----------+-------------+
|2016-10-30|09:58:11|1          |Bread        |
|2016-10-30|10:05:34|2          |Scandinavian |
|2016-10-30|10:05:34|2          |Scandinavian |
|2016-10-30|10:07:57|3          |Hot chocolate|
|2016-10-30|10:07:57|3          |Jam          |
|2016-10-30|10:07:57|3          |Cookies      |
|2016-10-30|10:08:41|4          |Muffin       |
|2016-10-30|10:13:03|5          |Coffee       |
|2016-10-30|10:13:03|5          |Pastry       |
|2016-10-30|10:13:03|5          |Bread        |
+----------+--------+-----------+-------------+
only showing top 10 rows



## Overwrite Data to Database Table
Overwrite the contents of the DataFrame to the 'transactions' table.

In [14]:
df3.write \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .option('truncate', 'true') \
    .mode('overwrite') \
    .save()

In [15]:
print('DataFrame rows: %d' % df1.count())
print('DataFrame schema: %s' % df1)
df1.show(10, False)

DataFrame rows: 21293
DataFrame schema: DataFrame[id: int, date: string, time: string, transaction: int, item: string]
+-----+----------+--------+-----------+-------------+
|id   |date      |time    |transaction|item         |
+-----+----------+--------+-----------+-------------+
|21295|2016-10-30|09:58:11|1          |Bread        |
|21296|2016-10-30|10:05:34|2          |Scandinavian |
|21297|2016-10-30|10:05:34|2          |Scandinavian |
|21298|2016-10-30|10:07:57|3          |Hot chocolate|
|21299|2016-10-30|10:07:57|3          |Jam          |
|21300|2016-10-30|10:07:57|3          |Cookies      |
|21301|2016-10-30|10:08:41|4          |Muffin       |
|21302|2016-10-30|10:13:03|5          |Coffee       |
|21303|2016-10-30|10:13:03|5          |Pastry       |
|21304|2016-10-30|10:13:03|5          |Bread        |
+-----+----------+--------+-----------+-------------+
only showing top 10 rows



## Analyze Data with Spark SQL
Perform some basic analysis of the DataFrame's bakery data using Spark SQL.

In [16]:
df4 = df1.select('*') \
    .sort(df1.transaction, df1.date, df1.time)

print('DataFrame rows: %d' % df4.count())
df4.show(10, False)

DataFrame rows: 21293
+-----+----------+--------+-----------+-------------+
|id   |date      |time    |transaction|item         |
+-----+----------+--------+-----------+-------------+
|21295|2016-10-30|09:58:11|1          |Bread        |
|21297|2016-10-30|10:05:34|2          |Scandinavian |
|21296|2016-10-30|10:05:34|2          |Scandinavian |
|21298|2016-10-30|10:07:57|3          |Hot chocolate|
|21300|2016-10-30|10:07:57|3          |Cookies      |
|21299|2016-10-30|10:07:57|3          |Jam          |
|21301|2016-10-30|10:08:41|4          |Muffin       |
|21302|2016-10-30|10:13:03|5          |Coffee       |
|21303|2016-10-30|10:13:03|5          |Pastry       |
|21304|2016-10-30|10:13:03|5          |Bread        |
+-----+----------+--------+-----------+-------------+
only showing top 10 rows



In [17]:
df4.createOrReplaceTempView('tmp_bakery')
sql_query = "SELECT item, count(*) as count " + \
            "FROM tmp_bakery " + \
            "WHERE item NOT LIKE 'NONE' " + \
            "GROUP BY item ORDER BY count DESC " + \
            "LIMIT 10"

df5 = spark.sql(sql_query)
df5.show(10, False)

+-------------+-----+
|item         |count|
+-------------+-----+
|Coffee       |5471 |
|Bread        |3325 |
|Tea          |1435 |
|Cake         |1025 |
|Pastry       |856  |
|Sandwich     |771  |
|Medialuna    |616  |
|Hot chocolate|590  |
|Cookies      |540  |
|Brownie      |379  |
+-------------+-----+



## Graph Data with BokehJS
Create a vertical bar chart and a pie chart with [BokehJS](https://docs.bokeh.org/en/latest/index.html), displaying the Spark DataFrame contents.

In [18]:
from math import pi
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.transform import factor_cmap, cumsum
from bokeh.palettes import Paired12, Category20c

output_notebook()

### Vertical Bar Chart

In [19]:
source = ColumnDataSource(data=df5.toPandas())
tooltips = [('item', '@item'), ('count', '@{count}{,}')]
items = source.data['item'].tolist()
color_map = factor_cmap(
    field_name='item', 
    palette=Paired12, 
    factors=items
)
plot = figure(
    x_range=items, 
    plot_width=750, 
    plot_height=375, 
    min_border=0, 
    tooltips=tooltips
)
plot.vbar(
    x='item', 
    bottom=0, 
    top='count', 
    source=source, 
    width=0.9, 
    fill_color=color_map
)
plot.title.text = 'Top 10 Bakery Items'
plot.xaxis.axis_label = 'Bakery Items'
plot.yaxis.axis_label = 'Total Items Sold'

show(plot)

### Pie Chart

In [20]:
data = df5.toPandas()
data['angle'] = data['count'] / data['count'].sum() * 2 * pi
plot = figure(
    plot_height=375, 
    title='Top 10 Bakery Items',
    tooltips=tooltips, 
    x_range=(-0.5, 1.0)
)
plot.wedge(
    x=0, 
    y=1, 
    radius=0.4,
    start_angle=cumsum('angle', 
                       include_zero=True), 
    end_angle=cumsum('angle'),
    line_color='white', 
    fill_color=color_map, 
    legend_field='item', 
    source=data
)
plot.axis.axis_label=None
plot.axis.visible=False
plot.grid.grid_line_color = None

show(plot)

## Analyze Data with Spark SQL
Perform further basic analysis of the DataFrame's bakery data using Spark SQL.

In [21]:
sql_query = "SELECT transaction, CAST(CONCAT(date,' ',time) as timestamp) as timestamp, item " + \
            "FROM tmp_bakery " + \
            "WHERE item NOT LIKE 'NONE' " + \
            "ORDER BY transaction ASC, item ASC"

df6 = spark.sql(sql_query)
print('DataFrame rows: %d' % df6.count())
print('DataFrame schema: %s' % df6)
df6.show(10, False)

DataFrame rows: 20507
DataFrame schema: DataFrame[transaction: int, timestamp: timestamp, item: string]
+-----------+-------------------+-------------+
|transaction|timestamp          |item         |
+-----------+-------------------+-------------+
|1          |2016-10-30 09:58:11|Bread        |
|2          |2016-10-30 10:05:34|Scandinavian |
|2          |2016-10-30 10:05:34|Scandinavian |
|3          |2016-10-30 10:07:57|Cookies      |
|3          |2016-10-30 10:07:57|Hot chocolate|
|3          |2016-10-30 10:07:57|Jam          |
|4          |2016-10-30 10:08:41|Muffin       |
|5          |2016-10-30 10:13:03|Bread        |
|5          |2016-10-30 10:13:03|Coffee       |
|5          |2016-10-30 10:13:03|Pastry       |
+-----------+-------------------+-------------+
only showing top 10 rows



## Read and Write Data to Parquet Format
Read and write resulting DataFrame contents to Parquet.

In [22]:
df6.write.parquet('output/bakery_parquet', mode='overwrite')

In [23]:
! ls 2>&1 -lh output/bakery_parquet | head -10
! echo 'Parquet Files:' $(ls | wc -l)

total 800K
-rw-r--r-- 1 garystaf users 1.9K Dec  3 22:50 part-00000-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 2.0K Dec  3 22:50 part-00001-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 1.8K Dec  3 22:50 part-00002-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 2.0K Dec  3 22:50 part-00003-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 1.9K Dec  3 22:50 part-00004-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 1.9K Dec  3 22:50 part-00005-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 2.0K Dec  3 22:50 part-00006-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 1.9K Dec  3 22:50 part-00007-fb8e241d-d822-4ac2-b57f-090c71ce9193-c000.snappy.parquet
-rw-r--r-- 1 garystaf users 2.1K Dec  3 22:50 part-00008-fb8e241d-d822-4ac2-b

In [24]:
df7 = spark.read.parquet('output/bakery_parquet')
print('DataFrame rows: %d' % df7.count())
print('DataFrame schema: %s' % df7)
df7.select(df7.transaction, df7.timestamp, df7.item) \
    .sort(df7.transaction, df7.item) \
    .show(10, False)

DataFrame rows: 20507
DataFrame schema: DataFrame[transaction: int, timestamp: timestamp, item: string]
+-----------+-------------------+-------------+
|transaction|timestamp          |item         |
+-----------+-------------------+-------------+
|1          |2016-10-30 09:58:11|Bread        |
|2          |2016-10-30 10:05:34|Scandinavian |
|2          |2016-10-30 10:05:34|Scandinavian |
|3          |2016-10-30 10:07:57|Cookies      |
|3          |2016-10-30 10:07:57|Hot chocolate|
|3          |2016-10-30 10:07:57|Jam          |
|4          |2016-10-30 10:08:41|Muffin       |
|5          |2016-10-30 10:13:03|Bread        |
|5          |2016-10-30 10:13:03|Coffee       |
|5          |2016-10-30 10:13:03|Pastry       |
+-----------+-------------------+-------------+
only showing top 10 rows

