# PySpark Demo Notebook
## Demo
1. Run PostgreSQL Script
2. Load PostgreSQL Data
3. Create New Record
4. Append New Record to Database Table
5. Load CSV Data File
6. Overwrite Data to Database Table
7. Analyze Data with Spark SQL
8. Graph Data with BokehJS
9. Read and Write Data to Parquet Format

_Prepared by: [Gary A. Stafford](https://twitter.com/GaryStafford)   
Associated article: https://wp.me/p1RD28-61V_

### Run PostgreSQL Script
Run the sql script to create the database schema and import data from csv file

In [4]:
%run -i '03_load_sql.py'

OperationalError: could not translate host name "postgres" to address: nodename nor servname provided, or not known


### Load PostgreSQL Data
Load the PostgreSQL 'transactions' table's contents into a DataFrame

In [5]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_timestamp
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

In [6]:
spark = SparkSession \
    .builder \
    .appName('pyspark_demo_app') \
    .config('spark.driver.extraClassPath',
            'postgresql-42.2.8.jar') \
    .getOrCreate()

Exception: Java gateway process exited before sending its port number

In [None]:
properties = {
    'driver': 'org.postgresql.Driver',
    'url': 'jdbc:postgresql://postgres:5432/bakery',
    'user': 'postgres',
    'password': 'postgres1234',
    'dbtable': 'transactions',
}

In [None]:
df1 = spark.read \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .load()

In [None]:
%%time
print('DataFrame rows: %d' % df1.count())
print('DataFrame schema: %s' % df1)
df1.show(10, False)

### Create a New Record
Create a new bakery record and load into a DataFrame

In [None]:
bakery_schema = StructType([
    StructField('date', StringType(), True),
    StructField('time', StringType(), True),
    StructField('transaction', IntegerType(), True),
    StructField('item', StringType(), True)
])

In [None]:
data = [('2016-10-30', '10:13:27', 2, 'Pastry')]
df2 = spark.createDataFrame(data, bakery_schema)

In [None]:
print('DataFrame rows: %d' % df2.count())
print('DataFrame schema: %s' % df2)
df2.show(10, False)

### Append New Record to Database Table
Append the contents of the DataFrame to the PostgreSQL 'transactions' table

In [None]:
df2.write \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .mode('append') \
    .save()

In [None]:
print('DataFrame rows: %d' % df1.count())

### Read CSV-Format File
Read CSV-format data file

In [None]:
df3 = spark.read \
    .format('csv') \
    .option('header', 'true') \
    .load('BreadBasket_DMS.csv', schema=bakery_schema)

In [None]:
print('DataFrame rows: %d' % df3.count())
print('DataFrame schema: %s' % df3)
df3.show(10, False)

### Overwrite Data to Database Table
Overwrite the contents of the DataFrame to the PostgreSQL 'transactions' table

In [None]:
df3.write \
    .format('jdbc') \
    .option('driver', properties['driver']) \
    .option('url', properties['url']) \
    .option('user', properties['user']) \
    .option('password', properties['password']) \
    .option('dbtable', properties['dbtable']) \
    .option('truncate', 'true') \
    .mode('overwrite') \
    .save()

In [None]:
print('DataFrame rows: %d' % df1.count())
print('DataFrame schema: %s' % df1)
df1.show(10, False)

### Analyze Data with Spark SQL
Analyze the DataFrame's bakery data using Spark SQL

In [None]:
df1.createOrReplaceTempView('tmp_bakery')
sql_query = 'SELECT * FROM tmp_bakery ' + \
            'ORDER BY transaction, date, time'
df4 = spark.sql(sql_query)
print('DataFrame rows: %d' % df4.count())
df4.show(10, False)

In [None]:
sql_query = 'SELECT COUNT(DISTINCT item) AS item_count FROM tmp_bakery'
df5 = spark.sql(sql_query)
df5.show(10, False)
sql_query = "SELECT item, count(*) as count " + \
            "FROM tmp_bakery " + \
            "WHERE item NOT LIKE 'NONE' " + \
            "GROUP BY item ORDER BY count DESC " + \
            "LIMIT 10"

df5 = spark.sql(sql_query)
df5.show(10, False)

### Graph Data with BokehJS
Create a vertical bar chart displaying DataFrame data

In [None]:
from bokeh.io import output_notebook, show
from bokeh.plotting import figure
from bokeh.models import ColumnDataSource
from bokeh.transform import factor_cmap
from bokeh.palettes import Paired12

output_notebook()

source = ColumnDataSource(data=df5.toPandas())

tooltips = [('item', '@item'), ('count', '@{count}{,}')]

items = source.data['item'].tolist()
color_map = factor_cmap(field_name='item', palette=Paired12, factors=items)
plot = figure(x_range=items, plot_width=750, plot_height=375, min_border=0, tooltips=tooltips)
plot.vbar(x='item', bottom=0, top='count', source=source, width=0.9, fill_color=color_map)
plot.title.text = 'Top 10 Bakery Items'
plot.xaxis.axis_label = 'Bakery Items'
plot.yaxis.axis_label = 'Total Items Sold'

show(plot)

### Analyze Data with Spark SQL
Analyze the DataFrame's bakery data using Spark SQL

In [None]:
sql_query = "SELECT CONCAT(date,' ',time) as timestamp, transaction, item " + \
            "FROM tmp_bakery " + \
            "WHERE item NOT LIKE 'NONE' " + \
            "ORDER BY transaction"
df6 = spark.sql(sql_query)
print('DataFrame rows: %d' % df6.count())
print('DataFrame schema: %s' % df6)
df6.show(10, False)

In [None]:
df7 = df6.withColumn('timestamp', to_timestamp(df6.timestamp, 'yyyy-MM-dd HH:mm:ss'))
print('DataFrame rows: %d' % df6.count())
print('DataFrame schema: %s' % df6)
df6.show(10, False)

In [None]:
df7.createOrReplaceTempView('tmp_bakery')
sql_query = "SELECT DISTINCT * " + \
            "FROM tmp_bakery " + \
            "WHERE item NOT LIKE 'NONE' " + \
            "ORDER BY transaction DESC"
df8 = spark.sql(sql_query)
print('DataFrame rows: %d' % df8.count())
print('DataFrame schema: %s' % df8)
df8.show(10, False)

### Read and Write Data to Parquet Format
Read and write DataFrame data to Parquet format files

In [None]:
df8.write.parquet('output/bakery_parquet', mode='overwrite')

In [None]:
! ls 2>&1 -lh output/bakery_parquet | head -10
! echo 'Parquet Files:' $(ls | wc -l)

In [None]:
df9 = spark.read.parquet('output/bakery_parquet')
print('DataFrame rows: %d' % df9.count())
print('DataFrame schema: %s' % df9)
df9.show(10, False)