# Atwater Customer Recommendations

#### A demo using DataStax Enterprise Analytics, Apache Cassandra, Apache Spark, Python and Jupyter Notebooks to utilize the power of big customer data to recommend items to our customers with a high degree of accruacy 

### Things To Setup
* Install DSE https://docs.datastax.com/en/install/doc/install60/installTOC.html
* Start DSE Analytics Cluster
* Using Python 2.7
* Using DSE Analytics 6
* Using latest verion of Jupyter 
* Find full path to <>/lib/pyspark.zip
* Find full path to <>/lib/py4j-0.10.4-src.zip
* Start Jupyter with DSE to get all environemnt variables: dse exec jupyter notebook
* Make sure that the two CSV files are in the same locations as this notebook
* !pip install cassandra-driver
* !pip install pattern 
* !pip install panadas
* Counter-intuitive don't install pyspark!!

#### Add some environment variables to find dse verision of pyspark. Edit these varibles with your path.

In [29]:
pysparkzip = "/usr/share/dse/spark/python/lib/pyspark.zip"
py4jzip = "/usr/share/dse/spark/python/lib/py4j-0.10.4-src.zip"

In [30]:
# Needed to be able to find pyspark libaries
import sys
sys.path.append(pysparkzip)
sys.path.append(py4jzip)

#### Import python packages -- all are required

In [33]:
import pandas
import cassandra
import pyspark
import re
import os
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.feature import Tokenizer, RegexTokenizer, StopWordsRemover
from pyspark.sql.functions import col, udf
from pyspark.sql.types import IntegerType
from pattern.en import sentiment, positive

#### Helper function to have nicer formatting of Spark DataFrames

In [34]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  5, truncate = True):
    if(truncate):
        pandas.set_option('display.max_colwidth', 50)
    else:
        pandas.set_option('display.max_colwidth', -1)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

### Creating Tables, Pulling Tweets, and Loading Tables

#### Connect to DSE Analytics Cluster

In [35]:
from cassandra.cluster import Cluster

cluster = Cluster(['127.0.0.1']) #If you have a locally installed DSE cluster
session = cluster.connect()

#### Create Demo Keyspace --Replication Factor is 1 since only have a one node demo cluster. Replication Factor is recommended at 3 or Write Consistency + Read Consistency > Replication Factor

In [190]:
session.execute("""
    CREATE KEYSPACE IF NOT EXISTS demo 
    WITH REPLICATION = 
    { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }"""
)

<cassandra.cluster.ResultSet at 0x7bbdc10>

#### Set keyspace 

In [233]:
session.set_keyspace('demo')

### Create the customer transaction table in DSE (this is for completed transactions).  This table will be updated with about 1000 transactions a minute (Atwater has around 200,000 transactions a day on their website)
#### Our primary key will be on state (limiting our analysis to just the US), and our clustering columns will be around gender, age and the transaction id. Consider your data model when choosing your primary key. This will give us a good distriubtion of the data and a unique row for each transaction. We will also be able to create models around age, gender, and state to give the best possible recommendations. 

In [234]:
query = "CREATE TABLE IF NOT EXISTS customer_transactions (id int, \
                                                            customer_name text, \
                                                            gender text, age int, \
                                                            state text, home_store int, \
                                                            items list<text>, year int, \
                                                            month int, rewards_member text, \
                                                            PRIMARY KEY ((state), gender, age, id))"
session.execute(query)


<cassandra.cluster.ResultSet at 0x7c07b90>

### Create the live customer table in DSE - this represents live customers that are currently logged in on the site and what they have in their shopping cart. 
#### We will use that information to get a prediction of what we should recommend for them.  The data model is the same as above. 

In [235]:
query = "CREATE TABLE IF NOT EXISTS customer_live (id int, \
                                                            customer_name text, \
                                                            gender text, age int, \
                                                            state text, home_store int, \
                                                            items list<text>, year int, \
                                                            month int, rewards_member text, \
                                                            PRIMARY KEY ((state), gender, age, id))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7bf2090>

### Create the Customer Recommendation Table in DSE
#### This table will be used with the inventory table to show the correct, in-stock items by the website. --In reality this information probably would not be written back to a Cassandra table as it doesn't need to be stored long-term. For the shake of the demo showing fast reads and fast writes. 

In [236]:
query = "CREATE TABLE IF NOT EXISTS customer_recommend (id int, \
                                                            customer_name text, \
                                                            gender text, age int, \
                                                            state text, home_store int, \
                                                            items list<text>, year int, \
                                                            month int, rewards_member text,\
                                                            prediction list<text>,\
                                                            PRIMARY KEY ((id, state), gender, age))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7c1f810>

### Create the Inventory Table in DSE 
#### Our primary key is going to be around the item type (pants, shirts, blender), the location of the items, the sku, and if it the items is currently avaliable. While customers may want to look at items that are on back-order, we do not want to recommend them. This will only cause frustration. This table would have around 6 million entries at one time, with inserts/deletions daily. 

In [237]:
query = "CREATE TABLE IF NOT EXISTS inventory (sku int, \
                                               item_name text, item_type text, \
                                               stock_loc text, num_items int, \
                                               backorder text, \
                                               PRIMARY KEY (item_type, stock_loc, sku, backorder))"
session.execute(query)

<cassandra.cluster.ResultSet at 0x7c6f150>

#### Load CSV files into DSE for Customer Transactions, Customer Live/Shopping Cart and Inventory Tables
##### Note could also use bulk loader or a loop with insert statements

In [238]:
!head -n 2 customer.csv
!cat loadCustomer.cql
!cqlsh -f loadCustomer.cql

1|Toby Moran|M|14|CA|20|['Collar','Sweater','Bed']|2014|12|N
2|Rocky Bucaojit|M|7|CA|21|['Collar', 'Bed', 'Bowl']|2014|11|N
COPY demo.customer_transactions( id, customer_name, gender, age, state, home_store, items, year, month, rewards_member) FROM 'customer.csv' WITH DELIMITER = '|';
Using 3 child processes

Starting copy of demo.customer_transactions with columns [id, customer_name, gender, age, state, home_store, items, year, month, rewards_member].
loadCustomer.cql:2:Failed to import 1 rows: ParseError - Invalid row length 0 should be 10,  given up without retries
loadCustomer.cql:2:Failed to process 1 rows; failed rows written to import_demo_customer_transactions.err
Processed: 3 rows; Rate:       5 rows/s; Avg. rate:       7 rows/s
3 rows imported from 1 files in 0.406 seconds (0 skipped).


In [240]:
!head -n 2 customerTest.csv
!cat loadCustomerTest.cql
!cqlsh -f loadCustomerTest.cql

3|Max Moran|M|7|CA|24|['Bed']|2015|02|Y
COPY demo.customer_live( id, customer_name, gender, age, state, home_store, items, year, month, rewards_member) FROM 'customerTest.csv' WITH DELIMITER = '|';
Using 3 child processes

Starting copy of demo.customer_live with columns [id, customer_name, gender, age, state, home_store, items, year, month, rewards_member].
Processed: 1 rows; Rate:       3 rows/s; Avg. rate:       3 rows/sProcessed: 1 rows; Rate:       2 rows/s; Avg. rate:       3 rows/s
1 rows imported from 1 files in 0.388 seconds (0 skipped).


In [241]:
!head -n 2 inventory.csv
!cat loadInventory.cql
!cqlsh -f loadInventory.cql

1|Fancy Collar 1|Collar|CA|5|N
2|Fancy Collar 2|Collar|CA|0|Y
COPY demo.inventory( sku, item_name, item_type, stock_loc, num_items, backorder) FROM 'inventory.csv' WITH DELIMITER = '|';
Using 3 child processes

Starting copy of demo.inventory with columns [sku, item_name, item_type, stock_loc, num_items, backorder].
Processed: 3 rows; Rate:      10 rows/s; Avg. rate:      10 rows/sProcessed: 3 rows; Rate:       5 rows/s; Avg. rate:       8 rows/s
3 rows imported from 1 files in 0.398 seconds (0 skipped).


#### Do a select * on customer transaction table and verify that the values have been inserted into the DSE table. Because we have used as our primary key "State" we can use this in our WHERE clause.

In [246]:
query = 'SELECT * FROM customer_transactions WHERE state=\'CA\' limit 10'
rows = session.execute(query)
for user_row in rows:
    print (user_row.id, user_row.items)

SELECT * FROM customer_transactions WHERE state='CA' limit 10
(2, [u'Collar', u'Bed', u'Bowl'])
(1, [u'Collar', u'Sweater', u'Bed'])


### Finally time for Some Analytics!

#### Create a spark session that is connected to Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [247]:
spark = SparkSession.builder.appName('demo').master("local").getOrCreate()

tableDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="customer_transactions", keyspace="demo").load()

testDF = spark.read.format("org.apache.spark.sql.cassandra").options(table="customer_live", keyspace="demo").load()


print "Table Train Count: "
print tableDF.count()
showDF(tableDF)

print "Table Test Count: "
print testDF.count()
showDF(testDF)

Table Train Count: 
2


Unnamed: 0,state,gender,age,id,customer_name,home_store,items,month,rewards_member,year
0,CA,M,7,2,Rocky Bucaojit,21,"[Collar, Bed, Bowl]",11,N,2014
1,CA,M,14,1,Toby Moran,20,"[Collar, Sweater, Bed]",12,N,2014


Table Test Count: 
1


Unnamed: 0,state,gender,age,id,customer_name,home_store,items,month,rewards_member,year
0,CA,M,7,3,Max Moran,24,[Bed],2,Y,2015


#### Use FPGrowth to find Recommendation

In [248]:
from pyspark.ml.fpm import FPGrowth

fpGrowth = FPGrowth(itemsCol="items", minSupport=0.5, minConfidence=0.6)
model = fpGrowth.fit(tableDF)
recommendDF=model.transform(testDF)
recommendDF.show()

+-----+------+---+---+-------------+----------+-----+-----+--------------+----+----------+
|state|gender|age| id|customer_name|home_store|items|month|rewards_member|year|prediction|
+-----+------+---+---+-------------+----------+-----+-----+--------------+----+----------+
|   CA|     M|  7|  3|    Max Moran|        24|[Bed]|    2|             Y|2015|  [Collar]|
+-----+------+---+---+-------------+----------+-----+-----+--------------+----+----------+



In [249]:
recommendDF.write.format("org.apache.spark.sql.cassandra").options(table="customer_recommend", keyspace="demo").save(mode="append")

In [250]:
query = 'SELECT * FROM customer_recommend limit 10'
rows = session.execute(query)
for user_row in rows:
    for item in user_row.prediction:
        query = "SELECT * FROM inventory WHERE item_type=\'%s\' AND stock_loc=\'%s\'" % (item, user_row.state)
        items = session.execute(query)
        for item_row in items:
            print "Customer: " + user_row.customer_name + " **Shopping Cart: " + str(user_row.items) + "** --> Current Recommendations: " + item_row.item_name

Customer: Max Moran **Shopping Cart: [u'Bed']** --> Current Recommendations: Fancy Collar 1
Customer: Max Moran **Shopping Cart: [u'Bed']** --> Current Recommendations: Fancy Collar 2
