# Demo 5: Collaborative Filtering and Comedy! 
------
<img src="images/seinfeld.jpg" width="400" height="400">

#### Real Dataset: http://eigentaste.berkeley.edu/dataset/ Dataset 2 
#### Rate Jokes: http://eigentaste.berkeley.edu

## What are we trying to learn from this dataset?

# QUESTION:  Can Collaborative Filtering be used to find which jokes to recommend to our users?


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

In [None]:
import pandas
import cassandra
import pyspark
import re
import os
import matplotlib.pyplot as plt
from IPython.display import IFrame
from IPython.display import display, Markdown
from pyspark.sql import SparkSession
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row

#### Helper function to have nicer formatting of Spark DataFrames

In [None]:
#Helper for pretty formatting for Spark DataFrames
def showDF(df, limitRows =  10, truncate = False):
    if(truncate):
        pandas.set_option('display.max_colwidth', 100)
    else:
        pandas.set_option('display.max_colwidth', None)
    pandas.set_option('display.max_rows', limitRows)
    display(df.limit(limitRows).toPandas())
    pandas.reset_option('display.max_rows')

## Astra Credentials & Keyspace

Put your Astra username, password, Secure Connect bundle, and keyspace that you configured when you signed up at http://astra.datastax.com into the variables below... it's that easy!

In [None]:
astraUsername      = 'username'
astraPassword      = 'password'
astraSecureConnect = 'file name with .zip extension'
astraKeyspace      = 'your keyspace name'

%store astraUsername astraPassword astraSecureConnect astraKeyspace

## Creating Tables and Loading Tables

### Connect to Cassandra

In [None]:
from cassandra.cluster import Cluster
from cassandra.auth import PlainTextAuthProvider

cloud_config = {
    'secure_connect_bundle': '/home/jovyan/secureconnect/'+astraSecureConnect
}
auth_provider = PlainTextAuthProvider(username=astraUsername, password=astraPassword)
cluster = Cluster(cloud=cloud_config, auth_provider=auth_provider)
session = cluster.connect()

### Set keyspace 

In [None]:
session.set_keyspace(astraKeyspace)

### Create table called jokes. Our PRIMARY will need to be a unique composite key (userid, jokeid). This will result in an even distribution of the data and allow for each row to be unique. Remember we will have to utilize that PRIMARY KEY in our WHERE clause in any of our CQL queries. 

In [None]:
query = "CREATE TABLE IF NOT EXISTS jokes \
                                    (userid int, jokeid int, rating float, \
                                     PRIMARY KEY (userid, jokeid))"
session.execute(query)

### What do these of these 3 columns represent: 

* **Column 1**: User id
* **Column 2**: Joke id
* **Column 3**: Rating of joke (-10.00 - 10.00) 

### Load Jokes dataset from CSV file (jester_ratings3.csv)
* This is a file I created from the *.dat file and I only have 10,000 rows -- dataset has over 1 million rows
<img src="images/laughing.gif" width="300" height="300">

#### Insert all the Joke Rating Data into the table `jokes`

In [None]:
fileName = 'data/jester_ratings3.csv'
input_file = open(fileName, 'r')

for line in input_file:
    jokeRow = line.split(',')
    query = "INSERT INTO jokes (userid, jokeid, rating)"
    
    query = query + "VALUES (%s, %s, %s)"
    
    session.execute(query, (int(jokeRow[0]), int(jokeRow[1]) , float(jokeRow[2]) ))

#### Do a select * on joke_table WHERE userid = x to verify that data was loaded into the table

In [None]:
query = 'SELECT * FROM jokes WHERE userid = 65'
rows = session.execute(query)
for row in rows:
    print (row.userid, row.jokeid, row.rating)

<img src="images/sparklogo.png" width="150" height="200">

### Finally time for Apache Spark! 

#### Create a spark session that is connected to Cassandra. From there load each table into a Spark Dataframe and take a count of the number of rows in each.

In [None]:
spark = SparkSession \
    .builder \
    .appName('demo') \
    .master("local") \
    .config( \
        "spark.cassandra.connection.config.cloud.path", \
        "file:/home/jovyan/secureconnect/"+astraSecureConnect) \
    .config("spark.cassandra.auth.username", astraUsername) \
    .config("spark.cassandra.auth.password", astraPassword) \
    .getOrCreate()

jokeTable = spark.read.format("org.apache.spark.sql.cassandra").options(table="jokes", keyspace=astraKeyspace).load()

print ("Table Row Count: ")
print (jokeTable.count())

#### Split dataset into training and testing set 

In [None]:
(training, test) = jokeTable.randomSplit([0.8, 0.2])

training_df = training.withColumn("rating", training.rating.cast('int'))
testing_df = test.withColumn("rating", test.rating.cast('int'))

showDF(training_df)

### Setup for CFliter with ALS

https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

In [None]:
als = ALS(maxIter=5, regParam=0.01, userCol="userid", itemCol="jokeid", ratingCol="rating",
          coldStartStrategy="drop")

model = als.fit(training_df)

In [None]:
# Evaluate the model by computing the RMSE on the test data
predictions = model.transform(testing_df)

# Generate top 10 joke recommendations for each user
userRecs = model.recommendForAllUsers(10)

showDF(userRecs)

# Generate top 10 user recommendations for each joke
jokeRecs = model.recommendForAllItems(10)

In [None]:
showDF(userRecs.filter(userRecs.userid == 65))

In [None]:
IFrame(src='images/init94.html', width=700, height=200)

In [None]:
IFrame(src='images/init43.html', width=700, height=200)

In [None]:
spark.stop()