# HW 5
## Recommendation Systems

## Part 1: Pyspark (40 Points)

### Setup

Let's setup Spark on your Colab environment.  Run the cell below!

In [1]:
!pip install pyspark
!pip install -U -q PyDrive
!apt install openjdk-8-jdk-headless -qq
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"

openjdk-8-jdk-headless is already the newest version (8u312-b07-0ubuntu1~18.04).
The following packages were automatically installed and are no longer required:
  libnvidia-common-460 nsight-compute-2020.2.0
Use 'apt autoremove' to remove them.
0 upgraded, 0 newly installed, 0 to remove and 42 not upgraded.


Now we authenticate a Google Drive client to download the filea we will be processing in our Spark job.

**Make sure to follow the interactive instructions.**

In [2]:
from pydrive.auth import GoogleAuth
from pydrive.drive import GoogleDrive
from google.colab import auth
from oauth2client.client import GoogleCredentials

# Authenticate and create the PyDrive client
auth.authenticate_user()
gauth = GoogleAuth()
gauth.credentials = GoogleCredentials.get_application_default()
drive = GoogleDrive(gauth)

In [3]:
id='1QtPy_HuIMSzhtYllT3-WeM3Sqg55wK_D'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.training')

id='1ePqnsQTJRRvQcBoF2EhoPU8CU1i5byHK'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.test')

id='1ncUBWdI5AIt3FDUJokbMqpHD2knd5ebp'
downloaded = drive.CreateFile({'id': id})
downloaded.GetContentFile('MovieLens.item')

If you executed the cells above, you should be able to see the dataset we will use for this Colab under the "Files" tab on the left panel.

Next, we import some of the common libraries needed for our task.

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import pyspark
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark import SparkContext, SparkConf

Let's initialize the Spark context.

In [5]:
# create the session
conf = SparkConf().set("spark.ui.port", "4050")

# create the context
sc = pyspark.SparkContext(conf=conf)
spark = SparkSession.builder.getOrCreate()

You can easily check the current version and get the link of the web interface. In the Spark UI, you can monitor the progress of your job and debug the performance bottlenecks (if your Colab is running with a **local runtime**).

In [6]:
spark

If you are running this Colab on the Google hosted runtime, the cell below will create a *ngrok* tunnel which will allow you to still check the Spark UI.

### Data Loading

In this Colab, we will be using the [MovieLens dataset](https://grouplens.org/datasets/movielens/), specifically the 100K dataset (which contains in total 100,000 ratings from 1000 users on ~1700 movies).

We load the ratings data in a 80%-20% ```training```/```test``` split, while the ```items``` dataframe contains the movie titles associated to the item identifiers.

In [82]:
schema_ratings = StructType([
    StructField("user_id", IntegerType(), False),
    StructField("item_id", IntegerType(), False),
    StructField("rating", IntegerType(), False),
    StructField("timestamp", IntegerType(), False)])

schema_items = StructType([
    StructField("item_id", IntegerType(), False),
    StructField("movie", StringType(), False)])

training = spark.read.option("sep", "\t").csv("MovieLens.training", header=False, schema=schema_ratings)
test = spark.read.option("sep", "\t").csv("MovieLens.test", header=False, schema=schema_ratings)
items = spark.read.option("sep", "|").csv("MovieLens.item", header=False, schema=schema_items)

In [8]:
training.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: integer (nullable = true)



In [9]:
test.printSchema()

root
 |-- user_id: integer (nullable = true)
 |-- item_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: integer (nullable = true)



In [10]:
items.printSchema()

root
 |-- item_id: integer (nullable = true)
 |-- movie: string (nullable = true)



### Your task

Let's compute some stats!  What is the number of ratings in the training and test dataset? How many movies are in our dataset?

In [11]:
df1 = training.toPandas()

In [12]:
training.take(5)

[Row(user_id=1, item_id=1, rating=5, timestamp=874965758),
 Row(user_id=1, item_id=2, rating=3, timestamp=876893171),
 Row(user_id=1, item_id=3, rating=4, timestamp=878542960),
 Row(user_id=1, item_id=4, rating=3, timestamp=876893119),
 Row(user_id=1, item_id=5, rating=3, timestamp=889751712)]

In [13]:
items.take(5)

[Row(item_id=1, movie='Toy Story (1995)'),
 Row(item_id=2, movie='GoldenEye (1995)'),
 Row(item_id=3, movie='Four Rooms (1995)'),
 Row(item_id=4, movie='Get Shorty (1995)'),
 Row(item_id=5, movie='Copycat (1995)')]

In [14]:
train = training.join(items, training.item_id == items.item_id).drop(items.item_id)
train.show()

+-------+-------+------+---------+--------------------+
|user_id|item_id|rating|timestamp|               movie|
+-------+-------+------+---------+--------------------+
|      1|      1|     5|874965758|    Toy Story (1995)|
|      1|      2|     3|876893171|    GoldenEye (1995)|
|      1|      3|     4|878542960|   Four Rooms (1995)|
|      1|      4|     3|876893119|   Get Shorty (1995)|
|      1|      5|     3|889751712|      Copycat (1995)|
|      1|      7|     4|875071561|Twelve Monkeys (1...|
|      1|      8|     1|875072484|         Babe (1995)|
|      1|      9|     5|878543541|Dead Man Walking ...|
|      1|     11|     2|875072262|Seven (Se7en) (1995)|
|      1|     13|     5|875071805|Mighty Aphrodite ...|
|      1|     15|     5|875071608|Mr. Holland's Opu...|
|      1|     16|     5|878543541|French Twist (Gaz...|
|      1|     18|     4|887432020|White Balloon, Th...|
|      1|     19|     5|875071515|Antonia's Line (1...|
|      1|     21|     1|878542772|Muppet Treasur

In [15]:
#test = test.join(items, test.item_id ==items.item_id).drop(items.item_id)
#test.show()

+-------+-------+------+---------+--------------------+
|user_id|item_id|rating|timestamp|               movie|
+-------+-------+------+---------+--------------------+
|      1|      6|     5|887431973|Shanghai Triad (Y...|
|      1|     10|     3|875693118|  Richard III (1995)|
|      1|     12|     5|878542960|Usual Suspects, T...|
|      1|     14|     5|874965706|  Postino, Il (1994)|
|      1|     17|     3|875073198|From Dusk Till Da...|
|      1|     20|     4|887431883|Angels and Insect...|
|      1|     23|     4|875072895|  Taxi Driver (1976)|
|      1|     24|     3|875071713|Rumble in the Bro...|
|      1|     27|     2|876892946|     Bad Boys (1995)|
|      1|     31|     3|875072144| Crimson Tide (1995)|
|      1|     33|     4|878542699|    Desperado (1995)|
|      1|     36|     2|875073180|     Mad Love (1995)|
|      1|     39|     4|875072173| Strange Days (1995)|
|      1|     44|     5|878543541|Dolores Claiborne...|
|      1|     47|     4|875072125|      Ed Wood 

In [47]:
df_train = train.toPandas()
df_train.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie
0,1,1,5,874965758,Toy Story (1995)
1,1,2,3,876893171,GoldenEye (1995)
2,1,3,4,878542960,Four Rooms (1995)
3,1,4,3,876893119,Get Shorty (1995)
4,1,5,3,889751712,Copycat (1995)


In [17]:
df_test = test.toPandas()
df_test.head()

Unnamed: 0,user_id,item_id,rating,timestamp,movie
0,1,6,5,887431973,Shanghai Triad (Yao a yao yao dao waipo qiao) ...
1,1,10,3,875693118,Richard III (1995)
2,1,12,5,878542960,"Usual Suspects, The (1995)"
3,1,14,5,874965706,"Postino, Il (1994)"
4,1,17,3,875073198,From Dusk Till Dawn (1996)


In [18]:
training.count() # gives the number of rows, ie, number of ratings

80000

In [19]:
test.count()

20000

In [20]:
items.count()

1682

# ANSWERS:
## 80000 ratings in training set
## 20000 ratings in test set
## 1682 movies in the dataset

Using the training set, train a model with the Alternating Least Squares method available in the Spark MLlib: [https://spark.apache.org/docs/latest/ml-collaborative-filtering.html](https://spark.apache.org/docs/latest/ml-collaborative-filtering.html)

maxIter = 5, regParam=0.01

In [88]:
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

In [89]:
#als = ALS(userCol='user_id', itemCol='item_id', ratingCol='rating', maxIter=5, regParam=0.01, coldStartStrategy='drop')
als = ALS(userCol="user_id", itemCol="item_id", ratingCol="rating", maxIter=5, regParam=0.01,nonnegative = True, implicitPrefs = False, coldStartStrategy="drop")

In [90]:
model = als.fit(training)

Now compute the RMSE on the test dataset.


In [91]:
predictions = model.transform(test)

In [84]:
df = predictions.toPandas()
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp,prediction
0,251,148,2,886272547,3.327871
1,255,833,4,883216902,1.324811
2,321,496,4,879438607,3.945987
3,108,471,2,879880076,4.139856
4,101,471,3,877136535,3.286622


In [92]:
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
rmse = evaluator.evaluate(predictions)
print("RMSE on test set: ", rmse)

RMSE on test set:  1.0465243868722673


At this point, you can use the trained model to produce the top-K recommendations for each user.  Recommend the top three movies for each user. 

In [93]:
top3 = model.recommendForAllUsers(3)
top3.show(truncate=False)




+-------+---------------------------------------------------------+
|user_id|recommendations                                          |
+-------+---------------------------------------------------------+
|1      |[{793, 6.3364606}, {960, 5.9672217}, {947, 5.9344974}]   |
|3      |[{1006, 7.234457}, {308, 7.156104}, {502, 7.016623}]     |
|5      |[{1368, 11.807686}, {1077, 10.737466}, {698, 9.860186}]  |
|6      |[{115, 5.6290255}, {361, 5.4949675}, {1524, 5.343671}]   |
|9      |[{1395, 11.222579}, {1131, 10.867725}, {1216, 10.820805}]|
|12     |[{1664, 7.465287}, {915, 7.380612}, {1006, 7.343769}]    |
|13     |[{1368, 7.1528616}, {6, 5.408635}, {580, 5.38941}]       |
|15     |[{1193, 6.3574047}, {1194, 6.0866623}, {1120, 5.931073}] |
|16     |[{1368, 7.6255207}, {1427, 7.252457}, {1005, 7.114436}]  |
|17     |[{1512, 7.437726}, {1077, 6.7629375}, {1368, 6.532758}]  |
|19     |[{1512, 10.260715}, {1172, 9.616025}, {962, 9.18335}]    |
|20     |[{1006, 13.493648}, {1664, 12.704903}, 

Print the name of the movies recommended for user 444  

In [94]:
r444 = top3.filter(top3.user_id == '444')
r444.show(truncate=False)

+-------+------------------------------------------------------+
|user_id|recommendations                                       |
+-------+------------------------------------------------------+
|444    |[{1466, 10.026432}, {1099, 9.459652}, {1021, 9.32674}]|
+-------+------------------------------------------------------+



In [95]:
r444 = r444.withColumn("rec_exp", explode("recommendations")).select('user_id', col("rec_exp.item_id"), col("rec_exp.rating"))
r444.show()

+-------+-------+---------+
|user_id|item_id|   rating|
+-------+-------+---------+
|    444|   1466|10.026432|
|    444|   1099| 9.459652|
|    444|   1021|  9.32674|
+-------+-------+---------+



In [96]:
r444.join(items, on='item_id').show(truncate=False)

+-------+-------+---------+-----------------------------------------+
|item_id|user_id|rating   |movie                                    |
+-------+-------+---------+-----------------------------------------+
|1466   |444    |10.026432|Margaret's Museum (1995)                 |
|1099   |444    |9.459652 |Red Firecracker, Green Firecracker (1994)|
|1021   |444    |9.32674  |8 1/2 (1963)                             |
+-------+-------+---------+-----------------------------------------+



## Part 2: Collaborative Filtering (60 Points)

scikit-surprise(http://surpriselib.com/) is a good library for multiple recommender systems algorithms. <br>You may start playing around with wrapper functions from the link provided.

Hint: Transform pyspark dataframe into pandas dataframe

In [34]:
# install surprise to build recommender in python
!pip install scikit-surprise

Collecting scikit-surprise
  Downloading scikit-surprise-1.1.1.tar.gz (11.8 MB)
[K     |████████████████████████████████| 11.8 MB 15.1 MB/s 
Building wheels for collected packages: scikit-surprise
  Building wheel for scikit-surprise (setup.py) ... [?25l[?25hdone
  Created wheel for scikit-surprise: filename=scikit_surprise-1.1.1-cp37-cp37m-linux_x86_64.whl size=1630128 sha256=e48b3f5b5e8de2c01ef4587605e4279003806aa869ce6a63b302f770198f906c
  Stored in directory: /root/.cache/pip/wheels/76/44/74/b498c42be47b2406bd27994e16c5188e337c657025ab400c1c
Successfully built scikit-surprise
Installing collected packages: scikit-surprise
Successfully installed scikit-surprise-1.1.1


### Task. Memory-based Filtering 

Your task is to train a predictor using the `training` set provided above, and make predictions on the `test` set.

A. User-based recommendation

To make a prediction on user $u$'s rating on item $i$ ($R_{u, i}$), User-based recommendation finds the top-N user neighbors who have already completed rating on $i$, taking their average (unweighted or weighted by their similarity with $u$) as the prediction $\hat{R}_{u,i}$.


(1). Use default parameters, report *RMSE* on training & test set, respectively.

In [35]:
from surprise import KNNWithMeans
from surprise import Reader
from surprise import Dataset
from surprise.model_selection import cross_validate

In [97]:
algo = KNNWithMeans()

In [98]:
reader = Reader(rating_scale=(1, 5))
data1 = Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)

In [99]:
train_df = data1.build_full_trainset()
algo.fit(train_df)
test_df = train_df.build_testset()
predictions = algo.test(test_df)

Computing the msd similarity matrix...
Done computing similarity matrix.


In [100]:
cross_validate(algo, data1, measures=['RMSE'], cv=2)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'fit_time': (0.14841032028198242, 0.15533161163330078),
 'test_rmse': array([0.97783689, 0.97009482]),
 'test_time': (4.491857051849365, 4.269320487976074)}

Calculate RMSE of the actual ratings $R$ and the predicted ratings $\hat{R}$ in the training set.

In [101]:
from surprise import accuracy
accuracy.rmse(predictions)

RMSE: 0.7569


0.7569412498224135

Now let's make predictions on the test set

In [102]:
data2 = Dataset.load_from_df(df_test[['user_id', 'item_id', 'rating']], reader)

In [103]:
train_df2 = data2.build_full_trainset()
test_df2 = train_df2.build_testset()
prediction_2 = algo.test(test_df2)

In [104]:
cross_validate(algo, data2, measures=['RMSE'], cv=2)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.


{'fit_time': (0.034761905670166016, 0.020264148712158203),
 'test_rmse': array([1.05928159, 1.05840889]),
 'test_time': (0.2831432819366455, 0.33246707916259766)}

Calculate RMSE of the actual ratings $R$ and the predicted ratings $\hat{R}$ from the trained user-based recommendation.

In [105]:
accuracy.rmse(prediction_2)

RMSE: 0.9980


0.9980359334665843

In [106]:
knn_algo = KNNWithMeans(k=10)

training_data = Dataset.load_from_df(df_train[['user_id', 'item_id', 'rating']], reader)
testing_data = Dataset.load_from_df(df_test[['user_id', 'item_id', 'rating']],reader)

In [107]:
trained = training_data.build_full_trainset()
knn_algo.fit(trained)
tested = trained.build_testset()
predictions = knn_algo.test(tested)
accuracy.rmse(predictions)

Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.6106


0.6106378518062169

In [108]:
train_2 = testing_data.build_full_trainset()
test_2 = train_2.build_testset()
predictions2 = knn_algo.test(test_2)
accuracy.rmse(predictions2)

RMSE: 0.9892


0.9891847088270564

(2). Now display the top 10 movies for user 10, ranked by the predicted rating scores in the test set.

In [58]:
df_item = items.toPandas()


In [109]:
predict = pd.DataFrame(predictions2)

allmovies = predict[predict.uid == 10].sort_values(by='est',ascending=False)[:10]["iid"]

print("The top 10 movies for user 10: ")
[df_item[df_item.item_id == x]['movie'].values[0] for x in allmovies]


The top 10 movies for user 10: 


['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'Bridge on the River Kwai, The (1957)',
 'Deer Hunter, The (1978)',
 'Casablanca (1942)',
 'Secrets & Lies (1996)',
 'Aliens (1986)',
 'Reservoir Dogs (1992)',
 'Laura (1944)',
 'Bringing Up Baby (1938)',
 '2001: A Space Odyssey (1968)']

(3). From what we learned in class, the number of nearest neighbors ($k$) considered for rating estimation $\hat{R}$ is an important hyperparameter affecting the prediction results. Repeat the training procedure above with different nearest neighbor selections (2-10), find the optimal $k$ in your experiment and report the corresponding *RMSE* in the test set.

In [63]:
trained = training_data.build_full_trainset()
tested = testing_data.build_full_trainset().build_testset()
param_grid = {'k':list(range(2,11))}

Note that we can write a for-loop to iterate throuogh different choices, but scikit-surprise provides us with a simplified cross-validation interface (`surprise.model_selection.GridSearchCV`) to fine-tune such hyperparameter.


Report the optimal k value. 

Report the RMSE given the optimal k value

In [65]:
from surprise.model_selection import GridSearchCV
from surprise import KNNWithMeans

In [66]:
kopt = GridSearchCV(KNNWithMeans, param_grid, measures=['rmse'])
kopt.fit(training_data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [111]:
print("RMSE score of optimal k value: ", kopt.best_score)

RMSE score of optimal k value:  {'rmse': 1.03832050710602}


In [68]:
print("optimal k value: ", kopt.best_params)

optimal k value:  {'rmse': {'k': 10}}


In [69]:
kopt.fit(testing_data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [112]:
print("RMSE score of optimal k value: ", kopt.best_score)
print("optimal k value: ", kopt.best_params)

RMSE score of optimal k value:  {'rmse': 1.03832050710602}
optimal k value:  {'rmse': {'k': 10}}


B. **item-based recommendation**

To make a prediction on user $u$'s rating on item $i$ ($R_{u, i}$), Item-based recommendation finds the top-N item neighbors (the user has rated) to $i$, taking their average (unweighted or weighted by their similarity with $i$) as the prediction $\hat{R}_{u,i}$.



(1). Similar to the previous question, implement the item-based recommender systems trained on the  `training` set, report the *RMSE* on both the `training` and `test` set. (Note: apply the optimal $k$ obtained in last question.)

In [114]:
sim_options = {'user_based': False}

# optimal value of K = 10 

i_knn = KNNWithMeans(k=10,sim_options=sim_options)
trained = training_data.build_full_trainset()
tested = trained.build_testset()
i_knn.fit(trained)
predictions = i_knn.test(tested)
rmse1 = accuracy.rmse(predictions)
print("Accuracy for the train set is: ", rmse1)


Computing the msd similarity matrix...
Done computing similarity matrix.
RMSE: 0.5690
Accuracy for the train set is:  0.5690496491977741


In [116]:
trained2 = testing_data.build_full_trainset()
tested2 = trained2.build_testset()
prediction_2 = i_knn.test(tested2)
rmse2 = accuracy.rmse(prediction_2)
print("The accuracy for the test set is: ",rmse2)

RMSE: 0.9783
The accuracy for the test set is:  0.9783018562689109


(2). Similar to previous question, display the top 10 movies for user 10, ranked by the predicted rating scores in the test set.

In [117]:
predict = pd.DataFrame(predictions2)

In [81]:
predict.head()

Unnamed: 0,uid,iid,r_ui,est,details
0,1,6,5.0,3.139162,"{'actual_k': 10, 'was_impossible': False}"
1,1,10,3.0,3.776408,"{'actual_k': 10, 'was_impossible': False}"
2,1,12,5.0,4.439765,"{'actual_k': 10, 'was_impossible': False}"
3,1,14,5.0,3.927944,"{'actual_k': 10, 'was_impossible': False}"
4,1,17,3.0,3.259577,"{'actual_k': 10, 'was_impossible': False}"


In [118]:
allmovies = predict[predict.uid == 10].sort_values(by='est',ascending=False)[:10]["iid"]

print("The top 10 movies for user 10: ")
[df_item[df_item.item_id == x]['movie'].values[0] for x in allmovies]


The top 10 movies for user 10: 


['Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)',
 'Bridge on the River Kwai, The (1957)',
 'Deer Hunter, The (1978)',
 'Casablanca (1942)',
 'Secrets & Lies (1996)',
 'Aliens (1986)',
 'Reservoir Dogs (1992)',
 'Laura (1944)',
 'Bringing Up Baby (1938)',
 '2001: A Space Odyssey (1968)']

(3). Given the same number of nearest neighbor ($k$), compare and discuss the user-based and item-based recommendation, which performs better on the test set?

User-based recommendation RMSE values:

**Training set - 0.610**
**Test set - 0.989**

Item-based recommendation RMSE values:

**Train set - 0.569**
**Test set - 0.978**

From above RMSE values, it is evident that user based recommendations perform better on train sets for the given parameters.

The model with optimal value of K = 10 performs better on the datasets. When K value was taken as default (k = 40), the accuracy of the model was lesser comparatively.



Item-based recommenders perform better than user-based recommenders on the test set and in general. This is better because user based recommender is higly based on the user and may give inconsistent results.