# IS622 Recommendation System Mini-Project
### Brian Chu | Nov 15, 2015

In [1]:
import os
import sys

# Path for Spark source folder
os.environ['SPARK_HOME']="/home/brian/workspace/cuny_msda_is622/spark-1.5.1-bin-hadoop2.6"

# Append pyspark to Python Path
sys.path.append("/home/brian/workspace/cuny_msda_is622/spark-1.5.1-bin-hadoop2.6/python/")

# Append py4j to Python Path
sys.path.append("/home/brian/workspace/cuny_msda_is622/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip")

In [2]:
# Launch Spark
execfile("/home/brian/workspace/cuny_msda_is622/spark-1.5.1-bin-hadoop2.6/python/pyspark/shell.py")

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.10 (default, Oct 14 2015 16:09:02)
SparkContext available as sc, HiveContext available as sqlContext.


In [3]:
# Load required packages
from pyspark.sql import SQLContext
import numpy as np
import pandas as pd

### Dataset: Jester Online Joke Recommender System
**http://eigentaste.berkeley.edu/dataset/**  
Eigentaste: A Constant Time Collaborative Filtering Algorithm. Ken Goldberg, Theresa Roeder, Dhruv Gupta, and Chris Perkins. Information Retrieval, 4(2), 133-151. July 2001.  
  
This dataset includes ratings from [-10, 10] of up to 100 jokes by 24,983 users.

### Load and clean data

In [4]:
jester = sc.textFile("jester2.csv")

# Sample row
jester.take(1)

[u'1,-7.82,8.79,-9.66,-8.16,-7.52,-8.5,-9.85,4.17,-8.98,-4.76,-8.5,-6.75,-7.18,8.45,-7.18,-7.52,-7.43,-9.81,-9.85,-9.85,-9.37,1.5,-4.37,-9.81,-8.5,1.12,7.82,2.86,9.13,-7.43,2.14,-4.08,-9.08,7.82,5.05,4.95,-9.17,-8.4,-8.4,-8.4,-8.11,-9.13,-9.03,-9.08,-7.14,-6.26,3.79,-0.1,3.93,4.13,-8.69,-7.14,3.2,8.3,-4.56,0.92,-9.13,-9.42,2.82,-8.64,8.59,3.59,-6.84,-9.03,2.82,-1.36,-9.08,8.3,5.68,-4.81,99,99,99,99,99,99,99,-9.42,99,99,99,-7.72,99,99,99,99,99,99,99,99,2.82,99,99,99,99,99,-5.63,99,99,99']

In [5]:
# Create RDD, DataFrame, and Pandas objects
jrdd = jester.map(lambda line: line.split(","))
jdf = jrdd.toDF()
jpd = jdf.toPandas()

Recode missing data (99) to NaN. I found this easier to do in pandas and then reconvert to PySpark dataframe. Perhaps because dataframes are immutable?

In [6]:
# Convert 99 to NaN (in pandas, remake DF)
sqlc = SQLContext(sc)
jpd.iloc[:,1:] = jpd.iloc[:,1:].astype(float)
jpd[jpd==99] = np.nan
jdf = sqlc.createDataFrame(jpd)

## PySpark MLlib  
I am going to use PySpark's MLlib package, which incorporates collaborative filtering using the Alternating Least Squares (ALS) algorithm to predict missing entries. ALS aims to minimize the squared error of the UV user-item rating matrix by alternating learning sequences of U to V and V to U until a steady-state has been reached. This is similar to the collaborative filtering theory discussed in the MMS text.  
  
### Modified source code:  
http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html  
https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

**MLlib ALS requires a Ratings object, which is a tuple in the format (userID, itemID, itemRating)**

In [7]:
# Create Ratings object required for MLlib ALS train function
# Exclude NaN entries for model training; will be used for testing and predicting
ratings = jdf.flatMap(lambda line: [(line[0], i, line[i]) for i in range(1,101) if np.isnan(line[i])==0])

In [8]:
# View snippet of Ratings object
ratings.collect()[:5]

[(u'1', 1, -7.82),
 (u'1', 2, 8.79),
 (u'1', 3, -9.66),
 (u'1', 4, -8.16),
 (u'1', 5, -7.52)]

**Check the number of users, jokes, and ratings in our dataframe**

In [9]:
numRatings = ratings.count()
numUsers = ratings.map(lambda r: r[0]).distinct().count()
numJokes = ratings.map(lambda r: r[1]).distinct().count()

print "Got %d ratings from %d users on %d jokes." % (numRatings, numUsers, numJokes)

Got 1810455 ratings from 24983 users on 100 jokes.


**Randomly divide ratings into a 75% training and 25% testing set**

In [10]:
ratingsTrain, ratingsTest = ratings.randomSplit([0.75, 0.25], seed = 85)

In [11]:
# Check number of ratings in each set
print len(ratingsTrain.collect())
print len(ratingsTest.collect()) 

1357773
452682


### Train the recommendation model using ALS

In [12]:
from pyspark.mllib.recommendation import ALS, Rating

# Default parameters; can also tune using function defined in MLlib example
rank = 5
numIterations = 10

# Train model with training data
model = ALS.train(ratingsTrain, rank, numIterations)

**Check model accuracy and error**

In [13]:
rt = sc.parallelize(ratingsTest.collect())
test = model.predictAll(rt.map(lambda p: (p[0], p[1]))).map(lambda r: ((r[0], r[1]), r[2]))

results = ratingsTest.map(lambda r: ((int(r[0]), r[1]), r[2])).join(test)
rpar = sc.parallelize(results.collect())

In [33]:
import math

MSE = rpar.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print "Mean Squared Error = %.3f" %(MSE)

RMSE = math.sqrt(MSE)
print "Root Mean Squared Error = %.3f \n" %(RMSE)

MAE = rpar.map(lambda r: abs((r[1][0] - r[1][1]))).mean()
print "Mean Absolute Error = %.3f" %(MAE)

NMAE = MAE / (10 - -10)
print "Normalized Mean Abolute Error = %.3f" %(NMAE)

Mean Squared Error = 17.767
Root Mean Squared Error = 4.215 

Mean Absolute Error = 3.295
Normalized Mean Abolute Error = 0.165


I tinkered with some of the tuning parameters (rank, iterations) and found the RMSE to be fairly consistent. The ratings scale goes from -10 to 10 (range of 20) so both RMSE and MAE may seem somewhat large. However, in the cited paper above, Goldberg et al. looked at a few other algorithms and found the NMAE to range from 0.187 to 0.237. Relatively, the ALS model here would appear very good. Note that NMAE was preferred by the authors to normalize errors as a percentage of the full scale. 

### Make recommendations using model and set of jokes not rated
*Example: User 13*

In [16]:
# Specify user
user = '13'

# List of all 100 Joke IDs
jokeID = list(range(1,101))

# User's rated jokes
uRatings = [(j,r) for (u,j,r) in ratings.collect() if u==user]
uRatingsID = [j[0] for j in uRatings]

# Check number of jokes rated
len(uRatingsID)

47

In [17]:
# Jokes not rated by User
uNR = [j for j in jokeID if j not in uRatingsID]
len(uNR)

53

#### Predicted scores

In [18]:
notRated = sc.parallelize(set(uNR))
predictions = model.predictAll(notRated.map(lambda x: (user, x))).collect()

# Sort by highest rated
predictions = sorted(predictions, key=lambda k: k[2], reverse=True)

**Only recommend joke IDs that are positive and above the user's average rating**  

In [19]:
avgRating = np.mean([x[1] for x in uRatings])
highRecommend = [(x[1], x[2]) for x in predictions if x[2] >= avgRating and x[2] >= 0]

print "Recommended jokes (score out of 10): \n"
for i in highRecommend:
    print "JokeID: %d    Score: %.2f" %(i[0], i[1])

Recommended jokes (score out of 10): 

JokeID: 89    Score: 7.04
JokeID: 72    Score: 6.94
JokeID: 12    Score: 6.88
JokeID: 26    Score: 6.81
JokeID: 6    Score: 6.41
JokeID: 81    Score: 6.19
JokeID: 83    Score: 6.11
JokeID: 76    Score: 6.10
JokeID: 100    Score: 5.96
JokeID: 34    Score: 5.88
JokeID: 40    Score: 5.71
JokeID: 22    Score: 5.66
JokeID: 87    Score: 5.44
JokeID: 88    Score: 5.35
JokeID: 80    Score: 5.31
JokeID: 78    Score: 5.28


**Print full joke of the top 5 recommendations**

In [24]:
from urllib import urlopen
from bs4 import BeautifulSoup

def print_joke(jnum):
    cdir = str(os.getcwd())
    jdir = "/jokes/"
    jpath = cdir + jdir + "init" + str(jnum) +".html" 
    html = urlopen(jpath).read()
    text = BeautifulSoup(html).get_text("\n", strip=True)
    return text

for n in highRecommend[:5]:
    print print_joke(n[0]) + "\n\n"

A Joke
A radio conversation of a US naval 
ship with Canadian authorities ...
Americans: Please divert your course 15 degrees to the North to avoid a
collision.
Canadians: Recommend you divert YOUR course 15 degrees to the South to 
avoid a collision.
Americans: This is the Captain of a US Navy ship.  I say again, divert 
YOUR course.
Canadians: No.  I say again, you divert YOUR course.
Americans: This is the aircraft carrier USS LINCOLN, the second largest ship in the United States' Atlantic Fleet. We are accompanied by three destroyers, three cruisers and numerous support vessels. I demand that you change your course 15 degrees north, that's ONE FIVE DEGREES NORTH, or counter-measures will be undertaken to ensure the safety of this ship.
Canadians:
This is a lighthouse.  Your call
.


A Joke
On the first day of college, the Dean addressed the students,
pointing out some of the rules:

"The female dormitory will be out-of-bounds for all male students
and the male dormitory to the fema

### Summary

Collaborative filtering is interesting because while it does not explicitly account for specific content features about the joke itself (e.g. humour type, general topic), it is able to implicitly deduce such latent factors and be effective at recommending similar items. In this example for User 13, it would appear s/he likes lowbrow humour and relationship topics. The downside of collaborative filtering may be that users must rate their tastes consistently but sometimes this is harder to formalize when the 'item' is more abstract like humour. Additionally, if you equally like many genres or content features and rate several items very similarly, ALS and collaborative filtering may be less successful in its recommendations. This would be true with almost any algorithm though.  

In the Goldberg paper, the authors assessed a few recommendation methods including k-nearest neighbors, global mean, and their proposed Eigentaste algorithm. The latter involves normalizing the ratings and applying principal component analysis (PCA), amongst other steps. They deduced Eigentaste was more appropriate for sparse matrices and 'capturing tastes with finer granularity'. As Eigentaste is effectively a form of collaborative filtering, it should not be too surprising that the ALS method used in this mini-project appears fairly consistent in model error and results.