Lab : Classification with MLlib
===================================

### Introduction

This lab explores a well known dataset from the Czech dating website libimseti.cz.  We'll just call it the "dating" dataset. :)

Normally we talk of users and items as different entities, but in dating websites we relate users to one another.

In our example, we're going to ignore the gender and orientation of each user in doing the recommendations.   The dating dataset does include a file which identifies the gender of each participant, but for simplicity we're not handling it here. This isn't as bad as it sounds, as most users likely will rate only one gender of dating site participants, and will no doubt receive recommendations from the same gender. Naturally there are always exceptions.

The checked in version is a tiny subset of the actual, as only the first 9999 users are included.  Furthermore, the ratings outside the subset are ignored, so a good portion of users have no data.

In [395]:
# initialize Spark Session
import os
import sys
top_dir = os.path.abspath(os.path.join(os.getcwd(), "../../"))
if top_dir not in sys.path:
    sys.path.append(top_dir)

from init_spark import init_spark
spark = init_spark()
sc = spark.sparkContext

Initializing Spark...
Spark found in :  /home/ubuntu/spark
Spark config:
	 executor.memory=2g
	some_property=some_value
	spark.app.name=TestApp
	spark.master=local[*]
	spark.sql.warehouse.dir=/tmp/tmpyam46jj0
	spark.submit.deployMode=client
	spark.ui.showConsoleProgress=true
Spark UI running on port 4042


In [396]:
from pyspark.mllib.recommendation import *

### Step 1 : Inspect Data
* Sample Data : [/data/dating/sample.txt](/data/dating/sample.txt)
* Rating data file : [/data/dating/medium/ratings.dat](/data/dating/medium/ratings.dat)

(browsers may not display the data properly, open the data in text editor)

### Step 2 : Create Rating Object for the Data

In [397]:
data = sc.textFile("../data/dating/medium/ratings.dat") 

For the dating website 
* Users = Users
* Products = Other users
* Rating = Rating given by one user to anothr user

In [398]:
splitted_data = data.map(lambda x : x.split(","))
#  Rating represents a (user, product, rating) tuple.
ratings = splitted_data.map(lambda x : Rating(x[0],x[1],x[2]))
# ratings.collect()

In [399]:
model = ALS.train(ratings, rank = 10, iterations = 5, lambda_= 0.01)

### Step 3: Transform the Rating object to a tuple of User, Product

In [400]:
# Get rid of rating to test model's effectiveness
# TODO: TRANSFORM Rating -> Tuple of (user, product)
# (i.e., get rid of the rating)
userItems = ratings.map(lambda x: (int(x[0]),int(x[1])))

### Step 4: Use the predictAll method to map the output to User, Product

In [401]:
# Do a test prediction
# TODO call model.predictAll() on userItems, and then map the output of that 
# to (user, product), rating
predict = model.predictAll(userItems)
recs = predict.map(lambda x: ((int(x[0]), int(x[1])), int(x[2])))

In [402]:
ratingsAndRecs = ratings.map(lambda x: ((int(x[0]), int(x[1])), int(x[2]))).join(recs)

In [403]:
mse = ratingsAndRecs.map(lambda x: (x[1][0] - x[1][1]) * (x[1][0] - x[1][1])).mean()

In [404]:
print (mse)

2.2576640412262527


### Step 5 : Find recommendations for Users based on ratings

In [None]:
# recommendProductsForUsers will give recommendations for all users in an arrray
# Number of recommendations needed should be provided as arguments
recsForEachUser = model.recommendProductsForUsers(3)
recsForEachUser.collect()

In [406]:
# recommendProducts will give recommendations for the particular user
# parameters : (User, NumberOfRecommemdationsNeeded
# recsForEachUser = model.recommendProducts(892, 4)
recsForEachUser = model.recommendProducts(4, 10)
print (recsForEachUser)

# Beware: some numbers aren't represented (e.g. 3)

[Rating(user=4, product=5702, rating=53.855695075320156), Rating(user=4, product=2928, rating=45.60359291212251), Rating(user=4, product=9967, rating=42.06318720908149), Rating(user=4, product=6807, rating=40.755852549595176), Rating(user=4, product=484, rating=40.75157465417058), Rating(user=4, product=1642, rating=39.33000008350386), Rating(user=4, product=4901, rating=38.75331923891722), Rating(user=4, product=9018, rating=37.487033262569184), Rating(user=4, product=8855, rating=37.32644276225213), Rating(user=4, product=2853, rating=37.2282182227778)]


### Step 6: Running on some of your own data

Create a file called personalratings.txt.  Include some test data as preferences.
We have included a file /data/dating/sample.txt for you.
you can refer to it.
    

In [407]:
personaldata = sc.textFile("personalratings.txt")

# And create the solution like you did using ratings.dat file
splitted_data = data.map(lambda x : x.split(","))
ratings = splitted_data.map(lambda x : Rating(x[0],x[1],x[2]))
model = ALS.train(ratings, rank = 10, iterations = 5, lambda_= 0.01)
userItems = ratings.map(lambda x: (int(x[0]),int(x[1])))
predict = model.predictAll(userItems)
recs = predict.map(lambda x: ((int(x[0]), int(x[1])), int(x[2])))
ratingsAndRecs = ratings.map(lambda x: ((int(x[0]), int(x[1])), int(x[2]))).join(recs)
mse = ratingsAndRecs.map(lambda x: (x[1][0] - x[1][1]) * (x[1][0] - x[1][1])).mean()
print (mse)
recsForEachUser = model.recommendProductsForUsers(4)
recsForEachUser = model.recommendProducts(4, 3)
print (recsForEachUser)

1.6453752232600356
[Rating(user=4, product=4901, rating=41.33717392078803), Rating(user=4, product=2571, rating=39.86126854301914), Rating(user=4, product=4354, rating=38.32360994639145)]


[Rating(user=4, product=4901, rating=41.33717392078803),
 Rating(user=4, product=2571, rating=39.86126854301914)]