# Assignment

Each user is potentially interested in watching one or more of the movies specified in `requests.json`. Our job is to **decide which movie or movies to recommend to these users**.

Use a combination of matrix factorization model (ALS) and a cold start model (using user and movie metadata) to fill in the NaN values and predict ratings for these movies.

Your predictions will be scored as follows:

1. Each user may watch movies from your list, starting with the highest predicted rating.
2. Your model will be scored based on how well the users liked the movies they watched.

# Import Libraries

In [1]:
import pyspark
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS, ALSModel

# K Means
from sklearn.cluster import KMeans
from sklearn.metrics import calinski_harabaz_score
from sklearn import metrics
from sklearn.metrics import calinski_harabaz_score

# Import Data

In [70]:
!ls data

LICENSE             movies.dat          ratings.json        users.dat
README              [31mmovies_metadata.csv[m[m requests.json
movies.csv          ratings.csv         users.csv


In [9]:
spark = (pyspark.sql.SparkSession.builder.master("local[*]").getOrCreate())

### Import Movies CSV

In [13]:
file = 'data/movies.csv'
movies = spark.read.csv(file, inferSchema=True, header=True)

Row(_c0=0, movie_id=2, title='Jumanji (1995)', genres="Adventure|Children's|Fantasy")

In [16]:
movies.head(3)

[Row(_c0=0, movie_id=2, title='Jumanji (1995)', genres="Adventure|Children's|Fantasy"),
 Row(_c0=1, movie_id=3, title='Grumpier Old Men (1995)', genres='Comedy|Romance'),
 Row(_c0=2, movie_id=4, title='Waiting to Exhale (1995)', genres='Comedy|Drama')]

### Import Movies_metadata CSV

In [17]:
file = 'data/movies_metadata.csv'
movies_metadata = spark.read.csv(file, inferSchema=True, header=True)

In [20]:
movies_metadata.head(1)

[Row(adult='False', belongs_to_collection="{'id': 10194, 'name': 'Toy Story Collection', 'poster_path': '/7G9915LfUQ2lVfwMEEhDsn3kT4B.jpg', 'backdrop_path': '/9FBwqcd9IRruEDUrTdcaafOMKUq.jpg'}", budget='30000000', genres="[{'id': 16, 'name': 'Animation'}, {'id': 35, 'name': 'Comedy'}, {'id': 10751, 'name': 'Family'}]", homepage='http://toystory.disney.com/toy-story', id='862', imdb_id='tt0114709', original_language='en', original_title='Toy Story', overview="Led by Woody, Andy's toys live happily in his room until Andy's birthday brings Buzz Lightyear onto the scene. Afraid of losing his place in Andy's heart, Woody plots against Buzz. But when circumstances separate Buzz and Woody from their owner, the duo eventually learns to put aside their differences.", popularity='21.946943', poster_path='/rhIRbceoE9lR4veEXuwCC2wARtG.jpg', production_companies="[{'name': 'Pixar Animation Studios', 'id': 3}]", production_countries="[{'iso_3166_1': 'US', 'name': 'United States of America'}]", relea

### Import Ratings CSV

In [27]:
file = 'data/ratings.csv'
ratings = spark.read.csv(file, inferSchema=True, header=True)

In [30]:
ratings.head(3)

[Row(_c0=0, movie_id=858, rating=4, timestamp=datetime.datetime(2000, 4, 25, 16, 5, 32), user_id=6040),
 Row(_c0=1, movie_id=2384, rating=4, timestamp=datetime.datetime(2000, 4, 25, 16, 5, 54), user_id=6040),
 Row(_c0=2, movie_id=593, rating=5, timestamp=datetime.datetime(2000, 4, 25, 16, 5, 54), user_id=6040)]

### Import Requests CSV

In [22]:
file = 'data/requests.csv'
requests = spark.read.csv(file, inferSchema=True, header=True)

AnalysisException: 'Path does not exist: file:/Users/adam/Documents/flatiron/Virtual-environment/dc_ds_04_22_19/module_5/recommendation_spark_project/recommendation_project/recommendation-case-study/data/requests.csv;'

In [23]:
requests

NameError: name 'requests' is not defined

### Import Users CSV

In [24]:
file = 'data/users.csv'
users = spark.read.csv(file, inferSchema=True, header=True)

In [26]:
users.head(3)

[Row(_c0=0, id=2, gender='M', age=56, occupation='self-employed', zipcode='70072'),
 Row(_c0=1, id=3, gender='M', age=25, occupation='scientist', zipcode='55117'),
 Row(_c0=2, id=4, gender='M', age=45, occupation='executive/managerial', zipcode='02460')]

## Read datasets into Pyspark DataFrames

In [31]:
# Print Schema
ratings.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- movie_id: integer (nullable = true)
 |-- rating: integer (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- user_id: integer (nullable = true)



In [32]:
ratings.persist()

DataFrame[_c0: int, movie_id: int, rating: int, timestamp: timestamp, user_id: int]

In [33]:
ratings.show(5)

+---+--------+------+-------------------+-------+
|_c0|movie_id|rating|          timestamp|user_id|
+---+--------+------+-------------------+-------+
|  0|     858|     4|2000-04-25 16:05:32|   6040|
|  1|    2384|     4|2000-04-25 16:05:54|   6040|
|  2|     593|     5|2000-04-25 16:05:54|   6040|
|  3|    1961|     4|2000-04-25 16:06:17|   6040|
|  4|    1419|     3|2000-04-25 16:07:36|   6040|
+---+--------+------+-------------------+-------+
only showing top 5 rows



## Drop timestamp

In [59]:
ratings = ratings.drop(ratings.timestamp)

# Fitting ALS Model

## Train:Test Split

In [60]:
(trainingdata, testdata) = ratings.randomSplit([0.7, 0.3], seed = 100)

print("Training Dataset Count: " + str(trainingdata.count()))
print("Test Dataset Count: " + str(testdata.count()))

Training Dataset Count: 503570
Test Dataset Count: 216379


## Convert to Matrix

In [61]:
als = ALS(
    rank=10,  #10 variables/latent factors
    maxIter=10,  #
    userCol='user_id',
    itemCol='movie_id',
    ratingCol='rating',
)

## Fit the model

In [62]:
als_model = als.fit(trainingdata)

## Test the Model on Training Data

In [71]:
predictions = als_model.transform(trainingdata)
predictions.persist()

DataFrame[_c0: int, movie_id: int, rating: int, user_id: int, prediction: float]

In [72]:
user_factors = als_model.userFactors
item_factors = als_model.itemFactors

In [73]:
evaluator = RegressionEvaluator(labelCol='rating')
evaluator.evaluate(predictions.na.drop())

0.7966203970798736

### GridSearch for rank on the ALS Model

In [76]:
for rank in range(2,20):
    als = ALS(
    rank=rank,  #10 variables/latent factors
    maxIter=10,  #
    userCol='user_id',
    itemCol='movie_id',
    ratingCol='rating',
    )
    
    als_model = als.fit(trainingdata)
    
    predictions = als_model.transform(trainingdata)
    predictions.persist()
    
    evaluator = RegressionEvaluator(labelCol='rating')
    train_evaluation = evaluator.evaluate(predictions.na.drop())
    print('rank of {} gives evaluation of: {}'.format(rank, train_evaluation))

rank of 2 gives evaluation of: 0.8692836935677496
rank of 3 gives evaluation of: 0.8498860941523478
rank of 4 gives evaluation of: 0.8432708835916568
rank of 5 gives evaluation of: 0.8300134886200013
rank of 6 gives evaluation of: 0.8168416618464474
rank of 7 gives evaluation of: 0.8142698576875608
rank of 8 gives evaluation of: 0.8073526060419344
rank of 9 gives evaluation of: 0.8014436953738617
rank of 10 gives evaluation of: 0.7966203970798736
rank of 11 gives evaluation of: 0.7920560493865898
rank of 12 gives evaluation of: 0.7868691369468591
rank of 13 gives evaluation of: 0.7838049648715684
rank of 14 gives evaluation of: 0.7805271499584969
rank of 15 gives evaluation of: 0.7770812532800946
rank of 16 gives evaluation of: 0.7759755304138988
rank of 17 gives evaluation of: 0.7723151487228088
rank of 18 gives evaluation of: 0.7700858450399435
rank of 19 gives evaluation of: 0.7674571313808766


* Rank of 19 seems to produce the best results

# Evaluate ALS Model

In [77]:
predictions = als_model.transform(testdata)
predictions.persist()

DataFrame[_c0: int, movie_id: int, rating: int, user_id: int, prediction: float]

In [78]:
ratings.show(1)

+---+--------+------+-------+
|_c0|movie_id|rating|user_id|
+---+--------+------+-------+
|  0|     858|     4|   6040|
+---+--------+------+-------+
only showing top 1 row



In [79]:
predictions.show(1)

+------+--------+------+-------+----------+
|   _c0|movie_id|rating|user_id|prediction|
+------+--------+------+-------+----------+
|388719|     148|     4|   3184| 2.9916844|
+------+--------+------+-------+----------+
only showing top 1 row



In [80]:
user_factors = als_model.userFactors
user_factors

DataFrame[id: int, features: array<float>]

In [81]:
item_factors = als_model.itemFactors
item_factors

DataFrame[id: int, features: array<float>]

In [83]:
evaluator = RegressionEvaluator(labelCol='rating')
test_evaluation = evaluator.evaluate(predictions.na.drop())
test_evaluation

0.8788509480050949

# Will User Like a Certain Movie?

In [69]:
# User
user_row = user_factors[user_factors['id'] == 10].first()
user_factors = np.array(user_row['features'])
user_factors

TypeError: 'NoneType' object is not subscriptable

In [None]:
# Movie
movie_row = item_factors[item_factors['id'] == 296].first()
movie_factors = np.array(movie_row['features'])
movie_factors

## Dot Product

In [None]:
user_factors @ movie_factors

## User Prediction

In [None]:
user_preds = predictions[predictions['userId'] == 10]
user_preds.sort('movieId').show()
!grep 296 < data/movies.csv

# What Movies will a User Like?

In [None]:
recs = als_model.recommendForAllUsers(numItems=10)
recs[recs['userId']==10].first()['recommendations']

In [None]:
top_movie = None # put a number here/movieID
!grep top_movie < data/movies.csv

# Cold Start Model

**Machine Learning in Recommendation Systems**

ML is only used in the best recommendation systems. The model is constantly learning and adapting to platforms’ users and products it sells. Enables platform to optimize and personalize the content for every particular user.

**Cold Start Problem**

success strongly depends on the platform’s capabilities to adapt quickly to a new person or a new search in order to provide the best and personalized service.

**Product vs Visitor Cold Start**

Can get both types, i.e a new movie or new visitor on platform.  

Use content-based filtering to address this challenge: 
* First use the metadata of new products while creating recommendations
* Visitor’s actions are not used until a certain period of time, i.e. we know enough about them

**Best Strategy for Visitor Cold Start**

Use popularity based recommendations
* regional trends, e.g. global, local
* Time based trends, e.g. time of day, time of year
* Geolocation, e.g. zipcode, region, country
* Platform, e.g. mobile, desktop

Make Clusters within these categories
* Kmeans
* Want high scores, but with tight confidence intervals

**Penalizing Some User Types**

Can even distinguish between users and how they jump between different movies.  If they jump around a lot, can weight their recommendations.

**Limitations**
* ALS is limited in how to deal with NaNs, as we have to drop them.  Doesn't work so well in reality

## KMeans

In [None]:
k_means = KMeans(n_clusters=3) # Must set number of clusters at initialization time!
k_means.fit(trainingdata) # Run the clustering algorithm
cluster_assignments = k_means.predict(trainingdata) # Generate cluster index values for each row in df

print(calinski_harabaz_score(trainingdata, cluster_assignments))

# Calculate silhouette score
labels = k_means.labels_
metrics.silhouette_score(trainingdata, labels, metric='euclidean')

## 