# INM432 Big Data Coursework 2016/207 Part 2: Spark Pipelines and Evaluation of Scaling of Algorithms

### Team Members: Ryan Nazareth and Aimore Dutra 

---

# Introduction
With the advent of the internet connecting everything and everyone, it became easy to one access a large amount of information. However, the facility to reach so much data also brought some problems. Consumers have to deal with an immeasurable number of items, loosing their time trying to find what they look for.

Hence, big companies that have an immensity of products in their database are keen on advertising their products in a smart way helping their clients to find what they want.

Nowadays, Recommendation Systems are being developed to address this problem.

---


## Task

Our task is to create a Recommender System that can suggest new movies to users based on their preferences (ratings).

There are several possible approaches for the recommendation task [1]:

##### 1) Recommend the most popular items

##### 2) Use a classifier to make recommendation

##### 3) Collaborative Filtering

#### We chose the Collaborative Filtering technique because this method gives more personalization and makes a more efficient use of data.

The Collaborative Filtering approach has two main types:
* a) User to User
* b) Item to Item

Item-item most of the time tends to be more accurate and computationally cheaper.




> #### References
[1] https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/


## Dataset

*Movies and most recently series have become a trend due to their current amazing quality and quantity at hand. Thanks to the advances in technology allowing them to be cheaper and quickly produced, there are millions of movies and series available.
Not only more content is being created, but the existing ones are being stored. This has resulted in viewers having difficulties to find new video entertainment instances that they like.*



The selected dataset for the coursework was the "(ml-20m)" from MovieLens, a movie recommendation service [1,2]. It contains: 

- 27,278 movies (with 19 different Genres)
- 138,493 users
- 465,564 tag applications 
- and 20,000,263 ratings (from 1-5 stars)

These data were created by  users between January 09, 1995 and March 31, 2015.

The data are divided in six files, containing each:
- genome-scores.csv: movieId, tagId, relevance
- genome-tags.csv: tagId, tag
- links.csv: movieId, imdbId, tmdbId
- movies.csv: MovieID::Title::Genres
- ratings.csv: UserID::MovieID::Rating::Timestamp
- tags.csv: userId, movieId, tag, timestamp


> #### References
[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

>[2] http://files.grouplens.org/datasets/movielens/ml-20m-README.html

# Technique Used
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:

- numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
- rank is the number of latent factors in the model (defaults to 10).
- maxIter is the maximum number of iterations to run (defaults to 10).
- regParam specifies the regularization parameter in ALS (defaults to 1.0).
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback).
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0).
- nonnegative specifies whether or not to use nonnegative constraints for least squares (defaults to false).

> #### References
[1] https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

>[2] https://spark.apache.org/docs/latest/ml-tuning.html

---
## Loading data and applying transformations

In [20]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import DoubleType
from pyspark.sql.types import IntegerType
import math 

spark = SparkSession.builder.getOrCreate() # create a SparkSession 

ratings = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-latest-small/ratings.csv")

ratings.describe() # lets check which features are present 

ratings = ratings.withColumn("movieId", ratings["movieId"].cast(IntegerType()))
ratings = ratings.withColumn("rating", ratings["rating"].cast(DoubleType()))
ratings = ratings.withColumn("timestamp", ratings["timestamp"].cast(IntegerType()))
ratings = ratings.withColumn("userId", ratings["userId"].cast(IntegerType()))
# ratings.take(3)
            

## Splitting into train and test set, setting parameter grid and estimator 

In [21]:
(training, test) = ratings.randomSplit([0.8, 0.2])


training.take(3)
# THE ORDER I AM GETTING  Row(userId=1, movieId=31, rating=2.5, timestamp=1260759144)
# THE CORRECT ORDER       Row(movieId=0, rating=1.0, timestamp=1424380312, userId=3)

als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")

# # We use a ParamGridBuilder to construct a grid of parameters to search over.
paramGrid = ParamGridBuilder().addGrid(als.regParam, [0.03,0.1,0.3]).addGrid(als.rank, [5,10,50]).build()
    
# # In this case the estimator is simply the linear regression.
# # A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
tvs = TrainValidationSplit(estimator=als,
                            estimatorParamMaps=paramGrid,
                            evaluator=regEval,
                            # 80% of the data will be used for training, 20% for validation.
                            trainRatio=0.8)

## Fitting the model, generating predictions and evaluating 

In [22]:
# # Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(training)

print ('The best model was trained with rank %d'%model.bestModel.rank)

print('')

## Computing the predictions
                
predictions = model.transform(test)

# dropping any rows with nan values from prediction (due to cold start problem)               
                
predictions = predictions.dropna()  

                
# printing out the first five observations and predictions                 
for row in predictions.take(5):
    print('')
    print(row)

               
# # # # # Evaluate the model by computing the RMSE on the test data
                
rmse = regEval.evaluate(predictions)
print('')
print("Root-mean-square error = " + str(rmse))   

The best model was trained with rank 5


Row(userId=232, movieId=463, rating=4.0, timestamp=955089443, prediction=3.7277157306671143)

Row(userId=350, movieId=471, rating=3.0, timestamp=1011714986, prediction=3.852583646774292)

Row(userId=306, movieId=471, rating=3.0, timestamp=939718996, prediction=3.836204767227173)

Row(userId=92, movieId=471, rating=4.0, timestamp=848526594, prediction=3.782599449157715)

Row(userId=299, movieId=471, rating=4.5, timestamp=1344186741, prediction=3.888803243637085)

Root-mean-square error = 0.9801629526313057


https://sourceforge.net/p/jupiter/wiki/markdown_syntax/


## Reducing training set size and applying all the steps above

In [23]:
(training, test) = ratings.randomSplit([0.7, 0.3]) #Using smaller training set 
training.take(3)
# THE ORDER I AM GETTING  Row(userId=1, movieId=31, rating=2.5, timestamp=1260759144)
# THE CORRECT ORDER       Row(movieId=0, rating=1.0, timestamp=1424380312, userId=3)

als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")

# # We use a ParamGridBuilder to construct a grid of parameters to search over.
paramGrid = ParamGridBuilder().addGrid(als.regParam, [0.03,0.1,0.3]).addGrid(als.rank, [5,10,50]).build()
    
# # In this case the estimator is simply the linear regression.
# # A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")
tvs = TrainValidationSplit(estimator=als,
                            estimatorParamMaps=paramGrid,
                            evaluator=regEval,
                            # 80% of the data will be used for training, 20% for validation.
                            trainRatio=0.8)

# # Run TrainValidationSplit, and choose the best set of parameters.
model = tvs.fit(training)

print ('The best model was trained with rank %d'%model.bestModel.rank)

print('')

## Computing the predictions
                
predictions = model.transform(test)

# dropping any rows with nan values from prediction (due to cold start problem)               
                
predictions = predictions.dropna()  

                
# printing out the first five observations and predictions                 
for row in predictions.take(5):
    print('')
    print(row)

               
# # # # # Evaluate the model by computing the RMSE on the test data
                
rmse = regEval.evaluate(predictions)
print('')
print("Root-mean-square error = " + str(rmse))   

The best model was trained with rank 5


Row(userId=452, movieId=463, rating=2.0, timestamp=976424451, prediction=3.046884059906006)

Row(userId=534, movieId=463, rating=4.0, timestamp=973377486, prediction=3.7673137187957764)

Row(userId=85, movieId=471, rating=3.0, timestamp=837512312, prediction=2.572993040084839)

Row(userId=350, movieId=471, rating=3.0, timestamp=1011714986, prediction=4.318910598754883)

Row(userId=602, movieId=471, rating=3.0, timestamp=842357922, prediction=3.957103729248047)

Root-mean-square error = 1.0040720904641864


Reducing the training set size increases the root-mean square error