# INM432 Big Data Coursework 2016/207 Part 2: Spark Pipelines and Evaluation of Scaling of Algorithms

### Team Members: Ryan Nazareth and Aimore Dutra 

---

# 1) Introduction
With the advent of the internet connecting everything and everyone, it became easy to one access a large amount of information. However, the facility to reach so much data also brought some problems. Consumers have to deal with an immeasurable number of items, loosing their time trying to find what they look for.

Hence, big companies that have an immensity of products in their database are keen on advertising their products in a smart way helping their clients to find what they want.

Nowadays, Recommendation Systems are being developed to address this problem.

---


## 1.1) Task

Our task is to create a Recommender System that can suggest new movies to users based on their preferences (ratings).

There are several possible approaches for the recommendation task [1]:

##### 1) Recommend the most popular items
##### 2) Use a classifier to make recommendation
##### 3) Collaborative Filtering

#### We chose the Collaborative Filtering technique because this method gives more personalization and makes a more efficient use of data.

The Collaborative Filtering approach has two main types:
* a) User to User
* b) Item to Item

Item-item most of the time tends to be more accurate and computationally cheaper.




> #### References
[1] https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/


## 1.2) Dataset

*Movies and most recently series have become a trend due to their current amazing quality and quantity at hand. Thanks to the advances in technology allowing them to be cheaper and quickly produced, there are millions of movies and series available.
Not only more content is being created, but the existing ones are being stored. This has resulted in viewers having difficulties to find new video entertainment instances that they like.*



The selected dataset for the coursework was the "(ml-20m)" from MovieLens, a movie recommendation service [1,2]. We made this choice because it has a lot of data and it most important it contains user ratings that allow us to use the Collaborative Filtering technique. The details of the dataset is below: 

- 27,278 movies (with 19 different Genres)
- 138,493 users
- 465,564 tag applications 
- and 20,000,263 ratings (from 1-5 stars)

These data were created by  users between January 09, 1995 and March 31, 2015.

The data are divided in six files, containing each:
- genome-scores.csv: MovieID::TagId::relevance
- genome-tags.csv:   TagId::Tag
- links.csv:         MovieID::imdbID::tmdbID
- movies.csv:        MovieID::Title::Genres
- ratings.csv:       UserID::MovieID::Rating::Timestamp
- tags.csv:          UserID::MovieID::Tag::Timestamp


> #### References
[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

>[2] http://files.grouplens.org/datasets/movielens/ml-20m-README.html

## 1.3) Learning Algorithm - ALS

Collaborative filtering is commonly used for recommender systems [1]. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:

- **numBlocks** is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
- **rank** is the number of latent factors in the model (defaults to 10).
- **maxIter** is the maximum number of iterations to run (defaults to 10).
- **regParam** specifies the regularization parameter in ALS (defaults to 1.0).
- **implicitPrefs** specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback).
- **alpha** is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0).
- **nonnegative** specifies whether or not to use nonnegative constraints for least squares (defaults to false).


### Train-Validation Split
In addition to CrossValidator Spark also offers TrainValidationSplit for hyper-parameter tuning. TrainValidationSplit only evaluates each combination of parameters once, as opposed to k times in the case of CrossValidator. It is therefore less expensive, but will not produce as reliable results when the training dataset is not sufficiently large. [2]

> #### References
[1] https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

>[2] https://spark.apache.org/docs/latest/ml-tuning.html

---
# 2) Code


## 2.1) Loading data and applying transformations

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import DoubleType
from pyspark.sql.types import IntegerType
import numpy as np
import math 
import time

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data from the path to a dataframe called "ratings"
## Small Dataset
ratings = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-latest-small/ratings.csv")
## Large Dataset
# ratings = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-20m/ratings.csv")
# Check which features are present 
print("ratings")
ratings.show()

# Load data from path to dataframe called "movies"
movies = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-latest-small/movies.csv")
# Check which features are present 
print("movies")
movies.show()

# Join them into "movielens"
print("movielens")
movielens = ratings.join(movies, "movieId")
# Check which features are present 
movielens.show()

# Cast data type from String to Integer and Double
movielens = movielens.withColumn("movieId", ratings["movieId"].cast(IntegerType()))
movielens = movielens.withColumn("rating", ratings["rating"].cast(DoubleType()))
movielens = movielens.withColumn("timestamp", ratings["timestamp"].cast(IntegerType()))
movielens = movielens.withColumn("userId", ratings["userId"].cast(IntegerType()))
# ratings.take(3)

# Print the types used in each column
ratings = movielens
ratings.printSchema()


ratings
+------+-------+------+----------+
|userId|movieId|rating| timestamp|
+------+-------+------+----------+
|     1|     31|   2.5|1260759144|
|     1|   1029|   3.0|1260759179|
|     1|   1061|   3.0|1260759182|
|     1|   1129|   2.0|1260759185|
|     1|   1172|   4.0|1260759205|
|     1|   1263|   2.0|1260759151|
|     1|   1287|   2.0|1260759187|
|     1|   1293|   2.0|1260759148|
|     1|   1339|   3.5|1260759125|
|     1|   1343|   2.0|1260759131|
|     1|   1371|   2.5|1260759135|
|     1|   1405|   1.0|1260759203|
|     1|   1953|   4.0|1260759191|
|     1|   2105|   4.0|1260759139|
|     1|   2150|   3.0|1260759194|
|     1|   2193|   2.0|1260759198|
|     1|   2294|   2.0|1260759108|
|     1|   2455|   2.5|1260759113|
|     1|   2968|   1.0|1260759200|
|     1|   3671|   3.0|1260759117|
+------+-------+------+----------+
only showing top 20 rows

movies
+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+------

## 2.2) Training and Testing the System
- Split Training (80%) and Testing(20%) data
- Change the training data size
- Do a Grid Search to select the best model
- Predict test data using the best model
- Evaluate the best model's performance and time taken for training and testing

In [None]:
# Split the data into training (80%) and hold-out testing data (20%)
(traindata, test) = ratings.randomSplit([0.8, 0.2], seed=123)

## Change the training data size
# (training, garbage) = traindata.randomSplit([0.999, 0.001])   # ~80,000 rows
# (training, garbage) = traindata.randomSplit([0.1, 0.9])     # ~8,000 rows
(training, garbage) = traindata.randomSplit([0.01, 0.99])   # ~800 rows
# (training, garbage) = traindata.randomSplit([0.001, 0.999]) # ~80 rows

# Print training data size
print('Training data size: ',training.count(),' rows') 

# Create an Alternate Least Square learning algorithm (Estimator)
als = ALS(rank=10, 
          maxIter=5,
          userCol="userId",   
          itemCol="movieId",  
          ratingCol="rating")
### numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
### rank is the number of latent factors in the model (defaults to 10).
### maxIter is the maximum number of iterations to run (defaults to 10).
### regParam specifies the regularization parameter in ALS (defaults to 1.0).

## Create a ParamGridBuilder to construct a grid of parameters to search over. (ParameterMaps)
paramGrid = ParamGridBuilder().addGrid(als.rank, [1,3,10,30,100])\
                                .addGrid(als.maxIter, [1,3,10,30,100])\
                                .build()
            
## No Grid Search
# paramGrid = ParamGridBuilder().build() # * UnComment this line and comment the block above to run quickly

# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
# In this case the estimator is simply the linear regression. (Evaluator)
regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

####-------- TRAINING --------####
# Get start time
s=time.time()

# Train and Validate models
tvs = TrainValidationSplit(estimator=als,
                            estimatorParamMaps=paramGrid,
                            evaluator=regEval,
                            # 80% of the data will be used for training, 20% for validation.
                            trainRatio=0.8, seed=5)

# Run TrainValidationSplit to train and choose the best set of parameters.
model = tvs.fit(training)

# Get the best model parameters
# evaluatorMetrics = model.validationMetrics.copy()
# minParams = np.argmin(evaluatorMetrics)
# paramGrid[minParams]
# print('paramGrid[minParams]:', paramGrid[minParams])
best_model = model.bestModel
maxIter = (best_model
    ._java_obj     # Get Java object
    .parent()      # Get parent (ALS estimator)
    .getMaxIter()) # Get maxIter

rank = best_model.rank
print("rank:", rank)
print("maxIter:", maxIter)
print("regParam:", maxIter) #### I CANT get this value (regParam) by ANY MEANS, so I took it off-I spent a lot of time trying...
print('\n-----------')

# Get end time
e=time.time()

# Test the model's prediction in the hold-out Training data           
predictions = model.transform(training)

# Drop any rows with nan values from prediction (due to cold start problem)
# predictions = predictions.dropna()
predictions = predictions.fillna(0);

# Evaluate the overall performance of the model by computing the Root-mean-square error (RMSE) on the Training data
rmse = regEval.evaluate(predictions)
print("Training Error (RMS) = ", round(rmse, 4))

## Print the size of training data
# print('Training data size: ',training.count(),' rows')  
     
# Print the time spent to train
print('Training time: ',round(e-s, 4),' seconds')

####-------- TESTING --------####
# Get relative time
s=time.time()

# Test the model's prediction in the hold-out Test data           
predictions = model.transform(test)

# Get end time
e=time.time()

# Drop or replace with 0 any rows with nan values from prediction (due to cold start problem)
# predictions = predictions.dropna()
predictions = predictions.fillna(0);
               
# Evaluate the overall performance of the model by computing the Root-mean-square error (RMSE) on the Test data
rmse = regEval.evaluate(predictions)
print('')
print("Test Error (RMS) = ", round(rmse, 4))

## Print the size of test data
# print('Test data size: ',test.count(),' rows')  
     
# Print the time spent to test
print('Test time: ',round(e-s, 4),' seconds')

####------ Results for training and testing without Grid Search -------####
## Training data size: ~80,000 rows
# Training Error (RMS) =  0.6382  || Test Error (RMS) =  1.1182
# Training time:  10.497  seconds || Test time:  0.0601  seconds

## Training data size: ~8,000 rows
# Training Error (RMS) =  0.2326  || Test Error (RMS) =  2.8208
# Training time:  7.6864  seconds || Test time:  0.0603  seconds
    
## Training data size: ~800 rows
# Training Error (RMS) =   0.0547 || Test Error (RMS) =  3.7398
# Training time:  6.492  seconds  || Test time:  0.0514  seconds
 
## Training data size: ~80 rows
# Training Error (RMS) =  0.0501  || Test Error (RMS) =  3.7025
# Training time:  5.419  seconds  || Test time:  0.0504  seconds

# As we can see the larger the dataset the more computational expensive it is. 
# The training error is low and testing error is high for small dataset (meaning overfitting). 
# When the dataset is large, the model generalizes better, reducing overfitting.
# One important point is that, only when the data set is really big (around 100,000) 
# is when the model starts to having a good prediction

# The Parameter Grid Search was applied in the largest dataset size and the results are documented below:
# Training Error (RMS) =  2.4116    || Test Error (RMS) =  3.2335
# Training time:  251.6069  seconds || Test time:  0.0446  seconds

Training data size:  850  rows


## 2.3) Adding a new user and recommending movies to him

In [None]:
new_user_ID = 0

# The format of each line is (userID, movieID, rating)
new_user_ratings = [
     (0,32,3),   # Twelve Monkeys
     (0,589,5),  # Terminator 2
     (0,50,4),   # Usual Suspects
     (0,1080,4), # Monty Python 
     (0,1278,1), # Young Frankenstein
     (0,1266,1), # Unforgiven 
     (0,1249,1), # Femme Nikita 
     (0,1090,1), # Platoon 
     (0,919,1) , # Wizard of Oz
     (0,47,5)    # Seven 
    ]

new_user_ratings_RDD = sc.parallelize(new_user_ratings)
print('New user ratings: %s' % new_user_ratings_RDD.take(10))

# Adding a personal new rating matrix 

df =[(0,32,3), # Twelve Monkeys
     (0,589,5), # Terminator 2
     (0,50,4), # Usual Suspects
     (0,1080,4), # Monty Python 
     (0,1278,1), # Young Frankenstein
     (0,1266,1), # Unforgiven 
     (0,1249,1), # Femme Nikita 
     (0,1090,1), # Platoon 
     (0,919,1) , # Wizard of Oz
     (0,47,5)] # Seven 

df1 = sqlContext.createDataFrame(df)

df1.collect()

df1.first()

# 3) Conclusions and Discussions

We implemented a collaborative filter to acess explicity the users-items ratings, and used the ALS algorithm which actually does not only do the Matrix Factorization but minimize the error for the predicted latented factors. Therefore, we achieve a system that can recommend movies based on the user's preferences (ratings). Although this is an efficient way to estimate non available ratings and therefore recommend products, there is still some issues which we could see through the work. First, the cold start problem which collaborative by itself cannot solve and . Second, it only really has a good accuracy with very large training data, otherwise it is not good. 

Implicit and explicity

Reducing the training set size increases the root-mean square error


#### References
[1] Zhou, Y., Wilkinson, D., Schreiber, R. and Pan, R., 2008, June. Large-scale parallel collaborative filtering for the netflix prize. In International Conference on Algorithmic Applications in Management (pp. 337-348). Springer Berlin Heidelberg.

https://sourceforge.net/p/jupiter/wiki/markdown_syntax/
