# INM432 Big Data Coursework 2016/207 Part 2: Spark Pipelines and Evaluation of Scaling of Algorithms

### Team Members: Ryan Nazareth and Aimore Dutra 

---

# 1) Introduction
With the advent of the internet connecting everything and everyone, it became easy to one access a large amount of information. However, the facility to reach so much data also brought some problems. Consumers have to deal with an immeasurable number of items, loosing their time trying to find what they look for.

Hence, big companies that have an immensity of products in their database are keen on advertising their products in a smart way helping their clients to find what they want.

Nowadays, Recommendation Systems are being developed to address this problem.

---


## 1.1) Task

Our task is to create a Recommender System that can suggest new movies to users based on their preferences (ratings).

There are several possible approaches for the recommendation task [1]:

##### 1) Recommend the most popular items
##### 2) Use a classifier to make recommendation
##### 3) Collaborative Filtering

#### We chose the Collaborative Filtering technique because this method gives more personalization and makes a more efficient use of data.

The Collaborative Filtering approach has two main types:
* a) User to User
* b) Item to Item

Item-item most of the time tends to be more accurate and computationally cheaper.




> #### References
[1] https://www.analyticsvidhya.com/blog/2016/06/quick-guide-build-recommendation-engine-python/


## 1.2) Dataset

*Movies and most recently series have become a trend due to their current amazing quality and quantity at hand. Thanks to the advances in technology allowing them to be cheaper and quickly produced, there are millions of movies and series available.
Not only more content is being created, but the existing ones are being stored. This has resulted in viewers having difficulties to find new video entertainment instances that they like.*



The selected dataset for the coursework was the "(ml-20m)" from MovieLens, a movie recommendation service [1,2]. We made this choice because it has a lot of data and it most important it contains user ratings that allow us to use the Collaborative Filtering technique. The details of the dataset is below: 

- 27,278 movies (with 19 different Genres)
- 138,493 users
- 465,564 tag applications 
- and 20,000,263 ratings (from 1-5 stars)

These data were created by  users between January 09, 1995 and March 31, 2015.

The data are divided in six files, containing each:
- genome-scores.csv: MovieID::TagId::relevance
- genome-tags.csv:   TagId::Tag
- links.csv:         MovieID::imdbID::tmdbID
- movies.csv:        MovieID::Title::Genres
- ratings.csv:       UserID::MovieID::Rating::Timestamp
- tags.csv:          UserID::MovieID::Tag::Timestamp


> #### References
[1] F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19 (December 2015), 19 pages. DOI=http://dx.doi.org/10.1145/2827872

>[2] http://files.grouplens.org/datasets/movielens/ml-20m-README.html

## 1.3) Technique Used
Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. spark.ml currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. spark.ml uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation in spark.ml has the following parameters:

- numBlocks is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
- rank is the number of latent factors in the model (defaults to 10).
- maxIter is the maximum number of iterations to run (defaults to 10).
- regParam specifies the regularization parameter in ALS (defaults to 1.0).
- implicitPrefs specifies whether to use the explicit feedback ALS variant or one adapted for implicit feedback data (defaults to false which means using explicit feedback).
- alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations (defaults to 1.0).
- nonnegative specifies whether or not to use nonnegative constraints for least squares (defaults to false).

> #### References
[1] https://spark.apache.org/docs/latest/ml-collaborative-filtering.html

>[2] https://spark.apache.org/docs/latest/ml-tuning.html

---
# 2) Code


### Loading data and applying transformations

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.sql.types import DoubleType
from pyspark.sql.types import IntegerType
import math 
import time

# Create a SparkSession
spark = SparkSession.builder.getOrCreate()

# Load data from the path to a dataframe called "ratings"
## Small Dataset
ratings = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-latest-small/ratings.csv")
## Large Dataset
# ratings = spark.read.format("csv").option("header", "true").load("hdfs://saltdean/data/movielens/ml-20m/ratings.csv")
# lets check which features are present 
ratings.describe() 

# Cast data type from String to Integer and Double
ratings = ratings.withColumn("movieId", ratings["movieId"].cast(IntegerType()))
ratings = ratings.withColumn("rating", ratings["rating"].cast(DoubleType()))
ratings = ratings.withColumn("timestamp", ratings["timestamp"].cast(IntegerType()))
ratings = ratings.withColumn("userId", ratings["userId"].cast(IntegerType()))
# ratings.take(3)


## 2.1) Approach 1:
- Split Training (80%) and Testing(20%) data
- Do a Grid Search to select the best model
- Predict test data using the best model
- Evaluate the best model's performance and time taken for training and testing

In [2]:
# Split the data into training (80%) and hold-out testing data (20%)
(data, nothing) = ratings.randomSplit([0.001, 0.999])
print('1')
(training, test) = data.randomSplit([0.8, 0.2])
print('Data size: ',data.count(),' rows') 
# print('Test data size: ',test.count(),' rows') 


# Split the data into training (80%) and hold-out testing data (20%)
# (training, test) = ratings.randomSplit([0.8, 0.2])
print('2')
# Create Alternate Least Square
als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")

# Create a ParamGridBuilder to construct a grid of parameters to search over.
# paramGrid = ParamGridBuilder().addGrid(als.regParam, [0.03,0.1,0.3]).addGrid(als.rank, [5,10,50]).addGrid(als.maxIter, [1,5,10]).build()
paramGrid = ParamGridBuilder().build()
print('3')
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

#### Train ####
# Get start time
s=time.time()
print('4')
# Train and Validate models
tvs = TrainValidationSplit(estimator=als,
                            estimatorParamMaps=paramGrid,
                            evaluator=regEval,
                            # 80% of the data will be used for training, 20% for validation.
                            trainRatio=0.8)

# Run TrainValidationSplit to choose the best set of parameters.
model = tvs.fit(training)
print ('The best model was trained with rank %d'%model.bestModel.rank)
print('')

# Get end time
e=time.time()

# Test the model's prediction in the hold-out Training data           
predictions = model.transform(training)

# Drop any rows with nan values from prediction (due to cold start problem)
predictions = predictions.dropna()

# Evaluate the overall performance of the model by computing the Root-mean-square error (RMSE) on the Training data
rmse = regEval.evaluate(predictions)
print('')
print("Training Error (RMS) = " + str(rmse))

# Print the size of training data
print('Training data size: ',training.count(),' rows')  
     
# Print the time spent to train
print('Training time: ',e-s,' seconds')

#### Test ####
# Get relative time
s=time.time()

# Test the model's prediction in the hold-out Test data           
predictions = model.transform(test)

# Get end time
e=time.time()

# Drop any rows with nan values from prediction (due to cold start problem)
predictions = predictions.dropna()
                
# Print the first 5 observations and predictions
for row in predictions.take(5):
    print('')
    print(row)
               
# Evaluate the overall performance of the model by computing the Root-mean-square error (RMSE) on the Test data
rmse = regEval.evaluate(predictions)
print('')
print("Test Error (RMS) = " + str(rmse))

# Print the size of test data
print('Test data size: ',test.count(),' rows')  
     
# Print the time spent to test
print('Test time: ',e-s,' seconds')

1
Data size:  107  rows
2
3
4
The best model was trained with rank 10


Training Error (RMS) = 0.0032372987037692395
Training data size:  92  rows
Training time:  12.917011260986328  seconds

Row(userId=615, movieId=7438, rating=4.0, timestamp=1408778718, prediction=-0.30026188492774963)

Test Error (RMS) = 4.30026188492775
Test data size:  15  rows
Test time:  0.07261943817138672  seconds


## 2.2) Approach 2:

### Reducing training set size and applying all the steps above

- Split Training (60%) and Testing(40%) data
- Do a Grid Search to select the best model
- Predict test data using the best model
- Evaluate the best model's performance and time taken for training and testing

In [None]:
# Split the data into training (60%) and hold-out testing data (40%)
(training, test) = ratings.randomSplit([0.6, 0.4])


als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating")

# Create a ParamGridBuilder to construct a grid of parameters to search over.
paramGrid = ParamGridBuilder().addGrid(als.regParam, [0.03,0.1,0.3]).addGrid(als.rank, [5,10,50]).build()
    
# In this case the estimator is simply the linear regression.
# A TrainValidationSplit requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.
regEval = RegressionEvaluator(metricName="rmse", labelCol="rating", predictionCol="prediction")

## Train ##
# Get relative time
s=time.time()

# Train and Validate models
tvs = TrainValidationSplit(estimator=als,
                            estimatorParamMaps=paramGrid,
                            evaluator=regEval,
                            # 80% of the data will be used for training, 20% for validation.
                            trainRatio=0.8)

# Run TrainValidationSplit to choose the best set of parameters.
model = tvs.fit(training)
print ('The best model was trained with rank %d'%model.bestModel.rank)
print('')

# Print the time spent to train
print('Training time:',time.time()-s,' seconds')

## Test ##
# Get relative time
s=time.time()

# Test the model's prediction in the hold-out Test data           
predictions = model.transform(test)

# Print the time spent to test
print('Test time:',time.time()-s,' seconds')

# Drop any rows with nan values from prediction (due to cold start problem)
predictions = predictions.dropna()
                
# Print the first 5 observations and predictions
for row in predictions.take(5):
    print('')
    print(row)
               
## Evaluate the overall performance of the model by computing the RMSE on the test data
rmse = regEval.evaluate(predictions)
print('')
print("Root-mean-square error = " + str(rmse))   

# 3) Conclusions and Discussions
Reducing the training set size increases the root-mean square error

https://sourceforge.net/p/jupiter/wiki/markdown_syntax/
