# **Spark Moive Recommendation**
In this notebook, we will use an Alternating Least Squares (ALS) algorithm with Spark APIs to predict the ratings for the movies in [MovieLens small dataset](https://grouplens.org/datasets/movielens/latest/)

# Data ETL and Data Exploration

In [None]:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

In [None]:
import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()

In [None]:
!ls

sample_data  spark-2.4.6-bin-hadoop2.7	spark-2.4.6-bin-hadoop2.7.tgz


In [None]:
spark.version

'2.4.6'

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import math
%matplotlib inline

  import pandas.util.testing as tm


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


In [None]:
movies_df = spark.read.load("drive/My Drive/ml-latest-small/movies.csv", format='csv', header = True)
ratings_df = spark.read.load("drive/My Drive/ml-latest-small/ratings.csv", format='csv', header = True)
links_df = spark.read.load("drive/My Drive/ml-latest-small/links.csv", format='csv', header = True)
tags_df = spark.read.load("drive/My Drive/ml-latest-small/tags.csv", format='csv', header = True)

In [None]:
movies_df.show(5)

+-------+--------------------+--------------------+
|movieId|               title|              genres|
+-------+--------------------+--------------------+
|      1|    Toy Story (1995)|Adventure|Animati...|
|      2|      Jumanji (1995)|Adventure|Childre...|
|      3|Grumpier Old Men ...|      Comedy|Romance|
|      4|Waiting to Exhale...|Comedy|Drama|Romance|
|      5|Father of the Bri...|              Comedy|
+-------+--------------------+--------------------+
only showing top 5 rows



In [None]:
ratings_df.show(5)

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
+------+-------+------+---------+
only showing top 5 rows



In [None]:
links_df.show(5)

+-------+-------+------+
|movieId| imdbId|tmdbId|
+-------+-------+------+
|      1|0114709|   862|
|      2|0113497|  8844|
|      3|0113228| 15602|
|      4|0114885| 31357|
|      5|0113041| 11862|
+-------+-------+------+
only showing top 5 rows



In [None]:
tags_df.show(5)

+------+-------+---------------+----------+
|userId|movieId|            tag| timestamp|
+------+-------+---------------+----------+
|     2|  60756|          funny|1445714994|
|     2|  60756|Highly quotable|1445714996|
|     2|  60756|   will ferrell|1445714992|
|     2|  89774|   Boxing story|1445715207|
|     2|  89774|            MMA|1445715200|
+------+-------+---------------+----------+
only showing top 5 rows



In [None]:
tmp1 = ratings_df.groupBy("userID").count().toPandas()['count'].min()
tmp2 = ratings_df.groupBy("movieId").count().toPandas()['count'].min()
print('For the users that rated movies and the movies that were rated:')
print('Minimum number of ratings per user is {}'.format(tmp1))
print('Minimum number of ratings per movie is {}'.format(tmp2))

For the users that rated movies and the movies that were rated:
Minimum number of ratings per user is 20
Minimum number of ratings per movie is 1


In [None]:
tmp1 = sum(ratings_df.groupBy("movieId").count().toPandas()['count'] == 1)
tmp2 = ratings_df.select('movieId').distinct().count()
print('{} out of {} movies are rated by only one user'.format(tmp1, tmp2))

3446 out of 9724 movies are rated by only one user


# Part 1: Spark SQL and OLAP

In [None]:
movies_df.registerTempTable("movies")
ratings_df.registerTempTable("ratings")
links_df.registerTempTable("links")
tags_df.registerTempTable("tags")

### Q1: The number of Users

In [None]:
q1_result=spark.sql("Select Count(Distinct userId) as Number_of_Users from ratings")
q1_result.show()

+---------------+
|Number_of_Users|
+---------------+
|            610|
+---------------+



### Q2: The number of Movies

In [None]:
q2_result=spark.sql("Select Count(movieId) as Number_of_Moives from movies")
q2_result.show()

+----------------+
|Number_of_Moives|
+----------------+
|            9742|
+----------------+



### Q3:  How many movies are rated by users? List movies not rated before

In [None]:
q3_result_1=spark.sql("Select Count(movieId) as Number_of_Rated_Moives From movies Where movieID in (Select movieId From ratings)")
q3_result_1.show()

+----------------------+
|Number_of_Rated_Moives|
+----------------------+
|                  9724|
+----------------------+



In [None]:
q3_result_2=spark.sql("Select movieId, title From movies Where movieID not in (Select movieId From ratings)")
q3_result_2.show()

+-------+--------------------+
|movieId|               title|
+-------+--------------------+
|   1076|Innocents, The (1...|
|   2939|      Niagara (1953)|
|   3338|For All Mankind (...|
|   3456|Color of Paradise...|
|   4194|I Know Where I'm ...|
|   5721|  Chosen, The (1981)|
|   6668|Road Home, The (W...|
|   6849|      Scrooge (1970)|
|   7020|        Proof (1991)|
|   7792|Parallax View, Th...|
|   8765|This Gun for Hire...|
|  25855|Roaring Twenties,...|
|  26085|Mutiny on the Bou...|
|  30892|In the Realms of ...|
|  32160|Twentieth Century...|
|  32371|Call Northside 77...|
|  34482|Browning Version,...|
|  85565|  Chalet Girl (2011)|
+-------+--------------------+



### Q4: List Movie Genres

In [None]:
q4_result=spark.sql("Select Distinct explode(split(genres,'[|]')) as genres From movies Order by 1")
q4_result.show()

+------------------+
|            genres|
+------------------+
|(no genres listed)|
|            Action|
|         Adventure|
|         Animation|
|          Children|
|            Comedy|
|             Crime|
|       Documentary|
|             Drama|
|           Fantasy|
|         Film-Noir|
|            Horror|
|              IMAX|
|           Musical|
|           Mystery|
|           Romance|
|            Sci-Fi|
|          Thriller|
|               War|
|           Western|
+------------------+



### Q5: Movie for Each Category

In [None]:
q5_result_1=spark.sql("Select genres,Count(movieId) as Number_of_Moives From(Select explode(split(genres,'[|]')) as genres, movieId From movies) Group By 1 Order by 2 DESC")
q5_result_1.show()

+------------------+----------------+
|            genres|Number_of_Moives|
+------------------+----------------+
|             Drama|            4361|
|            Comedy|            3756|
|          Thriller|            1894|
|            Action|            1828|
|           Romance|            1596|
|         Adventure|            1263|
|             Crime|            1199|
|            Sci-Fi|             980|
|            Horror|             978|
|           Fantasy|             779|
|          Children|             664|
|         Animation|             611|
|           Mystery|             573|
|       Documentary|             440|
|               War|             382|
|           Musical|             334|
|           Western|             167|
|              IMAX|             158|
|         Film-Noir|              87|
|(no genres listed)|              34|
+------------------+----------------+



In [None]:
q5_result_2=spark.sql("Select genres, concat_ws(',',collect_set(title)) as list_of_movies From(Select explode(split(genres,'[|]')) as genres, title From movies) Group By 1")
q5_result_2.show()

+------------------+--------------------+
|            genres|      list_of_movies|
+------------------+--------------------+
|             Crime|Stealing Rembrand...|
|           Romance|Vampire in Brookl...|
|          Thriller|Element of Crime,...|
|         Adventure|Ice Age: Collisio...|
|             Drama|Airport '77 (1977...|
|               War|General, The (192...|
|       Documentary|The Barkley Marat...|
|           Fantasy|Masters of the Un...|
|           Mystery|Before and After ...|
|           Musical|U2: Rattle and Hu...|
|         Animation|Ice Age: Collisio...|
|         Film-Noir|Rififi (Du rififi...|
|(no genres listed)|T2 3-D: Battle Ac...|
|              IMAX|Harry Potter and ...|
|            Horror|Underworld: Rise ...|
|           Western|Man Who Shot Libe...|
|            Comedy|Hysteria (2011),H...|
|          Children|Ice Age: Collisio...|
|            Action|Stealing Rembrand...|
|            Sci-Fi|Push (2009),SORI:...|
+------------------+--------------

# Part2: Spark ALS based approach for training model
We will use an Spark ML to predict the ratings, so let's reload "ratings.csv" using ``sc.textFile`` and then convert it to the form of (user, item, rating) tuples.

In [None]:
ratings_df.show()

+------+-------+------+---------+
|userId|movieId|rating|timestamp|
+------+-------+------+---------+
|     1|      1|   4.0|964982703|
|     1|      3|   4.0|964981247|
|     1|      6|   4.0|964982224|
|     1|     47|   5.0|964983815|
|     1|     50|   5.0|964982931|
|     1|     70|   3.0|964982400|
|     1|    101|   5.0|964980868|
|     1|    110|   4.0|964982176|
|     1|    151|   5.0|964984041|
|     1|    157|   5.0|964984100|
|     1|    163|   5.0|964983650|
|     1|    216|   5.0|964981208|
|     1|    223|   3.0|964980985|
|     1|    231|   5.0|964981179|
|     1|    235|   4.0|964980908|
|     1|    260|   5.0|964981680|
|     1|    296|   3.0|964982967|
|     1|    316|   3.0|964982310|
|     1|    333|   5.0|964981179|
|     1|    349|   4.0|964982563|
+------+-------+------+---------+
only showing top 20 rows



In [None]:
movie_ratings=ratings_df.drop('timestamp')

In [None]:
# Data type convert
from pyspark.sql.types import IntegerType, FloatType
movie_ratings = movie_ratings.withColumn("userId", movie_ratings["userId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("movieId", movie_ratings["movieId"].cast(IntegerType()))
movie_ratings = movie_ratings.withColumn("rating", movie_ratings["rating"].cast(FloatType()))

In [None]:
movie_ratings.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   4.0|
|     1|      3|   4.0|
|     1|      6|   4.0|
|     1|     47|   5.0|
|     1|     50|   5.0|
|     1|     70|   3.0|
|     1|    101|   5.0|
|     1|    110|   4.0|
|     1|    151|   5.0|
|     1|    157|   5.0|
|     1|    163|   5.0|
|     1|    216|   5.0|
|     1|    223|   3.0|
|     1|    231|   5.0|
|     1|    235|   4.0|
|     1|    260|   5.0|
|     1|    296|   3.0|
|     1|    316|   3.0|
|     1|    333|   5.0|
|     1|    349|   4.0|
+------+-------+------+
only showing top 20 rows



### ALS Model Selection and Evaluation

With the ALS model, we can use a grid search to find the optimal hyperparameters.

In [None]:
# import package
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.ml.tuning import CrossValidator,ParamGridBuilder

In [None]:
#Create test and train set
(training,test)=movie_ratings.randomSplit([0.8,0.2])

In [None]:
#Create ALS model
als = ALS(maxIter=5, rank=10, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")

In [None]:
#Tune model using ParamGridBuilder
paramGrid = (ParamGridBuilder()
             .addGrid(als.regParam, [0.05, 0.1, 0.3, 0.5])
             .addGrid(als.rank, [5, 10, 15])
             .addGrid(als.maxIter, [1, 5, 10])
             .build())

In [None]:
# Define evaluator as RMSE
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")

In [None]:
# Build Cross validation 
cv = CrossValidator(estimator=als, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5)

In [None]:
#Fit ALS model to training data
cvModel = cv.fit(training)

In [None]:
#Extract best model from the tuning exercise using ParamGridBuilder
bestModel=cvModel.bestModel

### Model testing
And finally, make a prediction and check the testing error.

In [None]:
#Generate predictions and evaluate using RMSE
predictions=bestModel.transform(test)
rmse = evaluator.evaluate(predictions)

In [None]:
#Print evaluation metrics and model parameters
print ("RMSE = "+str(rmse))
print ("**Best Model**")
print (" Rank: ", str(bestModel._java_obj.parent().getRank())),
print (" MaxIter: ", str(bestModel._java_obj.parent().getMaxIter())), 
print (" RegParam: ", str(bestModel._java_obj.parent().getRegParam()))

RMSE = 0.8841516298164385
**Best Model**
 Rank:  5
 MaxIter:  10
 RegParam:  0.1


In [None]:
predictions.show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|    91|    471|   1.0| 3.4323022|
|   218|    471|   4.0| 3.0170097|
|   387|    471|   3.0| 3.0230432|
|   312|    471|   4.0| 3.5635166|
|   469|    471|   5.0| 3.4064238|
|   426|    471|   5.0| 3.1981883|
|   260|    471|   4.5| 3.1806548|
|   104|    471|   4.5| 3.2562525|
|   463|   1088|   3.5|  3.427435|
|   159|   1088|   4.0| 3.0020626|
|   606|   1088|   3.0|  3.067695|
|   111|   1088|   3.0| 3.6050856|
|    47|   1088|   4.0| 2.3147006|
|   177|   1088|   3.5| 3.4895747|
|   479|   1088|   4.0|  3.139378|
|   554|   1088|   5.0| 3.5118544|
|   594|   1088|   4.5|  4.278222|
|    10|   1088|   3.0| 3.2420814|
|   116|   1088|   4.5| 3.3564475|
|    19|   1238|   3.0|  3.167931|
+------+-------+------+----------+
only showing top 20 rows



### Model apply and see the performance

In [None]:
alldata=bestModel.transform(movie_ratings)
rmse = evaluator.evaluate(alldata)
print ("RMSE = "+str(rmse))

RMSE = 0.690710880006859


In [None]:
alldata.registerTempTable("alldata")

In [None]:
spark.sql("Select * From alldata").show()

+------+-------+------+----------+
|userId|movieId|rating|prediction|
+------+-------+------+----------+
|   191|    148|   5.0| 4.9218035|
|   133|    471|   4.0| 3.2228916|
|   597|    471|   2.0| 3.7315211|
|   385|    471|   4.0| 3.4770637|
|   436|    471|   3.0| 3.4019136|
|   602|    471|   4.0| 3.4198265|
|    91|    471|   1.0| 3.4323022|
|   409|    471|   3.0|  4.106057|
|   372|    471|   3.0| 2.7508132|
|   599|    471|   2.5| 2.7531638|
|   603|    471|   4.0| 3.4116812|
|   182|    471|   4.5|  3.184734|
|   218|    471|   4.0| 3.0170097|
|   474|    471|   3.0| 3.5461674|
|   500|    471|   1.0|  2.102999|
|    57|    471|   3.0| 3.7029545|
|   462|    471|   2.5| 3.3789701|
|   387|    471|   3.0| 3.0230432|
|   610|    471|   4.0| 3.8253715|
|   217|    471|   2.0| 2.6545913|
+------+-------+------+----------+
only showing top 20 rows



In [None]:
spark.sql("select * from movies join alldata on movies.movieId=alldata.movieId").show()

+-------+--------------------+------+------+-------+------+----------+
|movieId|               title|genres|userId|movieId|rating|prediction|
+-------+--------------------+------+------+-------+------+----------+
|    148|Awfully Big Adven...| Drama|   191|    148|   5.0| 4.9218035|
|    471|Hudsucker Proxy, ...|Comedy|   133|    471|   4.0| 3.2228916|
|    471|Hudsucker Proxy, ...|Comedy|   597|    471|   2.0| 3.7315211|
|    471|Hudsucker Proxy, ...|Comedy|   385|    471|   4.0| 3.4770637|
|    471|Hudsucker Proxy, ...|Comedy|   436|    471|   3.0| 3.4019136|
|    471|Hudsucker Proxy, ...|Comedy|   602|    471|   4.0| 3.4198265|
|    471|Hudsucker Proxy, ...|Comedy|    91|    471|   1.0| 3.4323022|
|    471|Hudsucker Proxy, ...|Comedy|   409|    471|   3.0|  4.106057|
|    471|Hudsucker Proxy, ...|Comedy|   372|    471|   3.0| 2.7508132|
|    471|Hudsucker Proxy, ...|Comedy|   599|    471|   2.5| 2.7531638|
|    471|Hudsucker Proxy, ...|Comedy|   603|    471|   4.0| 3.4116812|
|    4

# Recommend moive to users with id: 575, 232. 
you can choose some users to recommend the moives 

In [None]:
!pip install koalas
import databricks.koalas as ks

Collecting koalas
[?25l  Downloading https://files.pythonhosted.org/packages/b6/c6/6fc08536962059596b2cbf6dae626f33e60ea171fbd5172297813de5dcf8/koalas-1.1.0-py3-none-any.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 2.3MB/s 
Installing collected packages: koalas
Successfully installed koalas-1.1.0


In [None]:
userRecs = bestModel.recommendForAllUsers(10)

In [None]:
userRecs_ks=userRecs.to_koalas()
movies_ks=movies_df.to_koalas()

In [None]:
def movieRecommendation(inputId):
  recs_list=[]
  for recs in userRecs_ks.loc[str(inputId), 'recommendations']:
    recs_list.append(str(recs[0]))
  return (movies_ks[movies_ks['movieId'].isin(recs_list)])

In [None]:
print("Recommended movies for user with id '575' are as follows.")
movieRecommendation(575)

Recommended movies for user with id '575' are as follows.


Unnamed: 0,movieId,title,genres
2368,3142,U2: Rattle and Hum (1988),Documentary|Musical
2926,3925,Stranger Than Paradise (1984),Comedy|Drama
5458,26133,"Charlie Brown Christmas, A (1965)",Animation|Children|Comedy
5467,26171,Play Time (a.k.a. Playtime) (1967),Comedy
5867,32892,Ivan's Childhood (a.k.a. My Name is Ivan) (Iva...,Drama|War
6728,59018,"Visitor, The (2007)",Drama|Romance
6813,60943,Frozen River (2008),Drama
6986,66943,"Cottage, The (2008)",Comedy|Crime|Horror|Thriller
7001,67695,Observe and Report (2009),Action|Comedy
7817,92643,Monsieur Lazhar (2011),Children|Comedy|Drama


In [None]:
print("Recommended movies for user with id '232' are as follows.")
movieRecommendation(232)

Recommended movies for user with id '232' are as follows.


Unnamed: 0,movieId,title,genres
924,1223,"Grand Day Out with Wallace and Gromit, A (1989)",Adventure|Animation|Children|Comedy|Sci-Fi
2283,3030,Yojimbo (1961),Action|Adventure
2427,3235,Where the Buffalo Roam (1980),Comedy
2523,3379,On the Beach (1959),Drama
4121,5915,Victory (a.k.a. Escape to Victory) (1981),Action|Drama|War
4590,6818,Come and See (Idi i smotri) (1985),Drama|War
5202,8477,"Jetée, La (1962)",Romance|Sci-Fi
5230,8542,"Day at the Races, A (1937)",Comedy|Musical
7037,68945,Neon Genesis Evangelion: Death & Rebirth (Shin...,Action|Animation|Mystery|Sci-Fi
8839,132333,Seve (2014),Documentary|Drama


# Find the similar moives for moive with id: 463, 471

1.   列表项
2.   列表项


You can find the similar moives based on the ALS results

In [None]:
itemFactors=bestModel.itemFactors.to_koalas()

In [None]:
def similarMovies(inputId, matrix='cosine_similarity'):
  try:
    movieFeature=itemFactors.loc[itemFactors.id==str(inputId),'features'].to_numpy()[0]
  except:
    return 'There is no movie with id ' + str(inputId)
  
  if matrix=='cosine_similarity':
    similarMovie=pd.DataFrame(columns=('movieId','cosine_similarity'))
    for id,feature in itemFactors.to_numpy():
      cs=np.dot(movieFeature,feature)/(np.linalg.norm(movieFeature) * np.linalg.norm(feature))
      similarMovie=similarMovie.append({'movieId':str(id), 'cosine_similarity':cs}, ignore_index=True)
    similarMovie_cs=similarMovie.sort_values(by=['cosine_similarity'],ascending = False)[1:11]
    joint=similarMovie_cs.merge(movies_ks.to_pandas(), left_on='movieId', right_on = 'movieId', how = 'inner')
  if matrix=='euclidean_distance':
    similarMovie=pd.DataFrame(columns=('movieId','euclidean_distance'))
    for id,feature in itemFactors.to_numpy():
      ed=np.linalg.norm(np.array(movieFeature)-np.array(feature))
      similarMovie=similarMovie.append({'movieId':str(id), 'euclidean_distance':ed}, ignore_index=True)
    similarMovie_ed=similarMovie.sort_values(by=['euclidean_distance'])[1:11]
    joint=similarMovie_ed.merge(movies_ks.to_pandas(), left_on='movieId', right_on = 'movieId', how = 'inner')
  return joint[['movieId','title','genres']]

In [None]:
similarMovies(463)

'There is no movie with id 463'

In [None]:
print('Similar movies based on cosine similarity matrix are as follows.')
similarMovies(471, 'cosine_similarity')

In [None]:
print('Similar movies based on euclidean distance matrix are as follows.')
similarMovies(471, 'euclidean_distance')

similar movies based on euclidean distance matrix are as follows.


Unnamed: 0,movieId,title,genres
0,3396,"Muppet Movie, The (1979)",Adventure|Children|Comedy|Musical
1,531,"Secret Garden, The (1993)",Children|Drama
2,1220,"Blues Brothers, The (1980)",Action|Comedy|Musical
3,3671,Blazing Saddles (1974),Comedy|Western
4,2922,Hang 'Em High (1968),Crime|Drama|Western
5,5876,"Quiet American, The (2002)",Drama|Thriller|War
6,952,Around the World in 80 Days (1956),Adventure|Comedy
7,7728,"Postman Always Rings Twice, The (1946)",Crime|Drama|Film-Noir|Thriller
8,6283,Cowboy Bebop: The Movie (Cowboy Bebop: Tengoku...,Action|Animation|Sci-Fi|Thriller
9,2109,"Jerk, The (1979)",Comedy


### Overall Summary

#### Motivation: 
Consider an online video playing app, a precise recommendation system is the one of the most critical component. In order to build a system to provide personalized recommendation for every user, I build a simple recommendation system using the data from GroupLens (https://grouplens.org/datasets/movielens/latest/) in this project. Going through this process, hopefully I would be well-prepared when I am trying to find a data scientist position for a video service company or any high-tech company in the future.

#### Steps:
1. Exploratory Data Analysis (EDA): Built data ETL pipeline to analyze movie rating dataset and conducted online analytical processing(OLAP) with Spark SQL.
2. Implemented the Alternative Least Square model to provide personalized movie recommendations.
3. Conducted model hyper-parameters through grid search and 4-fold cross validation; evaluate the tuned model on testing data.
4. Apply the model on the whole rating data to see the performance; Use the model to make recommendations for given userIds; Use one of the model output: item factor to find similar movies for given movieIds.

#### Output and conclusion:
 - The best model for ALS has the parameters to be: maxIter=10, regParam=0.1, rank=5, alpha=0.1. The rooted mean squared error (RMSE) on the testing data is 0.88 and on the whole dataset is 0.69.
 - As mentioned in the steps, the ALS model is not only able to provide recommendations, but also able to mine latent information, which is the latent variable in matrix factorization. This latent information is helpful that it can help us gain some deeper insight. In this project, this information was used to measure the difference between any two movies so that we are able to find similar movies .