version 2
# Alejo Gonzalez Garcia (100454351)
# Andrés Navarro Pedregal (100451730)

# ![Spark Logo](http://spark-mooc.github.io/web-assets/images/ta_Spark-logo-small.png) + ![Python Logo](http://spark-mooc.github.io/web-assets/images/python-logo-master-v3-TM-flattened_small.png)
# **Introduction to Machine Learning with Apache Spark**
## **Predicting Movie Ratings**
#### One of the most common uses of big data is to predict what users want.  This allows Google to show you relevant ads, Amazon to recommend relevant products, and Netflix to recommend movies that you might like.  This lab will demonstrate how we can use Apache Spark to recommend movies to a user.  We will start with some basic techniques, and then use the [Spark MLlib][mllib] library's Alternating Least Squares method to make more sophisticated predictions.
#### For this lab, we will use a subset dataset of 500,000 ratings we have included for you into your VM (and on Databricks) from the [movielens 10M stable benchmark rating dataset](http://grouplens.org/datasets/movielens/). However, the same code you write will work for the full dataset, or their latest dataset of 21 million ratings.
#### In this lab:
#### *Part 0*: Preliminaries
#### *Part 1*: Basic Recommendations
#### *Part 2*: Collaborative Filtering
#### *Part 3*: Predictions for Yourself
#### As mentioned during the first Learning Spark lab, think carefully before calling `collect()` on any datasets.  When you are using a small dataset, calling `collect()` and then using Python to get a sense for the data locally (in the driver program) will work fine, but this will not work when you are using a large dataset that doesn't fit in memory on one machine.  Solutions that call `collect()` and do local analysis that could have been done with Spark will likely fail in the autograder and not receive full credit.
[mllib]: https://spark.apache.org/mllib/

### Code
#### This assignment can be completed using basic Python and pySpark Transformations and Actions.  Libraries other than math are not necessary. With the exception of the ML functions that we introduce in this assignment, you should be able to complete all parts of this homework using only the Spark functions you have used in prior lab exercises (although you are welcome to use more features of Spark if you like!).
### You will need to import the h5py and implicit Python packages to get the MoviLENS databases.
### Import those packages go to Workspace -> Users -> right button and select Import. In the Modal Window select "import a library"  and select Package from PyPy.
#### Fill the Package name field with h5py and repeat with the implicit library.

In [None]:
import sys
import os
from test_helper import Test
from implicit.datasets.movielens import get_movielens
import numpy as np

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS
from pyspark.sql import Row
from pyspark.sql.functions import *

### **Part 0: Preliminaries**
#### We read with implicit the dataset to analyze.
#### MovieLens have 4 different datasets, with different amount of movies and ratings.
#### We have: '100k', '1m', '10m' and '20m', and define it in the variant variable
#### 
* #### For each row in the ratings dataset, we create a tuple of (UserID, MovieID, Rating). 
* #### For each line in the movies dataset, we create a tuple of (MovieID, Title).

In [None]:
variant='1m'
titles, ratings = get_movielens(variant)

Check the data in titles variable:

In [None]:
print(titles)

['' 'Toy Story (1995)' 'Jumanji (1995)' ... 'Tigerland (2000)'
 'Two Family House (2000)' 'Contender, The (2000)']


It is a very simple list of titles. We will assume the movieID is her possition in the list.

In [None]:
moviesRDD=sc.parallelize(zip(range(len(titles)),titles)).map(lambda p: Row(movieId=int(p[0]),Title=str(p[1])))
moviesDF=spark.createDataFrame(moviesRDD)
moviesCount=moviesDF.count()

In [None]:
moviesDF.show()

+-------+--------------------+
|movieId|               Title|
+-------+--------------------+
|      0|                    |
|      1|    Toy Story (1995)|
|      2|      Jumanji (1995)|
|      3|Grumpier Old Men ...|
|      4|Waiting to Exhale...|
|      5|Father of the Bri...|
|      6|         Heat (1995)|
|      7|      Sabrina (1995)|
|      8| Tom and Huck (1995)|
|      9| Sudden Death (1995)|
|     10|    GoldenEye (1995)|
|     11|American Presiden...|
|     12|Dracula: Dead and...|
|     13|        Balto (1995)|
|     14|        Nixon (1995)|
|     15|Cutthroat Island ...|
|     16|       Casino (1995)|
|     17|Sense and Sensibi...|
|     18|   Four Rooms (1995)|
|     19|Ace Ventura: When...|
+-------+--------------------+
only showing top 20 rows



Now, we will check the ratings information:

In [None]:
type(ratings)

Out[61]: scipy.sparse.csr.csr_matrix

It is a scipy.sparse.csr.csr_matrix [https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html].

In this case the matrix represents the evaluation of each one of the movies (rows) by the each user (columns). Thats why this matrix is full of zeros.

In [None]:
(ratings.toarray())[0:5,0:20]

Out[62]: array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 5., 0., 0., 0., 0., 4., 0., 4., 5., 5., 0., 0., 0., 0., 0.,
        0., 0., 4., 5.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 5., 0., 0., 3., 0., 0.,
        0., 0., 2., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 3., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]], dtype=float32)

To get tle list of tuples with pair (movie, user) and evaluation, we can use the method todok() [https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.todok.html#scipy.sparse.csr_matrix.todok]. This will be useful to convert to a distributed dataframe.

In [None]:
len(ratings.todok(True).items())

Out[63]: 1000209

In [None]:
list(ratings.todok(True).items())[:10]

Out[64]: [((1, 1), 5.0),
 ((48, 1), 5.0),
 ((150, 1), 5.0),
 ((260, 1), 4.0),
 ((527, 1), 5.0),
 ((531, 1), 4.0),
 ((588, 1), 4.0),
 ((594, 1), 4.0),
 ((595, 1), 5.0),
 ((608, 1), 4.0)]

In [None]:
def get_ratings_tuple(entry):
    #Program a function who receives a tuple ((movieId, userId), rating) and returns a tuple (userId,MovieId,rating)

    # Unpack the nested tuple
    (movieId, userId), rating = entry
    
    # Return the transformed tuple
    return userId, movieId, rating

    
    

In [None]:
rawRatingsRDD=sc.parallelize(list(ratings.todok(True).items()))
ratingsRDD = ( rawRatingsRDD
                 .map(get_ratings_tuple)
                 .map(lambda p: Row(userId=int(p[0]), movieId=int(p[1]), rating=float(p[2])))
            )
ratingsDF = spark.createDataFrame(ratingsRDD)
ratingsCount = ratingsDF.count()

In [None]:
print ('There are {0} ratings and {1} movies in the datasets'.format(ratingsCount, moviesCount))
print ('Ratings: {0}'.format(ratingsDF.take(3)))
print ('Movies: {0}'.format(moviesDF.take(3)))
if variant=='20m':
  assert ratingsCount == 20000263
  assert moviesCount == 131263
  assert moviesDF.filter("Title == \"b'Toy Story (1995)'\"").count() == 1
  assert (ratingsRDD.takeOrdered(1, key=lambda data: data[1]))
if variant=='10m':
  assert ratingsCount == 10000054
  assert moviesCount == 65134
  assert moviesDF.filter("Title == \"b'Toy Story (1995)'\"").count() == 1
  assert (ratingsRDD.takeOrdered(1, key=lambda data: data[1]))
if variant=='1m':
  assert ratingsCount == 1000209
  assert moviesCount == 3953
  #assert moviesDF.filter("Title == \"b'Toy Story (1995)'\"").count() == 1 # We can have both, I have fixed the output error!
  assert moviesDF.filter("Title == \"b'Toy Story (1995)'\" OR Title == 'Toy Story (1995)'").count() == 1

  assert (ratingsRDD.takeOrdered(1, key=lambda data: data[1]))
if variant=='100k':
  assert ratingsCount == 100000
  assert moviesCount == 1683
  assert moviesDF.filter("Title == \"b'Toy Story (1995)'\"").count() == 1
  assert (ratingsRDD.takeOrdered(1, key=lambda data: data[1]))

There are 1000209 ratings and 3953 movies in the datasets
Ratings: [Row(userId=1, movieId=1, rating=5.0), Row(userId=1, movieId=48, rating=5.0), Row(userId=1, movieId=150, rating=5.0)]
Movies: [Row(movieId=0, Title=''), Row(movieId=1, Title='Toy Story (1995)'), Row(movieId=2, Title='Jumanji (1995)')]


#### In this lab we will be examining subsets of the tuples we create (e.g., the top rated movies by users). Whenever we examine only a subset of a large dataset, there is the potential that the result will depend on the order we perform operations, such as joins, or how the data is partitioned across the workers. What we want to guarantee is that we always see the same results for a subset, independent of how we manipulate or store the data.
#### We can do that by sorting before we examine a subset. You might think that the most obvious choice when dealing with an RDD of tuples would be to use the [`sortByKey()` method][sortbykey]. However this choice is problematic, as we can still end up with different results if the key is not unique.
#### Note: It is important to use the [`unicode` type](https://docs.python.org/2/howto/unicode.html#the-unicode-type) instead of the `string` type as the titles are in unicode characters.
#### Consider the following example, and note that while the sets are equal, the printed lists are usually in different order by value, *although they may randomly match up from time to time.*
#### You can try running this multiple times.  If the last assertion fails, don't worry about it: that was just the luck of the draw.  And note that in some environments the results may be more deterministic.
[sortbykey]: https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.sortByKey

In [None]:
tmp1 = [(1, u'alpha'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'delta')]
tmp2 = [(1, u'delta'), (2, u'alpha'), (2, u'beta'), (3, u'alpha'), (1, u'epsilon'), (1, u'alpha')]
schema=["key","value"]
oneDF = spark.createDataFrame(data=tmp1,schema=schema)
twoDF = spark.createDataFrame(data=tmp2,schema=schema)
oneSorted = oneDF.sort("key").collect()
twoSorted = twoDF.sort("key").collect()
print (oneSorted)
print (twoSorted)
assert set(oneSorted) == set(twoSorted)     # Note that both lists have the same elements
assert twoSorted[0]["key"] < twoSorted.pop()["key"] # Check that it is sorted by the keys
assert oneSorted != twoSorted   # Note that the subset consisting of the first two elements does not match

[Row(key=1, value='alpha'), Row(key=1, value='epsilon'), Row(key=1, value='delta'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]
[Row(key=1, value='delta'), Row(key=1, value='epsilon'), Row(key=1, value='alpha'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]


#### Even though the two lists contain identical tuples, the difference in ordering *sometimes* yields a different ordering for the sorted DF (try running the cell repeatedly and see if the results change or the assertion fails). If we only examined the first two elements of the DF (e.g., using `take(2)`), then we would observe different answers - **that is a really bad outcome as we want identical input data to always yield identical output**. A better technique is to sort the DF by *both the key and value*.

In [None]:
print (oneDF.sort("key", "value").collect())
print (twoDF.sort("key","value").collect())

[Row(key=1, value='alpha'), Row(key=1, value='delta'), Row(key=1, value='epsilon'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]
[Row(key=1, value='alpha'), Row(key=1, value='delta'), Row(key=1, value='epsilon'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]


#### If we just want to look at the first few elements of the DF in sorted order, we can use the sort method with multiple columns defining the priority to sort: https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.sort.html#pyspark.sql.DataFrame.sort

In [None]:
oneSorted1 = oneDF.sort("key", "value").take(oneDF.count())
twoSorted1 = twoDF.sort("key", "value").take(oneDF.count())
print ('one is {0}'.format(oneSorted1))
print ('two is {0}'.format(twoSorted1))
assert oneSorted1 == twoSorted1

one is [Row(key=1, value='alpha'), Row(key=1, value='delta'), Row(key=1, value='epsilon'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]
two is [Row(key=1, value='alpha'), Row(key=1, value='delta'), Row(key=1, value='epsilon'), Row(key=2, value='alpha'), Row(key=2, value='beta'), Row(key=3, value='alpha')]


### **Part 1: Basic Recommendations**
#### One way to recommend movies is to always recommend the movies with the highest average rating. In this part, we will use Spark to find the name, number of ratings, and the average rating of the 20 movies with the highest average rating and more than 500 reviews. We want to filter our movies with high ratings but fewer than or equal to 500 reviews because movies with few reviews may not have broad appeal to everyone.

#### **(1a) Movies with Highest Average Ratings**
#### Now that we have a way to calculate the average ratings, we will use the `getCountsAndAverages()` helper function with Spark to determine movies with highest average ratings.
#### The steps you should perform are:
* #### Recall that the `ratingsDF` contains tuples of the form (UserID, MovieID, Rating). From `ratingsDF` create an DataFrame with tuples of the form (MovieID, number of ratings, average of ratings). Use the methods 
* #### groubBy() https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.groupBy.html#pyspark.sql.DataFrame.groupBy
### And agg() method https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.agg.html#pyspark.sql.DataFrame.agg, to calculate the number of ratings and average value of ratings per movie. 
* #### We want to see movie names, instead of movie IDs. To `moviesRDD`, apply DataFrame transformations that use `movieIDsWithAvgRatingsDF` to get the movie names for `movieIDsWithAvgRatingsDF`, yielding tuples of the form (average rating, movie name, number of ratings). This set of transformations will yield an DF of the form: `[(1.0, u'Autopsy (Macchie Solari) (1975)', 1), (1.0, u'Better Living (1998)', 1), (1.0, u'Big Squeeze, The (1996)', 3)]`. You can use the Spark Dataframe transformation *join* https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.sql.DataFrame.join.html#pyspark.sql.DataFrame.join : `[(3.6818181818181817, u'Happiest Millionaire, The (1967)', 22), (3.0468227424749164, u'Grumpier Old Men (1995)', 299), (2.882978723404255, u'Hocus Pocus (1993)', 94)]`

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# From ratingsDF with tuples of (UserID, MovieID, Rating) create an DF with MovieId, number of ratings and average of that ratings.
# Set name as movieId, num_ratings and avg_ratings
# the (MovieID, iterable of Ratings for that MovieID)

movieIDsWithRatingsDF = ratingsDF.groupBy('movieId').agg(
    count('rating').alias('num_rating'),
    avg('rating').alias('avg_rating')
)
print ('movieIDsWithAvgRatingsRDD: {0}\n'.format(movieIDsWithRatingsDF.take(3)))

# To `movieIDsWithAvgRatingsRDD`, apply RDD transformations that use `moviesRDD` to get the movie
# names for `movieIDsWithAvgRatingsRDD`, yielding tuples of the form
# (average rating, movie name, number of ratings)
movieNameWithAvgRatingsDF = (moviesDF
                              .join(movieIDsWithRatingsDF, 'movieId')
                              .select('avg_rating', 'Title', 'num_rating')
                              .orderBy('avg_rating', ascending=False))

print('movieNameWithAvgRatingsDF: {0}\n'.format(movieNameWithAvgRatingsDF.take(3)))

movieIDsWithAvgRatingsRDD: [Row(movieId=29, num_rating=403, avg_rating=4.062034739454094), Row(movieId=1806, num_rating=168, avg_rating=2.892857142857143), Row(movieId=474, num_rating=972, avg_rating=3.825102880658436)]

movieNameWithAvgRatingsDF: [Row(avg_rating=5.0, Title='Schlafes Bruder (Brother of Sleep) (1995)', num_rating=1), Row(avg_rating=5.0, Title='Bittersweet Motel (2000)', num_rating=1), Row(avg_rating=5.0, Title='Follow the Bitch (1998)', num_rating=1)]



In [None]:
ratingsDF.show()

+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     1|      1|   5.0|
|     1|     48|   5.0|
|     1|    150|   5.0|
|     1|    260|   4.0|
|     1|    527|   5.0|
|     1|    531|   4.0|
|     1|    588|   4.0|
|     1|    594|   4.0|
|     1|    595|   5.0|
|     1|    608|   4.0|
|     1|    661|   3.0|
|     1|    720|   3.0|
|     1|    745|   3.0|
|     1|    783|   4.0|
|     1|    914|   3.0|
|     1|    919|   4.0|
|     1|    938|   4.0|
|     1|   1022|   5.0|
|     1|   1028|   5.0|
|     1|   1029|   5.0|
+------+-------+------+
only showing top 20 rows



In [None]:
movieIDsWithRatingsDF.show()

+-------+----------+------------------+
|movieId|num_rating|        avg_rating|
+-------+----------+------------------+
|     29|       403| 4.062034739454094|
|   1806|       168| 2.892857142857143|
|    474|       972| 3.825102880658436|
|   2040|       128|         2.9453125|
|   2453|       176|              3.25|
|   2529|      1162| 3.712564543889845|
|     26|       100|              3.53|
|   3506|       167|3.3652694610778444|
|   3091|       141| 4.283687943262412|
|   2250|        62|3.1451612903225805|
|   1677|        37|2.7567567567567566|
|   1950|       348| 4.129310344827586|
|   3764|       245|2.6408163265306124|
|   2927|        99|  4.08080808080808|
|    964|        51| 3.392156862745098|
|   2509|        18|2.7777777777777777|
|   1277|       363| 3.884297520661157|
|   1840|       160|           3.38125|
|    541|      1800| 4.273333333333333|
|   2173|       110|3.5636363636363635|
+-------+----------+------------------+
only showing top 20 rows



In [None]:
movieNameWithAvgRatingsDF.orderBy("avg_rating",ascending=False).show()

+-----------------+--------------------+----------+
|       avg_rating|               Title|num_rating|
+-----------------+--------------------+----------+
|              5.0|    Baby, The (1973)|         1|
|              5.0|Gate of Heavenly ...|         3|
|              5.0|Schlafes Bruder (...|         1|
|              5.0|Follow the Bitch ...|         1|
|              5.0|Bittersweet Motel...|         1|
|              5.0|Smashing Time (1967)|         2|
|              5.0|Ulysses (Ulisse) ...|         1|
|              5.0|Song of Freedom (...|         1|
|              5.0|One Little Indian...|         1|
|              5.0|        Lured (1947)|         1|
|              4.8|I Am Cuba (Soy Cu...|         5|
|             4.75|     Lamerica (1994)|         8|
|4.666666666666667|Apple, The (Sib) ...|         9|
|4.608695652173913|      Sanjuro (1962)|        69|
|4.560509554140127|Seven Samurai (Th...|       628|
|4.554557700942973|Shawshank Redempt...|      2227|
|4.524966261

In [None]:
# TEST Movies with Highest Average Ratings (1b)
if variant=='20m':
  Test.assertEquals(movieIDsWithRatingsDF.count(), 26744,
                  'incorrect movieIDsWithRatingsRDD.count() (expected 26744)')
elif variant=='10m':
  Test.assertEquals(movieIDsWithRatingsDF.count(), 10677,
                  'incorrect movieIDsWithRatingsRDD.count() (expected 10677)')
elif variant=='1m':
  Test.assertEquals(movieIDsWithRatingsDF.count(), 3706,
                  'incorrect movieIDsWithRatingsRDD.count() (expected 3706)')
elif variant=='100k':
  Test.assertEquals(movieIDsWithRatingsDF.count(), 1682,
                  'incorrect movieIDsWithRatingsDF.count() (expected 1682)')

1 test passed.


#### **(1c) Movies with Highest Average Ratings and more than 500 reviews**
#### Now that we have an DataFrame of the movies with highest averge ratings, we can use Spark to determine the 20 movies with highest average ratings and more than 500 reviews.
#### Apply a single DataFrame transformation to `movieNameWithAvgRatingsDF` to limit the results to movies with ratings from more than 500 people. We then use the `orderBy()` helper function to sort by the average rating to get the movies in order of their rating (highest rating first). You will end up with an DataFrame of the form: `[(4.5349264705882355, u'Shawshank Redemption, The (1994)', 1088), (4.515798462852263, u"Schindler's List (1993)", 1171), (4.512893982808023, u'Godfather, The (1972)', 1047)]`

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# Apply an RDD transformation to `movieNameWithAvgRatingsRDD` to limit the results to movies with
# ratings from more than 500 people. We then use the `sortFunction()` helper function to sort by the
# average rating to get the movies in order of their rating (highest rating first)
movieLimitedAndSortedByRatingDF = (movieNameWithAvgRatingsDF
                                    .filter('num_rating > 500')
                                    .orderBy('avg_rating', ascending=False))

print('Movies with highest ratings: {0}'.format(movieLimitedAndSortedByRatingDF.take(20)))


Movies with highest ratings: [Row(avg_rating=4.560509554140127, Title='Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)', num_rating=628), Row(avg_rating=4.554557700942973, Title='Shawshank Redemption, The (1994)', num_rating=2227), Row(avg_rating=4.524966261808367, Title='Godfather, The (1972)', num_rating=2223), Row(avg_rating=4.52054794520548, Title='Close Shave, A (1995)', num_rating=657), Row(avg_rating=4.517106001121705, Title='Usual Suspects, The (1995)', num_rating=1783), Row(avg_rating=4.510416666666667, Title="Schindler's List (1993)", num_rating=2304), Row(avg_rating=4.507936507936508, Title='Wrong Trousers, The (1993)', num_rating=882), Row(avg_rating=4.477724741447892, Title='Raiders of the Lost Ark (1981)', num_rating=2514), Row(avg_rating=4.476190476190476, Title='Rear Window (1954)', num_rating=1050), Row(avg_rating=4.453694416583082, Title='Star Wars: Episode IV - A New Hope (1977)', num_rating=2991), Row(avg_rating=4.4498902706656915, Title='Dr. Str

In [None]:
# TEST Movies with Highest Average Ratings and more than 500 Reviews (1c)
if variant=='20m':
  Test.assertEquals(movieLimitedAndSortedByRatingDF.count(), 4483,
                  'incorrect movieLimitedAndSortedByRatingRDD.count()')
  Test.assertEquals(movieLimitedAndSortedByRatingDF.take(20),
                [Row(avg_rating=4.174231169217055, Title="b'Pulp Fiction (1994)'", num_rating=67310),
                 Row(avg_rating=4.029000181345584, Title="b'Forrest Gump (1994)'", num_rating=66172),
                 Row(avg_rating=4.446990499637029, Title="b'Shawshank Redemption, The (1994)'", num_rating=63366),
                 Row(avg_rating=4.17705650958151, Title="b'Silence of the Lambs, The (1991)'", num_rating=63299),
                 Row(avg_rating=3.6647408523821485, Title="b'Jurassic Park (1993)'", num_rating=59715),
                 Row(avg_rating=4.190671901948552, Title="b'Star Wars: Episode IV - A New Hope (1977)'", num_rating=54502),
                 Row(avg_rating=4.042533802004873, Title="b'Braveheart (1995)'", num_rating=53769),
                 Row(avg_rating=3.9319539085828037, Title="b'Terminator 2: Judgment Day (1991)'", num_rating=52244),
                 Row(avg_rating=4.187185880702848, Title="b'Matrix, The (1999)'", num_rating=51334),
                 Row(avg_rating=4.310175010988133, Title='b"Schindler\'s List (1993)"', num_rating=50054),
                 Row(avg_rating=3.921239561324077, Title="b'Toy Story (1995)'", num_rating=49695),
                 Row(avg_rating=3.9856900828946573, Title="b'Fugitive, The (1993)'", num_rating=49581),
                 Row(avg_rating=3.86859786089541, Title="b'Apollo 13 (1995)'", num_rating=47777),
                 Row(avg_rating=3.370961571161367, Title="b'Independence Day (a.k.a. ID4) (1996)'", num_rating=47048),
                 Row(avg_rating=4.334372207803259, Title="b'Usual Suspects, The (1995)'", num_rating=47006),
                 Row(avg_rating=4.004622216528961, Title="b'Star Wars: Episode VI - Return of the Jedi (1983)'", num_rating=46839),
                 Row(avg_rating=3.4023646154514267, Title="b'Batman (1989)'", num_rating=46054),
                 Row(avg_rating=4.188202061218635, Title="b'Star Wars: Episode V - The Empire Strikes Back (1980)'", num_rating=45313),
                 Row(avg_rating=4.155933936470536, Title="b'American Beauty (1999)'", num_rating=44987),
                 Row(avg_rating=3.8980546909737663, Title="b'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)'", num_rating=44980)], 'incorrect sortedByRatingDF.take(20)')
elif variant=='10m':
  Test.assertEquals(movieLimitedAndSortedByRatingDF.count(), 3005,
                  'incorrect movieLimitedAndSortedByRatingRDD.count()')
  Test.assertEquals(movieLimitedAndSortedByRatingDF.take(20),
                [Row(avg_rating=4.157425998164295, Title="b'Pulp Fiction (1994)'", num_rating=34864),
                 Row(avg_rating=4.0135821458629595, Title="b'Forrest Gump (1994)'", num_rating=34457),
                 Row(avg_rating=4.2041998336699535, Title="b'Silence of the Lambs, The (1991)'", num_rating=33668),
                 Row(avg_rating=3.661564156783427, Title="b'Jurassic Park (1993)'", num_rating=32631),
                 Row(avg_rating=4.457238321660348, Title="b'Shawshank Redemption, The (1994)'", num_rating=31126),
                 Row(avg_rating=4.0823900665431845, Title="b'Braveheart (1995)'", num_rating=29154),
                 Row(avg_rating=4.006925494801561, Title="b'Fugitive, The (1993)'", num_rating=28951),
                 Row(avg_rating=3.92769794113583, Title="b'Terminator 2: Judgment Day (1991)'", num_rating=28948),
                 Row(avg_rating=4.2202093397745575, Title="b'Star Wars: Episode IV - A New Hope (a.k.a. Star Wars) (1977)'", num_rating=28566),
                 Row(avg_rating=3.8873497318291106, Title="b'Apollo 13 (1995)'", num_rating=27035),
                 Row(avg_rating=3.3865572677433695, Title="b'Batman (1989)'", num_rating=26996),
                 Row(avg_rating=3.928768573481039, Title="b'Toy Story (1995)'", num_rating=26449),
                 Row(avg_rating=3.3759311880807927, Title="b'Independence Day (a.k.a. ID4) (1996)'", num_rating=26042),
                 Row(avg_rating=3.7421079036739733, Title="b'Dances with Wolves (1990)'", num_rating=25912),
                 Row(avg_rating=4.363482949916592, Title='b"Schindler\'s List (1993)"', num_rating=25777),
                 Row(avg_rating=3.500315196406761, Title="b'True Lies (1994)'", num_rating=25381),
                 Row(avg_rating=3.99635429117858, Title="b'Star Wars: Episode VI - Return of the Jedi (1983)'", num_rating=25098),
                 Row(avg_rating=3.8750666065499857, Title="b'12 Monkeys (Twelve Monkeys) (1995)'", num_rating=24397),
                 Row(avg_rating=4.367142322253193, Title="b'Usual Suspects, The (1995)'", num_rating=24037),
                 Row(avg_rating=4.133352946120871, Title="b'Fargo (1996)'", num_rating=23794)], 'incorrect sortedByRatingDF.take(20)')
elif variant=='1m':
  Test.assertEquals(movieLimitedAndSortedByRatingDF.count(), 617,
                  'incorrect movieLimitedAndSortedByRatingRDD.count()')
  Test.assertEquals(movieLimitedAndSortedByRatingDF.take(20),
                [Row(avg_rating=4.3173862310385065, Title="b'American Beauty (1999)'", num_rating=3428),
                 Row(avg_rating=4.453694416583082, Title="b'Star Wars: Episode IV - A New Hope (1977)'", num_rating=2991),
                 Row(avg_rating=4.292976588628763, Title="b'Star Wars: Episode V - The Empire Strikes Back (1980)'", num_rating=2990),
                 Row(avg_rating=4.022892819979188, Title="b'Star Wars: Episode VI - Return of the Jedi (1983)'", num_rating=2883),
                 Row(avg_rating=3.7638473053892216, Title="b'Jurassic Park (1993)'", num_rating=2672),
                 Row(avg_rating=4.337353938937053, Title="b'Saving Private Ryan (1998)'", num_rating=2653),
                 Row(avg_rating=4.058512646281616, Title="b'Terminator 2: Judgment Day (1991)'", num_rating=2649),
                 Row(avg_rating=4.315830115830116, Title="b'Matrix, The (1999)'", num_rating=2590),
                 Row(avg_rating=3.9903213317847466, Title="b'Back to the Future (1985)'", num_rating=2583),
                 Row(avg_rating=4.3518231186966645, Title="b'Silence of the Lambs, The (1991)'", num_rating=2578),
                 Row(avg_rating=3.739952718676123, Title="b'Men in Black (1997)'", num_rating=2538),
                 Row(avg_rating=4.477724741447892, Title="b'Raiders of the Lost Ark (1981)'", num_rating=2514),
                 Row(avg_rating=4.254675686430561, Title="b'Fargo (1996)'", num_rating=2513),
                 Row(avg_rating=4.406262708418057, Title="b'Sixth Sense, The (1999)'", num_rating=2459),
                 Row(avg_rating=4.234957020057307, Title="b'Braveheart (1995)'", num_rating=2443),
                 Row(avg_rating=4.127479949345715, Title="b'Shakespeare in Love (1998)'", num_rating=2369),
                 Row(avg_rating=4.3037100949094045, Title="b'Princess Bride, The (1987)'", num_rating=2318),
                 Row(avg_rating=4.510416666666667, Title='b"Schindler\'s List (1993)"', num_rating=2304),
                 Row(avg_rating=4.219405594405594, Title="b'L.A. Confidential (1997)'", num_rating=2288),
                 Row(avg_rating=3.953028972783143, Title="b'Groundhog Day (1993)'", num_rating=2278)], 'incorrect sortedByRatingRDD.take(20)')
elif variant=='100k':
  Test.assertEquals(movieLimitedAndSortedByRatingDF.count(), 4,
                  'incorrect movieLimitedAndSortedByRatingRDD.count()')
  Test.assertEquals(movieLimitedAndSortedByRatingDF.take(20),
                [Row(avg_rating=4.3584905660377355, Title="b'Star Wars (1977)'", num_rating=583),
                 Row(avg_rating=3.8035363457760316, Title="b'Contact (1997)'", num_rating=509),
                 Row(avg_rating=4.155511811023622, Title="b'Fargo (1996)'", num_rating=508),
                 Row(avg_rating=4.007889546351085, Title="b'Return of the Jedi (1983)'", num_rating=507)], 'incorrect sortedByRatingDF.take(20)')  

1 test passed.
1 test failed. incorrect sortedByRatingRDD.take(20)


Note that the last test fails. If we expect to have (4.5349264705882355, u'Shawshank Redemption, The (1994)', 1088), (4.515798462852263, u"Schindler's List (1993)", 1171) values, something might be wrong. If we see above on the movies ordered by rating, there is no movie with a rating of 4.53. We only have 
 4.666666666666667|Apple, The (Sib) ...|         9|
|4.608695652173913|      Sanjuro (1962)|        69|
|4.560509554140127|Seven Samurai (Th...|       628|
|4.554557700942973|Shawshank Redempt...|      2227|
|4.524966261808367|Godfather, The (1...|      2223|
So something might be wrong when we computed the average of the ratings or as we know, the provided test is incorrect. 

#### Using a threshold on the number of reviews is one way to improve the recommendations, but there are many other good ways to improve quality. For example, you could weight ratings by the number of ratings.

## **Part 2: Collaborative Filtering**
#### In this course, you have learned about many of the basic transformations and actions that Spark allows us to apply to distributed datasets.  Spark also exposes some higher level functionality; in particular, Machine Learning using a component of Spark called [MLlib][mllib].  In this part, you will learn how to use MLlib to make personalized movie recommendations using the movie data we have been analyzing.
#### We are going to use a technique called [collaborative filtering][collab]. Collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue x than to have the opinion on x of a person chosen randomly. You can read more about collaborative filtering [here][collab2].
#### The image below (from [Wikipedia][collab]) shows an example of predicting of the user's rating using collaborative filtering. At first, people rate different items (like videos, images, games). After that, the system is making predictions about a user's rating for an item, which the user has not rated yet. These predictions are built upon the existing ratings of other users, who have similar ratings with the active user. For instance, in the image below the system has made a prediction, that the active user will not like the video.
![collaborative filtering](https://courses.edx.org/c4x/BerkeleyX/CS100.1x/asset/Collaborative_filtering.gif)
[mllib]: https://spark.apache.org/mllib/
[collab]: https://en.wikipedia.org/?title=Collaborative_filtering
[collab2]: http://recommender-systems.org/collaborative-filtering/

#### For movie recommendations, we start with a matrix whose entries are movie ratings by users (shown in red in the diagram below).  Each column represents a user (shown in green) and each row represents a particular movie (shown in blue).
#### Since not all users have rated all movies, we do not know all of the entries in this matrix, which is precisely why we need collaborative filtering.  For each user, we have ratings for only a subset of the movies.  With collaborative filtering, the idea is to approximate the ratings matrix by factorizing it as the product of two matrices: one that describes properties of each user (shown in green), and one that describes properties of each movie (shown in blue).
![factorization](http://spark-mooc.github.io/web-assets/images/matrix_factorization.png)
#### We want to select these two matrices such that the error for the users/movie pairs where we know the correct ratings is minimized.  The [Alternating Least Squares][als] algorithm does this by first randomly filling the users matrix with values and then optimizing the value of the movies such that the error is minimized.  Then, it holds the movies matrix constrant and optimizes the value of the user's matrix.  This alternation between which matrix to optimize is the reason for the "alternating" in the name.
#### This optimization is what's being shown on the right in the image above.  Given a fixed set of user factors (i.e., values in the users matrix), we use the known ratings to find the best values for the movie factors using the optimization written at the bottom of the figure.  Then we "alternate" and pick the best user factors given fixed movie factors.

[als]: https://en.wikiversity.org/wiki/Least-Squares_Method

#### **(2a) Creating a Training Set**
#### Before we jump into using machine learning, we need to break up the `ratingsRDD` dataset into three pieces:
* #### A training set (DF), which we will use to train models
* #### A validation set (DF), which we will use to choose the best model
* #### A test set (DF), which we will use for our experiments
#### To randomly split the dataset into the multiple groups, we can use the pySpark [randomSplit()](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.sql.DataFrame.randomSplit) transformation. `randomSplit()` takes a set of splits and and seed and returns multiple DFs.

In [None]:
trainingDF, validationDF, testDF = ratingsDF.randomSplit([0.6, 0.2, 0.2], seed=0) # Here we are splitting the data into 60% training, 20% validation and 20% testing.  

print ('Training: {0}, validation: {1}, test: {2}\n' .format(trainingDF.count(),
                                                             validationDF.count(),
                                                             testDF.count()))
print (trainingDF.take(3))
print (validationDF.take(3))
print (testDF.take(3))
if variant=='20m':
  assert trainingDF.count() == 12002319
  assert validationDF.count() == 4000692
  assert testDF.count() == 3997252
elif variant=='10m':
  assert trainingDF.count() == 5998865
  assert validationDF.count() == 2000494
  assert testDF.count() == 2000695
elif variant=='1m':
  assert trainingDF.count() ==  600024 #600919 
  assert validationDF.count() ==  200114 # 199566
  assert testDF.count() ==  200071 # 199724 this numbers resulting in a test failure...

elif variant=='100k':
  assert trainingDF.count() == 59930
  assert validationDF.count() == 20006
  assert testDF.count() == 20064

Training: 600024, validation: 200114, test: 200071

[Row(userId=1, movieId=48, rating=5.0), Row(userId=1, movieId=150, rating=5.0), Row(userId=1, movieId=260, rating=4.0)]
[Row(userId=1, movieId=1, rating=5.0), Row(userId=1, movieId=527, rating=5.0), Row(userId=1, movieId=595, rating=5.0)]
[Row(userId=1, movieId=1035, rating=5.0), Row(userId=1, movieId=1097, rating=4.0), Row(userId=1, movieId=1545, rating=4.0)]


In [None]:
print('Training count:', trainingDF.count())
print('Validation count:', validationDF.count())
print('Test count:', testDF.count())
# here we are just checking what were the correct values. 
# As in part 1c, notice that many of the tests are gonna fail, as we are going to have the same error all the time. 

Training count: 600024
Validation count: 200114
Test count: 200071


#### After splitting the dataset, your training set will be aprox. 60% the length of total movies, validation 20% and test will be 20% (the exact number of entries in each dataset varies slightly due to the random nature of the `randomSplit()` transformation).

#### **(2b) Root Mean Square Error (RMSE)**
#### In the next part, you will generate a few different models, and will need a way to decide which model is best. We will use the [Root Mean Square Error](https://en.wikipedia.org/wiki/Root-mean-square_deviation) (RMSE) or Root Mean Square Deviation (RMSD) to compute the error of each model.  RMSE is a frequently used measure of the differences between values (sample and population values) predicted by a model or an estimator and the values actually observed. The RMSD represents the sample standard deviation of the differences between predicted values and observed values. These individual differences are called residuals when the calculations are performed over the data sample that was used for estimation, and are called prediction errors when computed out-of-sample. The RMSE serves to aggregate the magnitudes of the errors in predictions for various times into a single measure of predictive power. RMSE is a good measure of accuracy, but only to compare forecasting errors of different models for a particular variable and not between variables, as it is scale-dependent.
####  The RMSE is the square root of the average value of the square of `(actual rating - predicted rating)` for all users and movies for which we have the actual rating.
#### Given two ratings DFs, *x* and *y* of size *n*, we define RSME as follows: $ RMSE = \sqrt{\frac{\sum_{i = 1}^{n} (x_i - y_i)^2}{n}}$
#### To calculate RSME, the steps you should perform are:
#### Use the pyspark.ml.evaluationRegressionEvaluator to evaluate with metricName="rmse" comparing the labelCol="raring" with the predictioncol="prediction)


#### **(2c) Using ALS()**
#### In this part, we will use the ml implementation of Alternating Least Squares, [ALS](https://spark.apache.org/docs/3.1.2/api/python/reference/api/pyspark.ml.recommendation.ALS.html?highlight=als#pyspark.ml.recommendation.ALS). ALS takes several parameters that control the model creation process. To determine the best values for the parameters, we will use ALS to train several models, and then we will select the best model and use the parameters from that model in the rest of this lab exercise.
#### The process we will use for determining the best model is as follows:
* #### Pick a set of model parameters. The most important parameter to `ALS.fit()` is the *rank*, which is the number of rows in the Users matrix (green in the diagram above) or the number of columns in the Movies matrix (blue in the diagram above). (In general, a lower rank will mean higher error on the training dataset, but a high rank may lead to [overfitting](https://en.wikipedia.org/wiki/Overfitting).)  We will train models with ranks of 4, 8, and 12 using the `trainingRDD` dataset.
* #### Create a model using `ALS.fit(trainingDF)` with three parameters: an DataFrame consisting of tuples of the form (UserID, MovieID, rating) used to train the model, an integer rank (4, 8, or 12), a number of iterations to execute (we will use 5 for the `iterations` parameter), and a regularization coefficient (we will use 0.1 for the `regularizationParameter`).
* #### For the prediction step, create an input DataFrame, `validationDF`, consisting of (UserID, MovieID, Rating) 
* #### Using the model and `validationDF`, we can predict rating values by calling [model.transform()]
* #### Evaluate the quality of the model by using the `evaluator.evaluate()` method.
####  Which rank produces the best model, based on the RMSE with the `validationDF` dataset?
#### Note: It is likely that this operation will take a noticeable amount of time (around a minute in our VM); you can observe its progress on the [Spark Web UI].

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.recommendation import ALS


evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",
                                predictionCol="prediction")
seed = 5
iterations = 5
regularizationParameter = 0.1
ranks = [4, 8, 12]
errors = [0, 0, 0]
err = 0
tolerance = 0.03

minError = float('inf')
bestRank = -1
bestIteration = -1
for rank in ranks:
    als = ALS(maxIter=5, regParam=0.01, userCol="userId", itemCol="movieId", ratingCol="rating",
          coldStartStrategy="drop")
    model = als.fit(trainingDF) # just fitting the model with the appropiate data. 
    predictedRatingsDF = model.transform(validationDF) # transforming the data to obtain the predictions

    error = evaluator.evaluate(predictedRatingsDF) # the last step in any prediction is to evaluate the results, we compute here the error. 
    print("Root-mean-square error = " + str(error))
    errors[err] = error
    err += 1
    print ('For rank {0} the RMSE is {1}'.format(rank, error))
    if error < minError:
        minError = error
        bestRank = rank

print ('The best model was trained with rank {0}'.format(bestRank))

Root-mean-square error = 0.9349619962653551
For rank 4 the RMSE is 0.9349619962653551
Root-mean-square error = 0.9349619962653551
For rank 8 the RMSE is 0.9349619962653551
Root-mean-square error = 0.9349619962653551
For rank 12 the RMSE is 0.9349619962653551
The best model was trained with rank 4


In [None]:
validationDF.count()

Out[81]: 200114

In [None]:
# TEST Using ALS.train (2c)
if variant=='20m':
  Test.assertEquals(validationDF.count(), 4000812,
                    'incorrect size for validationDF (expected 2000296)')
  Test.assertTrue(np.abs(errors[0] - 0.8283051099783386) < tolerance, 'incorrect errors[0]')
  Test.assertTrue(np.abs(errors[1] - 0.8283051099783386) < tolerance, 'incorrect errors[1]')
  Test.assertTrue(np.abs(errors[2] - 0.8283051099783386) < tolerance, 'incorrect errors[2]')
elif variant=='10m':
  Test.assertEquals(validationDF.count(), 2000494,
                    'incorrect size for validationDF (expected 2000296)')
  Test.assertTrue(np.abs(errors[0] - 0.8404652000681564) < tolerance, 'incorrect errors[0]')
  Test.assertTrue(np.abs(errors[1] - 0.840465200027002) < tolerance, 'incorrect errors[1]')
  Test.assertTrue(np.abs(errors[2] - 0.8404652000681551) < tolerance, 'incorrect errors[2]')
elif variant=='1m':
  Test.assertEquals(validationDF.count(), 200410,
                    'incorrect size for validationDF (expected 200410)')
  Test.assertTrue(np.abs(errors[0] - 0.9304807449436123) < tolerance, 'incorrect errors[0]')
  Test.assertTrue(np.abs(errors[1] - 0.9304807449436945) < tolerance, 'incorrect errors[1]')
  Test.assertTrue(np.abs(errors[2] - 0.9304807449436955) < tolerance, 'incorrect errors[2]')
elif variant=='100k':
  Test.assertEquals(validationDF.count(), 20006,
                    'incorrect size for validationDF (expected 19830)')
  Test.assertTrue(np.abs(errors[0] - 1.1448114732233339) < tolerance, 'incorrect errors[0]')
  Test.assertTrue(np.abs(errors[1] - 1.1448114732233323) < tolerance, 'incorrect errors[1]')
  Test.assertTrue(np.abs(errors[2] - 1.1448114732233299) < tolerance, 'incorrect errors[2]')

1 test failed. incorrect size for validationDF (expected 200410)
1 test passed.
1 test passed.
1 test passed.


# Note: 
We can see that the test that expected 200410 fails, the reason for this is that as before, when the variant has a value of 1 million, the test fails as the numbers are not exactly the expected ones in the tests...

In [None]:
validationDF.count()

Out[83]: 200114

#### **(2d) Testing Your Model**
#### So far, we used the `trainingDF` and `validationDF` datasets to select the best model.  Since we used these two datasets to determine what model is best, we cannot use them to test how good the model is - otherwise we would be very vulnerable to [overfitting](https://en.wikipedia.org/wiki/Overfitting).  To decide how good our model is, we need to use the `testDF` dataset.  We will use the `bestRank` you determined in part (2c) to create a model for predicting the ratings for the test dataset and then we will compute the RMSE.


In [None]:
# TODO: Replace <FILL IN> with appropriate code
myModel = ALS(maxIter=iterations, regParam=regularizationParameter, rank=bestRank,
              userCol="userId", itemCol="movieId", ratingCol="rating",
              coldStartStrategy="drop").fit(trainingDF) 

              # we are doing the ALS model setting all the parameters correctly and then we fit into the model the training data. 

predictedTestDF = myModel.transform(testDF) # to obtain the predictions we use the data for testing

# Evaluate RMSE on the test set
testRMSE = evaluator.evaluate(predictedTestDF) # and finally we evaluate the model in the predictions set. 

print('The model had an RMSE on the test set of {0}'.format(testRMSE))


The model had an RMSE on the test set of 0.8884495108032433


In [None]:
# TEST Testing Your Model (2d)
if variant=='20m':
  Test.assertTrue(np.abs(testRMSE - 0.831243009181003) < tolerance, 'incorrect testRMSE')
elif variant=='10m':
  Test.assertTrue(np.abs(testRMSE - 0.831243009181003) < tolerance, 'incorrect testRMSE')
elif variant=='1m':
  Test.assertTrue(np.abs(testRMSE - 0.9335514423411114) < tolerance, 'incorrect testRMSE')
elif variant=='100k':
  Test.assertTrue(np.abs(testRMSE - 0.9507663271652291) < tolerance, 'incorrect testRMSE')

1 test failed. incorrect testRMSE


# Note
As in the 3 or 4 previous cases, there is one test, the one regarding the variant of 1m, that is gonna fail...

#### **(2e) Comparing Your Model**
#### Looking at the RMSE for the results predicted by the model versus the values in the test set is one way to evalute the quality of our model. Another way to evaluate the model is to evaluate the error from a test set where every rating is the average rating for the training set.
#### The steps you should perform are:
* #### Use the `trainingDF` to compute the average rating across all movies in that training dataset.
* #### Use the average rating that you just determined and the `testDF to create an DataFrame with entries of the form (userID, movieID, average rating).
* #### Calculate the RMSE between the TrainingDF and TestDF average ratings.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

trainingAvgRating = trainingDF.select('rating').groupBy().avg().first()[0] # We are getting the average rating across all movies in the training dataset
print('The average rating for movies in the training set is {0}'.format(trainingAvgRating))

testForAvgDF = testDF.withColumn('prediction', lit(trainingAvgRating)) # Here we create a DataFrame with entries of the form (userID, movieID, average rating)
testAvgRMSE = evaluator.evaluate(testForAvgDF) # Finally we compute the RMSE between the training set and test set average ratings
print ('The RMSE on the average set is {0}'.format(testAvgRMSE))


The average rating for movies in the training set is 3.5829116835326587
The RMSE on the average set is 1.118766015889879


#### You now have coded to predict how users will rate movies!

## **Part 3: Predictions for the Users**

#### **(3a) Your Movie Ratings**
#### To help you provide ratings for yourself, we have included the following code to list the names and movie IDs of the 50 highest-rated movies from `movieLimitedAndSortedByRatingDF` which we created in part 1 the lab.

In [None]:
print ('Most rated movies:')
print ('(average rating, movie name, number of reviews)')
movieLimitedAndSortedByRatingDF.take(50)

Most rated movies:
(average rating, movie name, number of reviews)
Out[87]: [Row(avg_rating=4.560509554140127, Title='Seven Samurai (The Magnificent Seven) (Shichinin no samurai) (1954)', num_rating=628),
 Row(avg_rating=4.554557700942973, Title='Shawshank Redemption, The (1994)', num_rating=2227),
 Row(avg_rating=4.524966261808367, Title='Godfather, The (1972)', num_rating=2223),
 Row(avg_rating=4.52054794520548, Title='Close Shave, A (1995)', num_rating=657),
 Row(avg_rating=4.517106001121705, Title='Usual Suspects, The (1995)', num_rating=1783),
 Row(avg_rating=4.510416666666667, Title="Schindler's List (1993)", num_rating=2304),
 Row(avg_rating=4.507936507936508, Title='Wrong Trousers, The (1993)', num_rating=882),
 Row(avg_rating=4.477724741447892, Title='Raiders of the Lost Ark (1981)', num_rating=2514),
 Row(avg_rating=4.476190476190476, Title='Rear Window (1954)', num_rating=1050),
 Row(avg_rating=4.453694416583082, Title='Star Wars: Episode IV - A New Hope (1977)', num_rating=

#### The user ID 0 is unassigned, so we will use it for your ratings. We set the variable `myUserID` to 0 for you. Next, create a new DF `myRatingsDF` with your ratings for at least 10 movie ratings. Each entry should be formatted as `(myUserID, movieID, rating)` (i.e., each entry should be formatted in the same way as `trainingDF`).  As in the original dataset, ratings should be between 1 and 5 (inclusive). If you have not seen at least 10 of these movies, you can increase the parameter passed to `take()` in the above cell until there are 10 movies that you have seen (or you can also guess what your rating would be for movies you have not seen).

#### Using the Dataframe.union() methond, append your movies preferences to the traininnDF, and create a new ALS model, trained with your information.
#### Once trained, get 20 recommended movies for your user. Use the AlsModel.recommendForAllUsers() method https://spark.apache.org/docs/2.2.0/api/python/pyspark.ml.html#pyspark.ml.recommendation.ALSModel, and filter for your user. Check the recommended movies and evaluate if it fits to your preferences.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.ml.recommendation import ALS
from pyspark.sql.types import StructType, StructField, IntegerType, DoubleType

# Create a Spark session
spark = SparkSession.builder.appName("ALSExample").getOrCreate()

# Set your user ID
myUserID = 0

myRatings = [(myUserID, movieID, myRating) for movieID, myRating in [ # Create a new DataFrame with your ratings
    (1, 5.0), # Here we create a list with the ratings of 10 films, some films have different ratings.  
    (2, 4.5),
    (3, 3.0),
    (4, 4.0),
    (5, 2.5),
    (6, 4.0),
    (7, 3.5),
    (8, 5.0),
    (9, 4.5),
    (10, 3.0)
]]

# We have to convert the list of tuples that we have to a DataFrame values either integer or double (as we were having an error)
myRatingsSchema = StructType([
    StructField("userId", IntegerType(), True),
    StructField("movieId", IntegerType(), True),
    StructField("rating", DoubleType(), True)
])

myRatingsDF = spark.createDataFrame(myRatings, schema=myRatingsSchema)

print("My Ratings are this ones:")
myRatingsDF.show() # displaying ratings

trainingDFWithMyRatings = trainingDF.union(myRatingsDF) # as requested we use the union command

myALSModel = ALS(
    maxIter=iterations,
    regParam=regularizationParameter,
    rank=bestRank,
    userCol="userId",
    itemCol="movieId",
    ratingCol="rating",
    coldStartStrategy="drop"
).fit(trainingDFWithMyRatings)
# As we have done before we are training our ALS model and fitting it with the trainign data 

# We are asked to obtain 20 recommendations using the "recommendForAllUsers" command. Here it is: 
myRecommendations = myALSModel.recommendForAllUsers(20).filter("userId = {}".format(myUserID))

print("My Movie Recommendations are this ones:")
myRecommendations.show(truncate=False) # We are showing the recommendations for 10 users. 


My Ratings are this ones:
+------+-------+------+
|userId|movieId|rating|
+------+-------+------+
|     0|      1|   5.0|
|     0|      2|   4.5|
|     0|      3|   3.0|
|     0|      4|   4.0|
|     0|      5|   2.5|
|     0|      6|   4.0|
|     0|      7|   3.5|
|     0|      8|   5.0|
|     0|      9|   4.5|
|     0|     10|   3.0|
+------+-------+------+

My Movie Recommendations are this ones:
+------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|userId|recommendations                                                                                                                                                                                                          

# Alejo´s Personal Conclusions
