## Correct format and distinct users
Take a look at the R dataframe. Notice that it is in conventional or "wide" format with a different movie in each column. Also notice that the User's and movie names are not in integer format. Follow the steps to properly prepare this data for ALS.

In [2]:
# Import monotonically_increasing_id and show R
from pyspark.sql.functions import monotonically_increasing_id
#from pyspark.sql.functions import explode
R = spark.createDataFrame(R)
R.show()

# Use the to_long() function to convert the dataframe to the "long" format.
ratings = to_long(R)
ratings.show()

# Get unique users and repartition to 1 partition
users = ratings.select("User").distinct().coalesce(1)

# Create a new column of unique integers called "userId" in the users dataframe.
users = users.withColumn("userId", monotonically_increasing_id()).persist()
users.show()

+----+-----+--------+----------+----------------+
|Coco|Shrek|Sneakers|Swing Kids|            User|
+----+-----+--------+----------+----------------+
|   4|    3|       3|         4|    James Alking|
|   5|    4|       2|      null|Elvira Marroquin|
|   2| null|       5|         2|      Jack Bauer|
|null|    5|       2|         2|     Julia James|
+----+-----+--------+----------+----------------+

+----------------+----------+------+
|            User|     Movie|Rating|
+----------------+----------+------+
|    James Alking|      Coco|     4|
|    James Alking|     Shrek|     3|
|    James Alking|  Sneakers|     3|
|    James Alking|Swing Kids|     4|
|Elvira Marroquin|      Coco|     5|
|Elvira Marroquin|     Shrek|     4|
|Elvira Marroquin|  Sneakers|     2|
|      Jack Bauer|      Coco|     2|
|      Jack Bauer|  Sneakers|     5|
|      Jack Bauer|Swing Kids|     2|
|     Julia James|     Shrek|     5|
|     Julia James|  Sneakers|     2|
|     Julia James|Swing Kids|     2|
+------

## Assigning integer id's to movies
Let's do the same thing to the movies. Then let's join the new user IDs and movie IDs into one dataframe.

In [3]:
# Extract the distinct movie id's
movies = ratings.select("Movie").distinct() 

# Repartition the data to have only one partition.
movies = movies.coalesce(1) 

# Create a new column of movieId integers. 
movies = movies.withColumn("movieId", monotonically_increasing_id()).persist() 

# Join the ratings, users and movies dataframes
movie_ratings = ratings.join(users, "User", "left").join(movies, "Movie", "left")
movie_ratings.show()

+----------+----------------+------+------+-------+
|     Movie|            User|Rating|userId|movieId|
+----------+----------------+------+------+-------+
|     Shrek|     Julia James|     5|     0|      0|
|     Shrek|    James Alking|     3|     1|      0|
|     Shrek|Elvira Marroquin|     4|     3|      0|
|      Coco|    James Alking|     4|     1|      1|
|      Coco|Elvira Marroquin|     5|     3|      1|
|      Coco|      Jack Bauer|     2|     2|      1|
|  Sneakers|Elvira Marroquin|     2|     3|      2|
|  Sneakers|      Jack Bauer|     5|     2|      2|
|  Sneakers|    James Alking|     3|     1|      2|
|  Sneakers|     Julia James|     2|     0|      2|
|Swing Kids|      Jack Bauer|     2|     2|      3|
|Swing Kids|    James Alking|     4|     1|      3|
|Swing Kids|     Julia James|     2|     0|      3|
+----------+----------------+------+------+-------+



## Build Out An ALS Model
Let's specify your first ALS model. Complete the code below to build your first ALS model.

Recall that you can use the .columns method on the ratings data frame to see what the names of the columns are that contain user, movie, and ratings data. Spark needs to know the names of these columns in order to perform ALS correctly.

In [4]:
# Convert Rating to Numeric
from pyspark.sql.types import IntegerType
movie_ratings = movie_ratings.withColumn("Rating", movie_ratings.Rating.cast("integer"))

# Split the ratings dataframe into training and test data
(training_data, test_data) = movie_ratings.randomSplit([0.8, 0.2], seed=42)

# Set the ALS hyperparameters
from pyspark.ml.recommendation import ALS
als = ALS(userCol="userId", itemCol="movieId", ratingCol="Rating", rank = 10, maxIter = 15, regParam = .1,
          coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)

# Fit the mdoel to the training_data
model = als.fit(training_data)

# Generate predictions on the test_data
test_predictions = model.transform(test_data)
test_predictions.show()

+----------+----------------+------+------+-------+----------+
|     Movie|            User|Rating|userId|movieId|prediction|
+----------+----------------+------+------+-------+----------+
|  Sneakers|Elvira Marroquin|     2|     3|      2| 2.8755982|
|     Shrek|Elvira Marroquin|     4|     3|      0| 3.4584947|
|Swing Kids|      Jack Bauer|     2|     2|      3| 2.8047183|
+----------+----------------+------+------+-------+----------+



## Build RMSE Evaluator
Now that you know how to fit a model to training data and generate test predictions, you need a way to evaluate how well your model performs. For this we'll build an evaluator. Evaluators in Spark can be built out in various ways. For our purposes, we want a regressionEvaluator that calculates the RMSE. After we build our regressionEvaluator, we can fit the model to our data and generate predictions.

In [5]:
# Import RegressionEvaluator
from pyspark.ml.evaluation import RegressionEvaluator

# Complete the evaluator code
evaluator = RegressionEvaluator(metricName="rmse", labelCol="Rating", predictionCol="prediction")

# Extract the 3 parameters
print(evaluator.getMetricName())
print(evaluator.getLabelCol())
print(evaluator.getPredictionCol())

rmse
Rating
prediction


## Get RMSE
Now that you know how to build a model and generate predictions, and have an evaluator to tell us how well it predicts ratings, we can calculate the RMSE to see how well an ALS model performed. We'll use the evaluator that we built in the previous exercise to calculate and print the rmse.

In [6]:
# Evaluate the "predictions" dataframe
RMSE = evaluator.evaluate(test_predictions)

# Print the RMSE
print (RMSE)

0.7544251002111148


Great work. This RMSE means that on average, the model's test predictions are about .75 off from the true values.