# What if you don't have customer ratings?
In most real-life situations, you won't not have "perfect" customer data available to build an ALS model. This chapter will teach you how to use your customer behavior data to "infer" customer ratings and use those inferred ratings to build an ALS recommendation engine. Using the Million Songs Dataset as well as another version of the MovieLens dataset, this chapter will show you how to use the data available to you to build a recommendation engine using ALS and evaluate it's performance.

## MSD summary statistics
Let's get familiar with the Million Songs Echo Nest Taste Profile data subset. For purposes of this course, we'll just call it the Million Songs dataset or msd. Let's get the number of users and the number of songs. Let's also see which songs have the most plays from this subset.

In [None]:
# File Path
file_path = ".../data/datacamp/"

# Load data
msd = spark.read.parquet(file_path + 'msd')

# Look at the data
msd.show(5)

# Count the number of distinct userIds
user_count = msd.select("userId").distinct().count()
print("Number of users: ", user_count)

# Count the number of distinct songIds
song_count = msd.select("songId").distinct().count()
print("Number of songs: ", song_count)

## Grouped summary statistics
In this exercise, we are going to combine the .groupBy() and .filter() methods that you've used previously to calculate the min() and avg() number of users that have rated each song, and the min() and avg() number of songs that each user has rated.

Because our data now includes 0's for items not yet consumed, we'll need to .filter() them out when doing grouped summary statistics like this. The msd dataset is provided for you here. The col(), min(), and avg() functions from pyspark.sql.functions have been imported for you.

In [None]:
from pyspark.sql.functions import col, min, avg

# Min num implicit ratings for a song
print("Minimum implicit ratings for a song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(min("count")).show()

# Avg num implicit ratings per songs
print("Average implicit ratings per song: ")
msd.filter(col("num_plays") > 0).groupBy("songId").count().select(avg("count")).show()

# Min num implicit ratings from a user
print("Minimum implicit ratings from a user: ")
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(min("count")).show()

# Avg num implicit ratings from users
print("Average implicit ratings per user: ")
msd.filter(col("num_plays") > 0).groupBy("userId").count().select(avg("count")).show()

Great work. Users have at least 21 implicit ratings with an average of 77 and each song has at least 3 implicit ratings with an average of 35.

## Add Zeros
Many recommendation engines use implicit ratings. In many cases these datasets don't include behavior counts for items that a user has never purchased. In these cases, you'll need to add them and include zeros. The dataframe Z is provided for you. It contains userId's, productId's and num_purchases which is the number of times a user has purchased a specific product.

In [None]:
from pandas import DataFrame

Z = spark.createDataFrame(DataFrame({'num_purchases': [1, 23, 9, 2, 5, 21, 8, 96],
 'productId': [777, 44, 227, 1981, 2390, 1662, 1492, 1811],
 'userId': [2112, 7, 1132, 686, 42, 13, 2112, 22]})).repartition(30)

# View the data
Z.show(5)

# Extract distinct userIds and productIds
users = Z.select("userId").distinct()
products = Z.select("productId").distinct()

# Cross join users and products
cj = users.crossJoin(products)

# Join cj and Z
Z_expanded = cj.join(Z, ["userId", "productId"], "left").fillna(0)

# View Z_expanded
Z_expanded.show(5)

## Specify ALS Hyperparameters
You're now going to build your first implicit rating recommendation engine using ALS. To do this, you will first tell Spark what values you want it to try when finding the best model.

Four empty lists are provided below. You will fill them with specific values that Spark can use to build several different ALS models. In the next exercise, you'll tell Spark to build out these models using the lists below.

In [None]:
# Complete the lists below
ranks = [10, 20, 30, 40]
maxIters = [10, 20, 30, 40]
regParams = [.05, .1, .15]
alphas = [20, 40, 60, 80]

## Build Implicit Models
Now that you have all of your hyperparameter values specified, let's have Spark build enough models to test each combination. To facilitate this, a for loop is provided here. Follow the instructions below to automatically create these ALS models. In subsequent exercises you will run these models on test datasets to see which one performs the best.

The ALS algorithm is already imported for you. The lists you created in the last exercise (ranks, maxIters, regParams, alphas) have been created for you.

In [None]:
model_list = []

# For loop will automatically create and store ALS models
for r in ranks:
    for mi in maxIters:
        for rp in regParams:
            for a in alphas:
                model_list.append(ALS(userCol= "userId", itemCol= "songId", ratingCol= "num_plays", rank = r, maxIter = mi, regParam = rp, alpha = a, coldStartStrategy="drop", nonnegative = True, implicitPrefs = True))

# Print the model list, and the length of model_list
print (model_list, "Length of model_list: ", len(model_list))

# Validate
len(model_list) == (len(ranks)*len(maxIters)*len(regParams)*len(alphas))

## Running a Cross-Validated Implicit ALS Model
Now that we have several ALS models, each with a different set of hyperparameter values, we can train them on a training portion of the msd dataset using cross validation, and then run them on a test set of data and evaluate how well each one performs using the ROEM function discussed earlier. Unfortunately, this takes too much time for this exercise, so it has been done separately. But for your reference you can evaluate your model_list using the following loop (we are using the msd dataset in this case):

```python
# Split the data into training and test sets
(training, test) = msd.randomSplit([0.8, 0.2])

#Building 5 folds within the training set.
train1, train2, train3, train4, train5 = training.randomSplit([0.2, 0.2, 0.2, 0.2, 0.2], seed = 1)
fold1 = train2.union(train3).union(train4).union(train5)
fold2 = train3.union(train4).union(train5).union(train1)
fold3 = train4.union(train5).union(train1).union(train2)
fold4 = train5.union(train1).union(train2).union(train3)
fold5 = train1.union(train2).union(train3).union(train4)

foldlist = [(fold1, train1), (fold2, train2), (fold3, train3), (fold4, train4), (fold5, train5)]

# Empty list to fill with ROEMs from each model
ROEMS = []

# Loops through all models and all folds
for model in model_list:
    for ft_pair in foldlist:

        # Fits model to fold within training data
        fitted_model = model.fit(ft_pair[0])

        # Generates predictions using fitted_model on respective CV test data
        predictions = fitted_model.transform(ft_pair[1])

        # Generates and prints a ROEM metric CV test data
        r = ROEM(predictions)
        print ("ROEM: ", r)

    # Fits model to all of training data and generates preds for test data
    v_fitted_model = model.fit(training)
    v_predictions = v_fitted_model.transform(test)
    v_ROEM = ROEM(v_predictions)

    # Adds validation ROEM to ROEM list
    ROEMS.append(v_ROEM)
    print ("Validation ROEM: ", v_ROEM)
```
For purposes of walking you through the steps, the test predictions for 192 models have already been generated, and their ROEM has been calculated. They are found in the ROEMS list provided. Because a list isn't unique to Pyspark, and because numpy works really well with lists, we're going to use numpy here. Follow the instructions below to find the best ROEM and the model that provided it.


In [None]:
# Import numpy
import numpy

# Find the index of the smallest ROEM
i = numpy.argmin(ROEMS)
print("Index of smallest ROEM:", i)

# Find ith element of ROEMS
print("Smallest ROEM: ", ROEMS[i])

## Extracting Parameters
You've now tested 192 different models on the msd dataset, and you found the best ROEM and its respective model (model 38).

You now need to extract the hyperparameters. The model_list you created previously is provided here. It contains all 192 models you generated. Use the instructions below to extract the hyperparameters.

In [None]:
# Extract the best_model
best_model = model_list[38]

# Extract the Rank
print ("Rank: ", best_model.getRank())

# Extract the MaxIter value
print ("MaxIter: ", best_model.getMaxIter())

# Extract the RegParam value
print ("RegParam: ", best_model.getRegParam())

# Extract the Alpha value
print ("Alpha: ", best_model.getAlpha())

Great work. Looks like a low rank, a higher maxIter, a low regParam, and a medium-high alpha is keeping the ROEM low. Because some of these values are on the high and low ends of the values we tried, it would be worth adding some additional values to test in our hyperparameter values, and doing this step again. But for right now, you should understand the process.

## Binary Model Performance
You've already built several ALS models, so we won't do that again. An implicit ALS model has already been fitted to the binary ratings of the MovieLens dataset. Let's look at the binary_test_predictions from this model to see what we can learn.

The ROEM() function has been defined for you. Feel free to run help(ROEM) in the console if you want more details on how to execute it!

In [None]:
def ROEM(predictions, default_df = binary_test_predictions.toPandas()):
  """ 
    Calculates the Rank Order Error Metric (ROEM) for a set of predictions.
    
    Parameters:
      - predictions: pyspark.sql.dataframe DataFrame object with the following columns:
        *Unique user ID
        *Unique item ID
        *Consumed or Viewed: where a 1 indicates that an item has been consumed and 0 indicates that it has not been consumed
        *Prediction which is a measure of confident the model is that the user prefers the relevant item
      
    Returns: A decimal number representing the ROEM. The lower the number the better the predictions.
    """
  preds_pd = predictions.toPandas()
  
  if preds_pd.equals(binary_test_predictions):
    print ("ROEM: 0.07436376290899886")
  else: 
    # Creates table that can be queried
    predictions.createOrReplaceTempView("predictions") 
    # Sum of total number of plays of all songs
    denominator = predictions.groupBy().sum("viewed").collect()[0][0]
    # Calculating rankings of songs predictions by user
    spark.sql("SELECT userID, viewed, PERCENT_RANK() OVER (PARTITION BY userId ORDER BY prediction DESC) AS rank FROM predictions").createOrReplaceTempView("rankings")
    # Multiplies the rank of each song by the number of plays and adds the products together
    numerator = spark.sql('SELECT SUM(viewed * rank) FROM rankings').collect()[0][0]
    result = numerator/denominator
    print ("ROEM:", result)
    return result

In [None]:
# Import the col function
from pyspark.sql.functions import col

# Look at the test predictions
binary_test_predictions.show()

# Evaluate ROEM on test predictions
ROEM(binary_test_predictions)

# Look at user 42's test predictions
binary_test_predictions.filter(col("userId") == 42).show()

Good job. The model has a pretty low ROEM. Did you notice that the model predicted some high numbers for unseen movies? This indicates that the model is creating recommendations from the movies that users have not seen.

## Recommendations From Binary Data
So you see from the ROEM, these models can still generate meaningful test predictions. Let's look at the actual recommendations now.

The col function from the pyspark.sql.functions class has been imported for you.

In [None]:
# View user 26's original ratings
print ("User 26 Original Ratings:")
original_ratings.filter(col("userId") == 26).show()

# View user 26's recommendations
print ("User 26 Recommendations:")
binary_recs.filter(col("userId") == 26).show()

# View user 99's original ratings
print ("User 99 Original Ratings:")
original_ratings.filter(col("userId") == 99).show()

# View user 99's recommendations
print ("User 99 Recommendations:")
binary_recs.filter(col("userId") == 99).show()

Great work. ALS seems to have picked up on the fact that user 26 likes thrillers, crime movies, action and adventure, and that user 99 likes dramas and romances. Do these look like good recommendations to you?

# Resources
- https://www.mckinsey.com/industries/retail/our-insights/how-retailers-can-keep-up-with-consumers
- https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/wide_to_long_function.py
- http://yifanhu.net/PUB/cf.pdf
- https://github.com/jamenlong/ALS_expected_percent_rank_cv/blob/master/ROEM_cv.py
- http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.306.4684&rep=rep1&type=pdf