# Lab 3 - Data Science using Spark

AdventureWorks would like to add a snazzy product recommendations feature to their website and email marketing campaigns that, for every user in their system, can recommend the top 10 products they might be interested in purchasing.

Adventureworks has provided you with the tables for users, products and weblogs that contains all of the data you need. You will train a recommendation model using Spark's built-in collaborative filtering alogrithm - [Alternating Least Squares (ALS)](http://spark.apache.org/docs/2.1.0/mllib-collaborative-filtering.html). Then you will use the model to pre-compute the user to product recommendation for every user and save this in a table. Then you will query from this table to quickly get the 10 product recommendations for a given user.

## Reviewing the Weblogs Data

First, let's import the modules and functions we will use.

In [None]:
from pyspark import SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
import os
from pyspark.ml.recommendation import ALS
from pyspark.sql.functions import UserDefinedFunction

The weblogs table we have tells us what actions a user took on any given product when using the website. A user can browse a product, add it to the cart or purchase it. 

In [None]:
%%sql
SELECT UserId,  ProductId, Action FROM weblogs limit 5

## Prepare the training and test data sets

Let's start by selecting a significant subset of our data to use in training the model.

In [None]:
train = spark.sql("select * from weblogs where cleanedtransactiondate between '2016-03-01' and '2016-05-31'")

## Train a model

We begin by defining how we want to weigh the implicit rating described by the action field in the weblogs table. An implicit rating occurs here because a user is not explictly providing a rating (e.g., they never say "I rate this product 4 out of 5 stars". Instead we will infer their rating by virtue of their action. 

A product that is browsed gets 30 points, a product that is added to the cart gets 70 points and a product that is purchased gets 100 points.

In [None]:
ActionPoints = {"Browsed":30, "Add To Cart":70, "Purchased":100}

Next, we will create a new DataFrame that contains the a tuple with only the data we are interested in plus the value of the action taken. So our ratings will include the UserId, the ProductId and the Points.

In [None]:
mapActionToPoints = UserDefinedFunction(lambda a: ActionPoints[a], IntegerType())
ratings_df = train.select(train.UserId, train.ProductId, mapActionToPoints(train.Action).alias('ActionPoints'))  
ratings_df.take(5)

When training a model using ALS, we should cache the DataFrame because algorithm will revist the dataset many times over during training.

In [None]:
ratings_df.cache()

When building a model using ALS there are some settings referred to as hyperparameters that we need to use guide how the model gets trained. Typically you find the best values by iterating through a range of possible values and evaluating the predictive result. For simplicity, we will begin with the settings provided.

We will train our data using the training data set and our hyperparameters. **Note that this will take about 1-2 minutes to complete.**

In [None]:
#TODO: Invoke the fit function of the ALS object providing the training DataFrame
als = ALS(maxIter=10, regParam=0.01, userCol="UserId", itemCol="ProductId", ratingCol="ActionPoints")
model = als.#TODO( )

As you noticed, training a model takes some time. Fortunately, we don't have pay the training cost everytime we want to use the model. To allow us to re-use a trained model, we can save it to disk (in Azure Storage).

With the folder structure in place, we need to invoke the save method on the model and indicate the path to the folder we created for it. You an store the models anywhere you want, but we chose /models/cfmodel.

In [None]:
#TODO: Invoke the save method on the model object, providing the path in which to serialize the model.
model.write().overwrite().#TODO

Before we move on, let's confirm our model was saved succesfully.

In [None]:
%%sh
hdfs dfs -ls /models/cfmodel

### Create a temporary view for the products data

Now we need to load and parse the product data from Azure Storage and present it using a temporary view called Products_View. 

In [None]:
products_schema = StructType([
        StructField('ProductId',IntegerType(),False), 
        StructField('ProductName', StringType()), 
        StructField('Price', FloatType()), 
        StructField('CategoryId', StringType()), 
        StructField('Ignore1', StringType()), 
        StructField('Ignore2', StringType()), 
        StructField('Ignore3', StringType()), 
        StructField('Category', StringType()), 
        StructField('Department', StringType())
    ])

products_DF = spark.read.csv("/retaildata/rawdata/ProductFile/part{*}", 
                    schema=products_schema,
                    header=False)

products_DF_with_price = products_DF.select("ProductId", "ProductName", "Price", "CategoryId", "Category", "Department")
products_DF_with_price.printSchema()
products_DF_with_price.show(5)

products_DF_with_price.createOrReplaceTempView("Products_View")

## Test the model

Now, let's put our model to use. We'll begin by creating a DataFrame that contains the set of UserID and ProductID for which we want to predict a rating. 

In [None]:
test_users_df = spark.sql("select distinct w.UserId, p.ProductId from weblogs w cross join Products_View p where w.UserId > 3686 and w.UserId < 3706")
test_users_df.show(10)

Now we use the tranform method on the model to create a new DataFrame that includes all of the columns from our test_users_df and adds a new prediciton column that indicates the "confidence" of the prediction.

In [None]:
#TODO: Invok the transform method of the model on the test data for which we want to acquire predictions.
predictions = model.#TODO
predictions.show(5)

In the prediction column, we may have Nan (not a number) values which simply mean no prediction. Let's clean up our prediction DataFrame by omitting rows with NaN values for the prediction, cache the results and take a peek.

In [None]:
#TODO: filter out prediction values of "NaN" with the isnan() SQL function
user_product_ratings = predictions.select("UserId", "ProductId", "prediction").where("not #TODO").orderBy("UserId", predictions.prediction.desc())
user_product_ratings.cache()
user_product_ratings.show(5)

We'll register this DataFrame as a temporary view.

In [None]:
user_product_ratings.createOrReplaceTempView("UserProductRatings_View")

### Get the top 10 recommended products for given user

With our product data available as a view, we now have all of the data sources we need to start making recommendations: Products_View, the UserProductRatings_View and the Users table. We want to define a function that will get us the top 10 recommended products for a given user (by user ID) by querying our data sources. We'll start by building the core lines of the function individually and then put them together into a function we can easily invoke.

Let's query the UserProductRatings_View for a sample user and examine what products our model suggests we recommend.

In [None]:
user_product_mapping = spark.sql("SELECT * FROM UserProductRatings_View WHERE UserId =" + str(3696))
user_product_mapping.show(10) 

Next, lets enrich the recommendation by joining the output with the products_DF_with_price DataFrame so that we can see the human friendly product names, category names and department names.

In [None]:
recommended_products = user_product_mapping.join(
        products_DF_with_price, user_product_mapping.ProductId == products_DF_with_price.ProductId
).orderBy( user_product_mapping.prediction.desc() )

recommended_products.where("UserId = 3696").show(10)

Let's put those lines together into a simple function we can call with a User ID.

In [None]:
#TODO 1: complete the missing line
#TODO 2: complete the missing line
def GetRecommendedProductsForUser(UserId):
    user_product_mapping = #TODO 1
    recommended_products = #TODO 2
    print("Users Information:")
    users_data = spark.sql("SELECT FirstName, LastName, Gender, Age from users WHERE id =" + str(UserId))
    users_data.show(1)
    print("Recommended Products:")
    recommended_products.show(10)

We'll invoke the function next for our example user.

In [None]:
GetRecommendedProductsForUser(UserId = 3696)

Let's interpret these results. If you look at the prediction column, the values for this user span a range, whereby the higher the rating, the more confident we are of the recommendation. 

So now we need to ask ourselves, do these recommendation make sense for our example customer? Let's look at how we might answer this question in the next section.

## Evaluate the model

Let's begin by getting a sense for what items the customer buys or strongly considers buying (by adding them to his or her cart). We can run the following query. When the output is displayed, select Pie to create a pie chart and set X to Department and Y to count(1).

In [None]:
%%sql
select w.Action, p.Department, p.Category, count(*)
from weblogs w 
join products_view p on w.ProductId = p.ProductId
where UserId = 3696 and Action = "Purchased" or Action = "Add To Cart"
group by w.Action, p.Department, p.Category
order by w.Action desc, count(*) desc

Notice in the PieChart that we can readily see that Houseware, Clothing, Appliance, Toy, Outdoor & Garden and Auto make up this customers top 6 favorite departments of products. As a first pass to evalauting the model, this lends support to the notion that our recommender is choosing products from the right departments for our customer when it suggests products from Clothing, Auto and Appliance.

## Conclusion

In this lab, you learned how to perform collaborative filtering on a fairly large dataset and in the process, helped AdventureWorks recommend products to its users based on their activity on the website.