# Lab 3 - Data Science using Spark

AdventureWorks would like to add a snazzy product recommendations feature to their website and email marketing campaigns that, for every user in their system, can recommend the top 10 products they might be interested in purchasing.

Adventureworks has provided you with the tables for users, products and weblogs that contains all of the data you need. You will train a recommendation model using Spark's built-in collaborative filtering alogrithm - [Alternating Least Squares (ALS)](http://spark.apache.org/docs/2.1.0/mllib-collaborative-filtering.html). Then you will use the model to pre-compute the user to product recommendation for every user and save this in a table. Then you will query from this table to quickly get the 10 product recommendations for a given user.

## Reviewing the Weblogs Data

First, let's import the modules and functions we will use.

In [None]:
from pyspark import SparkContext
from pyspark.sql import *
from pyspark.sql.types import *
import os
from pyspark.mllib.recommendation import *

The weblogs table we have tells us what actions a user took on any given product when using the website. A user can browse a product, add it to the cart or purchase it. 

In [None]:
%%sql
SELECT UserId,  ProductId, Action FROM weblogs limit 5

## Prepare the training and test data sets

Let's start by selecting a significant subset of our data to use in training the model.

In [None]:
train = spark.sql("select * from weblogs where cleanedtransactiondate between '2016-03-01' and '2016-05-31'")

## Train a model

We begin by defining how we want to weigh the implicit rating described by the action field in the weblogs table. An implicit rating occurs here because a user is not explictly providing a rating (e.g., they never say "I rate this product 4 out of 5 stars". Instead we will infer their rating by virtue of their action. 

A product that is browsed gets 30 points, a product that is added to the cart gets 70 points and a product that is purchased gets 100 points.

In [None]:
ActionPoints = {"Browsed":30, "Add To Cart":70, "Purchased":100}

Next, we will create a new RDD that contains the a tuple with only the data we are interested in plus the value of the action taken. So our ratings will include the UserId, the ProductId and the Points.

In [None]:
ratingsRdd = train.rdd.map(lambda s: [s.UserId, s.ProductId, ActionPoints[s.Action]])
ratingsRdd.take(5)

When training a model using ALS, we should cache the RDD because algorithm will revist the dataset many times over during training.

In [None]:
ratingsRdd.cache()

When building a model using ALS there are some settings referred to as hyperparameters that we need to use guide how the model gets trained. Typically you find the best values by iterating through a range of possible values and evaluating the predictive result. For simplicity, we will begin with the following settings.

In [None]:
rank = 10
numIterations = 10

Now we will train our data using the training data set and our hyperparameters. **Note that this will take about 12 minutes to complete.**

In [None]:
#TODO: Invoke the train function of the ALS object providing the ratingsRDD, rank and numIterations
model = #TODO( , , )

As you noticed, training a model takes some time. Fortunately, we don't have pay the training cost everytime we want to use the model. To allow us to re-use a trained model, we can save it to disk (in Azure Storage).

First, let's make sure we have a clean models directory (in practice you an store the models anywhere you want, but we chose /models).

In [None]:
%%sh
hdfs dfs -rm -r /models

Next, we want to create a subfolder for our collaborative filtering model. We'll name the subfolder cfmodel. 

In [None]:
%%sh
hdfs dfs -mkdir /models
hdfs dfs -mkdir /models/cfmodel

With the folder structure in place, we need to invoke the save method on the model and indicate the path to the folder we created for it.

In [None]:
#TODO: Invoke the save method on the model object, providing the SparkContext (sc) and the path in which to serialize the model.
model.#TODO( , "/models/cfmodel")

Before we move on, let's confirm our model was saved succesfully.

In [None]:
%%sh
hdfs dfs -ls /models/cfmodel

## Test the model

Now, let's put our model to use. We'll begin by using the recommendProductsForUsers function provided by our model to recommend 100 products for each user. 

In [None]:
print(model.recommendProductsForUsers.__doc__)

In [None]:
#TODO: invoke the recommendProductsForUsers function of the model for 100 users.
products_for_users = model.#TODO( )

Next, let's uses those recommendations to create an RDD where each row contains the ProductId, UserId and the Rating of the match.

In [None]:
users_product_ratings = products_for_users.flatMap(lambda xs: [Row(UserId=x[0], ProductId=x[1], Rating=x[2]) for x in xs[1]])
users_product_ratings.first()

Now let's convert this RDD into a DataFrame.

In [None]:
user_product_ratings_df = users_product_ratings.toDF()
user_product_ratings_df.show(6)

We'll register this DataFrame as a temporary view.

In [None]:
user_product_ratings_df.createOrReplaceTempView("UserProductRatings_View")

### Create a temporary view for the products data

Now we need to load and parse the data from Azure Storage and present it using a temporary view called Products_View. 

In [None]:
def DefineProductsFields(inline):
    l = inline.split(",")
    return Row(ProductId=int(l[0]),ProductName=l[1],BasePrice=float(l[2]),CategoryId=l[3],Category=l[7],Department=l[8])

products = sc.textFile("/retaildata/rawdata/ProductFile/part{*}")

products_RDD = products.map(lambda p:DefineProductsFields(p))
products_RDD.take(5)
products_DF = products_RDD.toDF()
products_DF_with_price = products_DF.select(
    products_DF.ProductId,
    products_DF.ProductName,
    products_DF.BasePrice.cast("decimal").alias("Price"),
    products_DF.CategoryId,
    products_DF.Category,
    products_DF.Department)
products_DF_with_price.createOrReplaceTempView("Products_View")
products_DF_with_price.show(6)

### Get the top 10 recommended products for given user

With our product data available as a view, we now have all of the data sources we need to start making recommendations: Products_View, the UserProductRatings_View and the Users table. We define a function that will get us the top 10 recommended products for a given user (by user ID) by querying our data sources.

In [None]:
def GetRecommendedProductsForUser(UserId):
    user_product_mapping = spark.sql("SELECT * FROM UserProductRatings_View WHERE UserId =" + str(UserId))
    recommended_products = user_product_mapping.join(
        products_DF_with_price, user_product_mapping.ProductId == products_DF_with_price.ProductId
    ).select(
        products_DF_with_price.ProductName,
        products_DF_with_price.Price,
        products_DF_with_price.Category,
        products_DF_with_price.Department,
        user_product_mapping.Rating
    )
    print("Users Information:")
    users_data = spark.sql("SELECT FirstName, LastName, Gender, Age from users WHERE id =" + str(UserId))
    users_data.show(1)
    print("Recommended Products:")
    recommended_products.orderBy('Rating',ascending=False).show(10)

Let's invoke the above function for a sample user and examine what products our model suggests we recommend.

In [None]:
GetRecommendedProductsForUser(UserId = 1336)

Let's interpret these results. If you look at the Rating column, the values for this user range from 30.29 to 55.86, whereby the higher the rating, the more confident we are of the recommendation. 

So now we need to ask ourselves, do these recommendation make sense for our example customer Frederik Nielsen? Let's look at how we might answer this question in the next section.

## Evaluate the model

Let's begin by getting a sense for what items Frederik buys or strongly considers buying (by ading them to his cart). We can run the following query.

In [None]:
%%sql
select w.Action, p.Department, p.Category, count(*)
from weblogs w 
join products p on w.ProductId = p.ProductId
where UserId = 1336 and Action = "Purchased" or Action = "Add To Cart"
group by w.Action, p.Department, p.Category
order by w.Action desc, count(*) desc

Notice in the output that the top two department\categories for his purchases by the number of actions are Clothing\Men and Clothing\Sport Shoes. Our recommender certainly recommended those above. But what about the Appliance Department items it recommended?

Again looking at the summary of Frederick's activities, we see that he frequently adds items to his cart that are in the Appliance\Kitchen Applicance category. In fact this is the #2 most common category of good he adds to his shopping cart. So while the recommender missed out on suggesting Houseware\Bedding (his #1 category of items added to the cart), these results are certainly reasonable for him.

## Conclusion

In this lab, you learned how to perform collaborative filtering on a fairly large dataset and in the process, helped AdventureWorks recommend products to its users based on their activity on the website.