# Lab 3 - Online Purchase Recommendations

Learn how to create a recommendation engine using the Alternating Least Squares algorithm in Spark's machine learning library

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/ALS.png' width="70%" height="70%"></img>

###The data

This is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.  The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

http://archive.ics.uci.edu/ml/datasets/Online+Retail

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/FullFile.png' width="80%" height="80%"></img>

##Create an RDD from the csv data 

In [1]:
//Download the data
import sys.process._
"rm OnlineRetail.csv.gz -f" !

In [2]:
import sys.process._
"wget https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz" !

--2016-04-27 19:18:03--  https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/OnlineRetail.csv.gz
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 23.235.39.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|23.235.39.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7483128 (7.1M) [application/octet-stream]
Saving to: ‘OnlineRetail.csv.gz’

     0K .......... .......... .......... .......... ..........  0%  743K 10s
    50K .......... .......... .......... .......... ..........  1% 1.45M 7s
   100K .......... .......... .......... .......... ..........  2% 1.46M 6s
   150K .......... .......... .......... .......... ..........  2% 97.7M 5s
   200K .......... .......... .......... .......... ..........  3% 91.5M 4s
   250K .......... .......... .......... .......... ..........  4% 1.43M 4s
   300K .......... .......... .......... .......... ..........  4%  132M 3s
   350K .......... .......... .........

In [3]:
//Put the csv into an RDD (at first, each row in the RDD is a string which
//correlates to a line in the csv
val loadRetailData = sc.textFile("/resources/OnlineRetail.csv.gz")
loadRetailData.take(2).foreach(println)

InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/10 8:26,2.55,17850,United Kingdom


##Prepare and shape the data

####Remove the header from the RDD
Type:<br>
header = loadRetailData.first()<br>
loadRetailData = loadRetailData.filter(lambda line: line != header)<br>

In [4]:
val header = loadRetailData.first()
val loadRetailData1 = loadRetailData.filter(line=>line != "InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country")
loadRetailData1.count

541909

#####To produce the ALS model, we need to train it with each individual purchase.  Each record in the RDD must be the customer id, item id, and the rating.  In this case, the rating is the quantity ordered.  MLlib converts these into a sparce, unfactored matrix.

####Split the string in each row by comma
Type:<br>
loadRetailData = loadRetailData.map(lambda l: l.split(","))

In [5]:
val loadRetailDataSplit = loadRetailData1.map(line=>line.split(","))

####Only keep rows that have a purchase quantity of greater than 0, a customerID not equal to 0, and a non blank stock code after romoving non-numeric characters
Type:
import re<br>
filteredRetailData = loadRetailData.filter(lambda l: int(l[3]) > 0 and len(re.sub("\D", "", l[1])) != 0 and len(l[6]) != 0)

In [6]:
import scala.util.matching.Regex
val filteredRetailDataFilter = loadRetailDataSplit.filter(l => (((l(3).toInt) > 0) && ((l(6).length) != 0) && ((l(1) forall Character.isDigit))))

####Only keep the customerID, stock code (with non-numeric characters removed), and quanitity as integers for each row
Type:<br>
skinnyRetailData = filteredRetailData.map(lambda l: (int(l[6]), int(re.sub("\D", "", l[1])), int(l[3])))

In [7]:
val skinnyRetailData = filteredRetailDataFilter.map(l => (l(6).toInt, l(1).toInt, l(3).toInt))

#####NOTE:  The original file at UCI's Machine Learning Repository has commas in the product description.  Those have been removed to expediate the lab.

####Randomly split the data into a testing set and a training set
Type:<br>
testRDD, trainRDD = skinnyRetailData.randomSplit([.2,.8])

In [8]:
//testRDD, trainRDD = skinnyRetailData.randomSplit([.2,.8])
val splits = skinnyRetailData.randomSplit(Array(0.8,0.2))
val trainRDD = splits(0)
val testRDD = splits(1)
trainRDD.count
testRDD.count

72174

###Convert the training RDD into Ratings

#####There is no need to do this with the test data becuase the values will be used as is, not as Ratings objects
Type:<br>
from pyspark.mllib.recommendation import ALS, Rating<br>
trainData = trainRDD.map(lambda l:  Rating(l[0],l[1],l[2]))

In [9]:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
val trainDataRating = trainRDD.map(l => Rating(l._1,l._1,l._2))

##Build the recommendation model

###Use trainging RDD to train a model with Alternating Least Squares 
Type:<br>
rank=5#(5 columns in the user-feature and product-feature matricies)<br>
iterations=10#(10 factorization runs)<br>
model = ALS.train(trainData, rank, numIterations)

In [10]:
val rank=5 //(5 columns in the user-feature and product-feature matricies)
val iterations=10 //(10 factorization runs)
val model = ALS.train(trainDataRating, rank, iterations)

print("The model has been trained")

The model has been trained

##Test the model

We can test this model in 2 ways:

1) Use it to predict what the user will rate a certain item (in this case, it is predicting how many of that item a user will buy).  We can use the test sample and compare how many purchases the users made to how many purchases our model predicts they will make.

2) Use it to predict items the user will be interested in.  This one is more dificult to quantify, but we can use our intuition to see if the recommended items would be useful to the customer

####Evaluate the model with the test rdd by using the predictAll function.  The RDD passed to the predictAll function should contain only the customerID and the stock code (use the testing RDD)
Type:<br>
predict = model.predictAll(testRDD.map(lambda l: (l[0],l[1])))<br>

In [11]:
val predict = model.predict(testRDD.map(l=>(l._1,l._2)))
predict.take(2)

Array(Rating(14748,16236,-2603.779325067255), Rating(16940,16008,-21911.322773875094))

####Join the predict all results with the test RDD.  Map the predict all results and the test RDD to a touple of:<br><br>(the customerID and stock code), and the rating/quantity<br><br>then join them together
Type:<br>
predictions = predict.map(lambda l: ((l[0],l[1]), l[2]))<br>
ratingsAndPredictions = testRDD.map(lambda l: ((l[0], l[1]), l[2])).join(predictions)

In [12]:
val predictions = predict.map { case Rating(user,product,rate)=> ((user,product),rate)}
val ratings = testRDD.map { case (user,product,rate) => ((user,product),rate.toDouble)}
val ratingsAndPredictions = ratings.join(predictions)
ratingsAndPredictions.take(2).foreach(println)

((14462,16054),(2.0,-5625.242091465685))
((17811,16225),(2.0,3034.606979630964))


####Calculate and print the Mean Squared Error.  For all ratings/prediction rows, subtract the actual purchase from the prediction, square the result, and take the mean of all of the squared differences
Type:<br>
meanSquaredError = ratingsAndPredictions.map(lambda l: (l[1][0] - l[1][1])**2).mean()<br>
print 'Mean squared error = %.4f' % meanSquaredError

In [53]:
//val meanSquaredError = ???

: 

#####This doesn't give us that good of a representation of ranking becuase the ranks are number of purchases.  Something better may be to look at some actual recommendations.
Evaluate the model with the recommendProducts function. Pass the recommendProducts funtion a customerID and the number of recommendations you would like to see

In [40]:
val recs = model.recommendProducts(15544,5)
//print the results (The results should be Ratings objects, 
//with the given customerID, products, and ratings)
//for rec in recs:
//    print rec
recs.foreach(println)

Rating(15544,13256,59847.58467112103)
Rating(15544,15118,59601.30146039754)
Rating(15544,15649,59582.137067303935)
Rating(15544,15488,58338.78603892082)
Rating(15544,14090,54572.279425788365)


Let's look up this user and the recommended product ID's in the excel file...

<img src='https://raw.githubusercontent.com/rosswlewis/RecommendationPoT/master/user.png' width="80%" height="80%"></img>

This user seems to have purchased a lot of childrens gifts and some holiday items.  The recomendation engine we created suggested some items along these lines

#####product=84568:<br>GIRLS ALPHABET IRON ON PATCHES 

#####product=16033:<br>MINI HIGHLIGHTER PENS

#####product=22266:<br>EASTER DECORATION HANGING BUNNY

#####product=84598:<br>BOYS ALPHABET IRON ON PATCHES

#####product=72803:<br>ROSE SCENT CANDLE JEWELLED DRAWER

The ALS algorithm uses some randomness, so the recommendations yours produces may be different than these.

###Looking those up in excel is a PAIN!
You already know about Spark SQL and dataframes.  Convert this file to a dataframe, register it as a table, and run queries on the recommendations produced and the customer the recommendations are for!

#####Data Citation
Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management, Vol. 19, No. 3, pp. 197â€“208, 2012 (Published online before print: 27 August 2012. doi: 10.1057/dbm.2012.17).