# Data Science
[Data Science](https://en.wikipedia.org/wiki/Data_science) is an **interdisciplinary field** about processes and systems to **extract knowledge or insights from data** in various forms, either structured or unstructured.

The term should read **data craftsmenship** because it is more about expertise and experience in applying knowlegde and algorithms from various fields and sciences. 

Let's not call it an art :-)

## Recommendations
We want to build a **recommendation engine** based on the sales data we collected. A recommendation engine allows you to recommend new products to users which he most likely be interested it. The recommender can work in 2 ways: 

1. **Product perspective**: You are buying or liking or looking at this product, but we can also recommend these other products. 
1. **User perspective**: Other users like you, prefer these products too.

But do mind '[how you target](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html)' ;-)

## Spark MLlib
Spark's [Machine Learning Library (MLlib)](http://spark.apache.org/docs/latest/mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) has the necessary tools build recommenders. 

We will use the *amount of products sold per customer* as an **expression of preference for a given product**. 

If we skimm through MLlib's available algorithms, [collaborative filtering](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html) using [alternating least squares (ALS)](http://dl.acm.org/citation.cfm?id=1608614) seems very appropriate, because you can also train the algorithm using only **implicit feedback** (*amount of products sold*). 

So let's pick-up where we started in the previous chapter: read the **sales dataframe**.

In [3]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

sqlContext = SQLContext(sc)

sales_schema = StructType([ \
    StructField("key", StringType(), True), \
    StructField("customer_age", IntegerType(), True), \
    StructField("customer_gender", StringType(), True), \
    StructField("customer_key", IntegerType(), True), \
    StructField("customer_marital_status", StringType(), True), \
    StructField("customer_name", StringType(), True), \
    StructField("customer_state", StringType(), True), \
    StructField("date", StringType(), True), \
    StructField("employee_gender", StringType(), True), \
    StructField("employee_job_title", StringType(), True), \
    StructField("employee_key", IntegerType(), True), \
    StructField("employee_name", StringType(), True), \
    StructField("employee_state", StringType(), True), \
    StructField("price", FloatType(), True), \
    StructField("product_category", StringType(), True), \
    StructField("product_department", StringType(), True), \
    StructField("product_description", StringType(), True), \
    StructField("product_key", IntegerType(), True), \
    StructField("product_price", FloatType(), True), \
    StructField("product_version", IntegerType(), True), \
    StructField("quantity", FloatType(), True), \
    StructField("store_key", IntegerType(), True), \
    StructField("store_name", StringType(), True), \
    StructField("store_state", StringType(), True), \
    StructField("tender_type", StringType(), True), \
    StructField("transaction", StringType(), True), \
    StructField("transaction_type", StringType(), True), \
    StructField("rainfall", FloatType(), True), \
    StructField("temp_avg", FloatType(), True), \
    StructField("temp_max", FloatType(), True), \
    StructField("temp_min", FloatType(), True) \
])

df = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('/data/views/daily_weather_sales_per_state', schema = sales_schema)
df.take(1)

[Row(key=u'20060416-LA', customer_age=34, customer_gender=u'Male', customer_key=8255, customer_marital_status=u'Married', customer_name=u'Kevin W. Perkins', customer_state=u'FL', date=u'2006-04-16T00:00:00.000000Z', employee_gender=u'Female', employee_job_title=u'Greeter', employee_key=7768, employee_name=u'Carla King', employee_state=u'TX', price=361.0, product_category=u'Misc', product_department=u'Gifts', product_description=u'Brand #2402 greeting cards', product_key=790, product_price=137.0, product_version=1, quantity=10.0, store_key=115, store_name=u'Store115', store_state=u'LA', tender_type=u'Check', transaction=u'3281511', transaction_type=u'purchase', rainfall=1.0978261232376099, temp_avg=185.42857360839844, temp_max=219.0625, temp_min=95.76841735839844)]

### Turn our sales data into customer preferences
Now, build the ratings-like customer's **preference dataframe** but *aggregating the amount of units sold per product*.

In [10]:
preferences = df \
    .groupBy('customer_key', 'product_key') \
    .agg(sum('quantity').alias('rating'))

preferences.take(1)

[Row(customer_key=33843, product_key=8786, rating=5.0)]

In [11]:
preferences.where('customer_key = 33843 and product_key = 6156').take(1)

[Row(customer_key=33843, product_key=6156, rating=1.0)]

In [24]:
# ALS requires a dataset of Ratings
ratings = preferences.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
ratings.take(1)

[Rating(user=33843, product=6156, rating=1.0)]

### Train the algorithm
Next we will have to train a **model using the ALS algorithm**. 

In [23]:
# Split the ratings dataset into a training (75%) and test (25%) dataset 
splits = ratings.randomSplit([0.75, 0.25], 11)

training = splits[0].cache()
test = splits[1]

In [25]:
# Build the recommendation model using ALS
model = ALS.train(training, rank=10, iterations=10)

In [28]:
# Evaluate the model on training data
testdata = test.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = test.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Mean Squared Error = 128.438936634


In [29]:
# Save and load model
model.save(sc, "/data/models/als_1")
sameModel = MatrixFactorizationModel.load(sc, "/data/models/als_1")

### Train the algorithm for implicit feedback
Next we will train a model using ALS but for *implicit feedback*.

In [32]:
# Build the recommendation model using ALS
model = ALS.trainImplicit(training, rank=10, iterations=10, lambda_=0.01, alpha=0.01)

In [33]:
# Evaluate the model on training data
testdata = test.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = test.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Mean Squared Error = 38.4803284453
