# Data Science
[Data Science](https://en.wikipedia.org/wiki/Data_science) is an **interdisciplinary field** about processes and systems to **extract knowledge or insights from data** in various forms, either structured or unstructured.

The term should read **data craftsmenship** because it is more about expertise and experience in applying knowlegde and algorithms from various fields and sciences. 

Let's not call it an art :-)

## Recommendations
We want to build a **recommendation engine** based on the sales data we collected. A recommendation engine allows you to recommend new products to users which he most likely be interested it. The recommender can work in 2 ways: 

1. **Product perspective**: You are buying or liking or looking at this product, but we can also recommend these other products. 
1. **User perspective**: Other users like you, prefer these products too.

But do mind '[how you target](http://www.nytimes.com/2012/02/19/magazine/shopping-habits.html)' ;-)

## Spark MLlib
Spark's [Machine Learning Library (MLlib)](http://spark.apache.org/docs/latest/mllib-guide.html#sparkml-high-level-apis-for-ml-pipelines) has the necessary tools to build recommenders. 

We will use the *amount of products sold per customer* as an **expression of preference for a given product**. 

If we skimm through MLlib's available algorithms, [collaborative filtering](http://spark.apache.org/docs/latest/mllib-collaborative-filtering.html) using [alternating least squares (ALS)](http://dl.acm.org/citation.cfm?id=1608614) seems very appropriate, because you can also train the algorithm using **implicit feedback** (*amount of products sold*). 

So let's pick-up where we started in the previous chapter: read the **sales dataframe**.

In [1]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating

sqlContext = SQLContext(sc)

sales_schema = StructType([ \
    StructField("key", StringType(), True), \
    StructField("customer_age", IntegerType(), True), \
    StructField("customer_gender", StringType(), True), \
    StructField("customer_key", IntegerType(), True), \
    StructField("customer_marital_status", StringType(), True), \
    StructField("customer_name", StringType(), True), \
    StructField("customer_state", StringType(), True), \
    StructField("date", StringType(), True), \
    StructField("employee_gender", StringType(), True), \
    StructField("employee_job_title", StringType(), True), \
    StructField("employee_key", IntegerType(), True), \
    StructField("employee_name", StringType(), True), \
    StructField("employee_state", StringType(), True), \
    StructField("price", FloatType(), True), \
    StructField("product_category", StringType(), True), \
    StructField("product_department", StringType(), True), \
    StructField("product_description", StringType(), True), \
    StructField("product_key", IntegerType(), True), \
    StructField("product_price", FloatType(), True), \
    StructField("product_version", IntegerType(), True), \
    StructField("quantity", FloatType(), True), \
    StructField("store_key", IntegerType(), True), \
    StructField("store_name", StringType(), True), \
    StructField("store_state", StringType(), True), \
    StructField("tender_type", StringType(), True), \
    StructField("transaction", StringType(), True), \
    StructField("transaction_type", StringType(), True), \
    StructField("rainfall", FloatType(), True), \
    StructField("temp_avg", FloatType(), True), \
    StructField("temp_max", FloatType(), True), \
    StructField("temp_min", FloatType(), True) \
])

df = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('/data/views/daily_weather_sales_per_state', schema = sales_schema)
df.take(1)

[Row(key=u'20061101-IN', customer_age=21, customer_gender=u'Male', customer_key=22102, customer_marital_status=u'Unknown', customer_name=u'Duncan A. Lewis', customer_state=u'TX', date=u'2006-11-01T00:00:00.000000Z', employee_gender=u'Male', employee_job_title=u'Head of PR', employee_key=1056, employee_name=u'Raja Weaver', employee_state=u'TX', price=62.0, product_category=u'Food', product_department=u'Canned Goods', product_description=u'Brand #28472 canned corn', product_key=9483, product_price=174.0, product_version=1, quantity=2.0, store_key=239, store_name=u'Store239', store_state=u'IN', tender_type=u'Credit', transaction=u'3821456', transaction_type=u'purchase', rainfall=19.02670669555664, temp_avg=160.3333282470703, temp_max=216.57142639160156, temp_min=74.26262664794922)]

In [8]:
pkey = udf(lambda key, version: key * 10 + version, IntegerType())
df = df.withColumn('product_pkey', pkey(df.product_key, df.product_version))
    
df.take(1)

[Row(key=u'20061208-NV', customer_age=26, customer_gender=u'Female', customer_key=24548, customer_marital_status=u'Divorced', customer_name=u'Betty X. Lewis', customer_state=u'IN', date=u'2006-12-08T00:00:00.000000Z', employee_gender=u'Male', employee_job_title=u'Greeter', employee_key=7780, employee_name=u'Theodore Fortin', employee_state=u'WI', price=489.0, product_category=u'Food', product_department=u'Dairy', product_description=u'Brand #59900 margarine', product_key=19950, product_price=242.0, product_version=3, quantity=8.0, store_key=214, store_name=u'Store214', store_state=u'NV', tender_type=u'Cash', transaction=u'3924716', transaction_type=u'purchase', rainfall=9.71875, temp_avg=85.9473648071289, temp_max=184.9942169189453, temp_min=32.410404205322266, product_pkey=199503)]

### Turn our sales data into customer preferences
Now, build the ratings-like customer's **preference dataframe** but *aggregating the amount of units sold per product*.

In [9]:
preferences = df \
    .groupBy('customer_key', 'product_pkey') \
    .agg(sum('quantity').alias('rating'))

preferences.take(1)

[Row(customer_key=8508, product_pkey=131751, rating=1.0)]

In [10]:
preferences.where('customer_key = 8508 and product_pkey = 131751').take(1)

[Row(customer_key=8508, product_pkey=131751, rating=1.0)]

In [12]:
# ALS requires a dataset of Ratings
ratings = preferences.map(lambda l: Rating(int(l[0]), int(l[1]), float(l[2])))
ratings.take(1)

[Rating(user=8508, product=131751, rating=1.0)]

### Train the algorithm
Next we will have to train a **model using the ALS algorithm**. 

In [13]:
# Split the ratings dataset into a training (75%) and test (25%) dataset 
splits = ratings.randomSplit([0.75, 0.25], 11)

training = splits[0].cache()
test = splits[1]

In [14]:
# Build the recommendation model using ALS
model = ALS.trainImplicit(training, rank=10, iterations=20, alpha=0.1)

In [21]:
# Evaluate the model on training data
testdata = test.map(lambda p: (p[0], p[1]))
predictions = model.predictAll(testdata).map(lambda r: ((r[0], r[1]), r[2]))
ratesAndPreds = test.map(lambda r: ((r[0], r[1]), r[2])).join(predictions)
MSE = ratesAndPreds.map(lambda r: (r[1][0] - r[1][1])**2).mean()
print("Mean Squared Error = " + str(MSE))

Mean Squared Error = 38.5277430927


In [18]:
predictions.sortBy(lambda r: -r[1]).take(5)

[((29640, 69551), 0.30281999165789997),
 ((36101, 97851), 0.16908753265968166),
 ((45522, 122711), 0.165085118658801),
 ((19017, 33971), 0.15391220559712881),
 ((941, 54171), 0.14609363603144015)]

In [19]:
# Save and load model
model.save(sc, "/data/models/als_1")
sameModel = MatrixFactorizationModel.load(sc, "/data/models/als_1")

### Now make some recommendations
Recommend 5 products for user <xyz>

In [35]:
products_schema = StructType([ \
    StructField("product_key", IntegerType(), True), \
    StructField("product_version", IntegerType(), True), \
    StructField("product_description", StringType(), True), \
    StructField("product_category", StringType(), True), \
    StructField("product_department", StringType(), True), \
    StructField("product_price", IntegerType(), True), \
])

products = sqlContext.read \
    .format('com.databricks.spark.csv') \
    .load('/data/raw/retail/products', schema = products_schema)
    
products = products.withColumn('product_pkey', pkey(products.product_key, products.product_version))
products = products.select('product_pkey', 'product_description')
products.take(1)

[Row(product_pkey=49971, product_description=u'Brand #14942 golf clubs')]

In [26]:
user_key = 29640
user_data = products.map(lambda p: (user_key, p.product_pkey))
user_data.take(1)

[(29640, 11)]

In [29]:
recommendations = sameModel.predictAll(user_data).sortBy(lambda r: -r[2])
recommendations.take(5)

[Rating(user=29640, product=38641, rating=0.5487241048313785),
 Rating(user=29640, product=112681, rating=0.43379049582455226),
 Rating(user=29640, product=195121, rating=0.3921101470760089),
 Rating(user=29640, product=56871, rating=0.3863316168798289),
 Rating(user=29640, product=185791, rating=0.36577378307444935)]

In [34]:
recommended_products = recommendations.map(lambda r: r.product)
recommended_products.take(5)

[38641, 112681, 195121, 56871, 185791]