The goal of this exercise is to suggest to Instacart customers items that they might want to purchase based on past orders. The point is not to create an award-winning algorithm, but to place yourself in the shoes of a data science consultant. 

1. Data Exploration    
    1. What initial insights can you get from a first exploration of the dataset?  
    2. Do some variables seem to have more importance than other? What transformations might be needed?  
2. Prediction    
    1. Describe your overall approach: how did you formulate the problem to make recommendations?  
    2. What model did you choose? Why?  
    3. How did you assess performance of your model? Which metrics seemed particularly important to you?  
    4. Please present your results and model performance  
    5. How would you suggest delivering the recommendations to your client?  
3. Insights and Next Steps    
    1. What insights can you extract from your analysis?  
    2. What are some of the things you would suggest as next steps for your client?  

# Recommendation Model

I read few materials based on building recommendation models.

Couple of the main priorities that I assigned for the recommendation model were -
1. Using technologies such as Spark makes stream processing faster.
2. Generate recommendation with speed and in a distributed framework.
3. Use minimal code for the recommendation model.

After surfing through collaborative filtering section from MLlib, eventually this post from databricks caught my attention. They implemented the FP-Growth algorithm for building recommendations and explains how algorithms and infrastructure is necessary to generate your association rules on a distributed platform with the ever growing ecommerce data.

https://databricks.com/blog/2018/09/18/simplify-market-basket-analysis-using-fp-growth-on-databricks.html

http://www.borgelt.net/doc/fpgrowth/fpgrowth.html

# Loading libraries

In [1]:
import pandas as pd
from pyspark.sql import SparkSession
from pyspark.sql.functions import collect_set, col, count, collect_list
from pyspark.ml.fpm import FPGrowth
spark = SparkSession.builder.appName('MarketBasketAnalysis').getOrCreate()
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator

%matplotlib inline

# Import Data

In [2]:
#Loading all the data-sets using spark
aisles = spark.read.csv("aisles.csv", header=True, inferSchema=True)
departments = spark.read.csv("departments.csv", header=True, inferSchema=True)
order_products_prior = spark.read.csv("order_products__prior.csv", header=True, inferSchema=True)
order_products_train = spark.read.csv("order_products__train_cap.csv", header=True, inferSchema=True)
order_products_test = spark.read.csv("order_products__test_cap.csv", header=True, inferSchema=True)
orders = spark.read.csv("orders.csv", header=True, inferSchema=True)
products = spark.read.csv("products.csv", header=True, inferSchema=True)

# creating temporary tables
aisles.createOrReplaceTempView("aisles")
departments.createOrReplaceTempView("departments")
order_products_prior.createOrReplaceTempView("order_products_prior")
order_products_train.createOrReplaceTempView("order_products_train")
order_products_test.createOrReplaceTempView("order_products_test")
orders.createOrReplaceTempView("orders")
products.createOrReplaceTempView("products")

# Association Mining

Association rule learning is a rule-based machine learning method for discovering interesting relations between variables in large databases. It is intended to identify strong rules discovered in databases using some measures of interestingness. To prepare our data for spark processing, we will organize the data by shopping basket. Which means that, each row of our DataFrame represents an order_id with each items column containing an array of items ordered within that order.

In [3]:
# organize by shopping basket
data = spark.sql("select p.product_name, o.order_id from products p inner join order_products_train o where o.product_id = p.product_id")
shopping_baskets = data.groupBy('order_id').agg(collect_set('product_name').alias('basket')).orderBy("order_id", ascending=True)
shopping_baskets.createOrReplaceTempView('shopping_baskets')
shopping_baskets.show()

+--------+--------------------+
|order_id|              basket|
+--------+--------------------+
|      36|[Cage Free Extra ...|
|      38|[Bunched Cilantro...|
|     170|[Classic Movie Th...|
|     226|["Magic Tape Refi...|
|     349|[Organic Beef Hot...|
|     456|[Petite Peas, Lar...|
|     719|[Heavy Duty Scrub...|
|     762|[Organic Cucumber...|
|     844|[Organic Red Radi...|
|     878|[Organic Cilantro...|
|     915|[Heavy Duty Paper...|
|     988|[Whipped Light Cr...|
|    1001|[Organic Roma Tom...|
|    1032|[Organic Living B...|
|    1042|[Pure Irish Butte...|
|    1077|[Sparkling Water,...|
|    1086|[Organic Mint, Li...|
|    1119|[Shallot, Large L...|
|    1120|[Kettle Cooked Po...|
|    1139|[Cinnamon Rolls w...|
+--------+--------------------+
only showing top 20 rows



# FP-Growth

FP-growth is a program for frequent item set mining, a data mining method that was originally developed for market basket analysis. Frequent item set mining aims at finding regularities in the shopping behavior of the customers of supermarkets, mail-order companies and online shops. FP-Growth is an improvement of apriori designed to eliminate some of the heavy bottlenecks in apriori. 

In particular, it tries to identify sets of products that are frequently bought together. Once identified, such sets of associated products may be used to optimize the organization of the offered products on the shelves of a supermarket or the pages of a mail-order catalog or web shop, may give hints which products may conveniently be bundled, or may allow to suggest other products to customers.

In [4]:
#seting the thresholds for minSupport and minConfidence
fpGrowth = FPGrowth(itemsCol="basket", minSupport=0.001, minConfidence=0)

model = fpGrowth.fit(shopping_baskets)

#calculating frequent item sets
frequent_items = model.freqItemsets
frequent_items.createOrReplaceTempView("frequentItems")

#display frequent itemsets
spark.sql('''select items, freq from frequentItems 
          where size(items) > 2 order by freq desc limit 20''').show(truncate =False)

+--------------------------------------------------------------------+----+
|items                                                               |freq|
+--------------------------------------------------------------------+----+
|[Organic Hass Avocado, Organic Strawberries, Bag of Organic Bananas]|509 |
|[Organic Raspberries, Organic Strawberries, Bag of Organic Bananas] |478 |
|[Organic Baby Spinach, Organic Strawberries, Bag of Organic Bananas]|437 |
|[Organic Raspberries, Organic Hass Avocado, Bag of Organic Bananas] |394 |
|[Organic Hass Avocado, Organic Baby Spinach, Bag of Organic Bananas]|379 |
|[Organic Avocado, Organic Baby Spinach, Banana]                     |367 |
|[Organic Avocado, Large Lemon, Banana]                              |355 |
|[Limes, Large Lemon, Banana]                                        |345 |
|[Organic Cucumber, Organic Strawberries, Bag of Organic Bananas]    |317 |
|[Organic Baby Spinach, Organic Strawberries, Banana]                |304 |
|[Limes, Org

# Association Rules sorted by Confidence

In [5]:
model.associationRules.orderBy("confidence", ascending=False).show(25, truncate=False)

+------------------------------------------------------------------+------------------------+-------------------+------------------+
|antecedent                                                        |consequent              |confidence         |lift              |
+------------------------------------------------------------------+------------------------+-------------------+------------------+
|[Organic Raspberries, Organic Hass Avocado, Organic Strawberries] |[Bag of Organic Bananas]|0.5765124555160143 |4.900007315383391 |
|[Organic Navel Orange, Organic Raspberries]                       |[Bag of Organic Bananas]|0.5707070707070707 |4.850665054413543 |
|[Yellow Onions, Strawberries]                                     |[Banana]                |0.5408163265306123 |3.803571428571429 |
|[Organic D'Anjou Pears, Organic Hass Avocado]                     |[Bag of Organic Bananas]|0.54               |4.589673518742443 |
|[Organic Kiwi, Organic Hass Avocado]                              |[

# Recomendation Predictions

In [6]:
data = spark.sql("select p.product_name, o.order_id from products p inner join order_products_test o where o.product_id = p.product_id")
shopping_baskets_test = data.groupBy('order_id').agg(collect_list('product_name').alias('basket')).orderBy("order_id", ascending=True)
shopping_baskets_test.createOrReplaceTempView('shopping_baskets_test')
shopping_baskets_test.show()

+--------+--------------------+
|order_id|              basket|
+--------+--------------------+
|       1|[Bulgarian Yogurt...|
|      96|[Roasted Turkey, ...|
|      98|[Natural Spring W...|
|     112|[Fresh Cauliflowe...|
|     218|[Natural Artisan ...|
|     393|[Shredded Mexican...|
|     473|[Organic Whole Mi...|
|     631|[Organic Strawber...|
|     774|[Ice Cream Variet...|
|     904|[Cup Noodles Chic...|
|    1280|[Lactose Free Hal...|
|    1325|[Dha Omega 3 Redu...|
|    1350|[Ground Cinnamon,...|
|    1721|[Organic Reduced ...|
|    1764|[Half & Half, Ita...|
|    2011|[Limeade, Origina...|
|    2021|[Granny Smith App...|
|    2115|[Organic Mixed Ve...|
|    2318|[Honey Vanilla Gr...|
|    2530|[Total 2% Lowfat ...|
+--------+--------------------+
only showing top 20 rows



In [10]:
# transform examines the input items against all the association rules and summarize the
# consequents as prediction

predictions = model.transform(shopping_baskets_test).toPandas()

In [12]:
predictions.head(100)

Unnamed: 0,order_id,basket,prediction
0,1,"[Bulgarian Yogurt, Organic 4% Milk Fat Whole M...","[Organic Tomato Basil Pasta Sauce, Marinara Sa..."
1,96,"[Roasted Turkey, Organic Cucumber, Organic Gra...","[Limes, Organic Strawberries, Organic Avocado,..."
2,98,"[Natural Spring Water, Organic Orange Juice Wi...","[Organic Tomato Basil Pasta Sauce, Marinara Sa..."
3,112,"[Fresh Cauliflower, I Heart Baby Kale, Sea Sal...","[Organic Cilantro, Asparagus, Limes, Organic W..."
4,218,"[Natural Artisan Water, Okra, Organic Yellow P...","[Bag of Organic Bananas, Banana, Organic Straw..."
5,393,"[Shredded Mexican Blend Cheese, Clementines, F...","[Raspberries, Bag of Organic Bananas, Organic ..."
6,473,"[Organic Whole Milk with DHA Omega-3, Banana, ...","[Organic Tomato Basil Pasta Sauce, Marinara Sa..."
7,631,"[Organic Strawberries, Uncured Genoa Salami, O...","[Banana, Asparagus, Limes, Organic Whole Milk,..."
8,774,"[Ice Cream Variety Pack, Nacho Cheese Sauce, D...",[]
9,904,"[Cup Noodles Chicken Flavor, Zero Calorie Cola]",[Soda]


In [11]:
predictions.to_csv('predictions.csv')