# Machine Learning Engineer Nanodegree
## Capstone Proposal
## Instacart Market Basket Analysis
James Kao<br>
June 28th, 2017

## Proposal

### Background

Instacart is a grocery ordering and delivery platform, where personal shoppers do the in-store shopping and delivery for you in as little as one hour. Like with many other online retail services, they have a huge problem of making relevant recommendations to streamline the discovery and ordering process given the huge dataset of grocery orders over time to leverage on. An improvement on the performance has a huge impact on the bottom line for many of these companies, yet the problem of recommender systems has received relatively little attention in academia relative to the impact in the industry. I know I'm personally frustrated with volume of irrelevant ads and recommendations that make search and product discovery more difficult than it should be. I also wanted to explore recommender systems on my own since it included as part of the content in the ML Nanodegree and are out of the scope of scikit-learn.

Instacart just publicly released a grocery shopping dataset on May 3rd, containing a sample of over 3 million grocery orders from more than 200,000 Instacart users. In the public dataset release announcement [blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), they noted that they currently use a mix of XGBoost, word2vec and Annoy to sort items for user to "buy again" and to recommend items for user while they shop.

### Problem

The goal of this project is to predict which products will be in a user's next order. There's a test set of reorders that includes some previously ordered items for us to measure how well the model ultimately performs. The performance can then be scored by measuring the accuracy of the predicted set of items against the test set across all orders, keeping in mind to maintain the flexibility to generalize to future orders.

### Data

[The Instacart Online Grocery Shopping Dataset 2017 (~200MB)](https://www.instacart.com/datasets/grocery-shopping-2017)
contains a relational set of anonymized sample of over 3 million grocery orders over time from over 200k users. Each order contains the customer, the timing, and the products purchased (through `order_products__*.csv`). Each product has a product aisle and department associated with it to give context on what products are similar.

Each entity (customer, product, order, aisle, etc.) has an associated unique id, where most variable names are pretty self-explanatory. The key is utilizing previous order data from all customers via `order_products__*.csv` and paying special attention to 'reordered' products. Orders predicted to have no reordered items will have a explicit 'None' value given. In addition, `orders.csv` tells us which set (prior, train, test) an order belongs in.

### Solution Statement

Intuitively, I'm looking to at recommender systems, which look at patterns of activities between different users and different products to produce recommendations. Collaborative filtering is one approach I'll likely use, which assumes that users who made similar decisions in the past will likely continue to have similar behaviors in the future. When a user expresses their preferences by purchasing a grocery item, it can be viewed as an rough representation of the user's interest in the corresponding product category.

We can match the user's spending behavior with other users with "similar" preferences by comparing the distance between the vectorized representations the user via matrix factorization, and predict what they might purchase next. Specifically I'll be using the famous SVD algorithm, as popularized by Simon Funk during the Netflix Prize, which is equivalent to probabilistic matrix factorization.

### Benchmark Model

I'll start with a naive model that will assume users will repurchase a given previously purchased item with a probability of:

$$\frac{number\ of\ previous\ orders\ with\ the\ item}{total\ number\ of\ previous\ orders\ since\ the\ first\ purchase\ of\ the\ item}$$

This way, it takes in to account more frequently purchased items relative to the introduction of the item in the users' order history. Then I'll suppliment it with a basic k-nearest neighbors collaborative filtering model.

### Evaluation Metrics

Commonly used metrics for evaluating reccommender systems are the mean squared error, root mean squared error, and the mean F1 score. Since the mean F1 score is used in competition, I'll be using that to evaluate the model's performance.

The mean F1 score is denoted by:

$$2 * \frac{ precision * recall }{ precision + recall }$$

where precision and recall are defined as:

$$ precision = (true\ positives) / (true\ positives + false\ positives) $$
$$ recall = (true\ positives) / (true\ positives + false\ negatives) $$

### Project Design
_(approx. 1 page)_

- **Programming Language**: Python 2.7.13
- **Libraries**: [scikit-learn 0.18.2](http://scikit-learn.org/), [surprise 1.0.3](http://surpriselib.com/)
- **Workflow**:
  - Establishing the baselines with naive probabilistic & k-NN based model for comparison.
  - Train a SVD collaborative filtering model.
  
  
  - Training a small convolutional neural network from scratch for further comparison with transfer learning models.
  - Extracting features from the images with the pretrained network and running a small fully connected network 8 output neurons on the last layer to get predictions. Comparing it with running SVM on the extracted features.^2
  - Fine tuning the pretrained network by choosing different optimizers and by training the network on this dataset from the convolutional layers instead of the dense layers as long as it's computationally inexpensive.
  - Optionally, comparing the performance of multiple pretrained networks. However, as finetuning them is computationally expensive, different pretrained networks can be compared at the feature extraction stage instead of direct comparison.

Document necessary software and libraries you're considering

In this final section, summarize a theoretical workflow for approaching a solution given the problem. Provide thorough discussion for what strategies you may consider employing, what analysis of the data might be required before being used, or which algorithms will be considered for your implementation. The workflow and discussion that you provide should align with the qualities of the previous sections. Additionally, you are encouraged to include small visualizations, pseudocode, or diagrams to aid in describing the project design, but it is not required. The discussion should clearly outline your intended workflow of the capstone project.

-----------

**Before submitting your proposal, ask yourself. . .**

- Does the proposal you have written follow a well-organized structure similar to that of the project template?
- Is each section (particularly **Solution Statement** and **Project Design**) written in a clear, concise and specific fashion? Are there any ambiguous terms or phrases that need clarification?
- Would the intended audience of your project be able to understand your proposal?
- Have you properly proofread your proposal to assure there are minimal grammatical and spelling mistakes?
- Are all the resources used for this project correctly cited and referenced?

-----------

Software interface: Library calls, REST APIs, data collection endpoints, database queries, etc.
User interface: Capturing user inputs & application events, displaying results & visualization, etc.
Scalability: Map-reduce, distributed processing, etc.
Deployment: Cloud hosting, containers & instances, microservices, etc.

# Inspo
https://github.com/frieds/instacart_user_classification
- flask app with an API to predict a category for an Instacart user for their shopping frequency.
- data.tsv and html file for a d3 data visualization

https://github.com/carsontang/instacart_grocery_2017_analysis
- sqlacademy implementation

deploy/scale? (data isn't so big)