## Case Study

**Scenario**: Instacart has provided user purchase data and wants to create user behavior profilesv to target their marketing.

**Example**: the power user, or organic andy - and be able to describe their behavior and how it differs from other groups.

**Deliverable**: 
- A paragraph with a nice looking table describing the different user groups in the dataset.
- The table will be produced through the clustering algorithms you learned today.
- You will cluster on `user_summaries`

**Data**: Download the zip file from [this link](https://drive.google.com/file/d/1_nl_wVnYYNlYe1cWlVU3QAWAWyPxS4S8/view?usp=sharing), unzip the folder, and save all the contents in a file named "data"

**Tasks**:
1. Create the following aggregate variables to add to your user summaries:

- average basket size
- percent of orders that contain organic items
- average percent of basket that is organic

2. create at least 3 other new aggregate variables (unique from other people in the group) from the `full_order_data` to add to `user_summaries`

3. Run both hierarchical and kmeans clustering to find the optimum # of groups

4. Write a paragraph to describe those cluster attributes

5. Make sure the table looks "client ready"

In [None]:
!ls ./data

In [None]:
# Import necessary data manipulation packages

import pandas as pd
pd.options.display.max_columns = 999

import numpy as np
import zipfile

import matplotlib.pyplot as plt
%matplotlib inline

### Orders

In [None]:
zf = zipfile.ZipFile('./data/orders.csv.zip')
orders = pd.read_csv(zf.open(zf.namelist()[0]))

In [None]:
# Create aggregates of order count and days between orders

max_orders = orders.groupby('user_id')['order_number'].max()
avg_days = orders.groupby('user_id')['days_since_prior_order'].mean()
max_days = orders.groupby('user_id')['days_since_prior_order'].max()
min_days = orders.groupby('user_id')['days_since_prior_order'].min()

In [None]:
# create new user_summary dataset
user_summaries = pd.DataFrame([avg_days,max_days, min_days, max_orders]).T

names = ['av_days_btw_orders', 'max_days_btw_orders', 'min_days_btw_orders', 'total_orders']
user_summaries.columns = names

### Products

In [None]:
zf = zipfile.ZipFile('./data/products.csv.zip')
products = pd.read_csv(zf.open(zf.namelist()[0]))

In [None]:
# create organic tag for each product
products['organic'] = np.where(products.product_name.str.contains("Organic"),
                             1, 0)
# remove Organic from product names
products.product_name=products.product_name.str.replace("Organic","")

### Department

In [None]:
zf = zipfile.ZipFile('./data/departments.csv.zip')
department = pd.read_csv(zf.open(zf.namelist()[0]))

### Order Products


In [None]:
zf = zipfile.ZipFile('./data/order_products__train.csv.zip')
order_prods = pd.read_csv(zf.open(zf.namelist()[0]))

## Create unified table

In [None]:
# merge user_id on to order table
full_table = order_prods.merge(order_user_key, on = 'order_id', how ='left') 

# merge product info on to table
full_table_copy = full_table.merge(products, on="product_id", how='left')

# merge department info on to table
full_order_data = full_table_copy.merge(department, on="department_id", how='left')