Prompt
===================

Earlier this year, Instacart publicly released an anonymized dataset consisting of over 3 million Instacart orders with information from what was purchased and by whom (user ids) and when in the day and week purchases were made. Many retailers, including grocery stores, sell to data brokers data in order to better profile consumers. Often these purchases are linked to credit and debit cards that reveal identity. For more information, here are links [on data brokers](https://www.propublica.org/getinvolved/item/discussion-how-do-data-brokers-impact-you) and [on data purchashing firms](http://www.npr.org/sections/alltechconsidered/2016/07/11/485571291/firms-are-buying-sharing-your-online-info-what-can-you-do-about-it). 

**Using Instacart’s data, try to profile consumers and specific groups within the dataset. For a couple of options, you could try to find a particular demographic, such as millennials, or a group a health insurance company might deem as high-risk.**

You're free to use whatever tools you're comfortable with; however, `pandas` is recommended as the library is intended for fast data analysis and manipulation of large volumes. 

The data dictionary can be found [here](https://gist.github.com/jeremystan/c3b39d947d9b88b3ccff3147dbcf6c6b).


### More Details
You'll notice that there is a zipped file, `instacart_data.zip`, and a regular folder by a similar name, `instacart_data_small`. The two CSVs within the zipped file are large and code to open and read them into pandas DataFrames are included below. The CSVs within the other folder can be opened as usual with a `pd.read_csv` method.  

Here is a brief description of some elements of the more confusing CSVs:

`orders_subset` - Each row represents one order or trip to the grocery store. `order_number` refers to whether the order was the user’s first or fifteenth. `order_dow` refers to the day of week the order was placed with Saturday as 0 and Sunday as 1. This CSV contains a subset of the original data. 

`order_products__prior_subset` - Each row represents one product in an order. `add_to_cart_order` refers to the order in which the product was added to the cart. 


### Caveat
When analyzing the data, especially with a large dataset, be sure to *filter* as much as possible the individual dataframes and then join as needed. In addition, when checking out what each CSV looks like, it’s recommended to use the `head` method, rather than loading the entire dataframe — this might crash your notebook (e.g. `df.head(5)`). Note to run terminal commands such as *head* in Jupyter notebook, attach an exclamation point as a prefix to the command (e.g. `!head file.csv`)

### To help get you going...

In [None]:
import pandas as pd 
import numpy as np
import zipfile

# Import other libraries you would like to use

In [None]:
!head instacart_data_small/products.csv

In [None]:
# Run this cell to open and read the large data files within the zip file

zf = zipfile.ZipFile('instacart_data.zip') 

# Both orders and orders_products are pandas DataFrames
orders = pd.read_csv(zf.open('instacart_data/orders_subset.csv'))
orders_products = pd.read_csv(zf.open('instacart_data/order_products__prior_subset.csv'))

In [None]:
# Example: 
# Filtering for customers who've ordered more than 70 times 
frequent_customers = orders.groupby(['user_id'])[['order_number']].max()
frequent_customers = frequent_customers[frequent_customers['order_number'] > 70]