<a href="https://colab.research.google.com/github/bs3537/DS-Unit-1-Sprint-2-Data-Wrangling-and-Storytelling/blob/master/Bhav_Copy_of_LS_DS_121_Join_and_Reshape_Data_Assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science_

# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [1]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2019-09-21 01:47:49--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.10.133
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.10.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2019-09-21 01:47:55 (34.5 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [2]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [3]:
%cd instacart_2017_05_01

/content/instacart_2017_05_01


In [4]:
!ls -lh *.csv

-rw-r--r-- 1 502 staff 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 502 staff  270 May  2  2017 departments.csv
-rw-r--r-- 1 502 staff 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 502 staff  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 502 staff 104M May  2  2017 orders.csv
-rw-r--r-- 1 502 staff 2.1M May  2  2017 products.csv


# Assignment

## Join Data Practice

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

First, write down which columns you need and which dataframes have them.

Next, merge these into a single dataframe.

Then, use pandas functions from the previous lesson to get the counts of the top 10 most frequently ordered products.

**First, I will open all CSV files and inspect the columns**

In [5]:
import pandas as pd
import numpy as np
aisles = pd.read_csv("aisles.csv")
aisles.head()

Unnamed: 0,aisle_id,aisle
0,1,prepared soups salads
1,2,specialty cheeses
2,3,energy granola bars
3,4,instant foods
4,5,marinades meat preparation


In [6]:

departments = pd.read_csv("departments.csv")
departments.head()

Unnamed: 0,department_id,department
0,1,frozen
1,2,other
2,3,bakery
3,4,produce
4,5,alcohol


In [7]:
order_products_prior = pd.read_csv("order_products__prior.csv")

order_products_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [8]:
order_products_prior.shape

(32434489, 4)

In [9]:
order_products_train = pd.read_csv("order_products__train.csv")
order_products_train.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,1,49302,1,1
1,1,11109,2,1
2,1,10246,3,0
3,1,49683,4,0
4,1,43633,5,1


In [10]:
order_products_train.shape

(1384617, 4)

In [11]:
orders = pd.read_csv("orders.csv")
orders.head()

Unnamed: 0,order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,2539329,1,prior,1,2,8,
1,2398795,1,prior,2,3,7,15.0
2,473747,1,prior,3,3,12,21.0
3,2254736,1,prior,4,4,7,29.0
4,431534,1,prior,5,4,15,28.0


In [12]:
products = pd.read_csv("products.csv")
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


**Now, I have opened all six csv files and will inspect their columns to see which ones have product names, product IDs, aisle Ids and number of orders**

**Order_product_prior dataframe has the target number of orders value that we want by product ID. If we can find product ID for these grocery items, we can find the number of count (add_to_cart_order value).**

In [13]:
products[['product_id', 'product_name']]

Unnamed: 0,product_id,product_name
0,1,Chocolate Sandwich Cookies
1,2,All-Seasons Salt
2,3,Robust Golden Unsweetened Oolong Tea
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...
4,5,Green Chile Anytime Sauce
5,6,Dry Nose Oil
6,7,Pure Coconut Water With Orange
7,8,Cut Russet Potatoes Steam N' Mash
8,9,Light Strawberry Blueberry Yogurt
9,10,Sparkling Orange Juice & Prickly Pear Beverage


**FINDING THE PRODUCT IDs of ABOVE GROCERY ITEMS FROM PRODUCTS DATAFRAME USING .LOC FUNCTION**

In [14]:
# finding the information for Banana

products.loc[products['product_name'] == 'Banana']

Unnamed: 0,product_id,product_name,aisle_id,department_id
24851,24852,Banana,24,4


**Concatenate order_products__prior and order_products__train**

In [25]:
#finding the dataset shape
order_products = pd.concat([order_products_prior, order_products_train])
order_products.shape

(33819106, 4)

**The concatenate function was successful and number of rows in the above merged output dataframe is a sum of the number of rows in individual dataframes. **

In [26]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


**MERGING THE DESIRED COLUMNS FROM DIFFERENT DATAFRAMES IN A SINGLE DATAFRAME CONTAINING COLUMNS OF INTEREST**

In [27]:
#creating a merged dataset using pd.merge function
merged = pd.merge(products, order_products)
merged.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id,order_id,add_to_cart_order,reordered
0,1,Chocolate Sandwich Cookies,61,19,1107,7,0
1,1,Chocolate Sandwich Cookies,61,19,5319,3,1
2,1,Chocolate Sandwich Cookies,61,19,7540,4,1
3,1,Chocolate Sandwich Cookies,61,19,9228,2,0
4,1,Chocolate Sandwich Cookies,61,19,9273,30,0


In [28]:
#summary statistics for the merged dataset

merged.describe()

Unnamed: 0,product_id,aisle_id,department_id,order_id,add_to_cart_order,reordered
count,33819110.0,33819110.0,33819110.0,33819110.0,33819110.0,33819110.0
mean,25575.51,71.21799,9.918544,1710566.0,8.367738,0.5900617
std,14097.7,38.19898,6.281655,987400.8,7.13954,0.491822
min,1.0,1.0,1.0,1.0,1.0,0.0
25%,13519.0,31.0,4.0,855413.0,3.0,0.0
50%,25256.0,83.0,9.0,1710660.0,6.0,1.0
75%,37935.0,107.0,16.0,2565587.0,11.0,1.0
max,49688.0,134.0,21.0,3421083.0,145.0,1.0


In [29]:
#finding the total order count in the merged dataset

merged.count()

# Thus, there were 33819106 orders for all items

product_id           33819106
product_name         33819106
aisle_id             33819106
department_id        33819106
order_id             33819106
add_to_cart_order    33819106
reordered            33819106
dtype: int64

###Function to find the order counts for different items by product name

In [40]:

merged.product_name.value_counts()

Banana                                                       491291
Bag of Organic Bananas                                       394930
Organic Strawberries                                         275577
Organic Baby Spinach                                         251705
Organic Hass Avocado                                         220877
Organic Avocado                                              184224
Large Lemon                                                  160792
Strawberries                                                 149445
Limes                                                        146660
Organic Whole Milk                                           142813
Organic Raspberries                                          142603
Organic Yellow Onion                                         117716
Organic Garlic                                               113936
Organic Zucchini                                             109412
Organic Blueberries                             

####The number of orders for each item is given below:

*   Banana = 491291
*   Bag of Organic Bananas = 394930
*   Organic Strawberries = 275577
*   Organic Baby Spinach = 251705
*   Organic Hass Avocado = 220877
*   Organic Avocado = 184224
*   Large Lemon = 160792
*   Strawberries = 149445
*   Limes = 146660
*   Organic Whole Milk = 142813



In [31]:
# Another way to do it s by using .loc function for slicing the dataframe and entering the product ID of the item. The example below is for Banana.

merged.loc[merged['product_id'] == 24852].count()

product_id           491291
product_name         491291
aisle_id             491291
department_id        491291
order_id             491291
add_to_cart_order    491291
reordered            491291
dtype: int64

Thus, there were 491291 orders for Banana out of total 33819106

## Reshape Data Section

- Replicate the lesson code
- Complete the code cells we skipped near the beginning of the notebook
- Table 2 --> Tidy
- Tidy --> Table 2
- Load seaborn's `flights` dataset by running the cell below. Then create a pivot table showing the number of passengers by month and year. Use year for the index and month for the columns. You've done it right if you get 112 passengers for January 1949 and 432 passengers for December 1960.

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

table1 = pd.DataFrame(
    [[np.nan, 2],
     [16,    11], 
     [3,      1]],
    index=['John Smith', 'Jane Doe', 'Mary Johnson'], 
    columns=['treatmenta', 'treatmentb'])

table2 = table1.T

In [35]:
table1

Unnamed: 0,treatmenta,treatmentb
John Smith,,2
Jane Doe,16.0,11
Mary Johnson,3.0,1


In [0]:
table2 = table1.reset_index()

In [43]:
table2

Unnamed: 0,index,treatmenta,treatmentb
0,John Smith,,2
1,Jane Doe,16.0,11
2,Mary Johnson,3.0,1


In [44]:
tidy = table2.melt(id_vars='index')
tidy


Unnamed: 0,index,variable,value
0,John Smith,treatmenta,
1,Jane Doe,treatmenta,16.0
2,Mary Johnson,treatmenta,3.0
3,John Smith,treatmentb,2.0
4,Jane Doe,treatmentb,11.0
5,Mary Johnson,treatmentb,1.0


In [45]:
flights = sns.load_dataset('flights')
flights.head()

Unnamed: 0,year,month,passengers
0,1949,January,112
1,1949,February,118
2,1949,March,132
3,1949,April,129
4,1949,May,121


### Creating a pivot table

In [49]:


flights_pivot = pd.pivot_table(flights,index=["year"], columns=['month'])
flights_pivot.head()

Unnamed: 0_level_0,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers,passengers
month,January,February,March,April,May,June,July,August,September,October,November,December
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
1949,112,118,132,129,121,135,148,148,136,119,104,118
1950,115,126,141,135,125,149,170,170,158,133,114,140
1951,145,150,178,163,172,178,199,199,184,162,146,166
1952,171,180,193,181,183,218,230,242,209,191,172,194
1953,196,196,236,235,229,243,264,272,237,211,180,201


## Join Data Stretch Challenge

The [Instacart blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2) has a visualization of "**Popular products** purchased earliest in the day (green) and latest in the day (red)." 

The post says,

> "We can also see the time of day that users purchase specific products.

> Healthier snacks and staples tend to be purchased earlier in the day, whereas ice cream (especially Half Baked and The Tonight Dough) are far more popular when customers are ordering in the evening.

> **In fact, of the top 25 latest ordered products, the first 24 are ice cream! The last one, of course, is a frozen pizza.**"

Your challenge is to reproduce the list of the top 25 latest ordered popular products.

We'll define "popular products" as products with more than 2,900 orders.



In [0]:
##### YOUR CODE HERE #####

## Reshape Data Stretch Challenge

_Try whatever sounds most interesting to you!_

- Replicate more of Instacart's visualization showing "Hour of Day Ordered" vs "Percent of Orders by Product"
- Replicate parts of the other visualization from [Instacart's blog post](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2), showing "Number of Purchases" vs "Percent Reorder Purchases"
- Get the most recent order for each user in Instacart's dataset. This is a useful baseline when [predicting a user's next order](https://www.kaggle.com/c/instacart-market-basket-analysis)
- Replicate parts of the blog post linked at the top of this notebook: [Modern Pandas, Part 5: Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [0]:
##### YOUR CODE HERE #####