_Lambda School Data Science_

# Join and Reshape datasets

Objectives
- concatenate data with pandas
- merge data with pandas
-  understand tidy data formatting
-  melt and pivot data with pandas

Links
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables
  
Reference
- Pandas Documentation: [Reshaping and Pivot Tables](https://pandas.pydata.org/pandas-docs/stable/reshaping.html)
- Modern Pandas, Part 5: [Tidy Data](https://tomaugspurger.github.io/modern-5-tidy.html)

In [1]:
#!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz 

# Make sure we're in the top-level /content directory
#
# See below for notes on the cd command and why it's %cd instead of !cd
%cd /content

# Remove everything in the current working directory
#
# rm is the remove command
# -rf specifies the "recursive" and "force" options to remove all files in 
# subdirectories without prompting
#
# THIS IS A POWERFUL COMMAND! DO NOT run this command on your local computer - ever!!
#
# In this particular case, removing all of the files makes things easier if you
# need to re-run these examples by allowing you start with a clean directory
# every time.
!rm -rf *

# wget retrieves files from a remote location
!wget https://www.dropbox.com/s/pofcl26lvoj6073/instacart-market-basket-analysis.zip

/content
--2020-09-30 19:08:06--  https://www.dropbox.com/s/pofcl26lvoj6073/instacart-market-basket-analysis.zip
Resolving www.dropbox.com (www.dropbox.com)... 162.125.5.1, 2620:100:601f:1::a27d:901
Connecting to www.dropbox.com (www.dropbox.com)|162.125.5.1|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: /s/raw/pofcl26lvoj6073/instacart-market-basket-analysis.zip [following]
--2020-09-30 19:08:07--  https://www.dropbox.com/s/raw/pofcl26lvoj6073/instacart-market-basket-analysis.zip
Reusing existing connection to www.dropbox.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://uc8c804d00db2b155c5b6cb02ba5.dl.dropboxusercontent.com/cd/0/inline/BAZ2PHWjHlQw-bkBz_1A9r-1fKj19S-4miXRo2rEdRaT8OHPmuNSeX7ETCliR0tKeFrjAFO7wN7bnCBBblpxMb1qtZqgqEmHOmJT8fHoEi5v68eAtAi5_2ry4S7B-U9gRKI/file# [following]
--2020-09-30 19:08:07--  https://uc8c804d00db2b155c5b6cb02ba5.dl.dropboxusercontent.com/cd/0/inline/BAZ2PHWjHlQw-bkBz_1A9r-1fKj19S-4m

In [2]:
# Unzip the archive
#
# Creates a new directory called instacart-market-basket-analysis

!unzip instacart-market-basket-analysis.zip

Archive:  instacart-market-basket-analysis.zip
   creating: instacart-market-basket-analysis/
  inflating: __MACOSX/._instacart-market-basket-analysis  
  inflating: instacart-market-basket-analysis/order_products__prior.csv.zip  
  inflating: __MACOSX/instacart-market-basket-analysis/._order_products__prior.csv.zip  
  inflating: instacart-market-basket-analysis/.DS_Store  
  inflating: __MACOSX/instacart-market-basket-analysis/._.DS_Store  
  inflating: instacart-market-basket-analysis/order_products__train.csv.zip  
  inflating: __MACOSX/instacart-market-basket-analysis/._order_products__train.csv.zip  
  inflating: instacart-market-basket-analysis/aisles.csv.zip  
  inflating: __MACOSX/instacart-market-basket-analysis/._aisles.csv.zip  
  inflating: instacart-market-basket-analysis/orders.csv.zip  
  inflating: __MACOSX/instacart-market-basket-analysis/._orders.csv.zip  
  inflating: instacart-market-basket-analysis/departments.csv.zip  
  inflating: __MACOSX/instacart-market-baske

In [3]:
# Change into the newly-unzipped directory
#
# % sign is required to change to a new directory -- you can't use !cd like
# other commands
#
# Optional technical details:
#
# % makes the command apply to the **entire notebook environment**, which is
# what you need to do to change the working directory
#
# The ! sign **opens a new shell process** behind the scenes to execute the
# command -- this works fine for regular commands like unzip and ls
#
# Therefore, !cd would apply only to that new shell and wouldn't change the
# global notebook environment
#
# If this makes your heard hurt, don't worry too much about it. We'll talk
# more about the shell and operating systems stuff later in the program.

%cd instacart-market-basket-analysis

/content/instacart-market-basket-analysis


In [4]:
# Unzip all .csv.zip files in the directory
!unzip "*.zip"

Archive:  products.csv.zip
  inflating: products.csv            
   creating: __MACOSX/
  inflating: __MACOSX/._products.csv  

Archive:  departments.csv.zip
  inflating: departments.csv         
  inflating: __MACOSX/._departments.csv  

Archive:  orders.csv.zip
  inflating: orders.csv              
  inflating: __MACOSX/._orders.csv   

Archive:  order_products__prior.csv.zip
  inflating: order_products__prior.csv  
  inflating: __MACOSX/._order_products__prior.csv  

Archive:  order_products__train.csv.zip
  inflating: order_products__train.csv  
  inflating: __MACOSX/._order_products__train.csv  

Archive:  aisles.csv.zip
  inflating: aisles.csv              
  inflating: __MACOSX/._aisles.csv   

6 archives were successfully processed.


In [5]:
# List all csv files in the current directory
# -l specifies the "long" listing format, which includes additional info on each file
# -h specifies "human readable" file size units
!ls -l -h *.csv

-rw-r--r-- 1 root root 2.6K May  2  2017 aisles.csv
-rw-r--r-- 1 root root  270 May  2  2017 departments.csv
-rw-r--r-- 1 root root 551M May  2  2017 order_products__prior.csv
-rw-r--r-- 1 root root  24M May  2  2017 order_products__train.csv
-rw-r--r-- 1 root root 104M May  2  2017 orders.csv
-rw-r--r-- 1 root root 2.1M May  2  2017 products.csv


# Assignment

## Practice joining data

These are the top 10 most frequently ordered products. How many times was each ordered? 

1. Banana
2. Bag of Organic Bananas
3. Organic Strawberries
4. Organic Baby Spinach 
5. Organic Hass Avocado
6. Organic Avocado
7. Large Lemon 
8. Strawberries
9. Limes 
10. Organic Whole Milk

**Here is what you need to do:**

* First, write down which columns you need and which dataframes have them.
* Next, merge these into a single dataframe.
* Then, use pandas functions from the previous lesson to get the **counts of the top 10 most frequently ordered products**.

###1) Read in and concatenate the order_products__prior.csv and order_products__train.csv CSVs.  Name the resulting dataset order_products.

In [7]:
##### YOUR CODE HERE #####
import pandas as pd

prior = pd.read_csv('order_products__prior.csv')
train = pd.read_csv('order_products__train.csv')

In [15]:
print(prior.head())
print(train.head())

   order_id  product_id  add_to_cart_order  reordered
0         2       33120                  1          1
1         2       28985                  2          1
2         2        9327                  3          0
3         2       45918                  4          1
4         2       30035                  5          0
   order_id  product_id  add_to_cart_order  reordered
0         1       49302                  1          1
1         1       11109                  2          1
2         1       10246                  3          0
3         1       49683                  4          0
4         1       43633                  5          1


In [10]:
order_products = pd.concat([prior, train])
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


###2) Create a list called ten_products that contains the names of the top 10 products ordered.  Be very careful about spelling and capitalization.

In [11]:
##### YOUR CODE HERE #####
ten_products = ['Banana', 'Bag of Organic Bananas', 'Organic Strawberries', 
                'Organic Baby Spinach', 'Organic Hass Avocado', 'Organic Avocado',
                'Large Lemon', 'Strawberries', 'Limes', 'Organic Whole Milk']

###3) Read in products.csv and name the dataset "products".

In [17]:
##### YOUR CODE HERE #####
products = pd.read_csv('products.csv')
print(products.shape)
products.head()

(49688, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
0,1,Chocolate Sandwich Cookies,61,19
1,2,All-Seasons Salt,104,13
2,3,Robust Golden Unsweetened Oolong Tea,94,7
3,4,Smart Ones Classic Favorites Mini Rigatoni Wit...,38,1
4,5,Green Chile Anytime Sauce,5,13


###4) Select only the rows of the products dataset that contain one of the top 10 most ordered products.  Rename that dataset "products".

In [23]:
##### YOUR CODE HERE #####
condition = products['product_name'].isin(ten_products)
products = products[condition]
print(products.shape)
products.head()

(10, 4)


Unnamed: 0,product_id,product_name,aisle_id,department_id
13175,13176,Bag of Organic Bananas,24,4
16796,16797,Strawberries,24,4
21136,21137,Organic Strawberries,24,4
21902,21903,Organic Baby Spinach,123,4
24851,24852,Banana,24,4


###5) Crete a dataset called product_orders that merges order_products and products. 

In [27]:
##### YOUR CODE HERE #####
product_orders = pd.merge(order_products, products, on='product_id')

print(product_orders.shape)
product_orders.head()

(2418314, 7)


Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,3,21903,4,1,Organic Baby Spinach,123,4
1,26,21903,6,0,Organic Baby Spinach,123,4
2,31,21903,3,1,Organic Baby Spinach,123,4
3,39,21903,4,0,Organic Baby Spinach,123,4
4,56,21903,8,1,Organic Baby Spinach,123,4


###6) Create a dataset called merged that merges product_orders and orders.

In [32]:
##### YOUR CODE HERE #####
orders = pd.read_csv('orders.csv')

merged = pd.merge(product_orders, orders, on= 'order_id')


###7) Print the top 5 rows of the merged dataset.

In [33]:
##### YOUR CODE HERE #####
merged.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
0,3,21903,4,1,Organic Baby Spinach,123,4,205970,prior,16,5,17,12.0
1,26,21903,6,0,Organic Baby Spinach,123,4,153404,prior,2,0,16,7.0
2,26,24852,2,1,Banana,24,4,153404,prior,2,0,16,7.0
3,26,47766,8,1,Organic Avocado,24,4,153404,prior,2,0,16,7.0
4,31,21903,3,1,Organic Baby Spinach,123,4,201744,prior,7,6,15,14.0


###9) Print the number of times each food appears in the merged dataset.  (Hint: use .valuecounts() )

In [34]:
##### YOUR CODE HERE #####
merged.value_counts('product_name')

product_name
Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Organic Avocado           184224
Large Lemon               160792
Strawberries              149445
Limes                     146660
Organic Whole Milk        142813
dtype: int64

###10) Run the following code to test for duplicate products in a single order.

In [35]:
##### Run this code #####

order_products.duplicated(subset=['order_id', 'product_id']).value_counts()

False    33819106
dtype: int64

### Conclusion? - This dataset does not have any information about the quantity of items ordered, only unique items ordered and whether the shopper had bought any items in past visits. So our counts of how many times the top 10 products were ordered will really be the number of orders that the top 10 products were included in.

### In order to count the frequency of orders of a given product we need to combine orders and products so that we have names associated with the products in each order.

###11) Print the top 5 rows of the products dataset.

In [36]:
##### YOUR CODE HERE #####
products.head()

Unnamed: 0,product_id,product_name,aisle_id,department_id
13175,13176,Bag of Organic Bananas,24,4
16796,16797,Strawberries,24,4
21136,21137,Organic Strawberries,24,4
21902,21903,Organic Baby Spinach,123,4
24851,24852,Banana,24,4


###12) Merge together the order_products and products datasets and call the dataset product_orders.

In [41]:
print(order_products.head())
print(products.head())

   order_id  product_id  add_to_cart_order  reordered
0         2       33120                  1          1
1         2       28985                  2          1
2         2        9327                  3          0
3         2       45918                  4          1
4         2       30035                  5          0
       product_id            product_name  aisle_id  department_id
13175       13176  Bag of Organic Bananas        24              4
16796       16797            Strawberries        24              4
21136       21137    Organic Strawberries        24              4
21902       21903    Organic Baby Spinach       123              4
24851       24852                  Banana        24              4


In [38]:
##### YOUR CODE HERE #####
product_orders = pd.merge(order_products, products, on= 'product_id')
product_orders

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,3,21903,4,1,Organic Baby Spinach,123,4
1,26,21903,6,0,Organic Baby Spinach,123,4
2,31,21903,3,1,Organic Baby Spinach,123,4
3,39,21903,4,0,Organic Baby Spinach,123,4
4,56,21903,8,1,Organic Baby Spinach,123,4
...,...,...,...,...,...,...,...
2418309,3418861,26209,3,1,Limes,24,4
2418310,3418905,26209,2,1,Limes,24,4
2418311,3419642,26209,6,0,Limes,24,4
2418312,3420257,26209,22,1,Limes,24,4


###13) Determine the number of unique instances of 'product name'.  This will tell you how many times each of the top 10 products was ordered.

In [39]:
##### YOUR CODE HERE #####
product_orders.value_counts('product_name')

product_name
Banana                    491291
Bag of Organic Bananas    394930
Organic Strawberries      275577
Organic Baby Spinach      251705
Organic Hass Avocado      220877
Organic Avocado           184224
Large Lemon               160792
Strawberries              149445
Limes                     146660
Organic Whole Milk        142813
dtype: int64

# Portfolio Project Milestone

Watch the Portfolio Project (formerly known as Build Week) kickoff video to get a sense of what you will accomplish over the next few weeks:
https://youtu.be/WYi9EXH-9lU