<a href="https://colab.research.google.com/github/Trabit/DS-Unit-1-Sprint-1-Data-Wrangling-and-Storytelling/blob/master/LS_DS_113_Join_and_Reshape_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 1, Sprint 1, Module 3*

---



# Join and Reshape Data 

- Student can concatenate data with pandas
- Student can merge data with pandas
- Student can understand and describe tidy data formatting
- Student can use the `.melt()` and `.pivot()` functions to translate between wide and tidy data format.

Helpful Links:
- [Pandas Cheat Sheet](https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf)
- [Tidy Data](https://en.wikipedia.org/wiki/Tidy_data)
  - Combine Data Sets: Standard Joins
  - Tidy Data
  - Reshaping Data
- Python Data Science Handbook
  - [Chapter 3.6](https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html), Combining Datasets: Concat and Append
  - [Chapter 3.7](https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html), Combining Datasets: Merge and Join
  - [Chapter 3.8](https://jakevdp.github.io/PythonDataScienceHandbook/03.08-aggregation-and-grouping.html), Aggregation and Grouping
  - [Chapter 3.9](https://jakevdp.github.io/PythonDataScienceHandbook/03.09-pivot-tables.html), Pivot Tables

# [Objective](#concat) Concatenate dataframes with pandas



## Overview

"Concatenate" is a fancy word for joining two things together. For example, we can concatenate two strings together using the `+` operator.

In [0]:
'We can join/concatenate two strings together ' + 'using the "+" operator.'

'We can join/concatenate two strings together using the "+" operator.'

When we "concatenate" two dataframes we will "stick them together" either by rows or columns. Lets look at some simple examples:

In [0]:
import pandas as pd

In [0]:
df1 = pd.DataFrame({'a': [1,2,3,4], 'b': [4,5,6,7], 'c': [7,8,9,10]})

df2 = pd.DataFrame({'a': [6,4,8,7], 'b': [9,4,3,2], 'c': [1,6,2,9]})

In [0]:
df1.head()

Unnamed: 0,a,b,c
0,1,4,7
1,2,5,8
2,3,6,9
3,4,7,10


In [0]:
df2.head()

Unnamed: 0,a,b,c
0,6,9,1
1,4,4,6
2,8,3,2
3,7,2,9


### Concatenate by Rows 

concatenating by rows is the default behavior of `pd.concat()` This is often the most common form of concatenation. 

In [0]:
# Pass in the dataframes that we want t concatenate as a list.
concatenated_by_rows = pd.concat([df1, df2])

# reset the index so that we don't have repeated row indentifiers
concatenated_by_rows = concatenated_by_rows.reset_index()

concatenated_by_rows.head(8)

Unnamed: 0,index,a,b,c
0,0,1,4,7
1,1,2,5,8
2,2,3,6,9
3,3,4,7,10
4,0,6,9,1
5,1,4,4,6
6,2,8,3,2
7,3,7,2,9


### Concatenate by Columns

In [0]:
concatenated_by_cloumns = pd.concat([df1, df2], axis=1)

concatenated_by_columns.head()

NameError: ignored

When concatenating dataframes, it is done using the column headers and row index values to match rows up. If these don't match up, then `NaN` values will be added where matches can't be found. 

In [0]:
df3 = pd.DataFrame({'a': [4,3,2,1], 'b': [4,5,6,7], 'c': [7,8,9,10]})

df4 = pd.DataFrame({'a': [6,4,8,7,8], 'b': [9,4,3,2,1], 'd': [1,6,2,9,5]})

In [0]:
df3.head()

Unnamed: 0,a,b,c
0,4,4,7
1,3,5,8
2,2,6,9
3,1,7,10


In [0]:
df4.head()

Unnamed: 0,a,b,d
0,6,9,1
1,4,4,6
2,8,3,2
3,7,2,9
4,8,1,5


### Concatenate by rows when not all column headers match

In [0]:
concatenate_by_rows = pd.concat([df3, df4])

concatenated_by_rows.head(8)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,index,a,b,c
0,0,1,4,7
1,1,2,5,8
2,2,3,6,9
3,3,4,7,10
4,0,6,9,1
5,1,4,4,6
6,2,8,3,2
7,3,7,2,9


### Concatenate by columns when not all row indexes match

In [0]:
concatenated_by_columns = pd.concat([df3, df4], axis=1)

concatenate_by_cloumns.head()

Unnamed: 0,a,b,c,a.1,b.1,c.1
0,1,4,7,6,9,1
1,2,5,8,4,4,6
2,3,6,9,8,3,2
3,4,7,10,7,2,9


Whenever we are combining dataframes, if appropriate values cannot be found based on the rules of the method we are using, then missing values will be filled with `NaNs`.

## Follow Along



We’ll work with a dataset of [3 Million Instacart Orders, Open Sourced](https://tech.instacart.com/3-million-instacart-orders-open-sourced-d40d29ead6f2)!

The files that we will be working with are in a folder of CSVs, we need to load that folder of CSVs, explore the CSVs to make sure that we understand what we're working with, and where the important data lies, and then work to combine the dataframes together as necessary. 



Our goal is to reproduce this table which holds the first two orders for user id 1.


In [0]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*vYGFQCafJtGBBX5mbl0xyw.png'
example = Image(url=url, width=600)

display(example)

In [0]:
!wget https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz

--2020-01-19 04:21:16--  https://s3.amazonaws.com/instacart-datasets/instacart_online_grocery_shopping_2017_05_01.tar.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.17.51
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.17.51|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 205548478 (196M) [application/x-gzip]
Saving to: ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’


2020-01-19 04:21:22 (34.6 MB/s) - ‘instacart_online_grocery_shopping_2017_05_01.tar.gz’ saved [205548478/205548478]



In [0]:
!tar --gunzip --extract --verbose --file=instacart_online_grocery_shopping_2017_05_01.tar.gz

instacart_2017_05_01/
instacart_2017_05_01/._aisles.csv
instacart_2017_05_01/aisles.csv
instacart_2017_05_01/._departments.csv
instacart_2017_05_01/departments.csv
instacart_2017_05_01/._order_products__prior.csv
instacart_2017_05_01/order_products__prior.csv
instacart_2017_05_01/._order_products__train.csv
instacart_2017_05_01/order_products__train.csv
instacart_2017_05_01/._orders.csv
instacart_2017_05_01/orders.csv
instacart_2017_05_01/._products.csv
instacart_2017_05_01/products.csv


In [0]:
%cd instacart_2017_05_01

[Errno 2] No such file or directory: 'instacart_2017_05_01'
/content


In [0]:
!ls -lh *.csv

ls: cannot access '*.csv': No such file or directory


### aisles

We don't need anything from aisles.csv

In [0]:
aisles = pd.read_csv('aisles.csv')

print (aisles.shape)
aisles.head()

NameError: ignored

### departments

We don't need anything from departments.csv

In [0]:
departments = pd.read_csv('departments.csv')
print(departments.shape)
departments.head()

NameError: ignored

### order_products__prior

We need:
- order id
- proudct id
- add to cart order

Everything except for 'reordered'

In [0]:
order_products__prior = pd.read_csv('order_product__prior.csv')
print (order_products__prior.shape)
order_products__prior.head()

NameError: ignored

### order_products__train

We need:
- order id
- proudct id
- add to cart order

Everything except for 'reordered'

Do you see anything similar between order_products__train and order_products__prior?



In [0]:
order_products__train = pd.read_csv('order_products__train.csv')
print (order_products__train.shape)
order_products__train.head()

NameError: ignored

### orders

We need:
- order id
- user id
- order number
- order dow
- order hour of day

In [0]:
orders = pd.read_csv('order.csv')
print (order.shape)
order_products__train.head()
orders.head()

NameError: ignored

### products

We need:
- product id
- product name

In [0]:
products = pd.read_csv('products.csv')
print (products.shape)
products.head()

NameError: ignored

## Concatenate order_products__prior and order_products__train




In [0]:
order_products__prior.shape

NameError: ignored

In [0]:
order_products__train.head

NameError: ignored

In [0]:
order_products = pd.concat([order_products_prior, order_products__train])

print(order_products.shape)
order_products.head()

NameError: ignored

## Challenge

Concatenating dataframes means to stick two dataframes together either by rows or by columns. The default behavior of `pd.concat()` is to take the rows of one dataframe and add them to the rows of another dataframe. If we pass the argument `axis=1` then we will be adding the columns of one dataframe to the columns of another dataframe.

Concatenating dataframes is most useful when the columns are the same between two dataframes or when we have matching row indices between two dataframes. 

Be ready to use this method to combine dataframes together during your assignment.

# [Objective](#merge) Merge dataframes with pandas



## Overview

In [0]:
display(example)

Before we can continue we need to understand where the data in the above table is coming from and what why specific pieces of data are held in the specific dataframes.

Each of these CSVs has a specific unit of observation (row). The columns that we see included in each CSV were selected purposefully. For example, everything each row of the `orders` dataframe is a specific and unique order -telling us who made the order, and when they made it. Every row in the `products` dataframe tells us about a specific and unique product that thestore offers. And everything in the `order_products` dataframe tells us about how products are associated with specific orders -including when the product was added to the shopping cart. 

### The Orders Dataframe

Holds information about specific orders, things like who placed the order, what 

- user_id
- order_id
- order_number
- order_dow
- order_hour_of_day

### The Products Dataframe

Holds information about individual products.

- product_id
- product_name

### The Order_Products Dataframe

Tells us how products are associated with specific orders since an order is a group of products.

- order_id
- product_id
- add_to_cart_order

As we look at the table that we're trying to recreate, we notice that we're not looking at specific orders or products, but at a specific **USER**. We're looking at the first two orders for a specific user and the products associated with those orders, so we'll need to combine dataframes to get all of this data together into a single table.

**The key to combining all of this information is that we need values that exist in both datasets that we can use to match up rows and combine dataframes.**

## Follow Along

We have two dataframes, so we're going to need to merge our data twice. As we approach merging datasets together we will take the following approach.

1) Identify which to dataframes we would like to combine.

2) Find columns that are common between both dataframes that we can use to match up information.

3) Slim down both of our dataframes so that they only relevant data before we merge.

4) Merge the dataframes.

In [0]:
display(example)

NameError: ignored

### First Merge

1) Combine `orders` and `order_products`

2) We will use the `order_id` column to match information between the two datasets

3) Lets slim down our dataframes to only the information that we need. We do this because the merge process is complex. Why would we merge millions of rows together if we know that we're only going to need 11 rows when we're done

What specific conditions could we use to slim down the `orders` dataframe?

`user_id == 1` and `order_id <=2`

or

`order_id == 2539329` and `order_id == 2398795`

In [0]:
dfl.head()

NameError: ignored

In [0]:
# What if I only wanted the rows where column C is > 8
condition = (dfl['c'] > 8)

dfl(condition)

NameError: ignored

In [0]:
orders['user_id'] == 1

NameError: ignored

In [0]:
# An example of dataframe filtering

# Create a condition
condition = (orders['user-id'] == 1) & (orders['order_number'] <=2)

# Pass that condition into the square brackets 
# that we use to access portions of a dataframe
# only the rows where that condition evaluates to *TRUE*
# will be retained in the dataframe
order-subset = order[condition]

# Look at the subsetted dataframe
print(orders_subset.shape)
orders_subset.head()

SyntaxError: ignored

In [0]:
condition = (order['order_id'] == 2539329) | (order['order_id']  == 2398795)

orders_subset = orders[condition]

print(orders_sunset.shape)
orders.head()

NameError: ignored

In [0]:
# We don't necessarily have to save our condition to the variable "condition"
# we can pass the condition into the square brackest directly
# I just wanted to be clear what was happening inside of the square brackets

orders_subset = (orders['order_id'] == 2539329)| (orders['order_id'] == 239879)

print(orders_subset.shape)
orders_subset.head()

NameError: ignored

Remember there are multiple ways that we could have filtered this dataframe. We also could have done it by specific `order_id`s


In [0]:
condition = (order['order_id'] == 2539329) | (order['order_id']  == 2398795)

orders_subset = orders[condition]

print(orders_sunset.shape)
orders.head()

Now we'll filter down the order_products dataframe

What conditions could we use for subsetting that table?

We can use order_id again.

In [0]:
condition = (orders_products['order_id'] == 2539329) | (order_production['order_id'] == 2398795)
order_products_subset = order_products[condition]

print(order_products_subset.shape)
order_products_subset.head(11)

NameError: ignored

4) Now we're ready to merge these two tables together.

In [0]:
# on = THE column header for the unique indentifer column that I'm using to
# match thetwo dataframe' information

#how = the way that I want any non-matching rows to be retained or not retained
# which NaNs should be kept vs drooped
orders_and_products = pd.merge(order_subset, 
                               order_procduct_subset, 
                               on='order_id', 
                               how=inner)

#inner = do  the merge, but drop any rows with NaNs
# outer = do the merge, but keep all the rows even the ones with NaNS
# right = do the merge, but only keep the rows with NaNs that come from the right-hand dataframe
# right-hand dataframe
# left = do the merge, but only keep therows with NaNsthst csme from the left-hands 
# left-hand dataframe

orders_products.head(11)

NameError: ignored

In [0]:
display(example)

In [0]:
# Remove columns that we don't need

orders_and_products = orders_and_products.drop(['eval_set',
                                                'recordered',
                                                'days_since_prior_order'],
                                               axis=1)

orders_and_products.head(11)

NameError: ignored

Okay, we're looking pretty good, we're missing one more column `product_name` so we're going to need to merge one more time

1) merge `orders_and_products` with `products`

2) Use `product_id` as our identifier in both tables

3) We need to slim down the `products` dataframe

In [0]:
orders_and_products['product_id']

NameError: ignored

In [0]:
orders_and_product['product_id']

NameError: ignored

In [0]:
orders_and_prductd['portect_id'].isin([196, 26088])

NameError: ignored

In [0]:
condition = products['product_id'].isin(orders_and_products)['product_id']

products_subset = products[condition]

products_subset

NameError: ignored

In [0]:
final = d.merge(orders_and_products, prodects_subset , on='product_id', how='inner)

final

SyntaxError: ignored

In [0]:
display(example)

NameError: ignored

### Some nitpicky cleanup:

In [0]:
# reorder columns

final = final[['user_id', 'order_id', 
               'order_number', 'order_dow',
               'order_hour_of_the _day', 
               'add_to cart_order]]

final 

SyntaxError: ignored

In [0]:
# sort rows
final.sort_values(by=['order_number, add_to_cart'])
final

In [0]:
# remove underscores from column headers

final.columns = [column,replace('_', ' ')for column in final]

SyntaxError: ignored

In [0]:
display(example)

## Challenge

Review this Chis Albon documentation about [concatenating dataframes by row and by column](https://chrisalbon.com/python/data_wrangling/pandas_join_merge_dataframe/) and then be ready to master this function and practice using different `how` parameters on your assignment.

# [Objective](#tidy) Learn Tidy Data Format

## Overview

### Why reshape data?

#### Some libraries prefer data in different formats

For example, the Seaborn data visualization library prefers data in "Tidy" format often (but not always).

> "[Seaborn will be most powerful when your datasets have a particular organization.](https://seaborn.pydata.org/introduction.html#organizing-datasets) This format ia alternately called “long-form” or “tidy” data and is described in detail by Hadley Wickham. The rules can be simply stated:

> - Each variable is a column
- Each observation is a row

> A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot."

#### Data science is often about putting square pegs in round holes

Here's an inspiring [video clip from _Apollo 13_](https://www.youtube.com/watch?v=ry55--J4_VQ): “Invent a way to put a square peg in a round hole.” It's a good metaphor for data wrangling!

### Hadley Wickham's Examples

From his paper, [Tidy Data](http://vita.had.co.nz/papers/tidy-data.html)

In [0]:
%matplotlib inline
import pandas as pd
import numpy as np
import seaborn as sns

table1 = pd.DataFrame(
    [[np.nan, 2],
     [16,    11], 
     [3,      1]],
    index=['John Smith', 'Jane Doe', 'Mary Johnson'], 
    columns=['treatmenta', 'treatmentb'])

"Table 1 provides some data about an imaginary experiment in a format commonly seen in the wild. 

The table has two columns and three rows, and both rows and columns are labelled."

In [0]:
table1

Unnamed: 0,treatmenta,treatmentb
John Smith,,2
Jane Doe,16.0,11
Mary Johnson,3.0,1


"There are many ways to structure the same underlying data. 

Table 2 shows the same data as Table 1, but the rows and columns have been transposed. The data is the same, but the layout is different."

In [0]:
table2 = table1.T
table2

Unnamed: 0,John Smith,Jane Doe,Mary Johnson
treatmenta,,16.0,3.0
treatmentb,2.0,11.0,1.0


"Table 3 reorganises Table 1 to make the values, variables and obserations more clear.

Table 3 is the tidy version of Table 1. Each row represents an observation, the result of one treatment on one person, and each column is a variable."

| name         | trt | result |
|--------------|-----|--------|
| John Smith   | a   | -      |
| Jane Doe     | a   | 16     |
| Mary Johnson | a   | 3      |
| John Smith   | b   | 2      |
| Jane Doe     | b   | 11     |
| Mary Johnson | b   | 1      |

## Follow Along

### Table 1 --> Tidy

We can use the pandas `melt` function to reshape Table 1 into Tidy format.

In [0]:
table1

NameError: ignored

In [0]:
# Take the row index, and add it as a new column
table = table1.reset_index()
table.head()


NameError: ignored

In [0]:
table1

NameError: ignored

In [0]:
# What is the unique identifier for each row
# Where is the data at that I want to be in my single "tidy" column
# Melt functoion - go from wide_>

tidy1 = table.melt(id_vars= 'index', value_vars['treatmenta', 'treatmentb'])

tiby1

SyntaxError: ignored

In [0]:
# rename columns

tidy1 = tidy1.rename(column={
    'index': 'name',
    'variable': 'trt',
    'value': 'result'
})
tydy1

NameError: ignored

In [0]:
tidy1.trt = tidy1.trt.str.replace('treatment','')

tidy1

NameError: ignored

### Tidy --> Table 1

The `pivot_table` function is the inverse of `melt`.

In [0]:
# index: unique identifier
# columns: What do you want to differentiate the columns in wide format
# values: Where are the numbers at - go in the middle of the wide dataframe

wide = tidy1.pivot_table(index='name',columns= 'trt', values'result')

wide

SyntaxError: ignored

## Challenge

On your assignment, be prepared to take table2 (the transpose of table1) and reshape it to be in tidy data format using `.melt()` and then put it back in "wide format" using `.pivot_table()`

# [Objective](#melt-pivot) Transition between tidy and wide data formats with `.melt()` and `.pivot()`.

## Overview

Tidy data format can be particularly useful with certain plotting libraries like Seaborn for example. Lets practice reshaping our data and show how this can be extremely useful in preparing our data for plotting.

Remember that tidy data format means:

- Each variable is a column
- Each observation is a row

A helpful mindset for determining whether your data are tidy is to think backwards from the plot you want to draw. From this perspective, a “variable” is something that will be assigned a role in the plot." When plotting, this typically means that the values that we're most interested in and that represent the same thing will all be in a single column. You'll see that in the different examples that we show. The important data will be in a single column.



In [0]:
# Look at some of the awesome out-of-the-box seaborn functionality:

import seaborn as sns

sns.catplot(x='trt', y='result', col='name',
            kind='bar', data=tidy1, height=2);

NameError: ignored

## Follow Along

Now with Instacart Data. We're going to try and reproduce a small part of this visualization: 

In [0]:
from IPython.display import display, Image
url = 'https://cdn-images-1.medium.com/max/1600/1*wKfV6OV-_1Ipwrl7AjjSuw.png'
example = Image(url=url, width=600)

display(example)

Instead of a plot with 50 products, we'll just do two — the first products from each list
- Half And Half Ultra Pasteurized
- Half Baked Frozen Yogurt

So, given a `product_name` we need to calculate its `order_hour_of_day` pattern.

In [0]:
import pandas as pd
protects = pd.read_csv('products.csv')

orders_products = pd.concat([pd.read_csv('order_products_prior.csv'),
                           pd.read.csv('order_products_train.csv')])

orders = pd.read.csv('orders.csv')

FileNotFoundError: ignored

### Subset and Merge

One challenge of performing a merge on this data is that the `products` and `orders` datasets do not have any common columns that we can merge on. Due to this we will have to use the `order_products` dataset to provide the columns that we will use to perform the merge.

Here's the two products that we want to work with.

In [0]:
product_names = ['half Baked Frozen Yogurt', 'Half and Half Ultra Pasteuized']

Lets remind ourselves of what columns we have to work with:

In [0]:
products.columns.to_list()

NameError: ignored

In [0]:
orders.columns.to_list()

NameError: ignored

This might blow your mind, but we're going to subset the dataframes to select specific columns **and** merge them all in one go. Ready?

In [0]:
order_prodects.columns.to_list()

NameError: ignored

Ok, so we were a little bit lazy and probably should have subsetted our the rows of our dataframes before we merged them. We are going to filter after the fact. This is something that you can try out for practice. Can you figure out how to filter these dataframes **before** merging rather than after?

In [0]:
merged = (products[['product_id', 'product_name']]
         .merge(order_products[['order_id', 'product_id']])
        .merge(orders)[['order_id', 'order_hour_of_day']])

NameError: ignored

Again, there are multiple effective ways to write conditions. 

In [0]:
condition = (merged['product_name'] == 'Half Baked Froxen Yogur') | (merged['product_name'] == 'Half and Half Ultra Pasteurized')

merged = merged[condition]

print(merged.shape)
merged.head()

NameError: ignored

In [0]:
product_names == ['Half Baked Frozen Yogurt','Half and Half Ultra Pasteurized']

condition = (merged['product_name'],isn(product_names)

subset = merged[condition]

print(subset.shape)
subset.head()

SyntaxError: ignored

### 4 ways to reshape and plot



In [0]:
display(example)

1) The `.value_counts()` approach.

Remember, that we're trying to get the key variables (values) listed as a single column.

In [0]:
froyo = subset['product_name']=='Half Baked Frozen Yogurt']
cream = subset[subset['product_name']=='Half And Half Ultra Pasteurized']

SyntaxError: ignored

In [0]:
cream['order_hour of the day'].value_counts(normalize=True).sot_index()

NameError: ignored

In [0]:
froyo['order_hour of the day'].value_counts(normalize=True).sort_index()

NameError: ignored

In [0]:
(cream['order_hour_of_day']
 .value_counts(normalize=True)
 .sort_index()
 .plot())

# plt.show()

froyo['order_of_the_day']
.value_counts(normalize=True)
.sort_index()
.plot());

SyntaxError: ignored

2) Crosstab

In [0]:
pd.crosstab(subset['order_hour_of_day'],
            suset['product_name'],
            normalize='columns').plot()

NameError: ignored

3) Pivot Table

In [0]:
subset.head()

NameError: ignored

In [0]:
subset.pivot_table(index='order_hour_of_day',
                   columns='product_name',
                   values='order_id'
                   aggfunc=len).plot();

SyntaxError: ignored

4) Melt 

We've got to get it into wide format first. We'll use a crosstab which is a specific type of pivot_table.

Now, with Seaborn: