# Week 13
# Data Wrangling: Join, Combine, and Reshape

In many applications, data may be spread across a number of files or be arranged in a form that is not easy to analyze. This chapter focuses on tools to help combine, join, and rearrange data.

*Reference*: Textbook, Chapter 8

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## I. Merging Datasets

### 1. Default merge operation for data frames

In [None]:
# Generate two data frames
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})
df1

In [None]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})
df2

`df1.merge(df2)` merges df1 with df2:

In [None]:
df1.merge(df2)

In [None]:
# It is the same as df2.merge(df1)
df2.merge(df1)

In [None]:
pd.merge(df2, df1)

Q: Can you identify the rule followed by merge?

- **How does Python know which row from df2 should be combined with a row from df1?**
A row from df2 can be merged with a row from df1 if and only if they are the same value in the shared column.

- **Which column is used to "glue" df1 and df2?**
The glue column is the column(s) that appear in both data frames.

- **Can a row from df1 disappear in the merged data frame?**
A row can indeed disappear if it cannot find a match from df2.

- **Can a row from df2 disappear in the merged data frame?**
A row can indeed disappear if it cannot find a match from df1.

- **Can a row from df1/df2 appear multiple times in the merged data frame?**
Yes. A row may appear multiple times if there are multiple matches from the other data frame.

In [None]:
df3 = pd.DataFrame({'key': ['a', 'b', 'b'],
                    'data2': range(3)})
df3

In [None]:
# Can you predict the resulting data frame?
df1.merge(df3)

**It is a good practice to specify explicitly which column(s) to join on.**

In [None]:
pd.merge(df1, df2, on='key')

In [None]:
df1.merge(df3, on='key')

### 2. What if the column to join has different names in the two data frames?

In [None]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare'],
    'Hw1': [100, 90, 80],
    'Hw2': [60, 70, 80]
})
homework

In [None]:
exam = pd.DataFrame({
    "Full Name": ['Alice', 'Bob', 'Clare'],
    "Midterm": [70, 80, 90],
    "Final": [85, 65, 75]
})
exam

In [None]:
pd.merge(homework, exam)

In [None]:
pd.merge(homework, exam, left_on="Name", right_on="Full Name")

In [None]:
# pd.merge(homework, exam)
# This will give an error.

### 3. What if the column to join has different values?

In [None]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90]
})
homework

In [None]:
exam = pd.DataFrame({
    "Full Name": ['Alice', 'Bob', 'Clare', 'Eli'],
    "Midterm": [70, 80, 90, 100],
    "Final": [85, 65, 75, 55]
})
exam

In [None]:
# Default merge will drop values that cannot find a match
pd.merge(homework, exam,
         left_on="Name",
         right_on="Full Name")

Different join types with `how` argument
- inner: Use only the keys combinations observed in both tables
- outer: Use all possible keys combinations
- left: Use all keys found in the first data frame
- right: Use all keys found in the second data frame

In [None]:
pd.merge(homework, exam, left_on="Name", right_on="Full Name",
         how='outer')

In [None]:
pd.merge(homework, exam,
         left_on="Name",
         right_on="Full Name",
         how="left")

In [None]:
pd.merge(homework, exam,
         left_on="Name",
         right_on="Full Name",
         how="right")

### 4. What if we want to join on multiple columns?

In [None]:
homework = pd.DataFrame({
    'Semester': ['Fall 2018', 'Fall 2018', 'Fall 2019', 'Fall 2019'],
    'Name': ['Alice', 'Bob', 'Clare', 'Alice'],
    'Hw1': [50, 90, 80, 70],
    'Hw2': [60, 70, 80, 90]
})
homework

In [None]:
exam = pd.DataFrame({
    'When': ['Fall 2018', 'Fall 2018', 'Fall 2019', 'Fall 2019'],
    "Name": ['Alice', 'Bob', 'Clare', 'Alice'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55]
})
exam

In [None]:
pd.merge(homework, exam, on='Name')

In [None]:
pd.merge(homework, exam, left_on=['Semester', 'Name'],
         right_on=['When', 'Name']) # order matters
#          right_on=["Name", "When"])

In [None]:
exam2 = exam.copy()
# exam2.columns = ['Final', 'Midterm', 'Name', 'Semester']
exam2.columns = ["Semester", "Name", "Midterm", "Final"]
exam2

In [None]:
pd.merge(homework, exam2, on=['Semester', 'Name'])

### 5. What if there are overlapping columns?

In [None]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90],
    'Average': [80, 80, 80, 80]
})
homework

In [None]:
exam = pd.DataFrame({
    "Name": ['Alice', 'Bob', 'Clare', 'Eva'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55],
    "Average": [52.5, 72.5, 82.5, 77.5]
})
exam

In [None]:
pd.merge(homework, exam) # Wrong approach!

In [None]:
pd.merge(homework, exam, on='Name', how='outer')

In [None]:
pd.merge(homework, exam, on='Name', suffixes=('_hw', '_ex'), how='outer')

### 6. What if we want to merge on index?

In [None]:
homework = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Hw1': [100, 90, 80, 70],
    'Hw2': [60, 70, 80, 90],
    'Average': [80, 80, 80, 80]
}, index=[111, 222, 333, 444])
homework

In [None]:
exam = pd.DataFrame({
    "Name": ['Alice', 'Bob', 'Clare', 'Eva'],
    "Midterm": [60, 80, 90, 100],
    "Final": [45, 65, 75, 55],
    "Average": [52.5, 72.5, 82.5, 77.5]
})
exam = exam.set_index('Name')
exam

In [None]:
pd.merge(homework, exam, left_on='Name', right_index=True)

## II. Concatenations

### 1. Concatenating NumPy Arrays
My personal favorite methods are np.hstack() for horizontal concatenation and np.vstack() for vertical concatenation.

In [None]:
arr1 = np.arange(12).reshape([3, 4])
print(arr1)

In [None]:
arr2 = np.arange(10, 90, 10).reshape([2, 4])
print(arr2)

In [None]:
print(np.vstack([arr1, arr2]))

In [None]:
arr3 = np.arange(100, 10, -10).reshape([3, 3])
print(arr3)

In [None]:
print(np.hstack([arr1, arr3]))

### 2. Concatenating Data Frames

In [None]:
spring_records = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Clare', 'David'],
    'Homework': [60, 70, 80, 90],
    'Exam': [65, 75, 85, 95]
})
spring_records

In [None]:
fall_records = pd.DataFrame({
    'Name': ['Alice', 'Eva', 'Fred', 'Gabriel'],
    'Homework': [66, 77, 88, 99],
    'Exam': [69, 79, 89, 99]
})
fall_records

In [None]:
pd.concat([spring_records, fall_records])

In [None]:
pd.concat([spring_records, fall_records], axis=1)