# CHAPTER 8
# Data Wrangling: Join, Combine, and Reshape
- In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. 
- This chapter focuses on tools to help combine, join, and rearrange data.

## Hierarchical Indexing
- **Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. 
- Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

In [2]:
# Import libraries
import pandas as pd
import numpy as np

In [3]:
# Create a Series with a list of lists (or arrays) as the index
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

# What you’re seeing is a prettified view of a Series with a MultiIndex as its index
# The “gaps” in the index display mean “use the label directly above”

a  1    0.737561
   2   -0.829516
   3    0.457601
b  1   -1.501524
   3   -0.419045
c  1   -0.177705
   2    0.298909
d  2    0.751331
   3    0.008016
dtype: float64

In [4]:
# Check the series index
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [5]:
# With a hierarchically indexed object, so-called partial indexing is possible, enabling
# you to concisely select subsets of the data:
data['b']

1   -1.501524
3   -0.419045
dtype: float64

In [6]:
# Selection is even possible from an “inner” level
data.loc[:, 2]

a   -0.829516
c    0.298909
d    0.751331
dtype: float64

In [7]:
# You could rearrange the data into a DataFrame using its unstack method
data.unstack()

Unnamed: 0,1,2,3
a,0.737561,-0.829516,0.457601
b,-1.501524,,-0.419045
c,-0.177705,0.298909,
d,,0.751331,0.008016


In [8]:
# The inverse operation of unstack is stack
data.unstack().stack()

a  1    0.737561
   2   -0.829516
   3    0.457601
b  1   -1.501524
   3   -0.419045
c  1   -0.177705
   2    0.298909
d  2    0.751331
   3    0.008016
dtype: float64

In [9]:
# With a DataFrame, either axis can have a hierarchical index
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [10]:
# The hierarchical levels can have names
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [11]:
# With partial column indexing you can similarly select groups of columns
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels
- The **swaplevel** takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwiseunaltered).

In [12]:
# Use swaplevel on our DataFrame
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [13]:
# sort_index sorts the data using only the values in a single level
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [14]:
# It is common when swapping levels to also use sort_index so that the result is
# lexicographically sorted by the indicated level
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


### Summary Statistics by Level
- Many descriptive and summary statistics on DataFrame and Series have a **level** option in which you can specify the level you want to aggregate by on a particular axis.
- Under the hood, this utilizes pandas’s **groupby** machinery.

In [15]:
# Aggregate by key2
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [16]:
# Aggregate by color
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a DataFrame’s columns
- It’s not unusual to want to use one or more columns from a DataFrame as the row index.
- Alternatively, you may wish to move the row index into the DataFrame’s columns.

In [17]:
# Create a DataFrame as an example
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [18]:
# Set a hierarchical index using set_index function
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [19]:
# By default the columns are removed from the DataFrame, though you can leave them in
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [20]:
# With reset_index the hierarchical index levels are moved into the columns
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Combining and Merging Datasets
- **pandas.merge** connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- **pandas.concat** concatenates or “stacks” together objects along an axis.
- The **combine_first** instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

### Database-Style DataFrame Joins
- Merge or join operations combine datasets by linking rows using one or more keys.
- These operations are central to relational databases (e.g., SQL-based). 
- The merge function in pandas is the main entry point for using these algorithms on your data.

In [21]:
# Create 2 simple DataFrames
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})

df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})

In [22]:
# Check df1
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [23]:
# Check df2
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [25]:
# This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, 
# whereas df2 has only one row for each value in the key column
pd.merge(df1, df2, on = 'key')

# If the column to join on is not specified merge uses the overlapping column names as the keys

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [26]:
# If the column names are different in each object, you can specify them separately
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


- By default **merge** does an **'inner'** join; the keys in the result are the intersection, or the common set found in both tables. 
- Other possible options are **'left'**, **'right'**, and **'outer'**. 
- The **outer** join takes the union of the keys, combining the effect of applying both left and right joins.

In [27]:
# Use the 'outer' join
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [30]:
# Examples for many-to-many merges
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})

In [31]:
# Check df1
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [32]:
# Check df2
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


- **Many-to-many** joins form the Cartesian product of the rows. 
- Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the result. 
- The join method only affects the distinct key values appearing in the result.

In [33]:
# Example fo many-to-many join
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


In [34]:
# To merge with multiple keys, pass a list of column names
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})

right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


In [35]:
# merge has a suffixes option for specifying strings to append to overlapping names in the left and right 
# DataFrame objects

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


**TABLE**: merge function arguments

| Argument                  | Description |
| :---                  |    :----    |
|left| DataFrame to be merged on the left side.
|right| DataFrame to be merged on the right side.
|how| One of 'inner', 'outer', 'left', or 'right'; defaults to 'inner'.
|on| Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys.
|left_on| Columns in left DataFrame to use as join keys.
|right_on| Analogous to left_on for left DataFrame.
|left_index| Use row index in left as its join key (or keys, if a MultiIndex).
|right_index| Analogous to left_index.
|sort| Sort merged data lexicographically by join keys; True by default (disable to get better performance in some cases on large datasets).
|suffixes| Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y') (e.g., if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result). 
|copy| If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.
|indicator| Adds a special column _merge that indicates the source of each row; values will be 'left_only', 'right_only', or 'both' based on the origin of the joined data in each row.

### Merging on Index