# CHAPTER 8
# Data Wrangling: Join, Combine, and Reshape
- In many applications, data may be spread across a number of files or databases or be arranged in a form that is not easy to analyze. 
- This chapter focuses on tools to help combine, join, and rearrange data.

## Hierarchical Indexing
- **Hierarchical indexing** is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. 
- Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

In [1]:
# Import libraries
import pandas as pd
import numpy as np

In [2]:
# Create a Series with a list of lists (or arrays) as the index
data = pd.Series(np.random.randn(9),
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

# What you’re seeing is a prettified view of a Series with a MultiIndex as its index
# The “gaps” in the index display mean “use the label directly above”

a  1   -0.437235
   2   -0.201246
   3    1.384400
b  1    1.468133
   3   -0.393745
c  1    1.154594
   2    0.246944
d  2   -0.724331
   3    0.553578
dtype: float64

In [3]:
# Check the series index
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [4]:
# With a hierarchically indexed object, so-called partial indexing is possible, enabling
# you to concisely select subsets of the data:
data['b']

1    1.468133
3   -0.393745
dtype: float64

In [5]:
# Selection is even possible from an “inner” level
data.loc[:, 2]

a   -0.201246
c    0.246944
d   -0.724331
dtype: float64

In [6]:
# You could rearrange the data into a DataFrame using its unstack method
data.unstack()

Unnamed: 0,1,2,3
a,-0.437235,-0.201246,1.3844
b,1.468133,,-0.393745
c,1.154594,0.246944,
d,,-0.724331,0.553578


In [7]:
# The inverse operation of unstack is stack
data.unstack().stack()

a  1   -0.437235
   2   -0.201246
   3    1.384400
b  1    1.468133
   3   -0.393745
c  1    1.154594
   2    0.246944
d  2   -0.724331
   3    0.553578
dtype: float64

In [8]:
# With a DataFrame, either axis can have a hierarchical index
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                     index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                     columns=[['Ohio', 'Ohio', 'Colorado'],['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [9]:
# The hierarchical levels can have names
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [10]:
# With partial column indexing you can similarly select groups of columns
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels
- The **swaplevel** takes two level numbers or names and returns a new object with the levels interchanged (but the data is otherwiseunaltered).

In [11]:
# Use swaplevel on our DataFrame
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [12]:
# sort_index sorts the data using only the values in a single level
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [13]:
# It is common when swapping levels to also use sort_index so that the result is
# lexicographically sorted by the indicated level
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


### Summary Statistics by Level
- Many descriptive and summary statistics on DataFrame and Series have a **level** option in which you can specify the level you want to aggregate by on a particular axis.
- Under the hood, this utilizes pandas’s **groupby** machinery.

In [14]:
# Aggregate by key2
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [15]:
# Aggregate by color
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a DataFrame’s columns
- It’s not unusual to want to use one or more columns from a DataFrame as the row index.
- Alternatively, you may wish to move the row index into the DataFrame’s columns.

In [16]:
# Create a DataFrame as an example
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                            'two', 'two'],
                      'd': [0, 1, 2, 0, 1, 2, 3]})
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [17]:
# Set a hierarchical index using set_index function
frame2 = frame.set_index(['c', 'd'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [18]:
# By default the columns are removed from the DataFrame, though you can leave them in
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [19]:
# With reset_index the hierarchical index levels are moved into the columns
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## Combining and Merging Datasets
- **pandas.merge** connects rows in DataFrames based on one or more keys. This will be familiar to users of SQL or other relational databases, as it implements database join operations.
- **pandas.concat** concatenates or “stacks” together objects along an axis.
- The **combine_first** instance method enables splicing together overlapping data to fill in missing values in one object with values from another.

### Database-Style DataFrame Joins
- Merge or join operations combine datasets by linking rows using one or more keys.
- These operations are central to relational databases (e.g., SQL-based). 
- The merge function in pandas is the main entry point for using these algorithms on your data.

In [20]:
# Create 2 simple DataFrames
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})

df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                    'data2': range(3)})

In [21]:
# Check df1
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [22]:
# Check df2
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [23]:
# This is an example of a many-to-one join; the data in df1 has multiple rows labeled a and b, 
# whereas df2 has only one row for each value in the key column
pd.merge(df1, df2, on = 'key')

# If the column to join on is not specified merge uses the overlapping column names as the keys

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [24]:
# If the column names are different in each object, you can specify them separately
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                    'data1': range(7)})

df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                    'data2': range(3)})

pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


- By default **merge** does an **'inner'** join; the keys in the result are the intersection, or the common set found in both tables. 
- Other possible options are **'left'**, **'right'**, and **'outer'**. 
- The **outer** join takes the union of the keys, combining the effect of applying both left and right joins.

In [25]:
# Use the 'outer' join
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [26]:
# Examples for many-to-many merges
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})

df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                    'data2': range(5)})

In [27]:
# Check df1
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [28]:
# Check df2
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


- **Many-to-many** joins form the Cartesian product of the rows. 
- Since there were three 'b' rows in the left DataFrame and two in the right one, there are six 'b' rows in the result. 
- The join method only affects the distinct key values appearing in the result.

In [29]:
# Example fo many-to-many join
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


In [30]:
# To merge with multiple keys, pass a list of column names
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],
                     'key2': ['one', 'two', 'one'],
                     'lval': [1, 2, 3]})

right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],
                      'key2': ['one', 'one', 'one', 'two'],
                      'rval': [4, 5, 6, 7]})

pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


In [31]:
# merge has a suffixes option for specifying strings to append to overlapping names in the left and right 
# DataFrame objects

pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


**TABLE**: merge function arguments

| Argument                  | Description |
| :---                  |    :----    |
|left| DataFrame to be merged on the left side.
|right| DataFrame to be merged on the right side.
|how| One of 'inner', 'outer', 'left', or 'right'; defaults to 'inner'.
|on| Column names to join on. Must be found in both DataFrame objects. If not specified and no other join keys given, will use the intersection of the column names in left and right as the join keys.
|left_on| Columns in left DataFrame to use as join keys.
|right_on| Analogous to left_on for left DataFrame.
|left_index| Use row index in left as its join key (or keys, if a MultiIndex).
|right_index| Analogous to left_index.
|sort| Sort merged data lexicographically by join keys; True by default (disable to get better performance in some cases on large datasets).
|suffixes| Tuple of string values to append to column names in case of overlap; defaults to ('_x', '_y') (e.g., if 'data' in both DataFrame objects, would appear as 'data_x' and 'data_y' in result). 
|copy| If False, avoid copying data into resulting data structure in some exceptional cases; by default always copies.
|indicator| Adds a special column _merge that indicates the source of each row; values will be 'left_only', 'right_only', or 'both' based on the origin of the joined data in each row.

### Merging on Index
- In some cases, the merge key(s) in a DataFrame will be found in its index. 
- In this case, you can pass **left_index=True** or **right_index=True** (or both) to indicate that the index should be used as the merge key.

In [32]:
# Create 2 DataFrames as an example
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'], 'value': range(6)})

right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

In [33]:
# Check first DataFrame
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [34]:
# Check second DataFrame
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [35]:
# Merge the 2 DataFrames
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [37]:
# Hierarchically indexed data
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                               'Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],
                      'data': np.arange(5.)})

righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio',
                              'Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])

In [38]:
lefth

Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [39]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [40]:
# You have to indicate multiple columns to merge on as a list
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [41]:
# Handling of duplicate index values with how='outer'
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True, how='outer')

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


In [42]:
# Create new DataFrame examples
left2 = pd.DataFrame([[1., 2.], [3., 4.], [5., 6.]],
                     index=['a', 'c', 'e'],
                     columns=['Ohio', 'Nevada'])

right2 = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [13, 14]],
                      index=['b', 'c', 'd', 'e'],
                      columns=['Missouri', 'Alabama'])

In [43]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [44]:
right2

Unnamed: 0,Missouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [45]:
# DataFrame has a convenient join instance for merging by index
# It can be used to combine together many DataFrame objects having the 
# same or similar indexes but non-overlapping columns

left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


- In part for legacy reasons (i.e., much earlier versions of pandas), DataFrame’s **join method** performs a **left join** on the join keys, exactly preserving the left frame’s row index. 
- It also supports joining the index of the passed DataFrame on one of the columns of the calling DataFrame.

In [46]:
# Use the 'key' column from right1 DataFrame to join on
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


In [47]:
# Create another DataFrame
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.], [16., 17.]],
                       index=['a', 'c', 'e', 'f'],
                       columns=['New York', 'Oregon'])

In [48]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [49]:
# For simple index-on-index merges, you can pass a list of DataFrames to join
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,,
d,,,11.0,12.0,,
f,,,,,16.0,17.0


### Concatenating Along an Axis
- Another kind of data combination operation is referred to interchangeably as concatenation, binding, or stacking. 
- NumPy’s **concatenate** function can do this with NumPy arrays.
- In the context of **pandas objects** such as Series and DataFrame, having labeled axes enable you to further generalize array **concatenation**.

In [50]:
# Create a NumPy array
arr = np.arange(12).reshape((3, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [51]:
# Concatenate 2 arrays
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [52]:
# Consider three Series with no index overlap
s1 = pd.Series([0, 1], index=['a', 'b'])

s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])

s3 = pd.Series([5, 6], index=['f', 'g'])

In [53]:
# Calling concat with these objects in a list glues together the values and indexes
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

In [54]:
# By default concat works along axis=0, producing another Series
# If you pass axis=1, the result will instead be a DataFrame (axis=1 is the columns)

pd.concat([s1, s2, s3], axis=1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [56]:
# Create a 4th Series by concatenating s1 & s3
s4 = pd.concat([s1, s3])
s4

a    0
b    1
f    5
g    6
dtype: int64

In [58]:
# You can instead intersect the Series by passing join='inner'
pd.concat([s1, s4], axis=1, join = 'inner')

Unnamed: 0,0,1
a,0,0
b,1,1


In [60]:
# Create a hierarchical index on the concatenation axis
result = pd.concat([s1, s1, s3], keys=['one', 'two', 'three'])
result

one    a    0
       b    1
two    a    0
       b    1
three  f    5
       g    6
dtype: int64

In [61]:
# In the case of combining Series along axis=1
# the keys become the DataFrame column headers
pd.concat([s1, s2, s3], axis=1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [62]:
# The same logic extends to DataFrame objects
df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                   columns=['one', 'two'])

df2 = pd.DataFrame(5 + np.arange(4).reshape(2, 2), index=['a', 'c'],
                   columns=['three', 'four'])

In [63]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [64]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [65]:
# Concatenate & create hierarchinal indexes
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


In [66]:
# Example when the row index does not contain any relevant data
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])

df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [67]:
df1

Unnamed: 0,a,b,c,d
0,-0.294621,0.854409,1.281779,1.231783
1,0.013662,0.2574,0.629366,1.036198
2,0.058857,1.151239,-0.290635,0.565149


In [68]:
df2

Unnamed: 0,b,d,a
0,0.402679,-0.032163,-0.414444
1,-1.135253,0.415326,-0.097901


In [71]:
# You can pass ignore_index=True:
pd.concat([df1, df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.294621,0.854409,1.281779,1.231783
1,0.013662,0.2574,0.629366,1.036198
2,0.058857,1.151239,-0.290635,0.565149
3,-0.414444,0.402679,,-0.032163
4,-0.097901,-1.135253,,0.415326


**TABLE**: concat function arguments

| Argument                  | Description |
| :---                  |    :----    |
|objs| List or dict of pandas objects to be concatenated; this is the only required argument
|axis| Axis to concatenate along; defaults to 0 (along rows) 
|join| Either 'inner' or 'outer' ('outer' by default); whether to intersection (inner) or union (outer) together indexes along the other axes
|join_axes| Specific indexes to use for the other n–1 axes instead of performing union/intersection logic
|keys| Values to associate with objects being concatenated, forming a hierarchical index along the concatenation axis; can either be a list or array of arbitrary values, an array of tuples, or a list of arrays (if multiple-level arrays passed in levels)
|levels| Specific indexes to use as hierarchical index level or levels if keys passed
|names| Names for created hierarchical levels if keys and/or levels passed verify_integrity Check new axis in concatenated object for duplicates and raise exception if so; by default (False) allows duplicates
|ignore_index| Do not preserve indexes along concatenation axis, instead producing a new range(total_length) index

### Combining Data with Overlap
- There is another data combination situation that can’t be expressed as either a merge or concatenation operation. 
- You may have two datasets whose indexes overlap in full or part. 
- As a motivating example, consider NumPy’s **where** function, which performs the array-oriented equivalent of an if-else expression.
- Series has a **combine_first method**, which performs the equivalent of this operation along with pandas’s usual data alignment logic.
- With DataFrames, **combine_first** does the same thing column by column, so you can think of it as “patching” missing data in the calling object with data from the object you pass.

In [72]:
# Create 2 NumPy arrays
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
              index=['f', 'e', 'd', 'c', 'b', 'a'])

b = pd.Series(np.arange(len(a), dtype=np.float64),
              index=['f', 'e', 'd', 'c', 'b', 'a'])

In [73]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [74]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [75]:
# If values in a are NA use values from b
np.where(pd.isnull(a), b, a)

array([0. , 2.5, 2. , 3.5, 4.5, 5. ])

In [76]:
# Create 2 DataFrames example
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan], 
                    'b': [np.nan, 2., np.nan, 6.],
                    'c': range(2, 18, 4)})

df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})

In [77]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [78]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [79]:
# Combine df1 and df2
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping and Pivoting
### Reshaping with Hierarchical Indexing
- Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
       stack: This “rotates” or pivots from the columns in the data to the rows
       unstack: This pivots from the rows into the columns

In [80]:
# Create an exampla DatFrame
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                    index=pd.Index(['Ohio', 'Colorado'], name='state'),
                    columns=pd.Index(['one', 'two', 'three'],
                                     name='number'))
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [81]:
# Using the stack method on this data pivots the columns into the rows
# and produces a Series

result = data.stack()
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

In [82]:
# From a hierarchically indexed Series, you can rearrange the data back into a 
# DataFrame with unstack
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [83]:
# You can unstack a different level by passing a level number or name
# Default is the innermost level
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [84]:
# Create a new Series example
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])

s2 = pd.Series([4, 5, 6], index=['c', 'd', 'e'])

data2 = pd.concat([s1, s2], keys=['one', 'two'])
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [85]:
# Unstacking might introduce missing data if all of the values in the level 
# aren’t found in each of the subgroups
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [86]:
# Stacking filters out missing data by default, so the operation is more easily 
# invertible
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

### Pivoting “Long” to “Wide” Format
- A common way to store multiple time series in databases and CSV is in so-called *long* or *stacked* format. 
- Let’s load some example data and do a small amount of time series wrangling and other data cleaning.

In [88]:
# Load the example data
data = pd.read_csv('examples/macrodata.csv')
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [89]:
# Combine the 'year' & 'quarter' columns
periods = pd.PeriodIndex(year=data.year, quarter=data.quarter,name='date')
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203, freq='Q-DEC')

In [90]:
# Select a few columns from our data to create an index object
columns = pd.Index(['realgdp', 'infl', 'unemp'], name='item')
columns

Index(['realgdp', 'infl', 'unemp'], dtype='object', name='item')

In [92]:
# Reindex our data using the index object created above
data = data.reindex(columns=columns)
data.head()

item,realgdp,infl,unemp
0,2710.349,0.0,5.8
1,2778.801,2.34,5.1
2,2775.488,2.74,5.3
3,2785.204,0.27,5.6
4,2847.699,2.31,5.2


In [96]:
# Set the index to the PeriodIndex object created earlier
data.index = periods.to_timestamp('D', 'end')

In [94]:
# Create a new DataFrame with our new index
ldata = data.stack().reset_index().rename(columns={0: 'value'})
ldata.head()

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34


- This is the so-called *long* format for multiple time series, or other observational data with two or more keys (here, our keys are date and item). Each row in the table represents a single observation.

In [99]:
# You might prefer to have a DataFrame containing one column per distinct 
# item value indexed by timestamps in the date column. 
# DataFrame’s pivot method performs exactly this transformation.
pivoted = ldata.pivot('date', 'item', 'value')
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


In [100]:
# Suppose you had two value columns that you wanted to reshape simultaneously
ldata['value2'] = np.random.randn(len(ldata))
ldata.head()

Unnamed: 0,date,item,value,value2
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,0.347358
1,1959-03-31 23:59:59.999999999,infl,0.0,-0.998778
2,1959-03-31 23:59:59.999999999,unemp,5.8,-2.128471
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,1.292548
4,1959-06-30 23:59:59.999999999,infl,2.34,-0.369615


In [103]:
# You can reshape ldata so that you obtain a DataFrame with hierarchical columns
pivoted = ldata.pivot('date', 'item')
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-0.998778,0.347358,-2.128471
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-0.369615,1.292548,0.207686
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,-1.607259,-1.413891,0.543917
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,-1.047428,-1.453251,2.246303
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,0.958066,0.91449,0.519729


In [104]:
# Select a slice for index 'value'
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


### Pivoting “Wide” to “Long” Format
- An inverse operation to **pivot** for DataFrames is **pandas.melt**. 
- Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input.

In [105]:
# Create an example DataFrame
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                   'A': [1, 2, 3],
                   'B': [4, 5, 6],
                   'C': [7, 8, 9]})
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


In [106]:
# When using pandas.melt, we must indicate which columns (if any) are group 
# indicators. Let’s use 'key' as the only group indicator here.
melted = pd.melt(df, ['key'])
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9
