# pandas 06 - Working with multiple DataFrames


by Nova@Douban

The video record of this session is here: https://zoom.us/recording/share/jk1xzcRFDRE3SXnaPwb1UFZSx2ZEogg5Rl2ha7lA6xuwIumekTziMw

---


## 6.1 Joining and merging data

### 6.1.1 Joining data

Usually, in a relational database, there are four ways of joining data by index: left, right, inner and outer. We can set according to needs.

`pd.DataFrame.join()` provides a similar way to handle data, which uses `how` to set the joining index.

| Merge method	| SQL Join Name	| Description |
| - | - | - |
|left |	LEFT OUTER JOIN	 |Use keys from left frame only         |
|right|	RIGHT OUTER JOIN |Use keys from right frame only        |
|outer|	FULL OUTER JOIN	 |Use union of keys from both frames    |
|inner|	INNER JOIN	Use  |intersection of keys from both frames |

<img src="../image/join_venn.png">

In [1]:
import pandas as pd
%load_ext memory_profiler

sp0 = pd.read_csv('../data/gspc.csv', index_col='Date')
sp1 = pd.read_csv('../data/gspc_.csv', index_col='Date')

display(sp0.head(), sp1.head())

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


In [2]:
join_left = pd.DataFrame.join(sp0, sp1, how='left', lsuffix='_l', rsuffix='_r')
join_right = pd.DataFrame.join(sp0, sp1, how='right', lsuffix='_l', rsuffix='_r')
join_inner = pd.DataFrame.join(sp0, sp1, how='inner', lsuffix='_l', rsuffix='_r')
join_outer = pd.DataFrame.join(sp0, sp1, how='outer', lsuffix='_l', rsuffix='_r')

display(join_left, join_right, join_inner, join_outer)

Unnamed: 0_level_0,Open_l,High_l,Low_l,Close_l,Adj Close_l,Volume_l,Open_r,High_r,Low_r,Close_r,Adj Close_r,Volume_r
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000,,,,,,
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000,,,,,,
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000,,,,,,


Unnamed: 0_level_0,Open_l,High_l,Low_l,Close_l,Adj Close_l,Volume_l,Open_r,High_r,Low_r,Close_r,Adj Close_r,Volume_r
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-12-24,,,,,,,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,,,,,,,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,,,,,,,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0_level_0,Open_l,High_l,Low_l,Close_l,Adj Close_l,Volume_l,Open_r,High_r,Low_r,Close_r,Adj Close_r,Volume_r
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0_level_0,Open_l,High_l,Low_l,Close_l,Adj Close_l,Volume_l,Open_r,High_r,Low_r,Close_r,Adj Close_r,Volume_r
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-12-24,,,,,,,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000.0
2018-12-26,,,,,,,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000.0
2018-12-27,,,,,,,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000.0
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000.0,,,,,,
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000.0,,,,,,
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000.0,,,,,,


### 6.1.2 Merging data

A more convenient way is to use `pd.DataFrame.merge()`, as it can automatically detect columns.

In [3]:
merge_left = pd.DataFrame.merge(sp0, sp1, how='left')
merge_right = pd.DataFrame.merge(sp0, sp1, how='right')
merge_inner = pd.DataFrame.merge(sp0, sp1, how='inner')
merge_outer = pd.DataFrame.merge(sp0, sp1, how='outer')

display(merge_left, merge_right, merge_inner, merge_outer)

Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
1,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
3,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
4,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
1,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
3,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
4,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
1,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
1,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
3,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
4,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
5,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
6,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
7,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000


None of the examples above contains the original index column `Date`, because pdans does not know which index from the original DataFrame should be used. To solve this problem, we need:

1. reset index first;
2. merge DataFrames;
3. apply the index

In [4]:
merge_left = pd.DataFrame.merge(sp0.reset_index(), sp1, how='left').set_index('Date')
merge_right = pd.DataFrame.merge(sp0, sp1.reset_index(), how='right').set_index('Date')
merge_inner = pd.DataFrame.merge(sp0.reset_index(), sp1, how='inner').set_index('Date')
merge_outer = pd.DataFrame.merge(sp0.reset_index(), sp1.reset_index(), how='outer').set_index('Date')

display(merge_left, merge_right, merge_inner, merge_outer)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000


### 6.1.3 Difference between merge and join

By defult, `pd.merge` automatically detects columns:

    > If joining columns on columns, the DataFrame indexes *will be ignored*.

while `pd.join` automatically detects index.

    > If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.
    
However, `pd.join` is faster than `pd.merge`, so choose accoridng to your needs.

In [5]:
%timeit join_inner = pd.DataFrame.join(sp0, sp1, how='inner', lsuffix='_l', rsuffix='_r')
%timeit merge_inner = pd.DataFrame.merge(sp0, sp1, how='inner')
%timeit merge_inner = pd.DataFrame.merge(sp0.reset_index(), sp1, how='inner').set_index('Date')

1.29 ms ± 234 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2.58 ms ± 65 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.3 ms ± 61.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


---

## 6.2 Concatenating data

Concatenation in pandas is the process of either adding rows to the end of an existing Series or DataFrame object or adding additional columns to a DataFrame.

1. Concatenating automatically detects common columns names;

2. If a column in the result does not exist in the object being copied, NaN values will be flled in those locations. 

3. Duplicate row index labels can occur.

4. A concatenation of two or more DataFrame objects actually performs an __outer__ join operation along the index labels on the axis opposite to the one specifed.

In [6]:
cct1 = pd.concat([sp0, sp1])
cct2 = pd.concat([sp0, sp1], axis=1, sort=False)

display(cct1, cct2)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume,Open.1,High.1,Low.1,Close.1,Adj Close.1,Volume.1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000.0,,,,,,
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000.0,,,,,,
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000.0,,,,,,
2018-12-24,,,,,,,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000.0
2018-12-26,,,,,,,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000.0
2018-12-27,,,,,,,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000.0


### 6.2.2 append data

Use `pd.append()`, if you would like to:

1. concatenate the two specifed DataFrame objects along the row index labels.

2. ensure that the resulting index does not have duplicates but preserves all of the rows, you can use the `ignore_index=True` parameter.

In [7]:
apd0 = sp0.append(sp1)
apd1 = sp0.append(sp1, ignore_index=True)
apd2 = sp0.append(sp1).drop_duplicates().sort_index()

display(apd0, apd1, apd2)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0,Open,High,Low,Close,Adj Close,Volume
0,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
1,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
3,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
4,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
5,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
6,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
7,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
8,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
9,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


### 6.2.3 Performance comparison of concat, merge and append

In the following examples, `pd.concat` and `pd.append` have similar performance. 

In [8]:
%timeit cct1 = pd.concat([sp0, sp1])
%timeit apd2 = sp0.append(sp1)

444 µs ± 5.33 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
448 µs ± 9.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [9]:
%timeit merge_outer = pd.DataFrame.merge(sp0.reset_index(), sp1.reset_index(), how='outer').set_index('Date')
%timeit cct1 = pd.concat([sp0, sp1]).reset_index().drop_duplicates().set_index('Date').sort_index()
%timeit apd2 = sp0.append(sp1).drop_duplicates().sort_index()

display(merge_outer, cct1, apd2)

8.27 ms ± 21.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
6.68 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.3 ms ± 85.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-24,2400.560059,2410.340088,2351.100098,2351.100098,2351.100098,2613930000
2018-12-26,2363.120117,2467.76001,2346.580078,2467.699951,2467.699951,4233990000
2018-12-27,2442.5,2489.100098,2397.939941,2488.830078,2488.830078,4096610000
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


---

## 6.3 Stacking and unstacking data

1. Stacking pivots a level of column labels to the row index. 

2. Unstacking performs the opposite, pivoting a level of the row index into the column index.

3. Stacking and unstacking do not lose any information, but change the means by which it is organized and accessed.

In [10]:
spsk = sp0.stack()
display(sp0, spsk, spsk.index)

Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000


Date                 
2018-12-28  Open         2.498770e+03
            High         2.520270e+03
            Low          2.472890e+03
            Close        2.485740e+03
            Adj Close    2.485740e+03
            Volume       3.702620e+09
2018-12-31  Open         2.498940e+03
            High         2.509240e+03
            Low          2.482820e+03
            Close        2.506850e+03
            Adj Close    2.506850e+03
            Volume       3.442870e+09
2019-01-02  Open         2.476960e+03
            High         2.519490e+03
            Low          2.467470e+03
            Close        2.510030e+03
            Adj Close    2.510030e+03
            Volume       3.733160e+09
2019-01-03  Open         2.491920e+03
            High         2.493140e+03
            Low          2.443960e+03
            Close        2.447890e+03
            Adj Close    2.447890e+03
            Volume       3.822860e+09
2019-01-04  Open         2.474330e+03
            High         2.5

MultiIndex(levels=[['2018-12-28', '2018-12-31', '2019-01-02', '2019-01-03', '2019-01-04'], ['Open', 'High', 'Low', 'Close', 'Adj Close', 'Volume']],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4], [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5]],
           names=['Date', None])

In [11]:
ussk0 = spsk.unstack(level=0)
ussk1 = spsk.unstack(level=1)
ussk2 = spsk.unstack(level='Date')

display(ussk0, ussk1, ussk2)

Date,2018-12-28,2018-12-31,2019-01-02,2019-01-03,2019-01-04
Open,2498.77,2498.94,2476.96,2491.92,2474.33
High,2520.27,2509.24,2519.49,2493.14,2538.07
Low,2472.89,2482.82,2467.47,2443.96,2474.33
Close,2485.74,2506.85,2510.03,2447.89,2531.94
Adj Close,2485.74,2506.85,2510.03,2447.89,2531.94
Volume,3702620000.0,3442870000.0,3733160000.0,3822860000.0,4213410000.0


Unnamed: 0_level_0,Open,High,Low,Close,Adj Close,Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2018-12-28,2498.77002,2520.27002,2472.889893,2485.73999,2485.73999,3702620000.0
2018-12-31,2498.939941,2509.23999,2482.820068,2506.850098,2506.850098,3442870000.0
2019-01-02,2476.959961,2519.48999,2467.469971,2510.030029,2510.030029,3733160000.0
2019-01-03,2491.919922,2493.139893,2443.959961,2447.889893,2447.889893,3822860000.0
2019-01-04,2474.330078,2538.070068,2474.330078,2531.939941,2531.939941,4213410000.0


Date,2018-12-28,2018-12-31,2019-01-02,2019-01-03,2019-01-04
Open,2498.77,2498.94,2476.96,2491.92,2474.33
High,2520.27,2509.24,2519.49,2493.14,2538.07
Low,2472.89,2482.82,2467.47,2443.96,2474.33
Close,2485.74,2506.85,2510.03,2447.89,2531.94
Adj Close,2485.74,2506.85,2510.03,2447.89,2531.94
Volume,3702620000.0,3442870000.0,3733160000.0,3822860000.0,4213410000.0


### 6.3.2 Performance gain from `pd.stack`

With stacked data, the lockup data with boolean operation can be significantly faster

In [12]:
%timeit spsk['2018-12-28', 'Volume']
%timeit spsk.loc['2018-12-28']['Volume']  # don't use this.
%timeit sp0.loc['2018-12-28']['Volume']

23 µs ± 970 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
370 µs ± 18.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
153 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


---

## 6.4 Melt and Pivot

Melting: 

1. reshapes a DataFrame into a format where two or more columns, referred to as variable and value;
2. unpivoted column labels in the variable column;
3. moves the data from these columns into the appropriate location in the value column.

In [13]:
price_df = sp0[['High', 'Low', 'Open', 'Close']].reset_index()

mlsp = price_df.melt(id_vars=['Date'], var_name='Category', value_name='Price')

display(price_df, mlsp)

Unnamed: 0,Date,High,Low,Open,Close
0,2018-12-28,2520.27002,2472.889893,2498.77002,2485.73999
1,2018-12-31,2509.23999,2482.820068,2498.939941,2506.850098
2,2019-01-02,2519.48999,2467.469971,2476.959961,2510.030029
3,2019-01-03,2493.139893,2443.959961,2491.919922,2447.889893
4,2019-01-04,2538.070068,2474.330078,2474.330078,2531.939941


Unnamed: 0,Date,Category,Price
0,2018-12-28,High,2520.27002
1,2018-12-31,High,2509.23999
2,2019-01-02,High,2519.48999
3,2019-01-03,High,2493.139893
4,2019-01-04,High,2538.070068
5,2018-12-28,Low,2472.889893
6,2018-12-31,Low,2482.820068
7,2019-01-02,Low,2467.469971
8,2019-01-03,Low,2443.959961
9,2019-01-04,Low,2474.330078


In [14]:
pvsp = mlsp.pivot(index='Date', columns='Category')

display(pvsp)

Unnamed: 0_level_0,Price,Price,Price,Price
Category,Close,High,Low,Open
Date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2018-12-28,2485.73999,2520.27002,2472.889893,2498.77002
2018-12-31,2506.850098,2509.23999,2482.820068,2498.939941
2019-01-02,2510.030029,2519.48999,2467.469971,2476.959961
2019-01-03,2447.889893,2493.139893,2443.959961,2491.919922
2019-01-04,2531.939941,2538.070068,2474.330078,2474.330078


---

## 6.5 Exercise

1. Read the official pandoc [documentation](http://pandas.pydata.org/pandas-docs/stable/merging.html#database-style-dataframe-joining-merging)
2. Why is the result of `unstack(level=0)` is equal to `unstack(level=1)`?
3. How to unmelt a melted table to the original with `pd.pivot`?