# Data Wrangling

In [319]:
import numpy as np
import pandas as pd

In many applications, data may be spread across a number of files or databases, or be arranged in a form that is not convenient to analyze.

*Hierarchical indexing* is an important feature of pandas that enables you to have multiple index levels on an axis. Another way of thinking about it is that it provides a way for you to work with higher dimensional data in a lower dimensional form.

In [320]:
data = pd.Series(
    np.random.uniform(size=9),
    index=[["a", "a", "a", "b", "b", "c", "c", "d", "d"], [1, 2, 3, 1, 3, 1, 2, 2, 3]],
)
data

a  1    0.865680
   2    0.886500
   3    0.154269
b  1    0.129948
   3    0.533290
c  1    0.039811
   2    0.831738
d  2    0.848714
   3    0.019786
dtype: float64

In [321]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchically indexed object, is possible to concisely select subsets of the data.

In [322]:
data["b"]

1    0.129948
3    0.533290
dtype: float64

In [323]:
data["b":"c"]

b  1    0.129948
   3    0.533290
c  1    0.039811
   2    0.831738
dtype: float64

In [324]:
data.loc[:, 2]

a    0.886500
c    0.831738
d    0.848714
dtype: float64

Hierarchical indexing plays an important role in reshaping data and in group-based operations like forming a pivot table. For example, you can rearrange this data into a DataFrame using its `unstack` method.

In [325]:
data.unstack()

Unnamed: 0,1,2,3
a,0.86568,0.8865,0.154269
b,0.129948,,0.53329
c,0.039811,0.831738,
d,,0.848714,0.019786


In [326]:
# the inverse of unstack is stack
data.unstack().stack()

a  1    0.865680
   2    0.886500
   3    0.154269
b  1    0.129948
   3    0.533290
c  1    0.039811
   2    0.831738
d  2    0.848714
   3    0.019786
dtype: float64

With a DataFrame, either axis can have a hierarchical index.

In [327]:
frame = pd.DataFrame(
    np.arange(12).reshape((4, 3)),
    index=[["a", "a", "b", "b"], [1, 2, 1, 2]],
    columns=[["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]],
)
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [328]:
frame.index.names = ["key1", "key2"]
frame.columns.names = ["state", "color"]
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Keep in mind that the index names "state" and "color" are not part of the row labels, the `frame.index` values.

In [329]:
# see how many levels an index has
frame.index.nlevels

2

In [330]:
frame["Ohio"]

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A `MultiIndex` can be created by itslef and then reused; the columns in the preceding DataFrame with level names could also be created like this:

In [331]:
pd.MultiIndex.from_arrays(
    [["Ohio", "Ohio", "Colorado"], ["Green", "Red", "Green"]], names=["state", "color"]
)

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

If you need to rearrange the order of the levels on an axis or sort the data by the values in one specific level the `swaplevel` method takes two level numbers or names and returns a new object with the levels interchanged.

In [332]:
frame.swaplevel("key1", "key2")

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


`sort_index` by default sorts the data lexicographically using all the index levels, but you can choose to use only a single level or a subset of levels to sort by passing the `level` argument.

In [333]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [334]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


Data selection performance is much better on hierarchically indexed objects if the index is lexicographically sorted starting with the outermost level, that is, the result of calling `sort_index(level=0)` or `sort_index()`.

Many descriptive and summary statistics on DataFrame and Series have a `level` option in which you can specify the level you want to aggregate by on a particular axis.

In [335]:
frame.groupby(level="key2").sum()

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [336]:
frame.groupby(level="color", axis="columns").sum()

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


It's not unusual to want to use one or more columns from a DataFrame as the row index; alternatively, you may wish to move the row index into the DataFrame's columns.

In [337]:
frame = pd.DataFrame(
    {
        "a": range(7),
        "b": range(7, 0, -1),
        "c": ["one", "one", "one", "two", "two", "two", "two"],
        "d": [0, 1, 2, 0, 1, 2, 3],
    }
)
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame's `set_index` function will create a new DataFrame using one or more of its columns as the index.

In [338]:
frame2 = frame.set_index(["c", "d"])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them in by passing `drop=False` to `set_index`.

In [339]:
frame.set_index(["c", "d"], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [340]:
# reset_index does the opposite to set_index
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


Data contained in pandas objects can be combined in a number of ways:
- `pandas.merge`: connect rows in DataFrames based on one or more keys.
- `pandas.concat`: concatenate or "stack" objects together along an axis.
- `combine_first`: splice together overlapping data to fill in missing values in one object with values from another.

In [341]:
df1 = pd.DataFrame(
    {
        "key": ["b", "b", "a", "c", "a", "a", "b"],
        "data1": pd.Series(range(7), dtype="Int64"),
    }
)
df2 = pd.DataFrame(
    {"key": ["a", "b", "d"], "data2": pd.Series(range(3), dtype="Int64")}
)

In [342]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [343]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [344]:
# many to one join
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Since I didn't specify which column to join on, `pandas.merge` uses the overlapping column names as the keys.

In [345]:
pd.merge(df1, df2, on="key")

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


If the column names are different in each object, you can specify them separately.

In [346]:
df3 = pd.DataFrame(
    {
        "lkey": ["b", "b", "a", "c", "a", "a", "b"],
        "data1": pd.Series(range(7), dtype="Int64"),
    }
)
df4 = pd.DataFrame(
    {"rkey": ["a", "b", "d"], "data2": pd.Series(range(3), dtype="Int64")}
)
pd.merge(df3, df4, left_on="lkey", right_on="rkey")

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


You can observe that the "c" and "d" values and associated data are missing from the result. By default, `pandas.merge` does an "inner" join; the keys in the result are the intersection, or the common set found in both tables.  
Other possible options are "left", "right", and "outer". The outer join takes the union of the keys, combining the effect of applying both left and right joins.

In [347]:
pd.merge(df1, df2, how="outer")

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


In [348]:
pd.merge(df3, df4, left_on="lkey", right_on="rkey", how="outer")

Unnamed: 0,lkey,data1,rkey,data2
0,b,0.0,b,1.0
1,b,1.0,b,1.0
2,b,6.0,b,1.0
3,a,2.0,a,0.0
4,a,4.0,a,0.0
5,a,5.0,a,0.0
6,c,3.0,,
7,,,d,2.0


| Option | Behavior |
| --- | --- |
| `how="inner"` | Use only the key combinations observed in both tables |
| `how="left"` | Use all key combinations found in the left table |
| `how="right"` | Use all key combinations found in the right table |
| `how="outer"` | Use all key combinations observed in both tables together |

In [349]:
df1 = pd.DataFrame(
    {"key": ["b", "b", "a", "c", "a", "b"], "data1": pd.Series(range(6), dtype="Int64")}
)
df2 = pd.DataFrame(
    {"key": ["a", "b", "a", "b", "d"], "data2": pd.Series(range(5), dtype="Int64")}
)

In [350]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [351]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [352]:
pd.merge(df1, df2, on="key", how="left")

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


In [353]:
pd.merge(df1, df2, how="inner")

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key.

A last issue to consider in merge operations is the treatment of overlapping column names.

In [354]:
left = pd.DataFrame(
    {
        "key1": ["foo", "foo", "bar"],
        "key2": ["one", "two", "one"],
        "lval": pd.Series([1, 2, 3], dtype="Int64"),
    }
)
right = pd.DataFrame(
    {
        "key1": ["foo", "foo", "bar", "bar"],
        "key2": ["one", "one", "one", "two"],
        "rval": pd.Series([4, 5, 6, 7], dtype="Int64"),
    }
)

In [355]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [356]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [357]:
pd.merge(left, right, on="key1")

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


While you can address the overlap manually by renaming axis labels, `pandas.merge` has a `suffixes` option for specifying strings to append to overlapping names in the left and right DataFrame objects.

In [358]:
pd.merge(left, right, on="key1", suffixes=("_left", "_right"))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In some cases, the merge key(s) in a DataFrame will be found in its index (row labels). In this case, you can pass `left_index=True` or `right_index=True` (or both) to indicate that the index should be used as the merge key.

DataFrame has a `join` instance method to simplify merging by index. It can also be used to combine many DataFrame objects having the same or similar indexes but non-overlapping columns.

In [359]:
left = pd.DataFrame(
    [[1.0, 2.0], [3.0, 4.0], [5.0, 6.0]],
    index=["a", "c", "e"],
    columns=["Ohio", "Nevada"],
).astype("Int64")
right = pd.DataFrame(
    [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0], [13, 14]],
    index=["b", "c", "d", "e"],
    columns=["Missouri", "Alabama"],
).astype("Int64")

In [360]:
left

Unnamed: 0,Ohio,Nevada
a,1,2
c,3,4
e,5,6


In [361]:
right

Unnamed: 0,Missouri,Alabama
b,7,8
c,9,10
d,11,12
e,13,14


In [362]:
left.join(right, how="outer")

Unnamed: 0,Ohio,Nevada,Missouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


Compared with `pandas.merge`, DataFrame's `join` method performs a left join on the join keys by default. It also supports joining the index of the passed DataFrame on one of the columns of the calling DataFrame.  
For simple index-on-index merges, you can pass a list of DataFrmaes to `join` as an alternative to using the more general `pandas.concat` function.

In [363]:
another = pd.DataFrame(
    [[7.0, 8.0], [9.0, 10.0], [11.0, 12.0], [16.0, 17]],
    index=["a", "c", "e", "f"],
    columns=["New York", "Oregon"],
)
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [364]:
left.join([right, another])

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1,2,,,7.0,8.0
c,3,4,9.0,10.0,9.0,10.0
e,5,6,13.0,14.0,11.0,12.0


In [365]:
left.join([right, another], how="outer")

Unnamed: 0,Ohio,Nevada,Missouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,,
d,,,11.0,12.0,,
f,,,,,16.0,17.0


Another kind of data combination operation is referred to interchangeably as concatenation or stacking. NumPy's `concatenate` function can do this with NumPy arrays.

In [366]:
arr = np.arange(12).reshape((3, 4))

In [367]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [368]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation. In particular, you have a number of additional concerns:
- If the objects are indexed differently on the other axes, should we combine the distinct elements in these axes or use only the values in common?
- Do the concatenated chunks of data need to be identifiable as such in the resulting object?
- Does the "concatenation axis" contain data that needs to be preserved?  

The `concat` function in pandas provides a consistent way to address each of these questions.

In [369]:
s1 = pd.Series([0, 1], index=["a", "b"], dtype="Int64")
s2 = pd.Series([2, 3, 4], index=["c", "d", "e"], dtype="Int64")
s3 = pd.Series([5, 6], index=["f", "g"], dtype="Int64")

In [370]:
s1

a    0
b    1
dtype: Int64

In [371]:
s2

c    2
d    3
e    4
dtype: Int64

In [372]:
s3

f    5
g    6
dtype: Int64

In [373]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: Int64

In [374]:
pd.concat([s1, s2, s3], axis="columns")

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [375]:
pd.concat([s1, s2, s3], keys=["one", "two", "three"])

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: Int64

There is another data combination situation that can't be expressed as either a merge or concatenation operation. You may have two datasets with indexes that overlap in full or in part.

In [376]:
a = pd.Series(
    [np.nan, 2.5, 0.0, 3.5, 4.5, np.nan], index=["f", "e", "d", "c", "b", "a"]
)
b = pd.Series(
    [0.0, np.nan, 2.0, np.nan, np.nan, 5.0], index=["a", "b", "c", "d", "e", "f"]
)

In [377]:
a

f    NaN
e    2.5
d    0.0
c    3.5
b    4.5
a    NaN
dtype: float64

In [378]:
b

a    0.0
b    NaN
c    2.0
d    NaN
e    NaN
f    5.0
dtype: float64

In [379]:
np.where(pd.isna(a), b, a)

array([0. , 2.5, 0. , 3.5, 4.5, 5. ])

Here, whenever values in `a` are null, values from `b` are selected, otherwise the non-null values from `a` are selected. Using `numpy.where` does not check whether the index labels are aligned or not, and does not even require the objects to be the same length, so if you want to line up values by index, use the Series `combine_first` method.

In [380]:
a.combine_first(b)

a    0.0
b    4.5
c    3.5
d    0.0
e    2.5
f    5.0
dtype: float64

With DataFrames, `combine_first` does the same thing column by column, so you can think of as "patching" missing data in the calling object with data from the object you pass.

In [381]:
df1 = pd.DataFrame(
    {
        "a": [1.0, np.nan, 5.0, np.nan],
        "b": [np.nan, 2.0, np.nan, 6.0],
        "c": range(2, 18, 4),
    }
)
df2 = pd.DataFrame(
    {"a": [5.0, 4.0, np.nan, 3.0, 7.0], "b": [np.nan, 3.0, 4.0, 6.0, 8.0]}
)

In [382]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [383]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [384]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


Hierarchical indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
- `stack`: "rotates" os pivots from the columns in the data to the rows.
- `unstack`: this pivots from the rows into the columns.

In [385]:
data = pd.DataFrame(
    np.arange(6).reshape((2, 3)),
    index=pd.Index(["Ohio", "Colorado"], name="state"),
    columns=pd.Index(["one", "two", "three"], name="number"),
)
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


In [386]:
# use stack method to pivot the columns into the rows
data.stack()

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

In [387]:
data.stack().unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default, the innermost level is unstacked (same with `stack`). You can unstack a different level by passing a level number or name.

In [388]:
data.stack().unstack(level=0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [389]:
data.stack().unstack(level="state")

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


A common way to store multiple time series in databases and CSV files is what is sometimes called *long* or *stacked* format. in this format, individual values are represented by a single row in a table rather tahn multiple values per row.

In [390]:
data = pd.read_csv("examples/macrodata.csv")
data = data.loc[:, ["year", "quarter", "realgdp", "infl", "unemp"]]
data

Unnamed: 0,year,quarter,realgdp,infl,unemp
0,1959,1,2710.349,0.00,5.8
1,1959,2,2778.801,2.34,5.1
2,1959,3,2775.488,2.74,5.3
3,1959,4,2785.204,0.27,5.6
4,1960,1,2847.699,2.31,5.2
...,...,...,...,...,...
198,2008,3,13324.600,-3.16,6.0
199,2008,4,13141.920,-8.79,6.9
200,2009,1,12925.410,0.94,8.1
201,2009,2,12901.504,3.37,9.2


First we use `pandas.PeriodIndex`, which represents time intervals rather than points in time, to combine the `year` and `quarter` columns to set the index to consists of `datetime` values at the end of each quarter.

In [391]:
periods = pd.PeriodIndex(
    year=data.pop("year"), quarter=data.pop("quarter"), name="date"
)
periods

PeriodIndex(['1959Q1', '1959Q2', '1959Q3', '1959Q4', '1960Q1', '1960Q2',
             '1960Q3', '1960Q4', '1961Q1', '1961Q2',
             ...
             '2007Q2', '2007Q3', '2007Q4', '2008Q1', '2008Q2', '2008Q3',
             '2008Q4', '2009Q1', '2009Q2', '2009Q3'],
            dtype='period[Q-DEC]', name='date', length=203)

In [392]:
data.index = periods.to_timestamp("D")
data.head()

Unnamed: 0_level_0,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,2710.349,0.0,5.8
1959-04-01,2778.801,2.34,5.1
1959-07-01,2775.488,2.74,5.3
1959-10-01,2785.204,0.27,5.6
1960-01-01,2847.699,2.31,5.2


After using the `pop` method on the DataFrame, which returns a column while deleting it from the DataFrame at the same time.  
Then, I select a subset of columns and give the `columns` index the name `"item"`.

In [393]:
data = data.reindex(columns=["realgdp", "infl", "unemp"])
data.columns.name = "item"
data.head()

item,realgdp,infl,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,2710.349,0.0,5.8
1959-04-01,2778.801,2.34,5.1
1959-07-01,2775.488,2.74,5.3
1959-10-01,2785.204,0.27,5.6
1960-01-01,2847.699,2.31,5.2


Lastly we reshape with `stack`, turn the nex index levels into columns with `reset_index` and finally give the column containing the data values the name `"value"`.

In [394]:
long_data = data.stack().reset_index().rename(columns={0: "value"})
long_data[:10]

Unnamed: 0,date,item,value
0,1959-01-01,realgdp,2710.349
1,1959-01-01,infl,0.0
2,1959-01-01,unemp,5.8
3,1959-04-01,realgdp,2778.801
4,1959-04-01,infl,2.34
5,1959-04-01,unemp,5.1
6,1959-07-01,realgdp,2775.488
7,1959-07-01,infl,2.74
8,1959-07-01,unemp,5.3
9,1959-10-01,realgdp,2785.204


In this so-called long format for multiple time series, each row in the table represents a single observation.  
Data is frequently stored this way in relatinal SQL databases, as a fixed schema (column names and data types) allows the number of distinct values in the `item` column to change as data is added to the table. In the previous example, `date` and `item` would usually be the primary keys (in relational database parlance), offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; you might prefer to have a DataFrame containing one column per distinct `item` value indexed by timestamps in the `date` column. DataFrame's `pivot` method performs exactly this transformation.

In [395]:
pivoted = long_data.pivot(index="date", columns="item", values="value")
pivoted.head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,0.0,2710.349,5.8
1959-04-01,2.34,2778.801,5.1
1959-07-01,2.74,2775.488,5.3
1959-10-01,0.27,2785.204,5.6
1960-01-01,2.31,2847.699,5.2


The first two values passed are the columns to be used, respectively, as the row and column index, then finally an optional value column to fill the DataFrame. Suppose you had two value columns that you wanted to reshape simultaneously.

In [396]:
long_data["value2"] = np.random.standard_normal(len(long_data))
long_data[:10]

Unnamed: 0,date,item,value,value2
0,1959-01-01,realgdp,2710.349,-0.625246
1,1959-01-01,infl,0.0,0.378607
2,1959-01-01,unemp,5.8,0.522582
3,1959-04-01,realgdp,2778.801,-1.406102
4,1959-04-01,infl,2.34,-1.157131
5,1959-04-01,unemp,5.1,-1.025771
6,1959-07-01,realgdp,2775.488,-1.997672
7,1959-07-01,infl,2.74,0.931429
8,1959-07-01,unemp,5.3,0.79059
9,1959-10-01,realgdp,2785.204,-1.179223


In [397]:
# by omitting the last argument, we obtain a DataFrame with hierarchical columns
pivoted = long_data.pivot(index="date", columns="item")
pivoted.head()

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-01-01,0.0,2710.349,5.8,0.378607,-0.625246,0.522582
1959-04-01,2.34,2778.801,5.1,-1.157131,-1.406102,-1.025771
1959-07-01,2.74,2775.488,5.3,0.931429,-1.997672,0.79059
1959-10-01,0.27,2785.204,5.6,-1.280083,-1.179223,0.884692
1960-01-01,2.31,2847.699,5.2,-1.208828,1.393547,0.551202


In [398]:
pivoted["value"].head()

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-01-01,0.0,2710.349,5.8
1959-04-01,2.34,2778.801,5.1
1959-07-01,2.74,2775.488,5.3
1959-10-01,0.27,2785.204,5.6
1960-01-01,2.31,2847.699,5.2


Note that `pivot` is equivalent to creating a hierarchical index using `set_index` followed by a call to `unstack`.

An inverse operation to `pivot` for DataFrame is `pandas.melt`. Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer that the input.

In [399]:
df = pd.DataFrame(
    {"key": ["foo", "bar", "baz"], "A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]}
)
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


The "key" column may be a group indicator, and the other columns are data values. When using `pandas.melt`, we must indicate which columns (if any) are group indicators.

In [400]:
melted = pd.melt(df, id_vars="key")
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


By using `pivot` we can reshape back to the original layout.

In [401]:
reshaped = melted.pivot(index="key", columns="variable", values="value")
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


Since the result of `pivot` creates an index from the column used as the row labels, we may want to use `reset_index` to move the data back into a column.

In [402]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


`pandas.melt` can be used without any group identifiers too.

In [403]:
pd.melt(df, value_vars=["A", "B", "C"])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9
