<h1>Data Wrangling</h1>

<h3>Hierarchial Indexing</h3>

<i>Hierarchial Indexing</i> is an important feature of pandas that enables us to have multiple index levels on an axis. In other words, it provides a way for us to work with higher dimensional data in a lower dimensional form.

In [122]:
import pandas as pd
import numpy as np

In [123]:
data = pd.Series(np.random.randn(9),
                index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1,2,3,1,3,1,2,2,3]]
                )

In [124]:
data

a  1    0.987698
   2    0.596506
   3   -1.607795
b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
d  2    1.721766
   3    0.446921
dtype: float64

In [125]:
data['a']

1    0.987698
2    0.596506
3   -1.607795
dtype: float64

In [126]:
data['a'][2]

0.5965059528211047

What we're seegin is a prettified view of a Series with <b>MultiIndex</b> as its index. The "gaps" in the index display mean "use the label directly above".

In [127]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

With a hierarchially indexed object, so called partial indexing is possible, enabling us to concisely select subsets of the data:

In [128]:
data

a  1    0.987698
   2    0.596506
   3   -1.607795
b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
d  2    1.721766
   3    0.446921
dtype: float64

In [129]:
data['b']

1   -0.679791
3    0.476943
dtype: float64

In [130]:
data['b':'c']

b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
dtype: float64

In [131]:
data[['b', 'd']]

b  1   -0.679791
   3    0.476943
d  2    1.721766
   3    0.446921
dtype: float64

Selection is even possible from an "inner" level:

In [132]:
data

a  1    0.987698
   2    0.596506
   3   -1.607795
b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
d  2    1.721766
   3    0.446921
dtype: float64

In [133]:
data.loc[:,2]

a    0.596506
c    0.091914
d    1.721766
dtype: float64

Hierarchial indexing plays an important role in reshaping data and group-based operations like forming a pivot table. For example, we can rearrange the data into a DataFrame using its <b>unstack</b> method:

In [134]:
data

a  1    0.987698
   2    0.596506
   3   -1.607795
b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
d  2    1.721766
   3    0.446921
dtype: float64

In [135]:
data.unstack()

Unnamed: 0,1,2,3
a,0.987698,0.596506,-1.607795
b,-0.679791,,0.476943
c,0.840347,0.091914,
d,,1.721766,0.446921


The inverse operatinon of unstack is stack:

In [136]:
data.unstack().stack()

a  1    0.987698
   2    0.596506
   3   -1.607795
b  1   -0.679791
   3    0.476943
c  1    0.840347
   2    0.091914
d  2    1.721766
   3    0.446921
dtype: float64

With a DataFrame, either axis can have a hierarchial index:

In [137]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [['a','a', 'b', 'b'],[1,2,1,2]],
                    columns = [['Ohio', 'Ohio', 'Colorado'], 
                              ['Green', 'Red', 'Green']])

In [138]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


The hierarchial levels can have names. If so, these will show up in the console output:

In [139]:
frame.index.names = ['key1', 'key2']

In [140]:
frame.columns.names = ['state', 'color']

In [141]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


With partial column indexing we can similarly select gorups of columns

In [142]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


A MultiIndex can be created by itself and then reused; the columns in the preceding DataFrame with level names could be created like this:

In [143]:
from pandas import MultiIndex

In [144]:
MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'],
                      ['Green', 'Red', 'Green']],
                      names=['state', 'color'])

MultiIndex([(    'Ohio', 'Green'),
            (    'Ohio',   'Red'),
            ('Colorado', 'Green')],
           names=['state', 'color'])

<h3>Reording and Sorting Levels</h3>

At times we will need to rearrange the order of the levels on an axis or sort the data by the values in one specific level. The swaplevel takes two level numbers or names and returns a new object with the leveles interchanged (but the data is otherwised unaltered):

In [145]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [146]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [147]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


<b>sort_index</b>, on the other hand, sorts the data using only the values in a single level.

In [148]:
frame.index

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           names=['key1', 'key2'])

In [149]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [150]:
frame.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


<b>Note: </b> Data selection performance is much better on hierarchially indexed objects if the index is lexicographically sorted starting with the outermost level-that is, the result of calling sort_index(level=0) or sort_index()

<h3>Summary Statistics by Level</h3>

Mnay descriptive and summary statistics on DataFrame and Series have a <b>level</b> option in which we can specify the level we  want to aggregate by on a particular axis

In [151]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [152]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [153]:
frame.sum(level='color', axis = 1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


<h3>Indexing with a DataFrame's columns</h3>

It's not unusual to want to use one or more columns from a DataFrame as the row index; alternatively we may wish to move the row index into the DataFrame's columns

In [154]:
frame = pd.DataFrame({'a': range(7), 'b':range(7,0,-1),
                     'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                     'd': [0,1,2,9,1,2,3]})

In [155]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,9
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrames's <b>set_index</b> function will create a new DataFrame using one or more of its columns as the index:

In [156]:
frame2 = frame.set_index(['c', 'd'])

In [157]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,9,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though we can leave them in by setting the <b>drop</b> attribute to false:

In [158]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,9,3,4,two,9
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


<b>reset_index</b>, on the other hand, does the opposite of set_index; the hierarchial index levels are moved into the columns:

In [159]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,9,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [160]:
frame2.reset_index()

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,9,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


<h3>Combining and Merging Datasets</h3>

Data contained in pandas objects can be combined together in a number of ways:<br>
<ul>
<li><b>pandas.merge</b> connects rows in DataFrames based on one or more keys. This will be familliar to users of SQL or other relational databses, as it implements the database <b>join</b> operations.<br></li>
    <li><b>pandas.concat</b> concatenates or "stacks" together objects along an axis.<br></li>
    <li>
    The <b>combine_first</b> instance method enables splicing together overlapping data to fill in missing values in one object with values from another.</li>
</ul>


<h3>Database-Style DataFrame joins</h3>

<b>Merge</b> or <b>join</b> combine datasets by linking rows using one or more keys. These operations are central to relational databases. The <b>merge</b> function in pandas is the main entry point for using theses algorithms in our data.

In [161]:
df1 = pd.DataFrame({'key':['b', 'b', 'a', 'c','a', 'a', 'b'],
                   'data1': range(7)})

In [162]:
df2 = pd.DataFrame({'key': ['a', 'b', 'd'],
                   'data2': range(3)})

In [163]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [164]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


This is an example of a <b>many-to-one</b> join; the data in df1 has multiple rows labeled a and b whereas df2 has only one row for each value in the key column. Calling merge with these objects we objtain:

In [165]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


Note that we didn't specify which column to join on. If that information is not specified, merge uses the <b>overlapping</b> column names as the keys. It is a good practice to specify explicitely though:

In [166]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


If the column names are different in each object we can specify them separately:

In [167]:
df3 = pd.DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                  'data1': range(7)})

In [168]:
df4 = pd.DataFrame({'rkey': ['a', 'b', 'd'],
                   'data2': range(3)})

In [169]:
df3

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [170]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


In [171]:
pd.merge(df3, df4, left_on = 'lkey', right_on ='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


<b>Note:</b> Here, we can notice that 'c' and 'd', values and thier associated data are missing from the result. By default <b>merge</b> does an <b>'inner'</b> join; the keys in the result are the intersection, or the common set found in both tables. Other possible options are 'left', 'right', and 'outer'. The outer join takes the union of the keys, combining the effect of applying both left and right joins:

In [172]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


![alt Text](Images/DataWrangling/dw_join.png)

Many-to-many merges have well defined, though not necessarily intuitive, behavior.

In [173]:
df1  = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data1': range(6)})

In [174]:
df2 = pd.DataFrame({'key': ['a', 'b', 'a', 'b', 'd'],
                   'data2': range(5)})

In [175]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [176]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [177]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


Mant to many joins from the Cartesian product of the rows. Since, there were three 'b' rows in the left DataFrame and two in the right  one, there are six 'b' rows in the result. The join method only affects the disticnt key values appearing in the result

In [178]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


To merge with multiple keys, pass a list of column names:

In [179]:
left = pd.DataFrame({
    'key1': ['foo', 'foo', 'bar'],
    'key2': ['one', 'two', 'one'],
    'lval': [1,2,3]
})

In [180]:
right = pd.DataFrame({
    'key1':['foo', 'foo', 'bar', 'bar'],
    'key2': ['one', 'one', 'one', 'two'],
    'rval': [4,5,6,7]
})

In [181]:
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [182]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [183]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


<b>Note</b>: To determine which key combinations will appear in the result depending on the choice of merge method, think of the multiple keys as forming an array of tuples to be used as a single join key.

A last issue to consider in merge operations is the treatement of overlapping column names. While we can address the overlap manually, merge has a <b>suffixes</b> option for specifying strings to append to overlapping names in the left and right DataFrame objects

In [184]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [185]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


![alt Text](Images/DataWrangling/dw_merge_func.png)

<h3>Merging On Index</h3>

In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, we can pass left_index = True or right_index = True (or both) to indicate that th eindex should be used as the merge key:

In [186]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                     'value' : range(6)})

In [187]:
right1 = pd.DataFrame({'group_val': [3.5,7]},
                     index = ['a', 'b'])

In [188]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [189]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [190]:
pd.merge(left1, right1, left_on= 'key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


Since, the default merge method is to intersect the join keys, we can instead form the union of them with an outer join:

In [191]:
pd.merge(left1, right1, left_on = 'key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


With heirarchially index data, things are more complicated, as joining on index is implicitely a multiple-key merge:

In [192]:
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio',
                              'Nevada', 'Neada'],
                     'key2': [2000,2001,2002,2001,2002],
                     'data': np.arange(5.)})

In [193]:
righth =pd.DataFrame(np.arange(12).reshape((6,2)),
                    index = [['Nevada', 'Nevada', 'Ohio', 'Ohio',
                             'Ohio', 'Ohio'],
                            [2001,2000,2000,2000,2001,2002]],
                    columns = ['event1', 'event2'])

In [194]:
lefth

Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Neada,2002,4.0


In [195]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [196]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [197]:
pd.merge(lefth, righth, left_on = ['key1', 'key2'],
        right_index = True, how='outer')

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Neada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


Using thd indexes of both sides of the merge is also possible:

In [198]:
left2 = pd.DataFrame([[1., 2.], [3.,4.],[5.,6.]],
                    index = ['a', 'c', 'e'],
                    columns = ['Ohio', 'Nevada'])

In [199]:
right2 = pd.DataFrame([[7., 8.], [9.,10.], [11., 12.], [13,14]],
                     index = ['b','c', 'd', 'e'],
                     columns = ['MIssouri', 'Alabama'])

In [200]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [201]:
right2

Unnamed: 0,MIssouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [202]:
pd.merge(left2, right2, how='outer', left_index = True, right_index=True)

Unnamed: 0,Ohio,Nevada,MIssouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


DataFrame has a convenient join instance for merging by index. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns. 

In [203]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [204]:
right2

Unnamed: 0,MIssouri,Alabama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [205]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,MIssouri,Alabama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In part for legacy reasons, DataFrame's join method performs a left  join on the join keys, exactly preserving the left frame's row index. It also supports joining the index of the passed DataFrame on one of the columns of the calling DataFrame:

In [206]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [207]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [208]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


Lastly, for simple index-on-index merges, we can pass a list of  DataFrames to jion as an alternative to using the more general <b>concat</b> function described in the next section:

In [209]:
another = pd.DataFrame([[7., 8.], [9., 10.], [11., 12.],[16.,17.]],
                      index=['a', 'c', 'e', 'f'],
                      columns = ['New York', 'Oregon'])

In [210]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [211]:
left2.join([right2, another])

Unnamed: 0,Ohio,Nevada,MIssouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [212]:
left2.join([right2, another], how='outer')

Unnamed: 0,Ohio,Nevada,MIssouri,Alabama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0
b,,,7.0,8.0,,
d,,,11.0,12.0,,
f,,,,,16.0,17.0


<h3>Concatenating Along an Axis</h3>

Another kind of data combination operation is referred to interchangeably as concatenation, binding or stacking. Numpy's conatenate function can do this with Numpy Arrays:


In [213]:
arr = np.arange(12).reshape((3,4))

In [214]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [215]:
np.concatenate([arr, arr], axis = 1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [216]:
np.concatenate([arr, arr, arr], axis = 0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

<pre>
In the context of pandas objects such as Series and DataFrame, having labeled axes enable you to further generalize array concatenation. In particular, you have a number of additional things to think about:
• If the objects are indexed differently on the other axes, should we combine the distinct elements in these axes or use only the shared values (the intersection)?
• Do the concatenated chunks of data need to be identifiable in the resulting object?
• Does the “concatenation axis” contain data that needs to be preserved? In many cases, the default integer labels in a DataFrame are best discarded during concatenation.
</pre>

The <b>concat</b> function in pandas provides a consistent way to address each of these concern.

In [217]:
s1 = pd.Series([0,1], index = ['a', 'b'])

In [218]:
s2 = pd.Series([2,3,4], index = ['c', 'd', 'e'])

In [219]:
s3 = pd.Series([5,6], index=['f', 'g'])

In [220]:
pd.concat([s1, s2, s3])

a    0
b    1
c    2
d    3
e    4
f    5
g    6
dtype: int64

By default concat works along axis = 0, producing another Series. If we pass axis = 1, the result will instead be a DataFrame

In [221]:
pd.concat([s1,s2,s3], axis = 1)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In this case there is no overlap on the other axis, which as we can see is the sorted union(the 'outer' join) of the indexes. We can instead intersect them by passing join = 'inner':

In [222]:
s4 = pd.concat([s1,s3])

In [223]:
s4

a    0
b    1
f    5
g    6
dtype: int64

In [224]:
pd.concat([s1,s4], axis = 1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [225]:
pd.concat([s1,s4], axis = 1, join = 'inner')

Unnamed: 0,0,1
a,0,0
b,1,1


In this example, the 'f' and 'g' labels disappeared because of the join = 'inner' option.
We can even specify the axes to be used on the other axes with join_axes:

A potential issue is that the concatenated pieces are not identifiable in the result. Suppose instead we wanted to create a hierarchial index on the concatenation axis. To do this, use the <b>keys</b> arguement:

In [226]:
result = pd.concat([s1,s2,s3], keys = ['one', 'two','three'])

In [227]:
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

In [228]:
result.unstack()

Unnamed: 0,a,b,c,d,e,f,g
one,0.0,1.0,,,,,
two,,,2.0,3.0,4.0,,
three,,,,,,5.0,6.0


In the case of combining Serries along axis = 1, the keys become the DataFrame column headers:

In [229]:
pd.concat([s1,s2,s3], axis = 1, keys=['one', 'two', 'three'])

Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


The same logic extends to DataFrame objects:

In [230]:
df1 = pd.DataFrame(np.arange(6).reshape(3,2), index = ['a', 'b', 'c'],
                  columns = ['one', 'two'])

In [231]:
df2 = pd.DataFrame(5+np.arange(4).reshape(2,2), index = ['a', 'c'],
                  columns = ['three', 'four'])

In [232]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [233]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [234]:
pd.concat([df1, df2], axis = 1, keys = ['level1', 'level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If we pass a dict of objects instead of a list, the dict's keys will be used for the keys option:

In [235]:
pd.concat({'level1':df1, 'level2': df2}, axis = 1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


A last consideration concerns DataFrames in which the row index does not contain any relevant data:

In [236]:
df1 = pd.DataFrame(np.random.randn(3,4), columns= ['a', 'b', 'c', 'd'])

In [237]:
df2 = pd.DataFrame(np.random.randn(2,3), columns=['b', 'd', 'a'])

In [238]:
df1

Unnamed: 0,a,b,c,d
0,0.491575,0.367485,2.308345,0.388806
1,-1.093813,0.098881,-0.66372,-0.083214
2,1.829689,-0.716236,0.123839,-0.800059


In [239]:
df2

Unnamed: 0,b,d,a
0,0.173985,0.705374,-0.36237
1,1.117323,1.800972,1.102946


In this case, we can pass <b>ignore_index=True</b>:

In [240]:
pd.concat([df1,df2], ignore_index=True)

Unnamed: 0,a,b,c,d
0,0.491575,0.367485,2.308345,0.388806
1,-1.093813,0.098881,-0.66372,-0.083214
2,1.829689,-0.716236,0.123839,-0.800059
3,-0.36237,0.173985,,0.705374
4,1.102946,1.117323,,1.800972


![alt Text](Images/DataWrangling/dw_concat1.png)

![alt Text](Images/DataWrangling/dw_concat2.png)

<h3>Combining Data with Overlap</h3>

There is another data combination situation that can't be expressed as either a merge or concatenation operation. We may have two datasets whose indexes overlap in full or part. 

In [241]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],
             index = ['f', 'e', 'd', 'c', 'b', 'a'])

In [242]:
b = pd.Series(np.arange(len(a), dtype=np.float64),
             index = ['f', 'e', 'd', 'c', 'b', 'a'])

In [243]:
b[-1] = np.nan

In [244]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [245]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    NaN
dtype: float64

In [246]:
np.where(pd.isnull(a), b, a)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

Series has a <b>combine_first</b> method, which performs the equivalent of this operation along with pandas's usual data alignment logic:

In [308]:
b[:-2]

f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64

In [247]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames, <b>combine_first</b> does the same thing column by column, so we can think of it as "patching" missing data in the calling object with data from the object we pass:

In [248]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],
                   'b': [np.nan, 2., np.nan, 6.],
                   'c': range(2,18,4)})

In [249]:
df2 = pd.DataFrame({'a': [5., 5., np.nan, 3., 7.],
                   'b': [np.nan, 3., 4., 6., 8.]})

In [250]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [251]:
df2

Unnamed: 0,a,b
0,5.0,
1,5.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [252]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,5.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


<h3> Reshaping and Pivoting</h3>

There are a number of basic operations for rearranging tabular data. These are alternatingly referred to as <i>reshape</i> or <i>pivot</i> operations>

<h3>Reshaping with Hierarchial Indexing</h3>

Hierarchial indexing provides a consistent way to rearrange data in a DataFrame. There are two primary actions:
<ul>
    <li>stack- This 'rotates" or pivots from the columns in the data to the rows. </li>
    <li>unstack - This pivots form the rows into the columns</li>
</ul>

In [253]:
data = pd.DataFrame(np.arange(6).reshape((2,3)),
                   index = pd.Index(['Ohio', 'Colorado'], name = 'state'),
                   columns = pd.Index(['one', 'two', 'three'],
                                     name = 'number'))

In [254]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the stack method on this data pivots the columns into the rows, producing a Series:

In [255]:
result = data.stack()

In [256]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int32

From a hierarchially indexed Series, we can rearrange the data back into a DataFrame with unstack:

In [257]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


By default the innermost level is unstacked (same with stack). You can unstack a different level by passing a level number or name:

In [258]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [259]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups:

In [260]:
s1 = pd.Series([0,1,2,3], index = ['a', 'b', 'c', 'd'])

In [261]:
s2 = pd.Series([4,5,6], index=['c', 'd', 'e'])

In [262]:
data2 = pd.concat([s1, s2], keys = ['one', 'two'])

In [263]:
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [264]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


Stacking filters out missing data by default, so the operation is more easily invertible:

In [265]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


In [266]:
data2.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    4.0
     d    5.0
     e    6.0
dtype: float64

In [267]:
data2.unstack().stack(dropna = False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    4.0
     d    5.0
     e    6.0
dtype: float64

When we unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

In [268]:
df = pd.DataFrame({'left': result, 'right': result + 5}, 
                 columns = pd.Index(['left', 'right'], name = 'side'))

In [269]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [270]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


Unstacking might introduce missing data if all of the values in the level aren't found in each of the subgroups:

In [271]:
s1 = pd.Series([0,1,2,3], index= ['a', 'b', 'c', 'd'])

In [272]:
s2 = pd.Series([4,5,6], index=['c', 'd', 'e'])

In [273]:
data2 = pd.concat([s1, s2], keys = ['one', 'two'])

In [274]:
data2

one  a    0
     b    1
     c    2
     d    3
two  c    4
     d    5
     e    6
dtype: int64

In [275]:
data2.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,4.0,5.0,6.0


When we unstack in a DataFrame, the level unstacked becomes the lowest level in the result:

In [276]:
df = pd.DataFrame({'left': result, 'right': result + 5}, 
                 columns = pd.Index(['left', 'right'], name = 'side'))

In [277]:
df

Unnamed: 0_level_0,side,left,right
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,one,0,5
Ohio,two,1,6
Ohio,three,2,7
Colorado,one,3,8
Colorado,two,4,9
Colorado,three,5,10


In [278]:
df.unstack('state')

side,left,left,right,right
state,Ohio,Colorado,Ohio,Colorado
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,0,3,5,8
two,1,4,6,9
three,2,5,7,10


When calling stack, we can indicate the name of the axis to stack:

In [279]:
df.unstack('state').stack('side')

Unnamed: 0_level_0,state,Colorado,Ohio
number,side,Unnamed: 2_level_1,Unnamed: 3_level_1
one,left,3,0
one,right,8,5
two,left,4,1
two,right,9,6
three,left,5,2
three,right,10,7


<h3>Pivoting "Long" to "Wide" Format

A common way to store multiple time series in databases and csv is in so-called long or stacked format.Let's load some example data and do a small amount of time series wrangling and other data cleaning:

In [280]:
data = pd.read_csv('pydata-book-2nd-edition/examples/macrodata.csv')

In [281]:
data.head()

Unnamed: 0,year,quarter,realgdp,realcons,realinv,realgovt,realdpi,cpi,m1,tbilrate,unemp,pop,infl,realint
0,1959.0,1.0,2710.349,1707.4,286.898,470.045,1886.9,28.98,139.7,2.82,5.8,177.146,0.0,0.0
1,1959.0,2.0,2778.801,1733.7,310.859,481.301,1919.7,29.15,141.7,3.08,5.1,177.83,2.34,0.74
2,1959.0,3.0,2775.488,1751.8,289.226,491.26,1916.4,29.35,140.5,3.82,5.3,178.657,2.74,1.09
3,1959.0,4.0,2785.204,1753.7,299.356,484.052,1931.3,29.37,140.0,4.33,5.6,179.386,0.27,4.06
4,1960.0,1.0,2847.699,1770.5,331.722,462.199,1955.5,29.54,139.6,3.5,5.2,180.007,2.31,1.19


In [282]:
periods = pd.PeriodIndex(year = data.year, quarter = data.quarter, name='date')

In [283]:
columns = pd.Index(['realgdp', 'infl', 'unemp'], name = 'item')

In [284]:
data = data.reindex(columns=columns)

In [285]:
data.index = periods.to_timestamp('D', 'end')

In [286]:
ldata = data.stack().reset_index().rename(columns = {0: 'value'})

In [287]:
ldata[:10]

Unnamed: 0,date,item,value
0,1959-03-31 23:59:59.999999999,realgdp,2710.349
1,1959-03-31 23:59:59.999999999,infl,0.0
2,1959-03-31 23:59:59.999999999,unemp,5.8
3,1959-06-30 23:59:59.999999999,realgdp,2778.801
4,1959-06-30 23:59:59.999999999,infl,2.34
5,1959-06-30 23:59:59.999999999,unemp,5.1
6,1959-09-30 23:59:59.999999999,realgdp,2775.488
7,1959-09-30 23:59:59.999999999,infl,2.74
8,1959-09-30 23:59:59.999999999,unemp,5.3
9,1959-12-31 23:59:59.999999999,realgdp,2785.204


This is the so-called long formta for multiple time series, or orther observational data with two or more keys. Each row in the table represents a single observation.

Data is frequently stored this way in relational databases like MySQL, as a fixed schema (column names and data types) allows the number of distinct values in the item column to change as data is added to the table. In the previous example, date and item would usually be the primary keys, offering both relational integrity and easier joins. In some cases, the data may be more difficult to work with in this format; we might prefer to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method performs exactly this transformation.

In [288]:
pivoted = ldata.pivot('date', 'item', 'value')

In [289]:
pivoted

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.00,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2
...,...,...,...
2008-09-30 23:59:59.999999999,-3.16,13324.600,6.0
2008-12-31 23:59:59.999999999,-8.79,13141.920,6.9
2009-03-31 23:59:59.999999999,0.94,12925.410,8.1
2009-06-30 23:59:59.999999999,3.37,12901.504,9.2


The first two values passed are the columns to be used respectively as the row and column index, then finally an optional value column to fill the DataFrame. Suppose we had two value columns that we wanted to reshape simultaneously:

In [290]:
ldata['value2'] = np.random.randn(len(ldata))

In [291]:
ldata[:10]

Unnamed: 0,date,item,value,value2
0,1959-03-31 23:59:59.999999999,realgdp,2710.349,1.33037
1,1959-03-31 23:59:59.999999999,infl,0.0,-1.078272
2,1959-03-31 23:59:59.999999999,unemp,5.8,-1.275206
3,1959-06-30 23:59:59.999999999,realgdp,2778.801,2.18427
4,1959-06-30 23:59:59.999999999,infl,2.34,-1.535845
5,1959-06-30 23:59:59.999999999,unemp,5.1,-1.043699
6,1959-09-30 23:59:59.999999999,realgdp,2775.488,0.03412
7,1959-09-30 23:59:59.999999999,infl,2.74,-0.393209
8,1959-09-30 23:59:59.999999999,unemp,5.3,0.213516
9,1959-12-31 23:59:59.999999999,realgdp,2785.204,1.395165


By omitting the last argument, we obtain a DataFrame with hierarchial columns:

In [292]:
pivoted = ldata.pivot("date", 'item')

In [293]:
pivoted[:5]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-1.078272,1.33037,-1.275206
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.535845,2.18427,-1.043699
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,-0.393209,0.03412,0.213516
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,0.943527,1.395165,1.050299
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,-0.25173,-1.876728,-0.938952


In [294]:
pivoted['value'][:5]

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2


Now that the pivot is equivalent to creating a hierarchial index using set_index followed by a call to unstack:

In [295]:
unstacked = ldata.set_index(['date', 'item']).unstack('item')

In [296]:
unstacked[:7]

Unnamed: 0_level_0,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
1959-03-31 23:59:59.999999999,0.0,2710.349,5.8,-1.078272,1.33037,-1.275206
1959-06-30 23:59:59.999999999,2.34,2778.801,5.1,-1.535845,2.18427,-1.043699
1959-09-30 23:59:59.999999999,2.74,2775.488,5.3,-0.393209,0.03412,0.213516
1959-12-31 23:59:59.999999999,0.27,2785.204,5.6,0.943527,1.395165,1.050299
1960-03-31 23:59:59.999999999,2.31,2847.699,5.2,-0.25173,-1.876728,-0.938952
1960-06-30 23:59:59.999999999,0.14,2834.39,5.2,-0.678275,0.427837,1.241551
1960-09-30 23:59:59.999999999,2.7,2839.022,5.6,-1.18451,0.656564,0.299093


<h3>Pivoting "Wide" to "Long" Format</h3>

An inverse opeation to pivot for DataFrames is <b>pandas.melt</b>. Rather than transforming one column into many in a new DataFrame, it merges multiple columns into one, producing a DataFrame that is longer than the input.

In [297]:
df = pd.DataFrame({'key': ['foo', 'bar', 'baz'],
                  'A': [1,2,3],
                  'B': [4,5,6],
                  'C': [7,8,9]})

In [298]:
df

Unnamed: 0,key,A,B,C
0,foo,1,4,7
1,bar,2,5,8
2,baz,3,6,9


The 'key' column may be a group indicator, and the other columns are data values. When using pandas.melt, we must indicate which columns (if any) are group indicators. Lets use 'key' as the only group indicator here:

In [299]:
melted = pd.melt(df, ['key'])

In [300]:
melted

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6
6,foo,C,7
7,bar,C,8
8,baz,C,9


Using pivot, we can reshape back to the original layout:

In [301]:
reshaped = melted.pivot('key', 'variable', 'value')

In [302]:
reshaped

variable,A,B,C
key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,2,5,8
baz,3,6,9
foo,1,4,7


Since the result of pivot creates an index from the column used as the row labels, we may want to use reset_index to move the data back into a column:

In [303]:
reshaped.reset_index()

variable,key,A,B,C
0,bar,2,5,8
1,baz,3,6,9
2,foo,1,4,7


We can also specify a subset of columns to use as value columns:

In [304]:
pd.melt(df, id_vars=['key'], value_vars=['A', 'B'])

Unnamed: 0,key,variable,value
0,foo,A,1
1,bar,A,2
2,baz,A,3
3,foo,B,4
4,bar,B,5
5,baz,B,6


<b>pandas.melt</b> can be used without any group identifiers, too:

In [305]:
pd.melt(df, value_vars=['A', 'B', 'C'])

Unnamed: 0,variable,value
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6
6,C,7
7,C,8
8,C,9


In [306]:
pd.melt(df, value_vars=['key' ,'A', 'B'])

Unnamed: 0,variable,value
0,key,foo
1,key,bar
2,key,baz
3,A,1
4,A,2
5,A,3
6,B,4
7,B,5
8,B,6


In [307]:
pd.merge?