## Data Wrangling: Clean, Transform, Merge, Reshape

In [2]:
import pandas as pd


## Combining and merging data sets

### Database-style DataFrame merges

In [15]:
df1 = pd.DataFrame({'data1': range(5,12), 'key': list('bbacaab')})
df2 = pd.DataFrame({'data2': range(56,59), 'key': list('abd')})
df2

Unnamed: 0,data2,key
0,56,a
1,57,b
2,58,d


In [16]:
df1.merge(df2) #it only combines the rows that are common in both data frames

Unnamed: 0,data1,key,data2
0,5,b,57
1,6,b,57
2,11,b,57
3,7,a,56
4,9,a,56
5,10,a,56


In [17]:
df1.merge(df2, how='outer') #returns all possible combinations, even the not common rows

Unnamed: 0,data1,key,data2
0,5.0,b,57.0
1,6.0,b,57.0
2,11.0,b,57.0
3,7.0,a,56.0
4,9.0,a,56.0
5,10.0,a,56.0
6,8.0,c,
7,,d,58.0


That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

In [19]:
df3 = pd.DataFrame({'data2': range(56,61), 'key': list('abdbd')}) #all possible combinations for the same keys
df3.merge(df1)

Unnamed: 0,data2,key,data1
0,56,a,7
1,56,a,9
2,56,a,10
3,57,b,5
4,57,b,6
5,57,b,11
6,59,b,5
7,59,b,6
8,59,b,11


If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [21]:
df4 = pd.DataFrame({'data2': range(56,61), 'rkey': list('abdbd')})
df1.merge(df4 , left_on='key', right_on='rkey')

Unnamed: 0,data1,key,data2,rkey
0,5,b,57,b
1,5,b,59,b
2,6,b,57,b
3,6,b,59,b
4,11,b,57,b
5,11,b,59,b
6,7,a,56,a
7,9,a,56,a
8,10,a,56,a


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

### Merging on index

In [23]:
df4.index = range(5,10)
df4

Unnamed: 0,data2,rkey
5,56,a
6,57,b
7,58,d
8,59,b
9,60,d


In [24]:
df1.merge(df4, left_on='data1', right_index=True)

Unnamed: 0,data1,key,data2,rkey
0,5,b,56,a
1,6,b,57,b
2,7,a,58,d
3,8,c,59,b
4,9,a,60,d


### Concatenating along an axis

In [28]:
pd.concat([df1,df2])
pd.concat([df1['data1'],df2['data2']], axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,data1,data2
0,5,56.0
1,6,57.0
2,7,58.0
3,8,
4,9,
5,10,
6,11,


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

In [30]:
pd.concat = [df1,df2] #override the concat

from importlib import reload
del(pd.concat)
reload(pd) #We can load it again
pd.concat([df3,df4])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


Unnamed: 0,data2,key,rkey
0,56,a,
1,57,b,
2,58,d,
3,59,b,
4,60,d,
5,56,,a
6,57,,b
7,58,,d
8,59,,b
9,60,,d


You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

## Data transformation

### Removing duplicates

In [35]:
df1['key'].drop_duplicates()

0    b
2    a
3    c
Name: key, dtype: object

In [36]:
df1.drop_duplicates(subset='key', keep='last') #I keep the last of the duplicates

Unnamed: 0,data1,key
0,5,b
2,7,a
3,8,c


### Renaming axis indexes

In [38]:
df1.index = list('abcdefg')
df1

Unnamed: 0,data1,key
a,5,b
b,6,b
c,7,a
d,8,c
e,9,a
f,10,a
g,11,b


### Discretization and binning

In [40]:
import numpy as np
np.random.seed(42)
ages = pd.Series(np.random.randint(9,99,50))

In [42]:
limits = [14, 18, 35, 50, 65] #splits by the elements determined in this list
categories = pd.cut(ages, limits)

In [43]:
categories.value_counts()

(18, 35]    9
(50, 65]    7
(35, 50]    5
(14, 18]    1
dtype: int64

## String manipulation

### String object methods

In [45]:
bichos = pd.Series(np.random.choice(['Mantis Shrimp', 'Naked Mole Rat', 'Star Nosed Mole'], 15))
bichos

0      Naked Mole Rat
1      Naked Mole Rat
2      Naked Mole Rat
3      Naked Mole Rat
4      Naked Mole Rat
5      Naked Mole Rat
6       Mantis Shrimp
7     Star Nosed Mole
8      Naked Mole Rat
9      Naked Mole Rat
10     Naked Mole Rat
11     Naked Mole Rat
12     Naked Mole Rat
13     Naked Mole Rat
14    Star Nosed Mole
dtype: object

In [46]:
bichos.str.upper()

0      NAKED MOLE RAT
1      NAKED MOLE RAT
2      NAKED MOLE RAT
3      NAKED MOLE RAT
4      NAKED MOLE RAT
5      NAKED MOLE RAT
6       MANTIS SHRIMP
7     STAR NOSED MOLE
8      NAKED MOLE RAT
9      NAKED MOLE RAT
10     NAKED MOLE RAT
11     NAKED MOLE RAT
12     NAKED MOLE RAT
13     NAKED MOLE RAT
14    STAR NOSED MOLE
dtype: object

In [48]:
bichos.str.lower()

0      naked mole rat
1      naked mole rat
2      naked mole rat
3      naked mole rat
4      naked mole rat
5      naked mole rat
6       mantis shrimp
7     star nosed mole
8      naked mole rat
9      naked mole rat
10     naked mole rat
11     naked mole rat
12     naked mole rat
13     naked mole rat
14    star nosed mole
dtype: object

In [49]:
bichos.str.split()

0      [Naked, Mole, Rat]
1      [Naked, Mole, Rat]
2      [Naked, Mole, Rat]
3      [Naked, Mole, Rat]
4      [Naked, Mole, Rat]
5      [Naked, Mole, Rat]
6        [Mantis, Shrimp]
7     [Star, Nosed, Mole]
8      [Naked, Mole, Rat]
9      [Naked, Mole, Rat]
10     [Naked, Mole, Rat]
11     [Naked, Mole, Rat]
12     [Naked, Mole, Rat]
13     [Naked, Mole, Rat]
14    [Star, Nosed, Mole]
dtype: object

In [50]:
bichos.str[:6]

0     Naked 
1     Naked 
2     Naked 
3     Naked 
4     Naked 
5     Naked 
6     Mantis
7     Star N
8     Naked 
9     Naked 
10    Naked 
11    Naked 
12    Naked 
13    Naked 
14    Star N
dtype: object

In [53]:
bichos.str.split().str[1]

0       Mole
1       Mole
2       Mole
3       Mole
4       Mole
5       Mole
6     Shrimp
7      Nosed
8       Mole
9       Mole
10      Mole
11      Mole
12      Mole
13      Mole
14     Nosed
dtype: object