## Data Wrangling: Clean, Transform, Merge, Reshape

In [1]:
import pandas as pd

In [2]:
df1 = pd.DataFrame({'data1': range(7), 'key': list('bbacaab')})
df2 = pd.DataFrame({'data2': range(20, 23), 'key': list('abd')})

In [3]:
df1

Unnamed: 0,data1,key
0,0,b
1,1,b
2,2,a
3,3,c
4,4,a
5,5,a
6,6,b


In [4]:
df2

Unnamed: 0,data2,key
0,20,a
1,21,b
2,22,d


## Combining and merging data sets

### Database-style DataFrame merges

In [5]:
df1.merge(df2, on='key')

Unnamed: 0,data1,key,data2
0,0,b,21
1,1,b,21
2,6,b,21
3,2,a,20
4,4,a,20
5,5,a,20


By default, .merge() performs an [inner join](https://www.w3schools.com/sql/sql_join.asp) between the DataFrames, using the common columns as keys.

That means that it returns the cartesian product of the elements with common keys: if there are duplicates, it will return all the possible combinations:

In [6]:
df2_wdups = pd.DataFrame({'data2': range(20, 25), 'key': list('abdaa')})
df1.merge(df2_wdups, on='key')

Unnamed: 0,data1,key,data2
0,0,b,21
1,1,b,21
2,6,b,21
3,2,a,20
4,2,a,23
5,2,a,24
6,4,a,20
7,4,a,23
8,4,a,24
9,5,a,20


If the columns to join on don't have the same name, or we want to join on the index of the DataFrames, we'll need to specify that.

In [7]:
df3 = pd.DataFrame({'data1': range(7), 'lkey': list('bbacaab')})
df4 = pd.DataFrame({'data2': range(20, 23), 'rkey': list('abd')})

In [8]:
# This will fail because there are not columns in common
# df3.merge(df4)

In [9]:
df3.merge(df4, left_on='lkey', right_on='rkey')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,21,b
1,1,b,21,b
2,6,b,21,b
3,2,a,20,a
4,4,a,20,a
5,5,a,20,a


In [10]:
df3.merge(df4, left_on='lkey', right_on='rkey', how='outer')

Unnamed: 0,data1,lkey,data2,rkey
0,0.0,b,21.0,b
1,1.0,b,21.0,b
2,6.0,b,21.0,b
3,2.0,a,20.0,a
4,4.0,a,20.0,a
5,5.0,a,20.0,a
6,3.0,c,,
7,,,22.0,d


In [11]:
df3.merge(df4, left_on='lkey', right_on='rkey', how='left')

Unnamed: 0,data1,lkey,data2,rkey
0,0,b,21.0,b
1,1,b,21.0,b
2,2,a,20.0,a
3,3,c,,
4,4,a,20.0,a
5,5,a,20.0,a
6,6,b,21.0,b


In [12]:
df3.merge(df4, left_on='lkey', right_on='rkey', how='right')

Unnamed: 0,data1,lkey,data2,rkey
0,0.0,b,21,b
1,1.0,b,21,b
2,6.0,b,21,b
3,2.0,a,20,a
4,4.0,a,20,a
5,5.0,a,20,a
6,,,22,d


If there are two columns with the same name that we do not join on, both will get transferred to the resulting DataFrame with a suffix. We can customize these suffixes.

In [13]:
df1['something'] = 54
df2['something'] = 64

In [14]:
# Try to merge inner columns
df1.merge(df2)

Unnamed: 0,data1,key,something,data2


In [15]:
df1.merge(df2, on='key', suffixes=('_2017', '_2018'))

Unnamed: 0,data1,key,something_2017,data2,something_2018
0,0,b,54,21,64
1,1,b,54,21,64
2,6,b,54,21,64
3,2,a,54,20,64
4,4,a,54,20,64
5,5,a,54,20,64


### Merging on index

In [16]:
df5 = pd.DataFrame({'Salary': [40000, 50000, 20000]}, index=list('abc'))

In [17]:
df5

Unnamed: 0,Salary
a,40000
b,50000
c,20000


In [18]:
df1.merge(df5, left_on='key', right_index=True)

Unnamed: 0,data1,key,something,Salary
0,0,b,54,50000
1,1,b,54,50000
6,6,b,54,50000
2,2,a,54,40000
4,4,a,54,40000
5,5,a,54,40000
3,3,c,54,20000


### Concatenating along an axis

In [19]:
pd.concat([df1, df5])

Unnamed: 0,Salary,data1,key,something
0,,0.0,b,54.0
1,,1.0,b,54.0
2,,2.0,a,54.0
3,,3.0,c,54.0
4,,4.0,a,54.0
5,,5.0,a,54.0
6,,6.0,b,54.0
a,40000.0,,,
b,50000.0,,,
c,20000.0,,,


In [20]:
pd.concat([df1, df2])

Unnamed: 0,data1,data2,key,something
0,0.0,,b,54
1,1.0,,b,54
2,2.0,,a,54
3,3.0,,c,54
4,4.0,,a,54
5,5.0,,a,54
6,6.0,,b,54
0,,20.0,a,64
1,,21.0,b,64
2,,22.0,d,64


In [21]:
df2.columns = [colname + '_df2' for colname in df2.columns]

In [22]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,data1,key,something,data2_df2,key_df2,something_df2
0,0,b,54,20.0,a,64.0
1,1,b,54,21.0,b,64.0
2,2,a,54,22.0,d,64.0
3,3,c,54,,,
4,4,a,54,,,
5,5,a,54,,,
6,6,b,54,,,


You can delete the overwritten variable, but you won't get back the original value. If it is an object or function from a module, you'll need to reload() the module, since Python doesn't load again an already imported module if you try to import it. reload() is useful also when you are actively developing your own module and want to load the latest definition of a function into memory.

In [23]:
import numpy as np

np.concatenate

<function numpy.core.multiarray.concatenate>

In [24]:
a1 = np.random.randint(50, size=20).reshape(5, 4)
a2 = np.random.randn(12).reshape(3, 4)

In [25]:
np.concatenate((a1, a2))

array([[ 1.40000000e+01,  4.40000000e+01,  1.60000000e+01,
         1.90000000e+01],
       [ 1.30000000e+01,  4.70000000e+01,  1.90000000e+01,
         1.20000000e+01],
       [ 3.30000000e+01,  2.70000000e+01,  2.70000000e+01,
         3.60000000e+01],
       [ 4.00000000e+00,  4.40000000e+01,  3.20000000e+01,
         1.80000000e+01],
       [ 3.00000000e+00,  2.50000000e+01,  1.20000000e+01,
         3.70000000e+01],
       [-8.64112929e-01, -7.41903774e-01,  2.12361117e-01,
         1.22603925e+00],
       [-4.11656038e-02,  4.54776002e-01,  6.47884293e-01,
         8.40495696e-01],
       [ 2.49649920e-01, -2.17937632e-01, -2.06544093e+00,
        -1.13641627e+00]])

In [26]:
# This will fail because the dimensions do not fix
# np.concatenate((a1, a2), axis=1)

In [27]:
series_1 = pd.Series(range(1, 20, 2), index=list('abcdefghij'))
series_2 = pd.Series(range(5, 26, 5), index=list('jklmn'))
series_3 = pd.Series(range(120, 160, 10), index=list('abpq'))

In [28]:
pd.concat((series_1, series_2, series_3))

a      1
b      3
c      5
d      7
e      9
f     11
g     13
h     15
i     17
j     19
j      5
k     10
l     15
m     20
n     25
a    120
b    130
p    140
q    150
dtype: int64

In [29]:
# Seems like an outer merge
pd.concat((series_1, series_2, series_3), axis=1)

Unnamed: 0,0,1,2
a,1.0,,120.0
b,3.0,,130.0
c,5.0,,
d,7.0,,
e,9.0,,
f,11.0,,
g,13.0,,
h,15.0,,
i,17.0,,
j,19.0,5.0,


In [30]:
# Renaming the columns
pd.concat((series_1, series_2, series_3), axis=1, keys=['s1', 's2', 's3'])

Unnamed: 0,s1,s2,s3
a,1.0,,120.0
b,3.0,,130.0
c,5.0,,
d,7.0,,
e,9.0,,
f,11.0,,
g,13.0,,
h,15.0,,
i,17.0,,
j,19.0,5.0,


In [31]:
# To do another sorted index
pd.concat((df1, df2), ignore_index=True, axis=1)

Unnamed: 0,0,1,2,3,4,5
0,0,b,54,20.0,a,64.0
1,1,b,54,21.0,b,64.0
2,2,a,54,22.0,d,64.0
3,3,c,54,,,
4,4,a,54,,,
5,5,a,54,,,
6,6,b,54,,,


#### Digression

Attention! Be careful not to reassign to reserved words or functions- you will overwrite the variable.

In [32]:
# Exists a lib for fixe problems of assigment variables to methods.
# You have to import reload from importlib and do reload(_method_)

## Data transformation

### Removing duplicates

In [33]:
import random

random.seed(42)
artists = pd.DataFrame({
    'name': ['Camilo'] * 3 + ['Nino'] * 2,
    'surname': [random.choice(['Sexto', 'Bravo']) for _ in range(5)]
})

artists

Unnamed: 0,name,surname
0,Camilo,Sexto
1,Camilo,Sexto
2,Camilo,Bravo
3,Nino,Sexto
4,Nino,Sexto


In [34]:
artists.duplicated()

0    False
1     True
2    False
3    False
4     True
dtype: bool

In [35]:
artists[artists.duplicated()]

Unnamed: 0,name,surname
1,Camilo,Sexto
4,Nino,Sexto


In [36]:
artists.drop_duplicates()

Unnamed: 0,name,surname
0,Camilo,Sexto
2,Camilo,Bravo
3,Nino,Sexto


In [37]:
artists.drop_duplicates(keep='last')

Unnamed: 0,name,surname
1,Camilo,Sexto
2,Camilo,Bravo
4,Nino,Sexto


In [38]:
artists['cache'] = [2e5, 5e5, 1e5, 6e8, 8e8]
artists

Unnamed: 0,name,surname,cache
0,Camilo,Sexto,200000.0
1,Camilo,Sexto,500000.0
2,Camilo,Bravo,100000.0
3,Nino,Sexto,600000000.0
4,Nino,Sexto,800000000.0


In [39]:
artists.drop_duplicates(keep='last', subset=['name', 'surname'])

Unnamed: 0,name,surname,cache
1,Camilo,Sexto,500000.0
2,Camilo,Bravo,100000.0
4,Nino,Sexto,800000000.0


In [40]:
artists.drop_duplicates(keep='first', subset=['name', 'surname'])

Unnamed: 0,name,surname,cache
0,Camilo,Sexto,200000.0
2,Camilo,Bravo,100000.0
3,Nino,Sexto,600000000.0


### Renaming axis indexes

In [41]:
artists.index = list('abcde')
artists

Unnamed: 0,name,surname,cache
a,Camilo,Sexto,200000.0
b,Camilo,Sexto,500000.0
c,Camilo,Bravo,100000.0
d,Nino,Sexto,600000000.0
e,Nino,Sexto,800000000.0


### Discretization and binning

In [42]:
# To categorize your data
ages = pd.Series([random.randint(0, 90) for _ in range(40)])

In [43]:
categorized = pd.cut(ages, 4)
categorized.value_counts()

(-0.089, 22.25]    13
(66.75, 89.0]      10
(22.25, 44.5]      10
(44.5, 66.75]       7
dtype: int64

In [44]:
categorized = pd.cut(ages, [0, 12, 19, 30, 65, 100])
categorized.value_counts().sort_index()

(0, 12]       7
(12, 19]      4
(19, 30]      6
(30, 65]     12
(65, 100]    10
dtype: int64

In [45]:
pd.cut(ages, [0, 12, 19, 30, 65, 100])[0]

Interval(12, 19, closed='right')

## String manipulation

### String object methods

### Vectorized string functions in pandas

[Vectorized string functions in pandas](https://pandas.pydata.org/pandas-docs/stable/text.html) are grouped within the .str attribute of Series and Indexes. They have the same names as the regular Python string functions, but work on Series of strings.

In [46]:
zoo = pd.DataFrame({
    'animal': 'mantisshrimp molerat platypus anteater sloth'.split(),
    'number': [1234, 40, 3, 1, 2]
})

In [47]:
zoo['animal'].str.capitalize()

0    Mantisshrimp
1         Molerat
2        Platypus
3        Anteater
4           Sloth
Name: animal, dtype: object

In [48]:
zoo['animal'].str.upper()

0    MANTISSHRIMP
1         MOLERAT
2        PLATYPUS
3        ANTEATER
4           SLOTH
Name: animal, dtype: object