<center> 
# R406: Using Python for data analysis and modelling

<br> <br> 

## <center> Pandas — data cleaning, merging, transformation and reshaping

<br>

<center> **Andrey Vassilev**

<br> 


 

# Outline

1. Merging data
2. Cleaning and transforming data
3. Reshaping data

In [None]:
import numpy as np
import pandas as pd
from IPython.display import display

# Merging datasets

- In Pandas datasets are merged similarly to database merge operations ("joins")
- There are different kinds of joins depending on which dataset is the "leading" one in the merge operation.
- Technically, one can specify different choices of common element(s) that determine the merging operation.

## Implicit merges

In this case Pandas will automatically try to find common columns to join on.

In [None]:
df1 = pd.DataFrame({"id":[112,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,113],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

## Explicit merges on key

In [None]:
df1 = pd.DataFrame({"id":[112,113,114,116,115],"id1":[16,14,12,15,13],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,113],"id1":[16,12,14,15,13], "x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2,on="id")

In [None]:
pd.merge(df1,df2,on="id1")

The keys we are merging on need not have the same names.

In [None]:
df1 = pd.DataFrame({"id1":[112,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id2":[112,115,114,116,113],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_on="id1", right_on="id2")

What happens when the keys match partially?

In [None]:
df1 = pd.DataFrame({"id":[0,113,114,116,115],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[112,115,114,116,999],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

The match is performed only on the common keys. This is called an *inner join*. It is an intersection operation on the keys. Pandas does this by default but we can control it using the `how` parameter.

In [None]:
pd.merge(df1,df2,how="inner") # same as above!

The merging operation can be made inclusive by making sure that no key from either `DataFrame` has been left out. This is called an *outer join* and is a union operation on the keys. Missing elements are filled with `NaN`.

In [None]:
pd.merge(df1,df2,how="outer")

It is also possible to have one of the `DataFrame`s as the "leading" one and the second one will be merged only where possible.

In [None]:
pd.merge(df1,df2,how="left")

In [None]:
pd.merge(df1,df2,how="right")
# pd.merge(df2,df1,how="left") # will give the same result

We can also merge on more than one key. Consider these two dataframes.

In [None]:
df1 = pd.DataFrame({"id":[1,1,2,2,3],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[1,1,2,2,3],"x2":[23,13,24,45,44]})
display(df1,df2)

Here is what happens when you merge:

In [None]:
pd.merge(df1,df2)

Now suppose they have an additional key that can serve to uniquely identify rows:

In [None]:
df1 = pd.DataFrame({"id":[1,1,2,2,3],"id1":[1,2,1,2,1],"x1":[1,3,2,4,5]})
df2 = pd.DataFrame({"id":[1,1,2,2,3],"id1":[1,2,1,2,1],"x2":[23,13,24,45,44]})
display(df1,df2)

In [None]:
pd.merge(df1,df2)

## Merging on index

You can also use dataframe indexes as the merge keys.

In [None]:
ind1 = pd.date_range(start="2005",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x2":[23,13,24,45,44]},index=ind1)
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True)

More complex merges also work as above:

In [None]:
ind1 = pd.date_range(start="2005",periods=5,freq="A")
ind2 = pd.date_range(start="2004",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x2":[23,13,24,45,44]},index=ind2)
display(df1,df2)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True)

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True,how="outer")

Note that there is also a `join()` method that merges on indexes. Its syntax is a bit more compact then that of `merge()` but we won't deal with it.

## Concatenation

Another way of combining datasets is to concatenate them (think stacking them one on top of another).

In [None]:
ind1 = pd.date_range(start="2000",periods=5,freq="A")
ind2 = pd.date_range(start="2004",periods=5,freq="A")
df1 = pd.DataFrame({"x1":[1,3,2,4,5]},index=ind1)
df2 = pd.DataFrame({"x1":[23,13,24,45,44]},index=ind2)
display(df1,df2)

In [None]:
pd.concat([df1,df2])

Compare with the result of a merge operation:

In [None]:
pd.merge(df1,df2,left_index=True,right_index=True,how="outer")

# Transformations and data cleaning

There are numerous operations that can be classified as "cleaning" or "transforming" the data. Cleaning is generally any type of operation that removes unnecessary information or handles the case of missing information. Transformations can be even more diverse and obviously can be part of a cleaning operation.

## Finding and removing duplicates

In [None]:
df = pd.DataFrame({"x1":[1,3,5,7,3],"x2":[2,4,6,8,4]})
display(df)
df.duplicated()

In [None]:
df = pd.DataFrame({"x1":[1,3,5,1,7,3,],"x2":[2,4,6,2,8,4]})
display(df)
df.duplicated()

In [None]:
df.drop_duplicates()

In [None]:
df.drop_duplicates(inplace=True)
df

## Transforming data with a function or a map

Let's look at the simples case first:

In [None]:
display(df)

In [None]:
df['x3'] = 5*df['x1'] - df['x2']**2
df

We are obviously not constrained to simple operations:

In [None]:
def Transf(x):
    tmp = x.copy() # What happens if you don't use copy()?
    tmp[tmp<0] *= 2
    tmp[tmp>0] += 33
    return tmp
df['x4'] = Transf(df['x3'])
df

Or we can use the `map()` method to do the transformation. This allows us to use a function which is not vectorized.

In [None]:
df['x5'] = df['x4'].map(lambda x:"Negative" if x<0 else "Positive" if x>0 else "Zero")
df

## Detecting null values

In [None]:
df["x5"]=np.nan
df.iloc[1,1]=np.nan
df.loc[2,"x3"]=None
df

In [None]:
df.isnull()

## Dropping NAs

In [None]:
del df['x5']
df

In [None]:
df.dropna() # Drops rows by default

In [None]:
df.dropna(axis=1) # Drops columns

We can consider only a certain column (or columns) when dropping:

In [None]:
df.dropna(subset=["x2"])

The `dropna()` method also allows us to:
- substitute inplace (as seen previously);
- use the `how = 'all'` argument to drop a label only if all entries are missing;
- use the `threshold = n` argument to specify that at least `n` values should be missing before dropping.

## Filling in missing values

In [None]:
display(df)
df.fillna(-999)

In [None]:
display(df)
df.fillna({'x1':1.11,'x2':2.22,'x3':3.33,'x4':4.44})

In [None]:
display(df)
df.fillna(method='backfill')

In [None]:
display(df)
df.fillna(method='pad')

## Replacing values

We can replace values in general using the `replace()` method.

In [None]:
df1 = df.fillna({'x1':1.11,'x2':2.22,'x3':3.33,'x4':4.44})
display(df1)
df1.replace(to_replace = [1.0,2.22,3.33],value=[100,222,333])

We can also use a dictionary to pass the substitution values:

In [None]:
display(df1)
df1.replace({2.22:np.nan,3.33:np.nan})

## Computing dummy variables

Sometimes it is useful for modelling purposes to generate a set of dummy variables from a categorical variable:

In [None]:
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'b'],
                    'data': range(6)})
display(df1)
pd.get_dummies(df1['key'])

We can get rid of the `key` variable in our example and replace it with the corresponding dummies:

In [None]:
pd.merge(df1[['data']],pd.get_dummies(df1['key']),left_index=True,right_index=True)

## Discretization and binning

We may have to distribute measurements into pre-specified groups, similarly to how one places observations in the different bins of a histogram. This is done with the `cut()` function.

As an example, suppose you are given weight measurements on 10 persons and you want to classify them in groups as follows:
- up to 50 kg.
- between 50 and 60 kg.
- between 60 and 90 kg.
- ...
- above 90 kg.

In [None]:
weights = [49,91,61,88,75,56,45,54,77,71]
bins = [0,50,60,70,80,90,np.inf]
wbin = pd.cut(weights,bins)
wbin

In [None]:
# These are the labels
wbin.categories

In [None]:
# And these are the groups the observations belong to
wbin.codes

We can get a tally of the number of people in each group:

In [None]:
pd.value_counts(wbin)

# Reshaping data

This part deals with various ways of representing our dataset by rearranging it from rows to columns and vice versa, making the data "wide" or "long" etc.

## Stacking and unstacking data

- The `stack()` method pivots from columns to rows.
- The `unstack()` method pivots from rows to columns.

Stacking makes data "long".

In [None]:
display(df)
stacked = df.stack()
# returns a Series with a hierarchical index
display(stacked) 

In [None]:
stacked[0]

In [None]:
stacked[4]['x2':'x4']

In [None]:
display(df)
stacked = df.stack(dropna=False) # keeps the NaNs
display(stacked) 

Unstacking works from rows to columns, i.e. makes you data "wide".

In [None]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
index = pd.MultiIndex.from_tuples(index)
pop = pd.Series(populations, index=index)
pop

In [None]:
pop.unstack()

## Pivoting

- The stacking and unstacking operations can be generalized a bit for more convenient use. 
- This is done through the `pivot()` method, which let us choose what goes on the rows and what — on the columns.
- It is especially useful for long data in the format usually retrieved from a database.

Consider the following dataset, which contains artificial balance of payments data:

In [None]:
df1 = pd.DataFrame({'date':[2010,2010,2011,2011,2012,2012],
                   'BOPcat':['X','M']*3,
                   'valLC':np.array([3000,3000,2900,3100,3050,2950]),
                   'valFC':np.array([3000,3000,2900,3100,3050,2950])*2})
display(df1)

Stacking does not produce very usable results:

In [None]:
df1.stack()

And neither does unstacking.

In [None]:
df1.unstack()

Let's use the `pivot()` method and instruct it to put the `date` variable on the rows and the `BOPcat` variable on the columns, tabulating the `valLC` variable.

In [None]:
df1.pivot('date','BOPcat','valLC')

We can do the same with the `valFC` variable:

In [None]:
df1.pivot('date','BOPcat','valFC')

Or swap rows for columns:

In [None]:
df1.pivot('BOPcat', 'date', 'valFC')

## Melting data

### The general idea

Sometimes your dataset will be organized in such a way that column names contain information that is actually data. Consider the following dataset:

In [None]:
dt = pd.DataFrame({'first' : ['John', 'Mary'],
                   'last' : ['Doe', 'Bo'],
                   'height' : [170, 180],
                   'weight' : [60, 80]})
dt

- Here the column names `height` and `weight` themselves contain information on the type of measurement (variable). 
- This information can be transformed into more compact form if we put it in a separate column and place the corresponding values in another column, like this:  

| Variable | Value |
| -------- | ----- |
| height   | 170   |
| height   | 180   |
| weight   | 80    |
| weight   | 60    |

- The above is a basic example of *melting*.

- This proposal may not look too different from the original format.
- However, imagine that we had observations on more variables like waistline, body fat percentage etc. 
- These would grow the dataframe horizontally in the original representation while under the proposed transformation having more variables will imply adding row information to a fixed number of columns.
- Obviously this process can apply only to some variables (called *measured variables* or *value variables*), as we need to keep certain variables (called *identifier variables*) in order to be able to identify observations uniquely.

### The Pandas implementation of melting

The `melt()` function collects the information from the columns (in this case, whether the measurement refers to a person's height or weight) and places it in a new variable:

In [None]:
pd.melt(dt, id_vars=['first', 'last'])

The `id_vars` list declares certain variables as identifiers and excludes them from the `melt` operation.

It is possible to change the name of the variable to something more expressive:

In [None]:
pd.melt(dt, id_vars=['first', 'last'], var_name='quantity')

To put things in perspective, the `id_vars` are needed in order to avoid losing information. In this case, we use the combination of first and last name to identify which person an observation refers to. Here is the (useless) molten dataframe without this declaration:

In [None]:
pd.melt(dt)

### More on the rationale behind melting

- At this stage one might wonder whether melting is such a good idea: it seems to make a choice in favour of "long" rather than "wide" data, with the side effect that the readability of the dataset may be worsened in the process of transformation.
- However, the primary advantage of melting is that it puts the data in a generic format that is suitable for transformation into different alternative representations, as needed.

- Think of it as having the dataset in a database-like format which is convenient for extracting different tables for different purposes.
- Actually, the term "melt" is used in reference to having molten metal that can be cast into different forms, as desired. Indeed, the statistical computing and graphics environment R uses precisely the term "cast" for this reverse operation (recall that in Pandas this is done via the `pivot()` method shown previously).