# More Data Preprocessing (with Pandas)

Pandas is a very useful data analytics package within Python.

Let's start with some random data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f', 'h'], columns=['one', 'two', 'three'])
df

Unnamed: 0,one,two,three
a,-0.32091,0.083295,0.234517
c,0.872061,-1.559059,-0.579188
e,0.254526,-0.525624,1.096691
f,0.339962,-2.114585,-0.99026
h,0.244691,0.087985,0.144782


We know with Pandas we can mix & match data types, so let us add two more columns, named 'four' and 'five':

In [3]:
df['four'] = 'bar'
df['five'] = df['one'] > 0
df

Unnamed: 0,one,two,three,four,five
a,-0.32091,0.083295,0.234517,bar,False
c,0.872061,-1.559059,-0.579188,bar,True
e,0.254526,-0.525624,1.096691,bar,True
f,0.339962,-2.114585,-0.99026,bar,True
h,0.244691,0.087985,0.144782,bar,True


Adding new rows is also simple. Below we include three extra empty rows:

In [4]:
df2 = df.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])
df2

Unnamed: 0,one,two,three,four,five
a,-0.32091,0.083295,0.234517,bar,False
b,,,,,
c,0.872061,-1.559059,-0.579188,bar,True
d,,,,,
e,0.254526,-0.525624,1.096691,bar,True
f,0.339962,-2.114585,-0.99026,bar,True
g,,,,,
h,0.244691,0.087985,0.144782,bar,True


# Your Turn Here

Do you still remember how to index a row/column?

In [10]:
#### index row 'c' below
df2.loc['c']

one      0.872061
two      -1.55906
three   -0.579188
four          bar
five         True
Name: c, dtype: object

In [60]:
#### index column 'two' below
df2.loc[:,'two']

a    0.083295
b         NaN
c   -1.559059
d         NaN
e   -0.525624
f   -2.114585
g         NaN
h    0.087985
Name: two, dtype: float64

pandas has two functions isnull() and notnull() that return boolean objects when called.

In [12]:
pd.isnull(df2['one'])

a    False
b     True
c    False
d     True
e    False
f    False
g     True
h    False
Name: one, dtype: bool

In [13]:
pd.notnull(df2['one'])

a     True
b    False
c     True
d    False
e     True
f     True
g    False
h     True
Name: one, dtype: bool

Missing values propagate naturally through arithmetic operations between pandas objects.

In [14]:
a = df[['one','two']]
a['one']['a':'e'] = float('nan')
a

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._set_with(key, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  exec(code_obj, self.user_global_ns, self.user_ns)


Unnamed: 0,one,two
a,,0.083295
c,,-1.559059
e,,-0.525624
f,0.339962,-2.114585
h,0.244691,0.087985


In [15]:
b = df[['one','two','three']]
b

Unnamed: 0,one,two,three
a,-0.32091,0.083295,0.234517
c,0.872061,-1.559059,-0.579188
e,0.254526,-0.525624,1.096691
f,0.339962,-2.114585,-0.99026
h,0.244691,0.087985,0.144782


In [16]:
a + b

Unnamed: 0,one,three,two
a,,,0.16659
c,,,-3.118118
e,,,-1.051248
f,0.679924,,-4.22917
h,0.489381,,0.17597


In [17]:
#Interesting observation - 
#NaN (or missing) values in df 'a' cause a return of "NaN" when dfs 'a' and 'b' are added

# How to deal with Missing Values

## Deleting Missing Values

The simplest method is always dropping all missing values - but it is highly **discouraged!!!**

In [18]:
a['one'].dropna()

f    0.339962
h    0.244691
Name: one, dtype: float64

By default, dropna() will drop any row containing **NaN** values, but you can change that by using the *axis=* and *thresh=* arguments.

**NOTE**: Dropping rows or columns have different uses.

In [19]:
#### This statement drops any column with NaN values - i.e., 'axis=1' for column
a.dropna(axis=1)

Unnamed: 0,two
a,0.083295
c,-1.559059
e,-0.525624
f,-2.114585
h,0.087985


In [25]:
#### thresh determines how many non-NaN values a column/row should have without being dropped
#So 'thresh=2' means keep any - in this case (axis=1) - column with 2 or more non-NaN values
c = a + b
c.dropna(axis=1, thresh=2)

Unnamed: 0,one,two
a,,0.16659
c,,-3.118118
e,,-1.051248
f,0.679924,-4.22917
h,0.489381,0.17597


You can also use the *how=* argument to determine how do you want to remove the NaN values.

In [26]:
#### By default, dropna() drops column/row with any NaN values
#### how = 'all' changes that to dropping column/row that has all NaN values
c.dropna(axis=1, how='all')

Unnamed: 0,one,two
a,,0.16659
c,,-3.118118
e,,-1.051248
f,0.679924,-4.22917
h,0.489381,0.17597


You can refer to the [pandas.dropna() docs](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html) for more information.

## Imputing Missing Values

Imputing means filling missing values - you can do that when the missing and non-missing values are in some type of relationship.

In [27]:
my_series = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
my_series

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

In [28]:
#### you can fill missing values with a specific value (0)
my_series.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

Alternatively, we can specify a forward-fill to propagate the previous value forward:

In [30]:
# forward-fill (use previous value to fill forward - i.e., subsequent value)
my_series.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:

In [31]:
# backward-fill (use subsequent value to fill backward - i.e., previous value)
my_series.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

## Your Turn Here
Aforementioned method can be applied to dataframes. 

In [32]:
#### Let us generate a random dataframe
rand_df = pd.DataFrame(np.random.randn(5, 3), 
                  index=['a', 'b', 'c', 'd', 'e'],
                  columns=['one', 'two', 'three'])
rand_df = rand_df.mask(np.random.random(rand_df.shape) < .3)
rand_df

Unnamed: 0,one,two,three
a,1.136506,-0.142969,-0.639014
b,,0.165893,0.555224
c,0.621133,0.237715,-0.738707
d,,0.732094,0.768153
e,,0.691943,-1.505192


Your tasks are as follows:

- fill missing values in column **one** with value 1;
- fill missing values in column **two** with forward-filling;
- fill missing values in column **three** with backward-filling.

In [54]:
#### insert your code here - fill missing values in column three with 1
rand_df['one'].fillna(1)
#### insert your code here - fill missing values in column two with forward-filling
rand_df['two'].fillna(method='ffill')
#### insert your code here - fill missing values in column three with backward-filling
rand_df['three'].fillna(method='bfill')
#### now the code to print the df
rand_df


Unnamed: 0,one,two,three
a,1.136506,-0.142969,-0.639014
b,1.0,0.165893,0.555224
c,0.621133,0.237715,-0.738707
d,1.0,0.732094,0.768153
e,1.0,0.691943,-1.505192


A useful approach for imputing your missing data is to use mean/mode to replace missing data - the reason behind this logic is that if we are going to *guess* the values of the missing data, the highest chance would be guessing it to be the mean/mode if the data follows **normal** distribution.

In [55]:
#### let us generate another DF
my_df = pd.DataFrame(np.random.randn(5, 3), 
                  index=['a', 'b', 'c', 'd', 'e'],
                  columns=['A', 'B', 'C'])
my_df = my_df.mask(np.random.random(my_df.shape) < .3)
my_df

Unnamed: 0,A,B,C
a,,0.085804,1.758938
b,0.028185,0.332538,-0.647632
c,,-1.390425,-0.672348
d,-0.52499,0.495886,-1.490784
e,-0.995681,-0.611037,-0.982574


In [56]:
#### Let us check if there is any missing value in the df
my_df.isnull().values.any()

True

In [57]:
#### Then we are getting the mean of the DF
#### Note that since we only 
my_df.mean()

A   -0.497495
B   -0.217447
C   -0.406880
dtype: float64

In [58]:
df_filled = my_df.fillna(my_df.mean())
df_filled

Unnamed: 0,A,B,C
a,-0.497495,0.085804,1.758938
b,0.028185,0.332538,-0.647632
c,-0.497495,-1.390425,-0.672348
d,-0.52499,0.495886,-1.490784
e,-0.995681,-0.611037,-0.982574


In [59]:
#### Now let us check again if there is any missing values
df_filled.isnull().values.any()

False

More info regarding how to handle missing data can be found [here](https://machinelearningmastery.com/handle-missing-data-python/).

# Other Tasks in Data Preprocessing

- Handling categorical data (coding)
- Handling imbalanced data
- feature engineering

These topics will be covered in later part of this class.