# Preprocessing

This notebook contains notes on preprocessing raw data. Three types of data are considered:
* Numerical
* Categorical 
* Missing

In [2]:
import numpy as np
import pandas as pd

### Overview of data 

The most common type of data in machine learning is tabular data. 
In tabular data, we have columns and rows. Columns contain data of a single type, which are also referred to as feature/attribute. Rows contain a set of observations, and they are called sample/instance. 

In [3]:
df = pd.read_csv('datasets/airfoil_self_noise.csv')
labels = df.columns 
df.head() # preview of the dataset 

Unnamed: 0,frequency,angle,chord_len,velocity,thickness,sound_pressure
0,800.0,0.0,0.3048,71.3,0.002663,126.201
1,1000.0,0.0,0.3048,71.3,0.002663,125.201
2,1250.0,0.0,0.3048,71.3,0.002663,125.951
3,1600.0,0.0,0.3048,71.3,0.002663,127.591
4,2000.0,0.0,0.3048,71.3,0.002663,127.461


In [4]:
df.describe() # summary of the data 

Unnamed: 0,frequency,angle,chord_len,velocity,thickness,sound_pressure
count,1503.0,1503.0,1503.0,1503.0,1503.0,1503.0
mean,2886.380572,6.782302,0.136548,50.860745,0.01114,124.835943
std,3152.573137,5.918128,0.093541,15.572784,0.01315,6.898657
min,200.0,0.0,0.0254,31.7,0.000401,103.38
25%,800.0,2.0,0.0508,39.6,0.002535,120.191
50%,1600.0,5.4,0.1016,39.6,0.004957,125.721
75%,4000.0,9.9,0.2286,71.3,0.015576,129.9955
max,20000.0,22.2,0.3048,71.3,0.058411,140.987


## Numerical data

### Standardization and normalization for supervised ML: 

* The standardization process scales features to a range (0,1)
* The normalization process (scaling samples to have unit norms) converts data to standard format where each feature has zero mean and unit variance (std=1)


In [6]:
from sklearn.preprocessing import StandardScaler 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import Normalizer

## Categorical data

### Ordinal encoding and one-hot encoding

* Ordinal encoding is the process of assigning each unique category an integer value. Doing this, we impose a natural ordered relationship between each category. For example, age is ordered in nature and we can map the different ranges to integer values. More specifically, 30-39 => 0, 40-49 =>1, 50-59 => 2, etc.
* One-hot encoding is the process of transforming each label of the orginal categorical variable into a new binary variable. This means the total number of features will increase after preprocessing. . One-hot encoding is used when there is no natural ordinal relationship among different categories. In addition, when the response variable has no ordinal relationship, encoding its labels as ordered integer values can result in poor performance. For example, suppose we encode the response's labels as 0, 1, 2. The algorithm can return a prediction of 1.5.

In [None]:
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder

In [7]:
df = pd.read_csv('datasets/breast_cancer.csv')
df.head()

Unnamed: 0,target,age,tumor-size,deg_malig,side,quad,irradiat
0,no-recurrence-events,30-39,30-34,3,left,left_low,no
1,no-recurrence-events,40-49,20-24,2,right,right_up,no
2,no-recurrence-events,40-49,20-24,2,left,left_low,no
3,no-recurrence-events,60-69,15-19,2,right,left_up,no
4,no-recurrence-events,40-49,0-4,2,right,right_low,no


### Label encoding 

* Label encoding is a separate module to encode target variable.

In [None]:
from sklearn.preprocessing import LabelEncoder

## Missing data 

In [9]:
df = pd.read_csv('datasets/iris_numeric.csv')
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,,1.4,0.2,0


In [12]:
# check if there is any missing data 
print(df.isna().values.any())
# check which columns have missing data 
print(df.isna().any())
df.info()

True
sepal_length    False
sepal_width      True
petal_length     True
petal_width     False
target          False
dtype: bool
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   sepal_length  150 non-null    float64
 1   sepal_width   149 non-null    float64
 2   petal_length  149 non-null    float64
 3   petal_width   150 non-null    float64
 4   target        150 non-null    int64  
dtypes: float64(4), int64(1)
memory usage: 6.0 KB


In [13]:
# specific rows of missing data 
df[df.isna().any(axis=1)]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,target
4,5.0,,1.4,0.2,0
10,5.4,3.7,,0.2,0


### Imputation of missing data 

To impute missing data, we can replace them with

* constant values
* statistics like mean, median, mode. 

In [14]:
from sklearn.impute import SimpleImputer