# Data Preprocessing

Preprocessing is the process of transforming the raw data into something that the machine learning algorithm can understand. Within this context, most algorithms will demand that you specify what “type” of data a feature is. We quickly review the different types of features and how to encode them in pandas/sklearn below. Note that **transformations should always be applied separately on the training and test data** since calculating statistics on the test data is considered a form of “cheating”(data leakage). While we give alternatives for both pandas and sklearn, it is usually more convenient to work with sklearn for this reason.

In [1]:
print(12343)

12343


## Handling missing data

Missing data is one of the most annoying, but also most common problems in any machine learning project. Your algorithms will assume your data matrices to be completely filled with numerical values so we need to somehow ensure this requirement. There are two basic strategies here: (a) dropping features or observations with many missing values and (b) imputing the values. There are various imputation strategies, but they all replace the missing value with some statistic that is calculated on the other rows for this particular feature. We may for instance replace a missing value with the mean of that feature or the most frequent value (the mode).

In [3]:
#Preparing Dataset
import pandas as pd
import sklearn.datasets
import numpy as np
from sklearn.model_selection import train_test_split

iris = sklearn.datasets.load_iris()
X = iris.data   
y = iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, shuffle=False)

#Create some missing values
X_train[:5, 2] = np.nan
X_test[:5, 2] = np.nan
display(pd.DataFrame(X_train).head())
display(pd.DataFrame(X_test).head())

Unnamed: 0,0,1,2,3
0,5.1,3.5,,0.2
1,4.9,3.0,,0.2
2,4.7,3.2,,0.2
3,4.6,3.1,,0.2
4,5.0,3.6,,0.2


Unnamed: 0,0,1,2,3
0,7.7,3.0,,2.3
1,6.3,3.4,,2.4
2,6.4,3.1,,1.8
3,6.0,3.0,,1.8
4,6.9,3.1,,2.1


In [4]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")

# Strategy avail: mean median most_frequent
print("hello")

imputer.fit(X_train)
X_train = imputer.transform(X_train)
X_test = imputer.transform(X_test)

display(pd.DataFrame(X_train).head())
display(pd.DataFrame(X_test).head())

hello


Unnamed: 0,0,1,2,3
0,5.1,3.5,1.5,0.2
1,4.9,3.0,1.5,0.2
2,4.7,3.2,1.5,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.5,0.2


Unnamed: 0,0,1,2,3
0,7.7,3.0,1.5,2.3
1,6.3,3.4,1.5,2.4
2,6.4,3.1,1.5,1.8
3,6.0,3.0,1.5,1.8
4,6.9,3.1,1.5,2.1


### KNN Imputation
KNN imputations imputes missing values using the weighted or unweighted mean of the desired number of nearest neighbors. In practice, KNN 

In [None]:
from sklearn.impute import KNNImputer

display(pd.DataFrame(X_train).head())
display(pd.DataFrame(X_test).head())

knnImputer = KNNImputer(n_neighbors=5, weights="uniform")
knnImputer.fit(X_train)
X_train = knnImputer.transform(X_train)
X_test = knnImputer.transform(X_test)

display(pd.DataFrame(X_train).head())
display(pd.DataFrame(X_test).head())


Unnamed: 0,0,1,2,3
0,5.1,3.5,1.5,0.2
1,4.9,3.0,1.5,0.2
2,4.7,3.2,1.5,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.5,0.2


Unnamed: 0,0,1,2,3
0,7.7,3.0,1.5,2.3
1,6.3,3.4,1.5,2.4
2,6.4,3.1,1.5,1.8
3,6.0,3.0,1.5,1.8
4,6.9,3.1,1.5,2.1


Unnamed: 0,0,1,2,3
0,5.1,3.5,1.5,0.2
1,4.9,3.0,1.5,0.2
2,4.7,3.2,1.5,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.5,0.2


Unnamed: 0,0,1,2,3
0,7.7,3.0,1.5,2.3
1,6.3,3.4,1.5,2.4
2,6.4,3.1,1.5,1.8
3,6.0,3.0,1.5,1.8
4,6.9,3.1,1.5,2.1


Another possible way to impute missing value is to train a regression model which estimates the missing value based on other variables. You could do that with sklearn.impute.IterativeImputer.

Note: for categorical values (see later), it is often not a good to impute values. Instead, you will want to treat “missing”as a special category/level.

## Feature creation

At this point, it may be useful to engineer your own features. You may want to take the log of some variable or perhaps add polynomial features as we saw in class:

In [None]:
from sklearn.preprocessing import PolynomialFeatures

poly = PolynomialFeatures(3).fit(X_train)
X_train_poly = poly.transform(X_train)
X_test_poly = poly.transform(X_test)

Beyond these generic things, you may also want to start thinking about adding external data that is relevant to your learning problem, for instance:
* predicting insurance claims: add weather forecasts
* ice cream sales: add the temperature of the day as a variable
* beer company: add world cup activity data
* stock prediction: add whether or not a big news event happened on that day 
* ...

## Vectorizing categorical data

Categorical variables usually describe traits and are often encoded as either text strings or as integers. Examples include size (“Small”,“Medium”,“Large”), country (“China”, “USA”, “Belgium”, ...), etc. We call the number of values a categorical variable can have the **cardinality** of the variable. A special type of categorical variables are **ordinal** variables which are ordered in addition to being categorical (e.g., size is an ordinal variable, country is not). Sometimes categorical variables are created for historical or privacy reasons (e.g., bucketing of salaries into brackets) though this is often not beneficial for learning performance. In pandas we can register a variable as being categorical by casting it as such:

In [None]:
from sklearn.preprocessing import LabelEncoder

lbls = LabelEncoder()
iris_df = pd.read_csv("https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv")

display(iris_df.head())

iris_df["species"] = iris_df["species"].astype("category")
display(iris_df["species"])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


0         setosa
1         setosa
2         setosa
3         setosa
4         setosa
         ...    
145    virginica
146    virginica
147    virginica
148    virginica
149    virginica
Name: species, Length: 150, dtype: category
Categories (3, object): ['setosa', 'versicolor', 'virginica']

### Label Encoder

One way to convert a categorical data into model-understandable numerical data is to assign each category an integer between $0$ and $C$. 

In [None]:
from sklearn.preprocessing import LabelEncoder

lbls = LabelEncoder()
iris_df = pd.read_csv("https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv")

print("Before encoding: ")
display(iris_df.head())
display(iris_df.tail())
iris_df["species"] = lbls.fit_transform(iris_df["species"])

print("After encoding: ")
display(iris_df.head())
display(iris_df.tail())
iris_df["species"].dtype

Before encoding: 


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


After encoding: 


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,2
146,6.3,2.5,5.0,1.9,2
147,6.5,3.0,5.2,2.0,2
148,6.2,3.4,5.4,2.3,2
149,5.9,3.0,5.1,1.8,2


dtype('int64')

It is worth noting that label encoding introduces a new problem. The different numbers assigned to each category will make the model misunderstand the data to be in some kind of order, since $0 < 1 < 2 < ... < C$. But this is not the case for non-ordinal data. Therefore, label encoding is only recommended if ordering strongly affects the relationship, or when there are too many categories and not enough data (see later).

### One-hot Encoder
Some algorithms (deep learning in particular) require a different encoding called **one-hot-encoding** (a.k.a. **dummy encoding**, **indicator variables**). In one-hot-encoding we convert each category value into a new column of a vector and assign to that column a 1/0 value. E.g.:

<img src="./One_hot.png" width="50%">

In this example, each row is converted into a vector of length 3 in which the first position indicates“Small”, the second“Medium”and the last“Large”. It is generally considered good practice to remove one variable to avoid issues of co-linearity even though this does not really matter as much in machine learning. Both pandas and sklearn have functions to do this for you:

In [None]:
from sklearn.preprocessing import OneHotEncoder

lbls = OneHotEncoder()
iris_df = pd.read_csv("https://raw.githubusercontent.com/yx1215/Machine_Learning_Dataset/main/iris.csv")

display(iris_df.head())
display(iris_df.tail())

print(iris_df["species"].size)
out = lbls.fit_transform(iris_df[["species"]]).toarray()
print(out.shape)
print(out[:5, :])

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


150
(150, 3)
[[1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]
 [1. 0. 0.]]


Alternatively, you could use pd.get_dummies()

In [None]:
import pandas as pd
pd.get_dummies(iris_df["species"])

Unnamed: 0,setosa,versicolor,virginica
0,1,0,0
1,1,0,0
2,1,0,0
3,1,0,0
4,1,0,0
...,...,...,...
145,0,0,1
146,0,0,1
147,0,0,1
148,0,0,1


However, it is often not suitable to apply one-hot encoding when there are too many categories. Every category level would create a new variable and therefore increase potential variance problems, label encoding reduces this problem.

## Normalizing continuous data

Continuous data take on a continuous range of values (e.g., temperature, wind-speed or sales per day). Most machine learning algorithms (all which use some kind of gradient descent) require that the data is somehow scaled. These scaling methods ensure that the data is spread around zero. E.g., the min-max scaling procedure applies the following formula to the continuous columns of a dataset:

\begin{equation*}
x_j \leftarrow{} \frac{x_j - \min X_j}{\max X_j - \min X_j}
\end{equation*}

If the data is normally distributed, you may want to use standardization instead:

\begin{equation*}
x_j \leftarrow \frac{x_j - \mu x_j}{\sigma_{X_j}}
\end{equation*}

Sometimes you will see that people try to ensure that the l1 or l2 norm of the features are 1. In practice the difference between all of these methods is minute, but it is crucially important that you do use some kind of normalization. In python:

In [None]:
from sklearn.preprocessing import MinMaxScaler, Normalizer, MaxAbsScaler

scaler = MinMaxScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

scaler = Normalizer().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

...

Ellipsis

## Time series data
Time-series are best converted to some standard format such as pd.datetime. It is often beneficial to generate extra cyclical variables based on the data such as:
* day of the week
* week number
* month
* quarter
* year
* ...

These can be treated as categorical and signal to the machine learning algorithm that it may have to handle instances coming from different periods completely different. It is as if you are giving the machine learning model all sorts of interesting extra time-series data to play with. These can be added automatically in using standard pandas code):

In [None]:
#Create some fake date data
s = pd.Series(['3/11/2000', '3/12/2000', '3/13/2000', "12/2/2020", "7/4/2021"])
s.head()

0    3/11/2000
1    3/12/2000
2    3/13/2000
3    12/2/2020
4     7/4/2021
dtype: object

In [None]:
s = pd.to_datetime(s, infer_datetime_format=True)
# s = pd.to_datetime(s, format="%m/%d/%Y") ##alternatively you could specify the format
s.head()

0   2000-03-11
1   2000-03-12
2   2000-03-13
3   2020-12-02
4   2021-07-04
dtype: datetime64[ns]

In [None]:
# Year
display(s.dt.year)

# Quarter
display(s.dt.quarter)

# Day
display(s.dt.day)

# Weekday, 0 stands for Monday, 6 stands for Sunday
display(s.dt.weekday)

0    2000
1    2000
2    2000
3    2020
4    2021
dtype: int64

0    1
1    1
2    1
3    4
4    3
dtype: int64

0    11
1    12
2    13
3     2
4     4
dtype: int64

0    5
1    6
2    0
3    2
4    6
dtype: int64

In [None]:
print(pd.Timestamp.min)
print(pd.Timestamp.max)

1677-09-21 00:12:43.145225
2262-04-11 23:47:16.854775807


## Writing custom preprocessing steps
It’s also possible (and often needed) to write your own preprocessing steps. In sklearn any function that transforms the data is called a transformer. By design, sklearn does not rely on inheritance, but on duck typing which means that all you need to do, is implement a class with the three required methods: fit() and transform() and fit_transform() (you can get the last one for free by extending TransformerMixin). These take as input a numpy array or matrix. If you want to parametrize your preprocessing step, you can do so by extending the BaseEstimator class. This will give you two extra methods for free: get_params() and set_params(). For instance, you could recreate a standardization preprocessing step by implementing it as shown below:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class MyStandardScaler(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        self.mean_ = np.mean(X,axis=0)
        self.std_ = np.std(X, axis=0)
        return self
    def transform(self, X, y="None", copy=None):
        X = (X-self.mean_)/self.std_
        return X


As a more advanced example, consider the following preprocesser which takes a datetime object as input and converts it to the day of the week:

In [None]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class DayOfWeek(BaseEstimator, TransformerMixin):
    def __init__(self, date_iloc = 0):
        self.date_iloc = date_iloc
        
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        datetimes = pd.to_datetime(X[:, self.date_iloc], infer_datetime_format=True)
        dayofweek = np.array(datetimes.dayofweek)
        return np.c_[X,dayofweek]

Note that while we use pandas in the above code, this is actually not needed. Generally, you will want to input and output numpy matrices/arrays. In this case we use pandas because it helps us to easily extract the day of week variable that we are looking for.

In [None]:
s = pd.DataFrame([['3/11/2000'], ['3/12/2000'], ['3/13/2000'], ["12/2/2020"], ["7/4/2021"]]).values
print(s)
SS = DayOfWeek()

s = SS.fit_transform(s)
print(s)

[['3/11/2000']
 ['3/12/2000']
 ['3/13/2000']
 ['12/2/2020']
 ['7/4/2021']]
[['3/11/2000' 5]
 ['3/12/2000' 6]
 ['3/13/2000' 0]
 ['12/2/2020' 2]
 ['7/4/2021' 6]]


In [None]:
print(SS.get_params())
params = {"date_iloc": 1}
SS.set_params(**params)
SS.get_params()

{'date_iloc': 0}


{'date_iloc': 1}

## Exercise
Write a transformer such that each column has l2 norm $k$ after transformation. Here $k$ should be a parameter of your transformer.