# Data Preprocessing


```
Author:
Zach Wolpe
zachcolinwolpe@gmail.com
www.zachwolpe.com
```


A simple but efficient data preprocessing template.

# Import

In [60]:
# Data Preprocessing Template

# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

# Importing the dataset
dataset = pd.read_csv('Data.csv')
print(dataset)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

   Country   Age   Salary Purchased
0   France  44.0  72000.0        No
1    Spain  27.0  48000.0       Yes
2  Germany  30.0  54000.0        No
3    Spain  38.0  61000.0        No
4  Germany  40.0      NaN       Yes
5   France  35.0  58000.0       Yes
6    Spain   NaN  52000.0        No
7   France  48.0  79000.0       Yes
8  Germany  50.0  83000.0        No
9   France  37.0  67000.0       Yes


# Missing Data

It is careless to simple remove missing data. 

## Continuous variables
One popular approach for handling missing data is to simple insert the mean of the dataset into the missing values. 

The two missing values have be replaced with the column means.

In [61]:
# Taking care of missing data
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

# Dummy Variables

One-hot encoding (creating dummy variables) can be readily done with sklearn. 

Remember encoding one less variable than actual categories is sufficient in capturing all infomation. This code simply encodes (gives a numeric value to) categorical variables & creates 1 dummy variable (0 or 1) for each category.

In [62]:
# Encoding categorical data
# Encoding the Independent Variable
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# Numeric Encoding 
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

# Create Dummy Variables
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()


# Encoding the Dependent Variable
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

X, y

(array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
         7.20000000e+04],
        [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
         4.80000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
         5.40000000e+04],
        [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
         6.10000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
         6.37777778e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
         5.80000000e+04],
        [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
         5.20000000e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
         7.90000000e+04],
        [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
         8.30000000e+04],
        [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
         6.70000000e+04]]), array([0

# Train-Test-Split

Perform splits AFTER preprocessing.

In [63]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

# Feature Scaling

It is often advicable to scale features, to prevent the model from overweighting variables of a larger magnitude.

This over-weighting often occurs as a result of distance calculation. Many machine learning models are based on Euclidean distance & as such distances of different scales may hinder model performance. (example, square distance from the mean).

Feature scaling also ensures the algorithm converges much faster.

It may not be neccessary to scale dummy variables as they are $\in [0,1]$ bound. This is often a matter of preference.

Two popular approaches to feature scaling:

#### Standardization 
$$X_{std} = \frac{X - \bar{X}}{std(X)}$$

#### Normalization 
$$X_{norm} = \frac{X - min(X)}{max(X) - min(X)}$$


###### Note: 
The Y variable is not scaled in the template because it is binary (discrete). Feature scaling is required for continuous dependent variables.

In [65]:
# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

#sc_y = StandardScaler()
#y_train = sc_y.fit_transform(y_train)

X_train

array([[-0.77459667, -0.77459667,  1.73205081, -0.25987589, -0.35492486],
       [-0.77459667,  1.29099445, -0.57735027,  0.06553392, -0.09005556],
       [ 1.29099445, -0.77459667, -0.57735027, -0.74799061, -0.6409837 ],
       [ 1.29099445, -0.77459667, -0.57735027,  1.36717316,  1.3614282 ],
       [ 1.29099445, -0.77459667, -0.57735027, -0.4225808 ,  0.21719283],
       [-0.77459667,  1.29099445, -0.57735027,  1.69258297,  1.74283999],
       [-0.77459667,  1.29099445, -0.57735027, -1.56151513, -1.0223955 ],
       [-0.77459667, -0.77459667,  1.73205081, -0.13332763, -1.21310139]])