# Data Preprocessing Tools

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values   # Matrix of features. Matrix formed with the feature columns or feature vectors. All the columns except the last column.
y = dataset.iloc[:, -1].values   # Dependent variable vector. Only the last column.

In [None]:
dataset.iloc[:, :-1]

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,
5,France,35.0,58000.0
6,Spain,,52000.0
7,France,48.0,79000.0
8,Germany,50.0,83000.0
9,France,37.0,67000.0


In [None]:
dataset.iloc[:, -1]

0     No
1    Yes
2     No
3     No
4    Yes
5    Yes
6     No
7    Yes
8     No
9    Yes
Name: Purchased, dtype: object

In [None]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [None]:
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

In [None]:
print(X)

# There are some missing values which is printed as NAN.

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [None]:
# A 'class' is a blueprint or assemble of instructions. e.g., house making plan is a class.
# An 'object' is a house in this case. Method is like a function of the object to do some work.

# Connecting the object to the matrix of features X: object_name.function_name(X)

from sklearn.impute import SimpleImputer   # import the SimpleImputer class from the sklearn.impute module. Here sklearn is short form of scikit-learn library which is an amazing library in Data Science.
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')   # This returns an object.
imputer.fit(X[:, 1:3])   # Applying fit() function of the imputer object to the numerical columns of the features. The fit() function calculate mean of all the values of all the columns on which this is applied.
X[:, 1:3] = imputer.transform(X[:, 1:3])   # Replace the old X (having missing values) with a new X (missing values are replaced by mean). imputer.transform(X[:, 1:3]) replace the missing values in the standard output. X[:, 1:3] = imputer.transform(X[:, 1:3]) replaces column-1 and 2 in the original X.


In [None]:
imputer.fit(X[:, 1:3])

In [None]:
# Just checking the individual part of the code.

imputer.transform(X[:, 1:3])

array([[4.40000000e+01, 7.20000000e+04],
       [2.70000000e+01, 4.80000000e+04],
       [3.00000000e+01, 5.40000000e+04],
       [3.80000000e+01, 6.10000000e+04],
       [4.00000000e+01, 6.37777778e+04],
       [3.50000000e+01, 5.80000000e+04],
       [3.87777778e+01, 5.20000000e+04],
       [4.80000000e+01, 7.90000000e+04],
       [5.00000000e+01, 8.30000000e+04],
       [3.70000000e+01, 6.70000000e+04]])

In [None]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [None]:
# We now want to encode the categorical values (i.e., values are strings) into some numbers. If we just encode
# Fracne = 0, Spain = 1 and Germany = 2, the ML model could understand there is some numerical ordering among these
# countries. The ML model could think this order matters which is not true. We want to avoid the model to have
# such interpretation, as that could cause misinterpreted correlations between the features and the outcome.

# The encoding can be done efficiently using One Hot Encoder. This method converts all the values of the categorical
# column (on which it is applied) into n columns where n = number of unique values in the categorical column. Here
# n = 3 as there are only 3 unique countries: France, Spain and Germany. OneHotEncoder creates binary arrays i.e.,
# the vectors having only 0 and 1 as elements for each of the unique values of the categorical column. The dimension
# of the vector will be n, the number of unique values in the categorical column. e.g., Fracne will be like (1, 0, 0),
# Spain will be like (0, 1, 0) and Germany will be like (0, 0, 1). So, there is now no numerical order among the
# 3 countries. The vectors are orthonormal also. This is known as One Hot Encoding. This is very useful method when
# are pre-processing categorical data.

# Also, here, the dependent variable can take only 2 values: yes and no. They are also strings or categories. We
# replace them by 1 and 0. For binary outcome i.e., for two level outcomes, it is perfectly fine.

# To do One Hot Encoding on the feature columns, we use two classes: (i) ColumnTransformer from the
# sklearn.compose moduleand and (ii) OneHotEncoder from the sklearn.preprocessing module.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder   # One Hot means array of all 0's and only one 1.
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')   # ct is the object of the ColumnTransformer class. The transformers argument of this class expects kind of transformation and the column indices on which this transformation applies. The remainder = 'passthrough' argument keeps the other columns as it is on which this class is not applied, e.g., Age and Salary here.

# Now connect the object to the matrix of features X. This time we can apply fit and transform at once.
# fit_transform function: object_name.function_name(X).

print(ct.fit_transform(X))

X = np.array(ct.fit_transform(X))   # Replace or update the old X with the transformed X. Also we convert the output as NumPy array as the ML models always expect X as numpy array.

# We see that there is no country column anymore, instead there are 3 binary columns. And we see that the country code
# for France is encoded as (1, 0, 0), for Spain (0, 0, 1) and for Germany (0, 1, 0).

# So, One Hot Encoder encode the categorical data into numerical values and there is no ordering of these numbers.


[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(X)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
# Similar to the encoding of the Matrix of features, we will now encode the dependent variable vector as it is also
# categorical. We will use LabelEncoder class which encodes no and yes as 0 and 1, respectively.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()   # Object of the LabelEncoder class. Since the dependent variable is only one single vector, we do not put anything as the argument.
y = le.fit_transform(y)   # fit_transform method converts no and yes into 0 and 1 respectively.


In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


### Exercise

In [None]:
# Importing the necessary libraries
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder

# Load the dataset
dataset = pd.read_csv('titanic.csv')
# print(dataset)
X = dataset.iloc[:, 2:].values
# print(X)
# print(X[:, 2])
y = dataset.iloc[:, 1].values
# print(y)

# Identify the categorical data

# Implement an instance of the ColumnTransformer class
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [2])], remainder='passthrough')

# Apply the fit_transform method on the instance of ColumnTransformer
X = np.array(ct.fit_transform(X))

# Convert the output into a NumPy array
X = np.array(ct.fit_transform(X))

# Use LabelEncoder to encode binary categorical data
le = LabelEncoder()
y = le.fit_transform(y)

# Print the updated matrix of features and the dependent variable vector
print(X[0:10, :])
print(y)


[[1.0 0.0 0.0 1.0 0.0
  'Cumings, Mrs. John Bradley (Florence Briggs Thayer)' 38.0 1 0
  'PC 17599' 71.2833 'C85' 'C']
 [1.0 0.0 0.0 1.0 0.0 'Futrelle, Mrs. Jacques Heath (Lily May Peel)' 35.0
  1 0 '113803' 53.1 'C123' 'S']
 [1.0 0.0 0.0 0.0 1.0 'McCarthy, Mr. Timothy J' 54.0 0 0 '17463' 51.8625
  'E46' 'S']
 [0.0 0.0 1.0 1.0 0.0 'Sandstrom, Miss. Marguerite Rut' 4.0 1 1 'PP 9549'
  16.7 'G6' 'S']
 [1.0 0.0 0.0 1.0 0.0 'Bonnell, Miss. Elizabeth' 58.0 0 0 '113783' 26.55
  'C103' 'S']
 [0.0 1.0 0.0 0.0 1.0 'Beesley, Mr. Lawrence' 34.0 0 0 '248698' 13.0
  'D56' 'S']
 [1.0 0.0 0.0 0.0 1.0 'Sloper, Mr. William Thompson' 28.0 0 0 '113788'
  35.5 'A6' 'S']
 [1.0 0.0 0.0 0.0 1.0 'Fortune, Mr. Charles Alexander' 19.0 3 2 '19950'
  263.0 'C23 C25 C27' 'S']
 [1.0 0.0 0.0 1.0 0.0 'Harper, Mrs. Henry Sleeper (Myna Haxtun)' 49.0 1 0
  'PC 17572' 76.7292 'D33' 'C']
 [1.0 0.0 0.0 0.0 1.0 'Ostby, Mr. Engelhart Cornelius' 65.0 0 1 '113509'
  61.9792 'B30' 'C']]
[1 1 0 1 1 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 

## Splitting the dataset into the Training set and Test set

**One important question:** Do we have to apply feature scaling before splitting the dataset into the training and test sets or after?

The simple answer is, apply the feature scaling **after** splitting the dataset into traning and test sets.

We train the ML model on the training set and evaluate the performance of the ML model on the test set. Test set is completely new to the ML Model as it was not part of the ML model training.

We do feature scaling to ensure that values of all the features are in the same scale. We do this so as to prevent one feature to dominate the other, which therefore would be neglected by the ML model.

In feature scaling, we transform the original columns by the respective mean and standard deviation (in case of standardization). So, if you apply the feature scaling before the train-test splitting, the test set will have effect on the mean and standard deviations of each column data. This is **not** supposed to happen as the test set must be completely new to the ML model. So, if you apply feature scaling before the split, it causes **information leackage** from the test set to the ML Model.

Feature scaling is not necessary for all the ML models, even when the features are taking very different values.

In [None]:
# Splitting the data into train and test sets

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)   # Train set = 80% and test set = 20% of the total data. random_state is given as an integer just to reproduce the same result.


In [None]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(y_test)

[0 1]


### Exercise

In [None]:
import numpy as np
import pandas as pd

# Load the dataset
dataset = pd.read_csv('iris.csv')

print(dataset)

     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                  5.1               3.5                1.4               0.2   
1                  4.9               3.0                1.4               0.2   
2                  4.7               3.2                1.3               0.2   
3                  4.6               3.1                1.5               0.2   
4                  5.0               3.6                1.4               0.2   
..                 ...               ...                ...               ...   
145                6.7               3.0                5.2               2.3   
146                6.3               2.5                5.0               1.9   
147                6.5               3.0                5.2               2.0   
148                6.2               3.4                5.4               2.3   
149                5.9               3.0                5.1               1.8   

     target  
0         0  

In [None]:
dataset.describe()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),target
count,150.0,150.0,150.0,150.0,150.0
mean,5.843333,3.057333,3.758,1.199333,1.0
std,0.828066,0.435866,1.765298,0.762238,0.819232
min,4.3,2.0,1.0,0.1,0.0
25%,5.1,2.8,1.6,0.3,0.0
50%,5.8,3.0,4.35,1.3,1.0
75%,6.4,3.3,5.1,1.8,2.0
max,7.9,4.4,6.9,2.5,2.0


In [None]:
# Checking the missing values in each column

dataset.isnull().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
target               0
dtype: int64

In [None]:
# Matrix of features
X = dataset.iloc[:, :-1].values
print(X)

# Dependent variable vector
y = dataset.iloc[:, -1].values
print(y)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [None]:
# No missing values and no categorical data in 'iris.csv' dataset.

# Train-Test splitting
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


In [None]:
print(X_train)

[[4.6 3.6 1.  0.2]
 [5.7 4.4 1.5 0.4]
 [6.7 3.1 4.4 1.4]
 [4.8 3.4 1.6 0.2]
 [4.4 3.2 1.3 0.2]
 [6.3 2.5 5.  1.9]
 [6.4 3.2 4.5 1.5]
 [5.2 3.5 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.2 4.1 1.5 0.1]
 [5.8 2.7 5.1 1.9]
 [6.  3.4 4.5 1.6]
 [6.7 3.1 4.7 1.5]
 [5.4 3.9 1.3 0.4]
 [5.4 3.7 1.5 0.2]
 [5.5 2.4 3.7 1. ]
 [6.3 2.8 5.1 1.5]
 [6.4 3.1 5.5 1.8]
 [6.6 3.  4.4 1.4]
 [7.2 3.6 6.1 2.5]
 [5.7 2.9 4.2 1.3]
 [7.6 3.  6.6 2.1]
 [5.6 3.  4.5 1.5]
 [5.1 3.5 1.4 0.2]
 [7.7 2.8 6.7 2. ]
 [5.8 2.7 4.1 1. ]
 [5.2 3.4 1.4 0.2]
 [5.  3.5 1.3 0.3]
 [5.1 3.8 1.9 0.4]
 [5.  2.  3.5 1. ]
 [6.3 2.7 4.9 1.8]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.1 3.3 1.7 0.5]
 [5.6 2.7 4.2 1.3]
 [5.1 3.4 1.5 0.2]
 [5.7 3.  4.2 1.2]
 [7.7 3.8 6.7 2.2]
 [4.6 3.2 1.4 0.2]
 [6.2 2.9 4.3 1.3]
 [5.7 2.5 5.  2. ]
 [5.5 4.2 1.4 0.2]
 [6.  3.  4.8 1.8]
 [5.8 2.7 5.1 1.9]
 [6.  2.2 4.  1. ]
 [5.4 3.  4.5 1.5]
 [6.2 3.4 5.4 2.3]
 [5.5 2.3 4.  1.3]
 [5.4 3.9 1.7 0.4]
 [5.  2.3 3.3 1. ]
 [6.4 2.7 5.3 1.9]
 [5.  3.3 1.4 0.2]
 [5.  3.2 1.

In [None]:
print(X_test)

[[6.1 2.8 4.7 1.2]
 [5.7 3.8 1.7 0.3]
 [7.7 2.6 6.9 2.3]
 [6.  2.9 4.5 1.5]
 [6.8 2.8 4.8 1.4]
 [5.4 3.4 1.5 0.4]
 [5.6 2.9 3.6 1.3]
 [6.9 3.1 5.1 2.3]
 [6.2 2.2 4.5 1.5]
 [5.8 2.7 3.9 1.2]
 [6.5 3.2 5.1 2. ]
 [4.8 3.  1.4 0.1]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [5.1 3.8 1.5 0.3]
 [6.3 3.3 4.7 1.6]
 [6.5 3.  5.8 2.2]
 [5.6 2.5 3.9 1.1]
 [5.7 2.8 4.5 1.3]
 [6.4 2.8 5.6 2.2]
 [4.7 3.2 1.6 0.2]
 [6.1 3.  4.9 1.8]
 [5.  3.4 1.6 0.4]
 [6.4 2.8 5.6 2.1]
 [7.9 3.8 6.4 2. ]
 [6.7 3.  5.2 2.3]
 [6.7 2.5 5.8 1.8]
 [6.8 3.2 5.9 2.3]
 [4.8 3.  1.4 0.3]
 [4.8 3.1 1.6 0.2]]


In [None]:
print(y_train)

[0 0 1 0 0 2 1 0 0 0 2 1 1 0 0 1 2 2 1 2 1 2 1 0 2 1 0 0 0 1 2 0 0 0 1 0 1
 2 0 1 2 0 2 2 1 1 2 1 0 1 2 0 0 1 1 0 2 0 0 1 1 2 1 2 2 1 0 0 2 2 0 0 0 1
 2 0 2 2 0 1 1 2 1 2 0 2 1 2 1 1 1 0 1 1 0 1 2 2 0 1 2 2 0 2 0 1 2 2 1 2 1
 1 2 2 0 1 2 0 1 2]


In [None]:
print(y_test)

[1 0 2 1 1 0 1 2 1 1 2 0 0 0 0 1 2 1 1 2 0 2 0 2 2 2 2 2 0 0]


In [None]:
# Feature scaling. Only applied to matrix of features X, not on the dependent variable vector y.
# Here we standardize the features to have mean = 0 and variance = 1 (hence standard deviation = 1).

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)


In [None]:
# The scaled training set is printed to verify the scaling

print(X_train)

[[-1.47393679  1.20365799 -1.56253475 -1.31260282]
 [-0.13307079  2.99237573 -1.27600637 -1.04563275]
 [ 1.08589829  0.08570939  0.38585821  0.28921757]
 [-1.23014297  0.75647855 -1.2187007  -1.31260282]
 [-1.7177306   0.30929911 -1.39061772 -1.31260282]
 [ 0.59831066 -1.25582892  0.72969227  0.95664273]
 [ 0.72020757  0.30929911  0.44316389  0.4227026 ]
 [-0.74255534  0.98006827 -1.27600637 -1.31260282]
 [-0.98634915  1.20365799 -1.33331205 -1.31260282]
 [-0.74255534  2.32160658 -1.27600637 -1.44608785]
 [-0.01117388 -0.80864948  0.78699794  0.95664273]
 [ 0.23261993  0.75647855  0.44316389  0.55618763]
 [ 1.08589829  0.08570939  0.55777524  0.4227026 ]
 [-0.49876152  1.87442714 -1.39061772 -1.04563275]
 [-0.49876152  1.4272477  -1.27600637 -1.31260282]
 [-0.37686461 -1.47941864 -0.01528151 -0.24472256]
 [ 0.59831066 -0.58505976  0.78699794  0.4227026 ]
 [ 0.72020757  0.08570939  1.01622064  0.8231577 ]
 [ 0.96400139 -0.13788033  0.38585821  0.28921757]
 [ 1.69538284  1.20365799  1.36

In [None]:
# The scaled test set is printed to verify the scaling

print(X_test)

[[ 0.35451684 -0.58505976  0.55777524  0.02224751]
 [-0.13307079  1.65083742 -1.16139502 -1.17911778]
 [ 2.30486738 -1.0322392   1.8185001   1.49058286]
 [ 0.23261993 -0.36147005  0.44316389  0.4227026 ]
 [ 1.2077952  -0.58505976  0.61508092  0.28921757]
 [-0.49876152  0.75647855 -1.27600637 -1.04563275]
 [-0.2549677  -0.36147005 -0.07258719  0.15573254]
 [ 1.32969211  0.08570939  0.78699794  1.49058286]
 [ 0.47641375 -1.92659808  0.44316389  0.4227026 ]
 [-0.01117388 -0.80864948  0.09932984  0.02224751]
 [ 0.84210448  0.30929911  0.78699794  1.09012776]
 [-1.23014297 -0.13788033 -1.33331205 -1.44608785]
 [-0.37686461  0.98006827 -1.39061772 -1.31260282]
 [-1.10824606  0.08570939 -1.27600637 -1.44608785]
 [-0.86445224  1.65083742 -1.27600637 -1.17911778]
 [ 0.59831066  0.53288883  0.55777524  0.55618763]
 [ 0.84210448 -0.13788033  1.18813767  1.35709783]
 [-0.2549677  -1.25582892  0.09932984 -0.11123753]
 [-0.13307079 -0.58505976  0.44316389  0.15573254]
 [ 0.72020757 -0.58505976  1.07

## Feature Scaling

We do feature scaling to ensure that values of all the features are in the same scale. We do this so as to prevent one feature to dominate the other, which therefore would be neglected by the ML model. In order to avoid some features to be dominated by other features in such a way that the dominated features are not even considered by the ML model.

Feature scaling is not necessary for all the ML models, even if the features are taking very different values. e.g., in the multiple linear regression model, $y = b_0 + b_1 X_1 + b_2 X_2 + ... + b_N X_N$. If some variables ($X_i$) take much higher variables than others, when learning the coefficients the coeifficients will just compensate by taking small values for the variables that take higher values.

The main two feature scaling methods are: (i) Standardization and (ii) Normalization, which make all the features in the same scaling.

In Standardization, all the values of each column are subtracted by the mean value of all the values of that column and then divided by the standard deviation of all the values of that column: $X_{stand} = \frac{X - X_{mean}}{\sigma}$. This will put all the values in the range
around [-3, 3].

In normalization, all the values of each column are subtracted by the minimum value of that column and then divided by the difference between maximum value and minumum value of that column: $X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}$. So, both the numerator and denominator are positive and the denominator is always $\ge$ the numerator. Therefore all the values in this transformation become in the range [0, 1].

Now the popular question in Data Science community is, should we go for Normalization or Standardization? Normalization is recommended when you have normal distribution in most of your features. Standardization actually works well for all the time. Therefore it is better to go with standardization.

Now we have 4 things: two matrices of features ($X_{train}$ and $X_{test}$) and two dependent variable vectors ($y_{tain}$ and $y_{test}$). So, feature scaling will be applied after the train-test splitting and only on $X_{train}$ and $X_{test}$, seperately. As $X_{test}$ is supposed to be completely new data to the ML model, for standardization of $X_{test}$ we will use the $X_{mean}$ and $\sigma$ of the $X_{train}$.  



In [None]:
# StandardScaler class perform the standardization on both X_train and X_test.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()   # Create the object of the StandardScaler class. This will standardize all the features.
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])   # The fit method will only compute the mean and S.D. of all the values for all the features. The transform method will apply the transformation of standardization to the values of all the numerical columns. Also we replace the old X_train with the transformed X_train.
X_test[:, 3:] = sc.transform(X_test[:, 3:])   # We have not applied the fit method here, we only applied the transform method. As the X_test will be completely new to the ML model, we don't want to take mean and S.D. of the X_test i.e., we want to prevent information leakage from the test set to the ML model. So we use the same scalar (i.e., mean and S.D.) of the training set to transform the test set.

Another important question in DS community is that do we have to apply feature scaling to the dummy variables (the numerical columns which is encoded from the categorical data) in the matrix of features? The answer is: no, we will not apply feature scaling to the dummy variables. The goal of standardization is to make all the values of all the features in the same scale. Here the dummy variables already have values in the range [-3, 3]. In fact, standardization will make the dummy variables even worse as we will lose the information of which country corresponds to which encoded value. We will get nonsense numerical values and we will be uncapable to say which tuple of 3 values correspond to which country. We will totally lose  the interpretation and besides this won't improve the training performance. So, apply feature scaling only to the numerical columns of the matrix of features.

In [None]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(X_test)

# Now all the features are in the same scale.

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


### Exercise

In [None]:
import numpy as np
import pandas as pd

# Import the dataset.
wine_quality_data_df = pd.read_csv('winequality-red.csv', sep = ';')
print(wine_quality_data_df)

      fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0               7.4             0.700         0.00             1.9      0.076   
1               7.8             0.880         0.00             2.6      0.098   
2               7.8             0.760         0.04             2.3      0.092   
3              11.2             0.280         0.56             1.9      0.075   
4               7.4             0.700         0.00             1.9      0.076   
...             ...               ...          ...             ...        ...   
1594            6.2             0.600         0.08             2.0      0.090   
1595            5.9             0.550         0.10             2.2      0.062   
1596            6.3             0.510         0.13             2.3      0.076   
1597            5.9             0.645         0.12             2.0      0.075   
1598            6.0             0.310         0.47             3.6      0.067   

      free sulfur dioxide  

In [None]:
# Checking the missing values.

print(wine_quality_data_df.isnull().sum())

# No missing values.

fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64


In [None]:
print(wine_quality_data_df.describe())

       fixed acidity  volatile acidity  citric acid  residual sugar  \
count    1599.000000       1599.000000  1599.000000     1599.000000   
mean        8.319637          0.527821     0.270976        2.538806   
std         1.741096          0.179060     0.194801        1.409928   
min         4.600000          0.120000     0.000000        0.900000   
25%         7.100000          0.390000     0.090000        1.900000   
50%         7.900000          0.520000     0.260000        2.200000   
75%         9.200000          0.640000     0.420000        2.600000   
max        15.900000          1.580000     1.000000       15.500000   

         chlorides  free sulfur dioxide  total sulfur dioxide      density  \
count  1599.000000          1599.000000           1599.000000  1599.000000   
mean      0.087467            15.874922             46.467792     0.996747   
std       0.047065            10.460157             32.895324     0.001887   
min       0.012000             1.000000         

In [None]:
print(wine_quality_data_df['quality'].unique())

# There are only 6 unique values in the 'quality' column. We are considering 'quality' as the dependent variable vector and other precedding columns as the matrix of features.

[5 6 7 4 8 3]


In [None]:
# Matrix of features.

X = wine_quality_data_df.iloc[:, :-1].values
print(X)

[[ 7.4    0.7    0.    ...  3.51   0.56   9.4  ]
 [ 7.8    0.88   0.    ...  3.2    0.68   9.8  ]
 [ 7.8    0.76   0.04  ...  3.26   0.65   9.8  ]
 ...
 [ 6.3    0.51   0.13  ...  3.42   0.75  11.   ]
 [ 5.9    0.645  0.12  ...  3.57   0.71  10.2  ]
 [ 6.     0.31   0.47  ...  3.39   0.66  11.   ]]


In [None]:
# Dependent variable vector.

y = wine_quality_data_df.iloc[:, -1].values
print(y)

[5 5 5 ... 6 5 6]


There are no categorical data and no missing values in the dataset.

In [None]:
# Splitting the data into train and test sets.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)


In [None]:
print(X_train)
print(X_train.shape)

[[ 8.7   0.69  0.31 ...  3.48  0.74 11.6 ]
 [ 6.1   0.21  0.4  ...  3.25  0.59 11.9 ]
 [10.9   0.39  0.47 ...  3.3   0.75  9.8 ]
 ...
 [ 7.2   0.62  0.06 ...  3.51  0.54  9.5 ]
 [ 7.9   0.2   0.35 ...  3.32  0.8  11.9 ]
 [ 5.8   0.29  0.26 ...  3.39  0.54 13.5 ]]
(1279, 11)


In [None]:
print(X_test)
print(X_test.shape)

[[ 7.7    0.56   0.08  ...  3.24   0.66   9.6  ]
 [ 7.8    0.5    0.17  ...  3.39   0.48   9.5  ]
 [10.7    0.67   0.22  ...  3.28   0.98   9.9  ]
 ...
 [ 8.3    0.6    0.25  ...  3.15   0.53   9.8  ]
 [ 8.8    0.27   0.39  ...  3.15   0.69  11.2  ]
 [ 9.1    0.765  0.04  ...  3.29   0.54   9.7  ]]
(320, 11)


In [None]:
print(y_train)
print(y_train.shape)

[6 6 6 ... 5 7 6]
(1279,)


In [None]:
print(y_test)
print(y_test.shape)

[6 5 6 5 6 5 5 5 5 6 7 3 5 5 6 7 5 7 8 5 5 6 5 6 6 6 7 6 5 6 5 5 6 5 6 5 7
 5 4 6 5 5 7 5 5 6 7 6 5 6 5 5 5 7 6 6 6 5 5 5 5 7 5 6 6 5 6 5 6 5 6 4 6 6
 6 5 8 5 6 6 5 6 5 6 6 7 5 6 7 4 7 6 5 5 5 6 5 6 5 6 5 5 5 7 6 7 6 5 6 5 8
 5 6 5 6 7 6 6 5 6 6 6 6 6 6 6 7 6 5 5 6 5 5 5 6 5 5 5 5 6 7 6 8 5 5 5 6 6
 6 5 6 7 6 5 6 5 5 6 6 6 7 5 7 5 5 5 6 6 5 5 6 5 7 6 7 6 6 5 5 6 4 6 5 7 5
 5 4 5 7 6 5 6 6 7 6 5 5 6 5 7 5 6 6 5 7 5 5 5 6 7 7 5 5 6 6 7 6 5 6 6 6 6
 6 7 4 5 5 7 5 5 5 5 6 6 5 7 5 6 6 6 5 4 6 7 6 7 5 6 6 5 5 6 5 6 4 5 6 6 5
 6 6 5 5 6 7 7 6 5 6 6 5 6 5 6 5 5 5 6 6 6 7 5 5 6 5 7 5 6 4 6 6 8 6 5 5 6
 5 7 6 6 5 5 7 6 6 5 6 6 5 7 6 6 6 6 5 6 5 5 6 4]
(320,)


In [None]:
# Feature scaling.

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()   # Object of the StandardScaler class.

X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


In [None]:
print(X_train)
print(X_train.shape)

[[ 0.21833164  0.88971201  0.19209222 ...  1.09349989  0.45822284
   1.12317723]
 [-1.29016623 -1.78878251  0.65275338 ... -0.40043872 -0.40119696
   1.40827174]
 [ 1.49475291 -0.78434707  1.01104539 ... -0.07566946  0.51551749
  -0.58738978]
 ...
 [-0.65195559  0.49909822 -1.08752211 ...  1.28836145 -0.68767023
  -0.87248428]
 [-0.24582155 -1.84458448  0.39683051 ...  0.05423824  0.80199076
   1.40827174]
 [-1.46422367 -1.34236676 -0.06383064 ...  0.50891521 -0.68767023
   2.92877575]]
(1279, 11)


In [None]:
print(X_test)
print(X_test.shape)

[[-3.61859850e-01  1.64286407e-01 -9.85152962e-01 ... -4.65392578e-01
  -1.34389336e-04 -7.77452782e-01]
 [-3.03840702e-01 -1.70525408e-01 -5.24491803e-01 ...  5.08915214e-01
  -1.03143815e+00 -8.72484283e-01]
 [ 1.37871461e+00  7.78108067e-01 -2.68568937e-01 ... -2.05577167e-01
   1.83329452e+00 -4.92358280e-01]
 ...
 [-1.37449586e-02  3.87494284e-01 -1.15015218e-01 ... -1.04997725e+00
  -7.44964886e-01 -5.87389780e-01]
 [ 2.76350785e-01 -1.45397070e+00  6.01568807e-01 ... -1.04997725e+00
   1.71749571e-01  7.43051230e-01]
 [ 4.50408230e-01  1.30822677e+00 -1.18989125e+00 ... -1.40623314e-01
  -6.87670232e-01 -6.82421281e-01]]
(320, 11)


## Path of data preprocessing

So the path we will follow:

1. Import the necessary libraries such as NumPy, Pandas, Matplotlib etc. Later we will import other libraries (scikit-learn etc.) when required.  

2. Import the dataset using read_csv or similar functions of pandas dataframe. Print the dataset_df. Understand it using **dataset_df.describe()** function.

3. Construct the matrix of features X and dependent variable vector y, using **dataset_df.iloc[row_i:row_f, col_i:col_f].values()** function or indexing the dataset_df for different columns.

4. Check if there are any missing values using **dataset_df.isnull().sum()** function.

5. If there are any missing values, replace it by mean or median or mode using the **SimpleImputer** class of **scikit-learn** library.

6. If there are any categorical data in matrix of features X, encode them using **OneHotEncoder** class.

7. Also, if output (y) is two level, encode them to 0 and 1 using **LabelEncoder** class.

8. Split the data into training and test sets using **train_test_split** function of **sklearn.model_selection** module.

9. Do the Feature scaling to make all the values of all the features in the same scale. It is not always necessary. Standardization always works while normalization mainly when the values follow the normal distribution. Do the feature scaling of $X_{train}$ with $\mu$ and $\sigma$ of training set. Feature scaling of $X_{test}$ will also be done using the same $\mu$ and $\sigma$ of training set, not with that of test set.