## Data Preprocessing

### Importing the Libraries

In [26]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Importing the Dataset

In [27]:
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

In [28]:
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

### Missing Values

In [29]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy = 'mean')
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

### Encoding Categorical variables

In [18]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
#passthrough indicates that the columns for which one hot encoding is not applied should not be deleted
X = np.array(ct.fit_transform(X))

In [19]:
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

In [13]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

In [14]:
y

array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

### Splitting the dataset into train and test set

In [20]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=1)

### Feature Scaling
Performed to make sure no variable is dominated by the other and to reduce Euclidian distance between variables

We use fit_transform() on the train data so that we learn the parameters of scaling on the train data and in the same time we scale the train data. We only use transform() on the test data because we use the scaling paramaters learned on the train data to scale the test data.

Let me give a hands-on example why this is important!
Let’s imagine we have a simple training set consisting of 3 samples with 1 feature column (let’s call the feature column “length in cm”):

sample1: 10 cm -> class 2
sample2: 20 cm -> class 2
sample3: 30 cm -> class 1

Given the data above, we compute the following parameters:

mean: 20
standard deviation: 8.2

If we use these parameters to standardize the same dataset, we get the following values:

sample1: -1.21 -> class 2
sample2: 0 -> class 2
sample3: 1.21 -> class 1

Now, let’s say our model has learned the following hypotheses: It classifies samples with a standardized length value < 0.6 as class 2 (class 1 otherwise). So far so good. Now, let’s imagine we have 3 new unlabeled data points that you want to classify.

sample4: 5 cm -> class ?
sample5: 6 cm -> class ?
sample6: 7 cm -> class ?

If we look at the “unstandardized “length in cm” values in our training datast, it is intuitive to say that all of these samples are likely belonging to class 2. However, if we standardize these by re-computing the standard deviation and and mean from the new data, we would get similar values as before (i.e., properties of a standard normal distribtion) in the training set and our classifier would (probably incorrectly) assign the “class 2” label to the samples 4 and 5.

sample5: -1.21 -> class 2
sample6: 0 -> class 2
sample7: 1.21 -> class 1

However, if we use the parameters from your “training set standardization, we will get the following standardized values
sample5: -18.37
sample6: -17.15
sample7: -15.92

Note that these values are more negative than the value of sample1 in the original training set, which makes much more sense now!

In [30]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])