In [9]:
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [10]:
# Importing the dataset
# use pandas to import data as data frame
dataset = pd.read_csv('../Data Preprocessing Template/Data.csv')

In [11]:
# from data fram split data into x (independent), y (dependent) arrays
# all machine learning libs work with arrays (not with dataframes) 
x = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Handling missing values
For machine learning analyses missing values are usually handled by imputing some common value like mean, median or modal. This is done because by removing records that have missing values, especially when lot of variables are used during analyses will significantly reduce sample size.

In [12]:
# Replace missing values with mean value of the column
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer.fit(x[:, 1:3])
x[:, 1:3] = imputer.transform(x[:, 1:3])

# Encoding categorical data
Categorical date expressed as text values need to prepared for the analyses. The preparation is twofold:

- Encoding strings to numbers: First action is to convert string values into numbers as machine learning alogs work only with numbers
- Removing numerical meaning from categorical value: in case when we have 'pure' categorical scale/value where position/number does not have meaning, we need to recode categorical values into boolean type (true/false). In this example, countries are records to numbers in the first instance and because values of each countries are not orignal or ration scale we convert it to array having 0/1 (true/false) value for each of country values.


In [13]:
# import encoders from scikit
from sklearn.preprocessing import LabelEncoder, OneHotEncoder


## Encode country values into boolean array

In [14]:
# init label encoder
country_encoder = LabelEncoder()
# convert labels of first column (country) to numerical values
x[:,0] = country_encoder.fit_transform(x[:,0])
# init boolean encoder and convert values into array of 
# categorical_features param takes col to transform from values into boolean array, 
# here is this col country 
bool_encoder = OneHotEncoder(categorical_features = [0])
x = bool_encoder.fit_transform(x).toarray()

## Encode dependent (y) purchase y/n into numerical values

In [15]:
# no = 0, yes = 1
purchase_encoder = LabelEncoder()
y = purchase_encoder.fit_transform(y)

# Spliting data set into training and test sets
Total data is usually split into 2 parts. In this example we use cross_validation module from scikit learn. In this example we use train_test_split function from scikit_learn. We pass arrays of independent, dependent var and test_size as value between 0 - 1 (representing % of split). Common split is 80/20, where 80% is used for training. Split is performed at random, but for replication one should provide random_state value.

`NOTE! cross_validation module is replaced with model_selection since v0.18`

In [16]:
# import cross validation module from scikit-learn
# NOTE! 
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=0)

# Feature scaling / 'normalizing' variable values

Machine learning analyses are based on Eucledian distance, eg. distance between values has meaning. When different variables use different scales this has influence on calculation. Therefore during data preparation variables are 'normalized'. Generally there are two ways of 'removing' influence of different variable scales:

- standarisation: values are standardised by deducting actual value from the mean value and divinding it by standard deviation. x = x - mean(x) / std.dev (x)
- normalizaration: bringing values between -1 and 1 by taking difference between actual value and minimal value and divinding it by value range, x = x - min(x) / max(x) - min(x)

For normalization in this example we use sckit-learn module StandardScaler. The same 'scaler' is used on train and test data, where fit & transform is only used on first set (train_x in this example). The dependent variable is in this case not 'normilized' because it is an simple classification problem with 2 categories. In the regression situation when dependent variable has wide range of values normalization should also be applied to dependent variable.

In [18]:
from sklearn.preprocessing import StandardScaler
scale_x = StandardScaler()
# for training set we need to fit and transform
x_train = scale_x.fit_transform(x_train)
# for test set we only transform because 
# the scale values were already fitted in first run
# notice that we use same scale_x object 
x_test = scale_x.transform(x_test)

In [20]:
# create dataframe from x_train
# just to be able to print it here as table
# for visual controle
test = pd.DataFrame(x_train)
test

Unnamed: 0,0,1,2,3,4
0,-1.0,2.645751,-0.774597,0.263068,0.123815
1,1.0,-0.377964,-0.774597,-0.253501,0.461756
2,-1.0,-0.377964,1.290994,-1.975398,-1.530933
3,-1.0,-0.377964,1.290994,0.052614,-1.11142
4,1.0,-0.377964,-0.774597,1.640585,1.720297
5,-1.0,-0.377964,1.290994,-0.081312,-0.167514
6,1.0,-0.377964,-0.774597,0.951826,0.986148
7,1.0,-0.377964,-0.774597,-0.597881,-0.482149
