<a href="https://colab.research.google.com/github/ed-roberts-github/Previous-work/blob/main/data_preprocessing_tools.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preprocessing Tools

## Importing the libraries

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Importing the dataset

In [None]:
dataset = pd.read_csv('Data.csv')
x = dataset.iloc[:, :-1].values #this selects all the rows and collums except the last one
y = dataset.iloc[:, -1].values #selects just last coll

In [None]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

In [None]:
from sklearn.impute import SimpleImputer #importing a class from impute module
imputer = SimpleImputer(missing_values=np.nan, strategy='mean') #declaring an onbject
imputer.fit(x[:,1:3]) #fit method looking for all missing values in col 1 and 2
x[:,1:3] = imputer.transform(x[:,1:3])

In [None]:
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


## Encoding categorical data

### Encoding the Independent Variable

In [None]:
#have to change the strings in country into another format, don't just use 0,1,2 as this can be
#interpreted wrong by the alogrithm later so instead we make the strins into vectors (1,0,0),(0,1,0),(0,0,1)
#this is called OneHot encoding

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

#creating object
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(), [0] )], remainder='passthrough')
#remainder = 'passthrough' ensure we keep the other collumns in x, otherwise we'd only end up with the country col

x = np.array(ct.fit_transform(x)) #this transforms the country collumn to the OneHot new one
#need to have x as a numpy array, do this by calling np.array()

In [None]:
print(x)

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


### Encoding the Dependent Variable

In [None]:
#don't need onehot encoding here as only 2 options
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder() #nothing in () as only 1 single vector so obvs what needed to be encoded
y = le.fit_transform(y) #no need for numpy array as its depedant variable

In [None]:
print(y)

[0 1 0 0 1 1 0 1 0 1]


## Splitting the dataset into the Training set and Test set

We split the data set up before doing feature scaling because the test set is meant to be a completely new set of data. We arent supposed to work with the test set while training! We would end up with 'data leakage' on the test set.

Feature scaling is done to scale your features so they all have values within the same scale so one doesn't dominated the training algorithm.


In [None]:
 from sklearn.model_selection import train_test_split
 x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 1) 
 #3rd argument in () is the split size usual done 80% training set
 #4th argument is just done in this case to select which random split I'll get so it matches
 #the course split data (so this isn't usual needed)

In [None]:
print(x_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [None]:
print(x_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [None]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [None]:
print(y_test)

[0 1]


## Feature Scaling

Puts all features on the same scale as some ML models can be dominated by some values (this isn't needed for all ML models). 
There are two main methods of future scaling, Standardisation (values around -3 to +3) or Normalisation (between 0 to 1).

Normalisation is recommened when you have a normal distibrution. Standardisation works pertty much all the time, so its recommened to use standardisation. This is the method done below. x_stand = (x-mean(x))/standard deviation(x)

You don't have to apply standardisation to dummy variable (the ones which were strings that we turned into vectors) because firstly these values are already between the -3 to 3 range and secondly IF WE DID APPLY FUTURE SCALING WE COULDN'T TELL WHICH DUMMY VARIBALE CORRESPONDS WITH WHICH ORIGINAL VALUE (ie country in this case).

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train[:,3:] = sc.fit_transform(x_train[:,3:])
 #'3:' as we know onehot creates 3 colls at start and so we want to have all collumes after the first.
#Fit computes the standard dev and mean of all the values, transfrom applies the formula and gets x_stand
#so here we use fit_transfrom

x_test[:,3:] = sc.transform(x_test[:,3:])
#only applying transform method because features needs same scalar applied as x_train. ie we need to
#get the same transformation as that applied to the training set (so we just use transform)

In [None]:
print(x_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [None]:
print(x_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
