# Machine Learning Notes

## Data Preparation

In [111]:
import pandas as pd
import numpy as np

In [132]:
data = pd.read_csv("cars.csv")

In [133]:
data.head()

Unnamed: 0,Make,Colour,Top Speed,MPG,Price
0,Aston Martin,Grey,250,25.0,180000
1,Ford,Blue,120,50.0,20000
2,Mercedes,Black,160,30.0,60000
3,Honda,Silver,115,,30000
4,Mini,Grey,140,40.0,35000


In this example assume we want to predict the price of a car based upon the other variables in our data.

Therefore we want to split our data into a matrix of variables (features) X and a vector y or our dependant variables (price in this example):

In [134]:
X = data.iloc[:, :-1].values
y = data.iloc[:, 4].values

## Missing Data

In the example dataset above we can see missing data in the MPG column. We want to imput this value based upon the other data in that column.

In [135]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = 'NaN', strategy = 'mean', axis = 0)
imputer = imputer.fit(X[:, 2:4])
X[:, 2:4] = imputer.transform(X[:, 2:4])

In [136]:
X

array([['Aston Martin', 'Grey', 250.0, 25.0],
       ['Ford', 'Blue', 120.0, 50.0],
       ['Mercedes', 'Black', 160.0, 30.0],
       ['Honda', 'Silver', 115.0, 36.25],
       ['Mini', 'Grey', 140.0, 40.0]], dtype=object)

As we can see above the Honda has been given the mean of the rest of the available data in the MPG column. The type of imputation you want to perform will depend on your dataset. In some cases you may want to run a linear regression to get a more accurate estimate for your missing value.

## Encode Data

When we have categorical data we want to encode these into numeric values so that our machine learning algorithms can deal with them.

In the above dataset colour would be an example that we'd need to convert.

In [137]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X = LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])
#X[:, 1] = labelencoder_X.fit_transform(X[:, 1])

In [138]:
X

array([[0, 'Grey', 250.0, 25.0],
       [1, 'Blue', 120.0, 50.0],
       [3, 'Black', 160.0, 30.0],
       [2, 'Silver', 115.0, 36.25],
       [4, 'Grey', 140.0, 40.0]], dtype=object)

In [139]:
onehotencoder = OneHotEncoder()
X = onehotencoder.fit_transform(X).toarray()

ValueError: could not convert string to float: 'Grey'

In [140]:
X

array([[0, 'Grey', 250.0, 25.0],
       [1, 'Blue', 120.0, 50.0],
       [3, 'Black', 160.0, 30.0],
       [2, 'Silver', 115.0, 36.25],
       [4, 'Grey', 140.0, 40.0]], dtype=object)