# Data Preprocessing

<hr>

## Importing Libraries

A few main libraries used for machine learning are:

* __NumPy__ (Numeric Python): Used to perform mathematical and scientific operations, as well as data manipulation.
* __Matplotlib__ (Mathematical Plotting Library): Generates various plots to help visualize a dataset.
* __Pandas__: The main data science library used for data analysis.

__To import the libraries:__

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

<hr>

## Importing the Dataset

The pandas library has __read_csv__, which is a function that reads in a CSV file and converts it into a Panda's DataFrame object. The value's in the object can then be split into _x_ and _y_ values. The indexing syntax is: `[rows, columns]`.

In [2]:
dataset = pd.read_csv('Data.csv');

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 3].values

X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

<hr>

## Missing Values

The __scikit-learn__ library is automatically contained within NumPy. Its _SimpleImputer_ class helps prevent errors caused by missing data by replacing them with appropriate values. In the example below, the imputer is initialized to take the mean of columns and fill in the missing data. The mean is calculated through the _fit_ method and is then _transformed_ onto the _X_ dataset. By using the mean, the average of the column will still remain the same.

In [3]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean')
imputer = imputer.fit(X[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

<hr>

## Training and Testing

A machine learning algorithm needs to be tested to see whether the results of its training has worked. As a result, some data within the data set needs to be split for testing, a common amount being __20%__. To achieve this:

In [7]:
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

X_test

array([[1.0, 0.0, 30.0, 54000.0],
       [1.0, 0.0, 50.0, 83000.0]], dtype=object)

The __random_state__ operator parameter helps perform a pseudo-random selection of the set, where the data will be split the exact same way if the same __random_state__ value is provided.

<hr>

## Feature Scaling

Also known as *standardization* or *normalization*, __Feature Scaling__ helps normalize data within a particular range. This is very important when dealing with algorithms that are negatively affected by large and misleading magnitudes, such as the __Euclidean Distance Algorithm__. __Feature Scaling__ helps converge the data into a much smaller range, which improves the effectiveness of magnitude-dependent algorithms. Additionally, __Feature Scaling__ also helps an algorithm converge towards an output much faster. This is especially true with _matplotlib_.The scaling must be performed after splitting the dataset into a *training* and *test* set, because it calculates the mean and varaince of the entire dataset. This would mean that, if the dataset isn't split, the *training* set values would be affected by the *test* set values and vice versa.
<br><br>
There are two main forms of feature scaling:

### Standardization

$$\large x_{stand} = \frac{x - mean(x)}{standard\;deviation(x)}$$

### Normalization

$$\large x_{norm} = \frac{x - min(x)}{max(x) - min(x)}$$

Normalization should be used when the entire dataset has the characterisitics of a normal distribution, but standardization pretty much works with all cases. 

In [8]:
from sklearn.preprocessing import StandardScaler
scX = StandardScaler()
X_train = scX.fit_transform(X_train)
X_test = scX.fit_transform(X_test)

X_train

array([[ 2.64575131, -0.77459667,  0.26306757,  0.12381479],
       [-0.37796447, -0.77459667, -0.25350148,  0.46175632],
       [-0.37796447,  1.29099445, -1.97539832, -1.53093341],
       [-0.37796447,  1.29099445,  0.05261351, -1.11141978],
       [-0.37796447, -0.77459667,  1.64058505,  1.7202972 ],
       [-0.37796447,  1.29099445, -0.0813118 , -0.16751412],
       [-0.37796447, -0.77459667,  0.95182631,  0.98614835],
       [-0.37796447, -0.77459667, -0.59788085, -0.48214934]])