#  Data Preprocessing

# Importing Libraries

**Numpy** for Mathematical functions.<br>
**Pandas** for importing and managing dataset

In [1]:
import numpy as np
import pandas as pd

# Importing dataset 

**read_csv** method to read a local csv as dataframe.<br>
Then make seperate Matrix and vector of independent and dependent variables from dataframe

In [2]:
dataset = pd.read_csv('dataset/Data.csv')
X = dataset.iloc[ : , :-1].values
Y = dataset.iloc[ : , 3].values

# Handling the missing data
The following code is deprecated.
```python
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values = "NaN", strategy = "mean", axis = 0)
```
**Use *SimpleImputer* from sklearn.impute**

In [3]:
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean = imp_mean.fit(X[ : , 1:3])
X[ : , 1:3] = imp_mean.transform(X[ : , 1:3])

In [4]:
X[:,0]

array(['France', 'Spain', 'Germany', 'Spain', 'Germany', 'France',
       'Spain', 'France', 'Germany', 'France'], dtype=object)

# Encoding categorical data 

Categorical data cannot be used for calculations, so we need to encode them into numbers.<br>
We use **LabelEncoder** from *sklearn.preprocessing*

In [11]:
from sklearn.preprocessing import LabelEncoder
labelencoder_X = LabelEncoder()
X[ : , 0] = labelencoder_X.fit_transform(X[ : , 0])

### Creating a dummy variable

The categorical_feature has been deprecated.
```python
onehotencoder = OneHotEncoder(categorical_features = [0])
X = onehotencoder.fit_transform(X).toarray()
labelencoder_Y = LabelEncoder()
Y =  labelencoder_Y.fit_transform(Y)
```

Need to use *ColumnTransformer* as below.

In [12]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

transformer = ColumnTransformer(
    transformers=[
        ("OneHot",        # Just a name
         OneHotEncoder(), # The transformer class
         [0]              # The column(s) to be applied on.
         )
    ]
)
X = transformer.fit_transform(X)
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


# Splitting the dataset into training and testing dataset

We use **train_test_split()** from *sklearn.model_selection* library. The split is generally 80/20.<br>
Also cross_validation is changed to *model_selection*

In [9]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 0)

# Feature Scaling 

In [10]:
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
sc_X = sc_X.fit(X_train)
X_train = sc_X.transform(X_train)
X_test = sc_X.transform(X_test)