# Data Preprocessing
Machine Learning algorithms are completely dependent on data because it is the most crucial aspect that makes model training possible. On the other hand, if we won’t be able to make sense out of that data, before feeding it to ML algorithms, a machine will be useless. In simple words, we always need to feed right data i.e. the data in correct scale, format and containing meaningful features, for the problem we want machine to solve. This makes data preparation the most important step in ML process. Data preparation may be defined as the procedure that __makes our dataset more appropriate for ML process__.

Data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm.

## Handling Missing data:

If our dataset contains some missing data, then it may create a huge problem for our machine learning model. Hence it is necessary to handle missing values present in the dataset.

There are mainly two ways to handle missing data, which are:

- Eliminating samples or features: The first way is used to commonly deal with null values. In this way, we just delete the specific row or column which consists of null values. But this way is not so efficient and removing data may lead to loss of information which will not give the accurate output.

- Imputing missing values with interpolation, such as mean imputation: In this way, we will calculate the mean of that column or row which contains any missing value and will put it on the place of missing value. This strategy is useful for the features which have numeric data such as age, salary, year, etc. 

## Scaling
Most probably our dataset comprises of the attributes with varying scale, but we cannot provide such data to ML algorithm hence it requires rescaling. Data rescaling makes sure that attributes are at same scale. Generally, attributes are rescaled into the range of 0 and 1. ML algorithms like gradient descent and k-Nearest Neighbors requires scaled data. 

### MinMaxScaling or Normalization

We can rescale the data with the help of `MinMaxScaler` class of `scikit-learn` Python library.

In [1]:
from pandas import read_csv
import numpy as np
from sklearn import preprocessing

path = 'diabetes.csv'
df = read_csv(path)
array= df.values

Now, we can use `MinMaxScaler` class to rescale the data in the range of 0 and 1.

In [2]:
data_scaler = preprocessing.MinMaxScaler(feature_range=(0, 1))
data_rescaled = data_scaler.fit_transform(array)

In [5]:
np.set_printoptions(precision=2)
print("Scaled data:\n", data_rescaled[0:10])

Scaled data:
 [[0.35 0.74 0.59 0.35 0.   0.5  0.23 0.48 1.  ]
 [0.06 0.43 0.54 0.29 0.   0.4  0.12 0.17 0.  ]
 [0.47 0.92 0.52 0.   0.   0.35 0.25 0.18 1.  ]
 [0.06 0.45 0.54 0.23 0.11 0.42 0.04 0.   0.  ]
 [0.   0.69 0.33 0.35 0.2  0.64 0.94 0.2  1.  ]
 [0.29 0.58 0.61 0.   0.   0.38 0.05 0.15 0.  ]
 [0.18 0.39 0.41 0.32 0.1  0.46 0.07 0.08 1.  ]
 [0.59 0.58 0.   0.   0.   0.53 0.02 0.13 0.  ]
 [0.12 0.99 0.57 0.45 0.64 0.45 0.03 0.53 1.  ]
 [0.47 0.63 0.79 0.   0.   0.   0.07 0.55 1.  ]]


In [4]:
print("Orginal data: \n", array[0:10])

Orginal data: 
 [[6.000e+00 1.480e+02 7.200e+01 3.500e+01 0.000e+00 3.360e+01 6.270e-01
  5.000e+01 1.000e+00]
 [1.000e+00 8.500e+01 6.600e+01 2.900e+01 0.000e+00 2.660e+01 3.510e-01
  3.100e+01 0.000e+00]
 [8.000e+00 1.830e+02 6.400e+01 0.000e+00 0.000e+00 2.330e+01 6.720e-01
  3.200e+01 1.000e+00]
 [1.000e+00 8.900e+01 6.600e+01 2.300e+01 9.400e+01 2.810e+01 1.670e-01
  2.100e+01 0.000e+00]
 [0.000e+00 1.370e+02 4.000e+01 3.500e+01 1.680e+02 4.310e+01 2.288e+00
  3.300e+01 1.000e+00]
 [5.000e+00 1.160e+02 7.400e+01 0.000e+00 0.000e+00 2.560e+01 2.010e-01
  3.000e+01 0.000e+00]
 [3.000e+00 7.800e+01 5.000e+01 3.200e+01 8.800e+01 3.100e+01 2.480e-01
  2.600e+01 1.000e+00]
 [1.000e+01 1.150e+02 0.000e+00 0.000e+00 0.000e+00 3.530e+01 1.340e-01
  2.900e+01 0.000e+00]
 [2.000e+00 1.970e+02 7.000e+01 4.500e+01 5.430e+02 3.050e+01 1.580e-01
  5.300e+01 1.000e+00]
 [8.000e+00 1.250e+02 9.600e+01 0.000e+00 0.000e+00 0.000e+00 2.320e-01
  5.400e+01 1.000e+00]]


### Standardization

Another useful data preprocessing technique which is basically used to transform the data attributes with a Gaussian distribution. It differs the mean and SD (Standard Deviation) to a __standard Gaussian distribution__ with a mean of 0 and a SD of 1. This technique is useful in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in input dataset and produce better results with rescaled data. 

We can standardize the data (mean = 0 and SD =1) with the help of `StandardScaler` class of scikit-learn Python library.

In [9]:
from sklearn.preprocessing import StandardScaler

data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)
print("Standardized data:\n", data_rescaled[0:3])

Standardized data:
 [[ 0.64  0.85  0.15  0.91 -0.69  0.2   0.47  1.43  1.37]
 [-0.84 -1.12 -0.16  0.53 -0.69 -0.68 -0.37 -0.19 -0.73]
 [ 1.23  1.94 -0.26 -1.29 -0.69 -1.1   0.6  -0.11  1.37]]


## Binarization
As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0.

For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it binarizing the data or thresholding the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.

We can binarize the data with the help of `Binarizer` class of scikit-learn Python library.

In [8]:
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.5).fit(array)
data_binarized = binarizer.transform(array)
print("Binarized data:\n", data_binarized[0:3])

Binarized data:
 [[1. 1. 1. 1. 0. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 0. 1. 0.]
 [1. 1. 1. 0. 0. 1. 1. 1. 1.]]


## Encoding Categorical data

There are different types of data in datasets:
- Numerical data:Such as house price, temperature, etc.
- Categorical data:Such as Yes/No, Red/Blue/green, etc.
    - Nominal features: such as t-shirt color red, green, and blue. 
    - Ordinal features: such as t-shirt size small, medium and large. 

Since machine learning model completely works on mathematics and numbers, but if our dataset would have a categorical variable, then it may create trouble while building the model. So it is necessary to encode these categorical variables into numbers.

In [47]:
from sklearn.preprocessing import OrdinalEncoder
import numpy as np

X = np.asarray([['x-small', 5], ['small', 7], ['medium', 9], ['large', 11], ['x-large', 13]])
# encoder = OrdinalEncoder()
# encoder.fit(X[:,0])
print(X)
print(X[:,0])
X_col_1 = np.expand_dims(X[:,0], axis=1)
print(X_col_1)
encoder = OrdinalEncoder()
encoder.fit(X_col_1)

[['x-small' '5']
 ['small' '7']
 ['medium' '9']
 ['large' '11']
 ['x-large' '13']]
['x-small' 'small' 'medium' 'large' 'x-large']
[['x-small']
 ['small']
 ['medium']
 ['large']
 ['x-large']]


In [48]:
encoder.categories_

[array(['large', 'medium', 'small', 'x-large', 'x-small'], dtype='<U21')]

In [49]:
encoded_values = encoder.transform(X_col_1)
print(encoded_values)

[[4.]
 [2.]
 [1.]
 [0.]
 [3.]]


### Label Encoding
Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called `label encoding`. We can perform label encoding of data with the help of `LabelEncoder()` function of scikit-learn Python library.

In [11]:
from sklearn.preprocessing import LabelEncoder

input_labels = ['x-small', 'small', 'medium', 'large', 'x-large']
encoder = LabelEncoder()
encoder.fit(input_labels)

In [12]:
test_labels = ['small', 'medium', 'large']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("\nEncoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)
print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))


Labels = ['small', 'medium', 'large']

Encoded values = [2, 1, 0]

Encoded values = [3, 0, 4, 1]

Decoded labels = ['x-large', 'large', 'x-small', 'medium']


### One-Hot Encoding

In [11]:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

data = [['x-small'], ['small'], ['medium'], ['large'], ['x-large'], ['x-small'], ['medium']]
df = pd.DataFrame(data, columns=["Size"], index=["shirt0", "shirt1", "shirt2", "shirt3", "shirt4", "shirt5", "shirt6"])
print("Data before encoding:")
print(df)

# define one hot encoding
categories = [['x-small', 'small', 'medium', 'large', 'x-large']]
onehot_encoder = OneHotEncoder(categories=categories, sparse_output=False)

#transform data
encoded_data = onehot_encoder.fit_transform(df)
df_encoded = pd.DataFrame(encoded_data, columns=onehot_encoder.categories_, index= df.index)
print("\nData after encoding:")
print(df_encoded)

Data before encoding:
           Size
shirt0  x-small
shirt1    small
shirt2   medium
shirt3    large
shirt4  x-large
shirt5  x-small
shirt6   medium

Data after encoding:
       x-small small medium large x-large
shirt0     1.0   0.0    0.0   0.0     0.0
shirt1     0.0   1.0    0.0   0.0     0.0
shirt2     0.0   0.0    1.0   0.0     0.0
shirt3     0.0   0.0    0.0   1.0     0.0
shirt4     0.0   0.0    0.0   0.0     1.0
shirt5     1.0   0.0    0.0   0.0     0.0
shirt6     0.0   0.0    1.0   0.0     0.0


## Splitting the Dataset into separate training and test sets

In machine learning data preprocessing, we divide our dataset into a training set and test set. This is one of the crucial steps of data preprocessing as by doing this, we can enhance the performance of our machine learning model.

In [None]:
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

- In train_test_split() function, we have passed four parameters in which first two are for arrays of data, and test_size is for specifying the size of the test set. The test_size maybe .5, .3, or .2, which tells the dividing ratio of training and testing sets.
- The last parameter random_state is used to set a seed for a random generator so that you always get the same result