#### Data Processing

data preprocessing involves the following operations:
1. dealing with missing value
2. normalization
3. standardization
4. formatting
5. binning

In [23]:
import pandas as pd
df = pd.read_csv('../Datasets/cupcake.csv')

In [24]:
df.head()

Unnamed: 0,Mese,Cupcake
0,2004-01,5.0
1,2004-02,
2,2004-03,4.0
3,2004-04,6.0
4,2004-05,5.0


The scikit-learn library provides two mechanisms to deal with missing values:
1. univariate feature imputation
2. multivariate feature imputation
3. Nearest neighbors imputation

#### Univaritate feature imputation

- it involves the replacement of missing values with a constant value or some provided statistics related to a feature.
- the SimpleImputer class can be used to perform univariate feature imputation.

In [25]:
import numpy as np
from numpy import nan
from sklearn.impute import SimpleImputer

preprocessor = SimpleImputer(missing_values=np.nan, strategy='mean')

In [26]:
X = np.array(df['Cupcake']).reshape(-1,1)
preprocessor.fit(X)

SimpleImputer()

In [27]:
SimpleImputer(add_indicator=False, copy=True, fill_value=None, missing_values=nan, strategy='mean', verbose=0)

SimpleImputer()

In [28]:
X_prep = preprocessor.transform(X)

In [29]:
df['Cupcake_univariate'] = X_prep.reshape(1,-1)[0]
df.head()

Unnamed: 0,Mese,Cupcake,Cupcake_univariate
0,2004-01,5.0,5.0
1,2004-02,,50.079208
2,2004-03,4.0,4.0
3,2004-04,6.0,6.0
4,2004-05,5.0,5.0


#### Multivariate Feature Imputation

In the multivariate feature imputation each feature with missing values is calculated as a function of the other features.

An iterative imputation is built thus the maximum number of iteration must be specified.

We can use IterativeImputer class.

In [30]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

preprocessor = IterativeImputer(max_iter=10, random_state=0)

In [31]:
X1 = np.array(df['Cupcake']).reshape(-1,1)
X2 = np.array(df.index).reshape(-1,1)
X = np.hstack((X1, X2))

In [32]:
preprocessor.fit(X)

IterativeImputer(random_state=0)

In [33]:
X_prep = preprocessor.transform(X)
df['Cupcake_multicariate'] = np.hsplit(X_prep,2)[0].reshape(1,-1)[0]

In [34]:
df.iloc[26]

Mese                      2006-03
Cupcake                       NaN
Cupcake_univariate      50.079208
Cupcake_multicariate    28.631082
Name: 26, dtype: object

In [35]:
df.iloc[1]

Mese                      2004-02
Cupcake                       NaN
Cupcake_univariate      50.079208
Cupcake_multicariate    21.610078
Name: 1, dtype: object

#### Nearest neighbors imputation

- This method fills missing values using the k-nearest neigbors approach.
- Each missing value is calculated using values from n_neigbors nearest neigbors that have a value.
- We can use the KNNImputer class of the scikit_learn library.

In [36]:
from sklearn.impute import KNNImputer

preprocessor = KNNImputer(n_neighbors=5, weights="distance")
preprocessor.fit(X)
X_prep = preprocessor.transform(X)
df['Cupcake_knn'] = np.hsplit(X_prep, 2)[0].reshape(1,-1)[0]

In [37]:
df.iloc[26]

Mese                      2006-03
Cupcake                       NaN
Cupcake_univariate      50.079208
Cupcake_multicariate    28.631082
Cupcake_knn                  10.7
Name: 26, dtype: object

In [38]:
df.iloc[1]

Mese                      2004-02
Cupcake                       NaN
Cupcake_univariate      50.079208
Cupcake_multicariate    21.610078
Cupcake_knn              4.918919
Name: 1, dtype: object

the KNNImputer produces the nearest values to those in the original dataset (10 for position 26 and 5 for position 1)

#### What is normalization

Normalization is a process of scaling down data.

Usually while normalizing we change the scale of the data to fall between 0 - 1.

#### What is need for normalization

It makes to understand the importance of each feature easily, when looking at the model weights.

It also makes the process training less sentive to the scale of features.

The process of making features more suitable for training by rescaling is called feature scaling.


Normalization in python

1. normalize()

In [39]:
from sklearn import preprocessing

In [40]:
import numpy as np
arr = np.random.randint(100, size=(15))
print(arr)

[49 48 60 15 28 99 98  4 17 92 62 67 47 72 87]


In [42]:
arr_norm = preprocessing.normalize([arr])
print(arr_norm)

[[0.19874903 0.19469293 0.24336616 0.06084154 0.11357087 0.40155416
  0.39749806 0.01622441 0.06895374 0.37316144 0.25147836 0.27175888
  0.19063682 0.29203939 0.35288093]]


2. normalize dataset

In [43]:
insurance_data = pd.read_csv('../Datasets/insurance2.csv')

In [46]:
bmi = np.array(insurance_data['bmi'])
normalized_bmi = preprocessing.normalize([bmi])
print(normalized_bmi)

[[0.02439715 0.02953017 0.02885684 ... 0.03222348 0.02256081 0.02542026]]


In [45]:
norm_data = preprocessing.normalize(insurance_data)

In [48]:
insurance_data = pd.read_csv('../Datasets/insurance2.csv')
norm_data = preprocessing.normalize(insurance_data, axis=0)
norm_df = pd.DataFrame(norm_data, columns=[insurance_data.columns])
print('Original Data\n', insurance_data.head(10))
print('Normalized Data\n', norm_df.head(10))

Original Data
    age  sex     bmi  children  smoker  region      charges  insuranceclaim
0   19    0  27.900         0       1       3  16884.92400               1
1   18    1  33.770         1       0       2   1725.55230               1
2   28    1  33.000         3       0       2   4449.46200               0
3   33    1  22.705         0       0       1  21984.47061               0
4   32    1  28.880         0       0       1   3866.85520               1
5   31    0  25.740         0       0       2   3756.62160               0
6   46    0  33.440         1       0       2   8240.58960               1
7   37    0  27.740         3       0       1   7281.50560               0
8   37    1  29.830         2       0       0   6406.41070               0
9   60    0  25.840         0       0       1  28923.13692               0
Normalized Data
         age       sex       bmi  children    smoker    region   charges  \
0  0.012472  0.000000  0.024397  0.000000  0.060412  0.043732  0.025

2. MinMaxScaler()

In [50]:
insurance_data = pd.read_csv('../Datasets/insurance2.csv')
scaler = preprocessing.MinMaxScaler(feature_range=(0, 2))
norm = scaler.fit_transform(insurance_data)
norm_df = pd.DataFrame(norm, columns=[insurance_data.columns])
print('Original Data\n', insurance_data.head(10))
print('MinMaxScaler Data\n', norm_df.head(10))

Original Data
    age  sex     bmi  children  smoker  region      charges  insuranceclaim
0   19    0  27.900         0       1       3  16884.92400               1
1   18    1  33.770         1       0       2   1725.55230               1
2   28    1  33.000         3       0       2   4449.46200               0
3   33    1  22.705         0       0       1  21984.47061               0
4   32    1  28.880         0       0       1   3866.85520               1
5   31    0  25.740         0       0       2   3756.62160               0
6   46    0  33.440         1       0       2   8240.58960               1
7   37    0  27.740         3       0       1   7281.50560               0
8   37    1  29.830         2       0       0   6406.41070               0
9   60    0  25.840         0       0       1  28923.13692               0
MinMaxScaler Data
         age  sex       bmi children smoker    region   charges insuranceclaim
0  0.043478  0.0  0.642454      0.0    2.0  2.000000  0.503222 