## Data Impuatation

##### In real life, we often have to deal with data that contains missing values. Sometimes, if the dataset is missing too many values, we just don't use it. However, if only a few of the values are missing, we can perform data imputation to substitute the missing data with some other value(s).

There are many different methods for data imputation. In scikit-learn, the SimpleImputer transformer performs four different data imputation methods.

### The four methods are:
1. Using the mean value
2. Using the median value
3. Using the most frequent value
4. Filling in missing values with a constant

The code below shows how to perform data imputation using mean values from each column

In [2]:
import numpy as np

In [3]:
from sklearn.impute import SimpleImputer

In [4]:
imp_mean=SimpleImputer()

In [9]:
arr=np.array([[ 1.,  2., np.nan,  2.],
       [ 5., np.nan,  1.,  2.],
       [ 4., np.nan,  3., np.nan],
       [ 5.,  6.,  8.,  1.],
       [np.nan,  7., np.nan,  0.]])

In [11]:
transform=imp_mean.fit_transform(arr)

In [12]:
transform

array([[1.  , 2.  , 4.  , 2.  ],
       [5.  , 5.  , 1.  , 2.  ],
       [4.  , 5.  , 3.  , 1.25],
       [5.  , 6.  , 8.  , 1.  ],
       [3.75, 7.  , 4.  , 0.  ]])


#### The default imputation method for SimpleImputer is using the **column means**. By using the **strategy** keyword argument when initializing a **SimpleImputer object**, we can specify a different imputation method.

In [13]:
arr=np.array([[ 1.,  2., np.nan,  2.],
       [ 5., np.nan,  1.,  2.],
       [ 4., np.nan,  3., np.nan],
       [ 5.,  6.,  8.,  1.],
       [np.nan,  7., np.nan,  0.]])

In [15]:
imp_median=SimpleImputer(strategy="median")

In [16]:
imp_median.fit_transform(arr)

array([[1. , 2. , 3. , 2. ],
       [5. , 6. , 1. , 2. ],
       [4. , 6. , 3. , 1.5],
       [5. , 6. , 8. , 1. ],
       [4.5, 7. , 3. , 0. ]])

In [17]:
from sklearn.impute import SimpleImputer

In [18]:
imp_constant=SimpleImputer(strategy='constant',fill_value=-1)

In [19]:
imp_constant.fit_transform(arr)

array([[ 1.,  2., -1.,  2.],
       [ 5., -1.,  1.,  2.],
       [ 4., -1.,  3., -1.],
       [ 5.,  6.,  8.,  1.],
       [-1.,  7., -1.,  0.]])