# Machine Learning Tutorial! <br><br><br>



### handling missing data

#### Importing the libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#### Importing the dataset

In [2]:
dataset = pd.read_csv('Data.csv')
dataset.head()

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes


In [3]:
X = dataset.iloc[:, :-1].values
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, nan],
       ['France', 35.0, 58000.0],
       ['Spain', nan, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

In [4]:
y = dataset.iloc[:, 3].values
y

array(['No', 'Yes', 'No', 'No', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes'],
      dtype=object)

### Taking care of missing data

In [12]:
from sklearn.impute import SimpleImputer

## `Parameters of SimpleImputer :`
> #### missing_values : _number, string, np.nan (default) or None_
- The placeholder for the missing values. All occurrences of *`missing_values`* will be imputed.

> #### strategy : _string, optional (default=”mean”)_
> > The imputation strategy.
- `If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.`
- `If “median”, then replace missing values using the median along each column. Can only be used with numeric data.`
- `If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data.`
- `If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.`<br><br>
    New in version 0.20: strategy=”constant” for fixed value imputation.

> #### fill_value : _string or numerical value, optional (default=None)_
- `When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. `
- `If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.`

> #### verbose : _integer, optional (default=0)_
- `Controls the verbosity of the imputer.`

> #### copy : _boolean, optional (default=True)_
> > If True, a copy of X will be created.
> > If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if `copy=False`:
- `If X is not an array of floating values;`
- `If X is encoded as a CSR matrix;`
- `If add_indicator=True.`

> #### add_indicator : _boolean, optional (default=False)_
- `If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation.`
- `If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.`

In [15]:
imputer = SimpleImputer(missing_values = 'NaN', strategy = 'mean')
imputer

SimpleImputer(copy=True, fill_value=None, missing_values='NaN',
       strategy='mean', verbose=0)