## Importing Library and Preprocessing Dataset

### Importing Library

In [None]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer

Usage: 
1. `Pandas` = High pergormance and easy to use data structure and data analysis tools
2. `Numpy` = Use to work with array
3. `SimpleImputer` : is a class from `sklearn.impute` module that provides basic strategies of missing values.

### About Dataset

#### Dataframe

DataFrame is two dimensional structure which data that arrange in row and column, its has similarity with SQL table or excel sheets. 

In [2]:
import pandas as pd

# Create dataframe from dictionary
data = {
    'Name': ['John', 'Anna', 'Peter'],
    'Age': [23, 45, 29]
}

pd.DataFrame(data)

Unnamed: 0,Name,Age
0,John,23
1,Anna,45
2,Peter,29


#### Import Dataset

Dataset is loaded to Pandas dataframe using the `read_csv` function to read CSV file into dataframe

In [None]:
df = pd.read_csv("pima-indians-diabetes.csv")

#### Access Column

`iloc` method is `Pandas` method that use to select index of integer on rows and column

To select the index, we can use:
* `:` for "all index"
* `-1` refers to last index on Dataframe
* `:-1` mean "All column except last column", `:` before `-1` mean "all index before last index"

`values` is used to change DataFrame into numpy array

In [None]:
x = df.iloc[:, :-1].values #range from zero, up to -1(index of last column). All the column excluding the last one. (MATRIX FEATURE)
y = df.iloc[:, -1].values # Dependent variable vector

#### Frature Matrix and Dependant Vector Variable

##### Feature Matrix


Feature Matrix is matrix dan contain all of feature (or idependent variable) from dataset. Each rows on matrix representing one of data sample and each column representing one feature. 

Example of house dataset:
* Size of house
* Total of room
* Age of house

Then Feature Matrix will contain value for features on each house on dataset

##### Dependent Vector Variable

Dependent Vector Variable is a vector containing value of target variable (or dependent variable) that want to predict of classified.

Example, for predicting house price based on above features, then Dependent Vector Variable will contain price of each house on dataset

#### Identify Missing Data

To identify missing data of DataFrame we can use isnull() method, return boolean value. Then count the missing data based on its value (boolean)

In [None]:
missing_data = df.isnull().sum()
print("Missing data: \n", missing_data)

#### Fill Missing Value of Data

`SimpleImputer` is class from `sklearn.impute` module. This class is provide basic strategies for imputing (filling in) missing value, using either constant or the mean, median, or medium, etc. `imputer` is object that initialized from class `Simple Imputer` by using `SimpleImputer(missing_values=np.nan, strategy="mean")`. When we initialized, we pass argument `missing_value=np.nan` and `strategy="mean"` to the __init__ method from class `Simple Imputer`. `missing_value=np.nan` indicate for filling missing value that represented as `np.nan` (missing data in pandas DataFrame) and `strategy="mean` to use mean as strategy to fill missing value. Both is argument specifies that passed to the `SimpleImputer` class when instance is created. 

With using `imputer`, we use this object to fill missing value on our data with using mean strategy.

In [None]:
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

After instance of `SimpleImputer` is stored in `imputer` as object. To fill the missing data we use `fit` and `transform` methods. `fit` method will compute the imputation values based on the provide data, and the `transform` method will fill the missing values.

In [None]:
imputer.fit(df)
dataset_imputed = imputer.transform(df)
print(df)