### PART 1: Dataset
I renamed database as the "condition.csv"<br>
The meaning of each column are listed below:<br>
<br>
- 0. Number of times pregnant.
- 1. Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- 2. Diastolic blood pressure (mm Hg).
- 3. Triceps skinfold thickness (mm).
- 4. 2-Hour serum insulin (mu U/ml).
- 5. Body mass index (weight in kg/(height in m)^2).
- 6. Diabetes pedigree function.
- 7. Age (years).
- 8. Class variable (0 or 1).

In [47]:
import pandas as pd
# Notice that there's no header in original csv file
# Thus we set header = None
df = pd.read_csv("condition.csv", header=None)
df.head(3)

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


### PART 2: Detect Missing Values
We have couple of methods to detect missing values <br>
- Roughly browse the dataset if it's not large
- use df.describe() to get summary statistics on each column


In [3]:
df.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


From the summary statistics, we know that the dataset has 768 records, column 0-5 and column 8 has 0 as min value, according to the meaning of each attribute, it's invalid to has 0 as a normal value for column 1-5. thus, we have successfully detected the columns with missing value, that is:<br>
- 1: Plasma glucose concentration
- 2: Diastolic blood pressure
- 3: Triceps skinfold thickness
- 4: 2-Hour serum insulin
- 5: Body mass index
We count the number of missing value in above 3 attribute:

In [48]:
missing_number = (df[[1,2,3,4,5]] == 0).sum()
missing_number

1      5
2     35
3    227
4    374
5     11
dtype: int64

for column 1,2,5, effect of missing value are not significant but for rest of columns missed almost half values <br>
In python, we replace the missing value (0 in decimal attribute) into the "nan"

In [49]:
from numpy import nan
df[[1,2,3,4,5]] = df[[1,2,3,4,5]].replace(0, nan)
df[[1,2,3,4,5]].head(3)

Unnamed: 0,1,2,3,4,5
0,148.0,72.0,35.0,,33.6
1,85.0,66.0,29.0,,26.6
2,183.0,64.0,,,23.3


Then we can detect the missing value in more clear way

In [50]:
df.isnull().sum()

0      0
1      5
2     35
3    227
4    374
5     11
6      0
7      0
8      0
dtype: int64

### PART 3: Remove Rows with missing value
The easiet way to deal with missing value is to remove the corresponding rows, it's effient in large database:

In [None]:
df.dropna(inplace = True) 
# just display the code but not run it 

However, in current case, there's only 768 records and at least 374 of them contains missing values, it's not eligible to remove all these rows. Different from instruction in tutorial, my idea is to remove the rows which has more than 3 (3 included) missing values and use other methods to replace the missing values. That's equivalent to 

In [51]:
df.dropna(inplace=True, thresh=6)
len(df)

761

Only 7 rows has been removed

### PART 4: Fill with value
So many options can be used to simply fill the missing value: <br>
- Random number
- Constant number
- Mean / Median / Mode 
- Value estimated by other predictive model <br>

For example, for column 1: Plasma glucose concentration, we fill missing value use mean <br>
for column 2: Diastolic blood pressure, we fill it with median<br>

In [52]:
df[1].fillna(df[1].mean(), inplace = True)

In [56]:
df[2].fillna(df[2].median(), inplace = True)

### PART 5: Fill missing value using model to predict
There are three types of missing value:
- MACR(Missing Completely at Random)<br>
Easy to understand
- MAR (Missing at Random)<br>
Example: the data survey has been lost, it occurs at partially random, but also depends on the probability
- MNAR (Missing Not at Random)<br>
Example: data values are missing because males are less likely to respond to a depression survey

KNN is often used as a model to predict missing value when the simple value (mean, median etc) are not eligible to be filled with, in this case, we use KNN the fill the rest of missing values, that is column3, column4, column5.

In [59]:
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors = 3)
df_filled = imputer.fit_transform(df[[3,4,5]])

In [62]:
df[[3,4,5]] = df_filled

Check if all missing values has been filled

In [63]:
df.isnull().sum() # all good

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
dtype: int64