<a href="https://www.kaggle.com/alperenkaran/water-potability-caution-with-missing-values?scriptVersionId=88886158" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Section 1: Main purpose of this notebook

*I have seen many notebooks that imputed missing values using the information from the class labels for the classification task. This should be avoided, because it results in data leakage between train and test sets.*


**Section 2:** I simply look at the dataset, and count the missing values.

**Section 3 Part 1:** I create a binary classification task on an artificial dataset with two random features. Obviously, the accuracy should be around 50% since the random features are not informative about the class labels. When, furthermore, some values of a feature are set to `NaN` and imputed using the wrong way of imputation, the accuracies increased. This should not have happened. This demonstrates why that imputation method is wrong.

**Section 3 Part 2:** Here, I do the same for the Water Potability task, and get around 71%.

**Section 4:** I show how we can avoid data leakage between train and test sets. I do the model selection without using class labels of the test set.

# Section 2: Read the Water Potability data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('../input/water-potability/water_potability.csv')
data.sample(3)

In [None]:
data.isnull().sum()

The data has some missing values, and we would like to impute them using some strategies.

# Section 3: What you shouldn't do when imputing missing values

## Part 1: Artificial dataset example

Let me first create an artificial dataset. It has two features consisting of random integers between 0 and 4.

We expect that a machine learning model will have an accuracy about 50%.

In [None]:
random_column1 = np.random.randint(5, size=10000) #just 10000 random integers between 0 and 4
random_column2 = np.random.randint(5, size=10000) #just 10000 random integers between 0 and 4
labels = np.random.randint(2, size=10000) #0 or 1 <- binary label

artificial_data = pd.DataFrame()
artificial_data['random_column1'], artificial_data['random_column2'] = random_column1, random_column2

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier()
cv_result = cross_val_score(clf, artificial_data, labels)
print('Cross validation accuracy:', np.mean(cv_result))

Now let's make some values of `random_column1` missing.

In [None]:
for i in range(len(artificial_data)):
    if np.random.randint(4) == 0: #this has a probability 25% of being 0
        artificial_data.iloc[i,0] = np.nan #so, approximately one quarter of the column1 is NaN now.

In [None]:
# impute the missing values with means of labels (SHOULD NOT BE DONE)
artificial_data['random_column1'] = artificial_data['random_column1'].fillna(artificial_data.groupby(labels)['random_column1'].transform('mean'))

clf = DecisionTreeClassifier(random_state=0)
cv_result = cross_val_score(clf, artificial_data, labels)
print('Cross validation accuracy:', np.mean(cv_result))

**The accuracy increased to 62% !**

This is because we did something we shouldn't have done. We imputed missing values using the labels.

## Part 2: Water potability example

If we impute the missing values using the labels, let's see how much accuracy we get.

In [None]:
data = pd.read_csv('../input/water-potability/water_potability.csv')

In [None]:
# We impute using the labels - SHOULD NOT BE DONE

data['ph'] = data['ph'].fillna(data.groupby(['Potability'])['ph'].transform('mean'))
data['Sulfate']=data['Sulfate'].fillna(data.groupby(['Potability'])['Sulfate'].transform('mean'))
data['Trihalomethanes'] =data['Trihalomethanes'].fillna(data.groupby(['Potability'])['Trihalomethanes'].transform('mean'))

In [None]:
x = data.drop('Potability', axis=1)
y = data['Potability']

In [None]:
clf = DecisionTreeClassifier(random_state=0)
cv_result = cross_val_score(clf, x, y)
print(np.mean(cv_result))

We got around 71% accuracy. This will be an optimistic estimate for the decision tree classifier.

**Note:** I am not saying that 71% accuracy is not achievable. This was just to state that imputing missing values with information using the labels (y values) is wrong, and should be avoided.

# Section 4: The correct way of imputation and analysis

We will

- impute the missing values without using the y values (labels)
- try different classifiers

In [None]:
data = pd.read_csv('../input/water-potability/water_potability.csv')

x = data.drop('Potability', axis=1)
y = data['Potability']

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.2, random_state=0)

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

#classifiers
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
classifiers = [DecisionTreeClassifier(random_state=0), RandomForestClassifier(random_state=0), 
               AdaBoostClassifier(random_state=0), SVC(random_state=0), 
               KNeighborsClassifier(), LogisticRegression(random_state=0)]

clf_names = ['DT', 'RF', 'AB', 'SVC', 'KNN', 'LR']

for clfname, clf in zip(clf_names, classifiers):
    pipeline = Pipeline([('scaler',StandardScaler()), ('imputer',SimpleImputer(strategy='mean')), (clfname,clf)])
    cv_result = cross_val_score(pipeline, x_train, y_train)
    print(clfname + ':', np.mean(cv_result))

It looks like SVC is the best classifier among all. Let's tune the hyperparameters.

In [None]:
from sklearn.model_selection import GridSearchCV

pipeline = Pipeline([('scaler',StandardScaler()), ('imputer',SimpleImputer()), ('clf',SVC())])
parameter_grid = {'imputer__strategy':['mean','median'],'clf__C':[.001,.01,.1,.2,.5,1.0,1.5,2.0,5.0], 'clf__kernel':['rbf','linear']}
search = GridSearchCV(pipeline, parameter_grid)
search.fit(x_train, y_train)

print('The best parameters are:', search.best_params_)
print('The best score cross validation score is:', search.best_score_)

In [None]:
clf = search.best_estimator_

clf.fit(x_train, y_train)
print('The test accuracy is', clf.score(x_test, y_test))