# Pump it Up: Data Mining the Water Table 

In this competition, we are trying to predict which water pumps are functional in some places in Tanzania. This competition is hosted by DrivenData[Add link] and the dataset is provided Taarifa[Add link] and the Tanzanian Ministry of Water[Add link].

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np

## Reading the data

In [None]:
train = pd.read_csv("input/train.csv",index_col=0)
test = pd.read_csv("input/test.csv",index_col=0)
train_labels = pd.read_csv("input/train_labels.csv",index_col=0)

In [None]:
#train = pd.merge(train, train_labels, left_on="id", right_on="id")
all_df = pd.concat((train, test), axis=0)

In [None]:
train.shape, test.shape, all_df.shape

In [None]:
#tNull checking, let's take care of these values
#train.isnull().sum().sort_values(ascending=False).head(8)
#test.isnull().sum().sort_values(ascending=False).head(8)

## Replacing NaN values

In [None]:
all_df.replace(['Not Known'], ['unkown'], inplace=True)
all_df.loc[all_df.scheme_name.isnull(), 'scheme_name'] = 'unknown'
all_df.loc[all_df.scheme_management.isnull(), 'scheme_management'] = 'unknown'
all_df.loc[all_df.funder.isnull(), 'funder'] = 'unknown'
all_df.loc[all_df.installer.isnull(), 'installer'] = 'unknown'
all_df.loc[all_df.subvillage.isnull(), 'subvillage'] = 'unknown'
all_df.loc[all_df.public_meeting.isnull(), 'public_meeting'] = False
all_df.loc[all_df.permit.isnull(), 'permit'] = False


In [None]:
all_df.isnull().sum().sort_values(ascending=False).head(5)

## Deleting unused features

For now, we are not going to use height, longitude and latitute. They demand more complex processing.

In [None]:
exclude_features = ["gps_height", "longitude", "latitude", "recorded_by"]
all_df = all_df.drop(exclude_features, 1)
all_df.shape

## Fixing features

### Processing dates

For now, we are going to use only year in which a pump has been built.

In [None]:
def get_year(date):
    if date != None:
        return date[:4]
    return 0

all_df["date_recorded"] = all_df["date_recorded"].apply(get_year)

### Changing string to float

Some float values are read as strings, so let's fix them.

In [None]:
all_df["amount_tsh"] = pd.to_numeric(all_df.amount_tsh)

## Changing unique categorical features

A lot of features have many unique values (e.g. there are many funders that only built 1 water pump). We are going to change all values with less than 10 occurrences into one group.

In [None]:
categorical_features = all_df.select_dtypes(include=['object'])

for col in categorical_features:
    val_counts = all_df[col].value_counts()
    vals_to_remove = val_counts[val_counts <= 10].index.values
    all_df[col].loc[all_df[col].isin(vals_to_remove)] = "Many_Unique"

## One hot encoding (Option 1)

Here we are changing categorical values to numerical ones. Maybe it's interesting to use dummy variables later on.

In [None]:
categorical_features = all_df.select_dtypes(include=['object'])

for col in all_df:
    if col in categorical_features:
        all_df[col] = pd.factorize(all_df[col])[0]

## Generating dummy features (Option 2)

This option generates more than 5000 columns, so we are going to skip it.

In [None]:
#print(all_df.shape)
#all_dummy_df = pd.get_dummies(all_df)


## Separating train and test again

In [None]:
train_df = all_df.loc[train.index]
test_df = all_df.loc[test.index]

train_df.shape, test_df.shape, train_labels.shape

## Building the model

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [None]:
#X_train, X_test, y_train, y_test = train_test_split(train_df, train_labels["status_group"], test_size=0.3)
#train_df.shape, X_train.shape, X_test.shape

In [None]:
alg = RandomForestClassifier(random_state=1, n_estimators=10, n_jobs=3)
alg.fit(train_df, train_labels["status_group"])

scores = cross_val_score(alg, train_df, train_labels["status_group"], cv=3, )

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
alg = GradientBoostingClassifier(random_state=1)
alg.fit(train_df, train_labels["status_group"])

scores = cross_val_score(alg, train_df, train_labels["status_group"], cv=3)

print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))