## BloomTech Data Science

---


# Cross-Validation

- Do **k-fold cross-validation** with independent test set
- Use scikit-learn for **hyperparameter optimization**

In [None]:
#%%capture
#!pip install category_encoders==2.*

In [None]:
# from category_encoders import OrdinalEncoder
# from sklearn.ensemble import RandomForestClassifier
# from sklearn.model_selection import train_test_split
# from sklearn.metrics import accuracy_score
# from sklearn.impute import SimpleImputer
# from sklearn.model_selection import cross_val_score, validation_curve # k-fold CV
# from sklearn.model_selection import GridSearchCV, RandomizedSearchCV # Hyperparameter tuning
# from sklearn.pipeline import make_pipeline
# from sklearn.tree import DecisionTreeClassifier
# import numpy as np
# import pandas as pd
# import matplotlib.pyplot as plt

# Downloading the Tanzania Waterpump Dataset

Make sure  you only use the dataset that is available through the **DS** **Kaggle Competition**. DO NOT USE any other Tanzania waterpump datasets that you might find online.

There are two ways you can get the dataset. Make sure you have joined the competition first!:

1. You can download the dataset directly by accessing the challenge and the files through the Kaggle Competition URL on Canvas (make sure you have joined the competition!)

2. Use the Kaggle API using the code in the following cells. This article provides helpful information on how to fetch your Kaggle Dataset into Google Colab using the Kaggle API.

> https://medium.com/analytics-vidhya/how-to-fetch-kaggle-datasets-into-google-colab-ea682569851a

# Using Kaggle API to download datset

In [None]:
# # mounting your google drive on colab
# from google.colab import drive
# drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
#change your working directory, if you want to or have already saved your kaggle dataset on google drive.
# %cd /content/gdrive/My Drive/Kaggle
# update it to your folder location on drive that contians the dataset and/or kaggle API token json file.

/content/gdrive/My Drive/Kaggle


In [None]:
# Download your Kaggle Dataset, if you haven't already done so.
# import os
# os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/My Drive/Kaggle"
# !kaggle competitions download -c bloomtech-water-pump-challenge

In [None]:
# Unzip your Kaggle dataset, if you haven't already done so.
# !unzip \*.zip  && rm *.zip

In [None]:
# List all files in your Kaggle folder on your google drive.
# !ls

# I. Wrangle Data

In [2]:
import pandas as pd
import numpy as np
def wrangle(fm_path, tv_path=None):
  if tv_path:
    df = pd.merge(pd.read_csv(fm_path,
                              na_values=[0, -2.000000e-08],
                              parse_dates=['date_recorded']),
                  pd.read_csv(tv_path)).set_index('id')
  else:
    df = pd.read_csv(fm_path,
                     na_values=[0, -2.000000e-08],
                     parse_dates=['date_recorded'],
                     index_col='id')

  # Drop constant columns
  df.drop(columns=['recorded_by'], inplace=True)

  # Create age feature
  df['pump_age'] = df['date_recorded'].dt.year - df['construction_year']
  df.drop(columns='date_recorded', inplace=True)

  # Drop HCCCs
  cutoff = 100
  drop_cols = [col for col in df.select_dtypes('object').columns
              if df[col].nunique() > cutoff]
  df.drop(columns=drop_cols, inplace=True)

  # Drop duplicate columns
  dupe_cols = [col for col in df.head(100).T.duplicated().index # change 15 to 100!!!!
               if df.head(100).T.duplicated()[col]]
  df.drop(columns=dupe_cols, inplace=True)

  return df

df = wrangle(fm_path='train_features.csv',
             tv_path='train_labels.csv')

X_test = wrangle(fm_path='test_features.csv')

FileNotFoundError: [Errno 2] No such file or directory: 'train_features.csv'

In [None]:
df.shape, X_test.shape

# II. Split Data

## Split TV from FM

In [None]:
target = 'status_group'
y = df[target]
X = df.drop(columns = target)

# Training-Validation Split

# III. Establish Baseline

This is a **classification** problem, our baseline will be **accuracy**.

In [None]:
y.value_counts(normalize = True).max()

# IV. Build Models

- `DecisionTreeClassifier`
- `RandomForestClassifier`

In [None]:
model_dt = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    DecisionTreeClassifier(random_state=42))

In [None]:
model_rf = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(n_estimators =25, random_state=42)
);

**Check cross-validation scores**

![Cross Validation](https://upload.wikimedia.org/wikipedia/commons/4/4b/KfoldCV.gif)

In [None]:
print('CV score DecisionTreeClassifier')
print('Mean CV accuracy score:', )
print('STD CV accuracy score:',)

In [None]:
print('CV score RandomForestClassifier')
print('Mean CV accuracy score:',)
print('STD CV accuracy score:',)

# V. Tune Model

Different hyperparameter values I want to try out :

*   Simpleimputer - mean, median - 2 values
*   max_depth - range(5,40,5) - 7 values
*   n_estimators - range(25,125,25) - 4 values

> Total combinations of these hyperparameters = 2 * 7 * 4 = 56

Testing out the above hyperparameter combinations with 5-fold Cross Validation will need :

> Total number of models to be fit = 2 * 7 * 4 * 5 = 280


**`GridSearch`:**

**`RandomizedSearchCV`:**

# VI. Communicate Results

**Showing Feature Importance**

Plot the feature importance for our `RandomForest` model.

In [None]:
bestestimator =
importances = bestestimator.named_steps['randomforestclassifier'].feature_importances_
features = X.columns
feat_imp = pd.Series(importances, index=features).sort_values()
feat_imp.tail(10).plot(kind='barh')
plt.xlabel('Reduction in Gini Impurity');

# VII. Make Submission

In [None]:
y_pred = model_rfrs.predict(X_test)

In [None]:
submission = pd.DataFrame({'status_group':y_pred}, index=X_test.index)

In [None]:
submission

In [None]:
pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_')

In [None]:
datestamp = pd.Timestamp.now().strftime('%Y-%m-%d_%H%M_') #string from time format
submission.to_csv(f'{datestamp}submission.csv') #format string

# VIII. Saving a trained model to reuse it in the future

In [None]:
# Once you have found the best model, you might as well save it and then reload it when you want to test it later

# save model
import pickle

filename =

#save your model (it will be stored in your current working directory - download to your computer if GDrive is not mounted)
pickle.dump(model_rf,open(filename,'wb'))
#load model
model_rf_loaded = pickle.load(open(filename,'rb'))