### Tobias Reaper

LSDS Unit 2: Predictive Modeling

# Kaggle Challenge, Module 4

---
---

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge - at least 5 submission that score at least 60% accuracy.
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.

## Stretch Goals

### Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)
- [ ] Stacking Ensemble. (See below)

### Stacking Ensemble

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [4]:
# The classixx
import pandas as pd
import numpy as np

In [5]:
# The extras
import pandas_profiling
import janitor

import plotly.express as px
import plotly.figure_factory as ff

In [6]:
# The SkoolKidz
from scipy.stats import randint, uniform
import category_encoders as ce
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.model_selection import train_test_split, cross_val_score, RandomizedSearchCV

from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

In [7]:
# The metrikidz
from sklearn.metrics import confusion_matrix, accuracy_score, classification_report
from sklearn.utils.multiclass import unique_labels  # Check out what other methods this library has

In [8]:
# Define data source
# DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'  # Remote
DATA_PATH = '/Users/Tobias/workshop/dasci/sprints/06-Kaggle_Challenge/DS-Unit-2-Kaggle-Challenge/data/'  # Local

# Load and merge train_features + train_labels
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features + sample_submission
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

---

In [None]:
pipeline = make_pipeline(  # Define the pipe
    ce.OrdinalEncoder(), 
    IterativeImputer(), 
    RandomForestClassifier(random_state=42),
)

param_distributions = {  # The hyper-parameters to search
    'iterativeimputer__initial_strategy': ["mean", "median"], 
    'iterativeimputer__max_iter': randint(5, 20),
    'iterativeimputer__n_nearest_features': randint(1, 6),
    'iterativeimputer__imputation_order': ["ascending", "arabic"],
    'randomforestclassifier__n_estimators': randint(100, 300), 
    'randomforestclassifier__max_depth': [16, 20, 24, 32], 
    'randomforestclassifier__max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(  # How many times
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=3,
    cv=3, 
    scoring='accuracy',
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1,
    random_state=42,
)

In [None]:
# Fit and find the best hyperparameters according to above
search.fit(X_train, y_train["status_group"].values.flatten());

In [None]:
# Print the best combo of hyperparameters and the corresponding accuracy
print('Best hyperparameters', search.best_params_)
print('Accuracy', search.best_score_)

In [None]:
# Make predictions on test using best model from random search
bestest = search.best_estimator_  # Create instance of best model from search

# Predict on the test data
y_pred = bestest.predict(X_test)

In [None]:
# Check out values counts
y_pred_series = pd.Series(y_pred)
y_pred_series.value_counts()

---

## Plot the Matrix

> Main diagonal are correct predictions


In [None]:
print(classification_report(y_val, y_pred))

In [None]:
# 3.2 Plot a heatmap with plotly
def plot_confusion_matrix(y_true, y_pred):
    labels = unique_labels(y_true)
    columns = [f'Predicted {label}' for label in labels]
    index = [f'Actual {label}' for label in labels]
    z = confusion_matrix(y_true, y_pred)
    fig = ff.create_annotated_heatmap(z, x=columns, y=index, colorscale="Viridis")
    fig.update_yaxes(autorange="reversed")    
    fig.show()

plot_confusion_matrix(y_val, y_pred)