Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)
- [ ] Stacking Ensemble. (See below)

### Stacking Ensemble

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [16]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from pandas_profiling import ProfileReport
from scipy.stats import randint, uniform

In [6]:
def wrangle(df):
    df = df.copy()
    df['latitude'] = df['latitude'].replace(-2e-08, 0)
    
    cols_with_zeros = ['longitude', 'latitude', 'construction_year',
                      'gps_height', 'population']
    
    for col in cols_with_zeros:
        df[col] = df[col].replace(0, np.nan)
        df[col+'_MISSING'] = df[col].isnull()
        
    duplicates = ['quantity_group', 'payment_type']
    df = df.drop(columns = duplicates)
    
    unusable_variance = ['recorded_by']
    #removed id to keep it in test set
    df = df.drop(columns = unusable_variance)
    
    #deal with amount_tsh distribution
    df["amount_tsh_zero"] = df["amount_tsh"] == 0
    df["log_amount_tsh"] = np.log(df["amount_tsh"] + 1)
    
    df["num_private_is_zero"] = df["num_private"]==0
    
    # for some high-cardinality data, keep top-30 values and everything else other
    top50f = df["funder"].value_counts()[:50].index
    df.loc[~df['funder'].isin(top50f), "funder"] = "other"
    top50i = df["installer"].value_counts()[:50].index
    df.loc[~df["installer"].isin(top50i), "installer"] = "other"
    top50w = df["wpt_name"].value_counts()[:50].index
    df.loc[~df["wpt_name"].isin(top50w), "wpt_name"] = "other"
    top50s = df["subvillage"].value_counts()[:50].index
    df.loc[~df["subvillage"].isin(top50s), "subvillage"] = "other"
    
    #set categoricals as categoricals
    df["region_code"] = df["region_code"].astype("category")
    df["district_code"] = df["district_code"].astype("category")
    
    df['date_recorded'] = pd.to_datetime(df['date_recorded'], infer_datetime_format = True)
    
    df['year_recorded'] = df['date_recorded'].dt.year
    df['month_recorded'] = df['date_recorded'].dt.month
    df['day_recorded'] = df['date_recorded'].dt.day
    df = df.drop(columns = 'date_recorded')
    
    #df['years'] = df['year_recorded'] - df['construction_year']
    # too correlated with construction year
    df['years_MISSING'] = df['construction_year'].isnull()
    
    return df

In [7]:
train = wrangle(train)
test = wrangle(test)

In [8]:
target = 'status_group'
features = train.drop([target], axis = 1).columns.tolist()

In [9]:
X_train = train[features]
y_train = train[target]
X_test = test[features]

In [17]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy = "median", add_indicator = True),
    RandomForestClassifier(random_state = 100)
)

param_distributions = { 
    'randomforestclassifier__n_estimators': randint(50, 1300),
    'randomforestclassifier__max_depth': [5, 15, 25, 35, 45, None]
}

search = RandomizedSearchCV(
    pipeline, 
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=3, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 12.5min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed: 19.4min remaining:  2.2min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 21.9min finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=Pipeline(memory=None,
                                      steps=[('ordinalencoder',
                                              OrdinalEncoder(cols=None,
                                                             drop_invariant=False,
                                                             handle_missing='value',
                                                             handle_unknown='value',
                                                             mapping=None,
                                                             return_df=True,
                                                             verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=True,
                                                            copy=True,
                                                            fill_value=None,
 

from skopt import BayesSearchCV

parameter ranges are specified by one of below
from skopt.space import Real, Categorical, Integer

from sklearn.datasets import load_iris from sklearn.svm import SVC from sklearn.model_selection import train_test_split

X, y = load_iris(True) X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.75, random_state=0)

log-uniform: understand as search over p = exp(x) by varying x
opt = BayesSearchCV( SVC(), { 'C': Real(1e-6, 1e+6, prior='log-uniform'), 'gamma': Real(1e-6, 1e+1, prior='log-uniform'), 'degree': Integer(1,8), 'kernel': Categorical(['linear', 'poly', 'rbf']), }, n_iter=32 )

executes bayesian optimization
opt.fit(X_train, y_train)

In [19]:
print('Best hyperparameters', search.best_params_)
print('Accuracy', search.best_score_)

pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score').T

Best hyperparameters {'randomforestclassifier__max_depth': 25, 'randomforestclassifier__n_estimators': 1275}
Accuracy 0.8091414141414143


Unnamed: 0,3,9,1,8,5,4,7,0,6,2
mean_fit_time,183.348,164.357,187.162,165.58,221.666,111.47,140.806,60.6318,100.342,43.3282
std_fit_time,2.99278,13.5918,4.5249,0.876993,14.2678,18.1971,20.2325,0.773856,13.6103,4.72183
mean_score_time,14.8177,9.98362,12.4719,12.6989,12.8194,9.04015,16.766,3.87634,5.81635,2.7417
std_score_time,1.47847,1.33906,0.531447,2.58532,2.2178,2.08736,8.91325,0.0927213,0.191493,0.489628
param_randomforestclassifier__max_depth,25,25,25,35,45,35,45,45,15,5
param_randomforestclassifier__n_estimators,1275,986,1024,935,1250,682,1034,335,999,726
params,"{'randomforestclassifier__max_depth': 25, 'ran...","{'randomforestclassifier__max_depth': 25, 'ran...","{'randomforestclassifier__max_depth': 25, 'ran...","{'randomforestclassifier__max_depth': 35, 'ran...","{'randomforestclassifier__max_depth': 45, 'ran...","{'randomforestclassifier__max_depth': 35, 'ran...","{'randomforestclassifier__max_depth': 45, 'ran...","{'randomforestclassifier__max_depth': 45, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 5, 'rand..."
split0_test_score,0.810606,0.810354,0.810152,0.808485,0.808434,0.80899,0.80798,0.808232,0.79803,0.714596
split1_test_score,0.810202,0.809394,0.809697,0.807576,0.808838,0.805859,0.807929,0.806919,0.798889,0.71803
split2_test_score,0.806616,0.806515,0.806414,0.805,0.803434,0.805253,0.803535,0.80298,0.799242,0.716263


In [None]:
pipeline = search.best_estimator_
y_pred = pipeline.predict(X_test)