Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

In [None]:
# Classification Metrics

## Assignment
- [X] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading
- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) by Kevin Markham, with video
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)
- [ ] Stacking Ensemble. (See below)

### Stacking Ensemble

Here's some code you can use to "stack" multiple submissions, which is another form of ensembling:

```python
import pandas as pd

# Filenames of your submissions you want to ensemble
files = ['submission-01.csv', 'submission-02.csv', 'submission-03.csv']

target = 'status_group'
submissions = (pd.read_csv(file)[[target]] for file in files)
ensemble = pd.concat(submissions, axis='columns')
majority_vote = ensemble.mode(axis='columns')[0]

sample_submission = pd.read_csv('sample_submission.csv')
submission = sample_submission.copy()
submission[target] = majority_vote
submission.to_csv('my-ultimate-ensemble-submission.csv', index=False)
```

In [2]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [3]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [10]:
import category_encoders as ce
import numpy as np
from sklearn.feature_selection import f_classif, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import balanced_accuracy_score

In [None]:
#1. Begin with baselines for classification. 
#What is your baseline accuracy, if you guessed the majority class for every prediction?

In [None]:
#since I don't want to divide my data into train, val, test, I am going
#to make a baseline using linear classification

In [19]:
target = 'status_group'
features = train.columns.drop([target] + ['id', 'recorded_by'])
X_train = train[features]
y_train = train[target]

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    StandardScaler(), 
    LogisticRegression(random_state=42)
)

k = 3
scores = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='accuracy')
print(f'Accuracy score for {k} folds:', scores)

#pipeline.fit(X_train, y_train)
#print('Validation Score:', pipeline.score(X_val, y_val))



Accuracy score for 3 folds: [0.68116162 0.64085859 0.64383838]


In [None]:
#first model, no randomized search

In [20]:
from sklearn.ensemble import RandomForestClassifier

pipeline1 = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    StandardScaler(),
    RandomForestClassifier(n_estimators=100, n_jobs=-1, random_state=42)
)

k = 5
scores1 = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='accuracy')
print(f'Accuracy Score for {k} folds:', scores1)



Accuracy Score for 5 folds: [0.62881912 0.63765676 0.63989899 0.64469697 0.6426166 ]


In [None]:
#second model, with randomized search, selectkbest, random forest 

In [22]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    StandardScaler(),
    SelectKBest(f_classif),
    RandomForestClassifier(random_state=42)
)

param_distributions = {
    'simpleimputer__strategy': ['mean', 'median'],
    'selectkbest__k': randint(20, 38),
    'randomforestclassifier__n_estimators': randint(50, 500),
    'randomforestclassifier__max_depth': [5, 10, 15, 20, None],
    'randomforestclassifier__max_features': uniform(0, 1)
}

search3 = RandomizedSearchCV(
    pipeline,
    param_distributions= param_distributions,
    n_iter=20,
    cv = 5,
    scoring='accuracy',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)

search3.fit(X_train, y_train);

k = 5
scores3 = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='accuracy')
print(f'Accuracy Score for {k} folds:', scores3)

Fitting 5 folds for each of 20 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  7.7min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed: 18.3min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed: 21.8min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 31.5min
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 34.0min
[Parallel(n_jobs=-1)]: Done  53 tasks      | elapsed: 35.5min
[Parallel(n_jobs=-1)]: Done  64 tasks      | elapsed: 40.3min
[Parallel(n_jobs=-1)]: Done  77 tasks      | elapsed: 51.6min
[Parallel(n_jobs=-1)]: Done  90 tasks      | elapsed: 57.6min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 60.0min finished


Accuracy Score for 5 folds: [0.75439778 0.73655416 0.73686869 0.73964646 0.74187574]


In [25]:
print('Model Hyperparameters:')
print(pipeline.named_steps['randomforestclassifier'])
print(pipeline.named_steps['simpleimputer'])
print(pipeline.named_steps['selectkbest'])

Model Hyperparameters:
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=None, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators='warn',
                       n_jobs=None, oob_score=False, random_state=42, verbose=0,
                       warm_start=False)
SimpleImputer(add_indicator=False, copy=True, fill_value=None,
              missing_values=nan, strategy='mean', verbose=0)
SelectKBest(k=10, score_func=<function f_classif at 0x00000283DEF5DDC8>)


In [29]:
pd.DataFrame(search3.cv_results_).sort_values(by='rank_test_score').T

Unnamed: 0,3,17,11,1,13,6,2,19,18,12,0,5,15,9,14,8,4,7,16,10
mean_fit_time,258.982,138.648,213.7,176.713,355.76,223.318,393.076,32.8407,127.257,21.3625,73.3058,273.662,112.432,45.0959,58.4201,16.733,65.1004,46.525,20.7069,8.75831
std_fit_time,11.6995,1.86205,1.69436,4.89846,8.37056,14.0615,1.12998,1.84665,2.29736,0.756782,1.0473,14.5191,0.932773,0.837746,2.63628,0.828387,8.23109,9.35016,0.433426,0.0500218
mean_score_time,4.78576,3.14737,3.03036,2.43974,2.26975,4.42439,3.73478,0.846895,1.75543,1.07618,1.41971,1.73193,1.77789,0.943977,1.05619,0.963142,0.892993,1.01821,0.55931,0.831711
std_score_time,0.182329,0.526265,0.062521,0.171543,0.355533,1.28306,0.260284,0.10137,0.338103,0.150182,0.27149,0.318168,0.247859,0.141365,0.120942,0.303761,0.080325,0.308095,0.125169,0.119715
param_randomforestclassifier__max_depth,20,20,20,,20,15,15,20,20,20,10,10,10,5,5,5,5,5,5,5
param_randomforestclassifier__max_features,0.312156,0.324083,0.56095,0.390953,0.809581,0.367746,0.686272,0.445792,0.680705,0.232645,0.426876,0.742908,0.580056,0.447058,0.459558,0.202948,0.852269,0.791177,0.563964,0.0360089
param_randomforestclassifier__n_estimators,444,400,381,204,317,482,433,90,253,113,131,355,393,253,297,147,139,191,133,174
param_selectkbest__k,34,30,28,36,35,30,30,23,24,20,34,33,23,28,32,30,28,20,20,25
param_simpleimputer__strategy,mean,mean,mean,mean,mean,median,median,mean,median,median,median,mean,mean,mean,mean,mean,mean,mean,median,median
params,"{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': None, 'r...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 15, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 20, 'ran...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 10, 'ran...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand...","{'randomforestclassifier__max_depth': 5, 'rand..."


In [None]:
#model 4

In [31]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy = 'mean'),
    StandardScaler(),
    SelectKBest(f_classif),
    RandomForestClassifier(random_state=42, max_depth=None)
)

param_distributions = {
    'selectkbest__k': randint(20, 38),
    'randomforestclassifier__n_estimators': randint(50, 500),
    'randomforestclassifier__max_features': uniform(0, 1)
}

search4 = RandomizedSearchCV(
    pipeline,
    param_distributions= param_distributions,
    n_iter=10,
    cv = 3,
    scoring='accuracy',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)

search4.fit(X_train, y_train);

k = 5
scores4 = cross_val_score(pipeline, X_train, y_train, cv=k, 
                         scoring='accuracy')
print(f'Accuracy Score for {k} folds:', scores4)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  6.2min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:  9.9min remaining:  1.1min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 11.3min finished


Accuracy Score for 5 folds: [0.75439778 0.73655416 0.73686869 0.73964646 0.74187574]


In [3]:
cd<DS-Unit-2-Kaggle-Challenge>

[WinError 123] The filename, directory name, or volume label syntax is incorrect: '<DS-Unit-2-Kaggle-Challenge>'
C:\Users\rosee\DS-Unit-2-Kaggle-Challenge\module4
