<a href="https://colab.research.google.com/github/austiezr/DS-Unit-2-Kaggle-Challenge/blob/master/module4-classification-metrics/LS_DS_224_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [x] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [x] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [x] Commit your notebook to your fork of the GitHub repo.
- [x] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [x] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [6]:
!pip3 install category_encoders
!pip3 install xgboost
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import f_classif, mutual_info_classif
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering
from sklearn.mixture import GaussianMixture
from category_encoders.target_encoder import TargetEncoder
from category_encoders.woe import WOEEncoder
from category_encoders import OneHotEncoder, OrdinalEncoder
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from scipy.stats import randint, uniform
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import RidgeClassifierCV, RidgeClassifier
from xgboost import XGBClassifier



In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'

In [30]:
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train, val = train_test_split(train, test_size=len(test), 
                              stratify=train['status_group'], random_state=33)

train.shape, val.shape, test.shape

((45042, 41), (14358, 41), (14358, 40))

In [0]:
def wrangle(X):
    X = X.copy()

    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    cols_with_zeros = ['longitude', 'latitude', 'amount_tsh', 'construction_year', 'gps_height', 'population']
    # cols_with_zeros = ['longitude', 'latitude', 'amount_tsh', 'construction_year', 'gps_height']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
            
    cols_high_c = ['ward', 'subvillage', 'funder', 'installer', 'lga', 'wpt_name']

    for col in cols_high_c:
      top10 = X[col].value_counts()[:15].index
      X.loc[~X[col].isin(top10), col] = 'OTHER'

    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()

    X = X.drop(columns=['quantity_group', 'payment', 'recorded_by', 
                        'extraction_type_group', 'extraction_type_class'])
    
    return X


train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [0]:
target = 'status_group'

X_train = train.drop(columns=[target, 'id'])
y_train = train[target]
X_val = val.drop(columns=[target, 'id'])
y_val = val[target]
X_test = test.drop(columns=['id'])

In [0]:
pipe = make_pipeline(
    OrdinalEncoder(),
    IterativeImputer(random_state=33),
    XGBClassifier()
    )


pipe.fit(X_train, y_train)
accuracy = pipe.score(X_val, y_val)
print(f'Train Accuracy: {pipe.score(X_train, y_train)}\n')
print(f'Validation Accuracy: {accuracy}\n')
# output.eval_js('new Audio("http://noproblo.dayjo.org/ZeldaSounds/LOZ/LOZ_Secret.wav").play()')

In [41]:
pipe = make_pipeline(
    OrdinalEncoder(),
    IterativeImputer(random_state=33),
    RandomForestClassifier(n_jobs=-1,
                           max_features='auto',
                           random_state=33)
    )

param_distributions = {
    'iterativeimputer__max_iter': randint(10,500),
    'iterativeimputer__initial_strategy': ['mean', 'median', 'most_frequent'],
    'randomforestclassifier__n_estimators': randint(2, 600),
    'randomforestclassifier__min_samples_leaf': randint(2, 100)
}

search = RandomizedSearchCV(
    pipe, 
    param_distributions=param_distributions, 
    n_iter=100, 
    cv=5, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);
Audio(sound_file, autoplay=True)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  5.3min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  6.7min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  8.0min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  9.4min
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed: 11.2min
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed: 13.2min
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed: 15.1min
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed: 16.9min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 19.1min
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed: 21.5min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 23

In [42]:
pipe = make_pipeline(
    OrdinalEncoder(),
    IterativeImputer(random_state=33),
    RandomForestClassifier(n_jobs=-1,
                           max_features='auto',
                           random_state=33)
    )

param_distributions = {
    'iterativeimputer__max_iter': randint(10,500),
    'iterativeimputer__initial_strategy': ['mean', 'median', 'most_frequent'],
    'randomforestclassifier__n_estimators': randint(2, 600),
    'randomforestclassifier__min_samples_leaf': randint(2, 100)
}

search = RandomizedSearchCV(
    pipe, 
    param_distributions=param_distributions, 
    n_iter=300, 
    cv=5, 
    scoring='accuracy', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1
)

search.fit(X_train, y_train);
Audio(sound_file, autoplay=True)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   48.1s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done  16 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  3.2min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  4.0min
[Parallel(n_jobs=-1)]: Done  45 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done  56 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done  69 tasks      | elapsed:  7.4min
[Parallel(n_jobs=-1)]: Done  82 tasks      | elapsed:  8.7min
[Parallel(n_jobs=-1)]: Done  97 tasks      | elapsed: 10.6min
[Parallel(n_jobs=-1)]: Done 112 tasks      | elapsed: 14.6min
[Parallel(n_jobs=-1)]: Done 129 tasks      | elapsed: 20.9min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 28.5min
[Parallel(n_jobs=-1)]: Done 165 tasks      | elapsed: 35.1min
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed: 41

In [0]:
from IPython.display import Audio
sound_file = './Desktop/Clarke.mp3'

In [46]:
search2.fit(X_train, y_train);

Fitting 5 folds for each of 300 candidates, totalling 1500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.


KeyboardInterrupt: ignored

In [0]:
pipe = search.best_estimator_

In [39]:
print('Best hyperparameters', search.best_params_)

Best hyperparameters {'iterativeimputer__initial_strategy': 'median', 'iterativeimputer__max_iter': 38, 'xgbclassifier__colsample_bytree': 0.35321164911458447, 'xgbclassifier__eval_metric': 'merror', 'xgbclassifier__gamma': 0, 'xgbclassifier__learning_rate': 0.05225778827907627, 'xgbclassifier__max_depth': 11, 'xgbclassifier__min_child_weight': 27, 'xgbclassifier__reg_alpha': 2, 'xgbclassifier__reg_lambda': 7, 'xgbclassifier__scale_pos_weight': 9, 'xgbclassifier__subsample': 0.8643134091331315}


In [44]:
accuracy = pipe.score(X_val, y_val)
print(f'Train Accuracy: {pipe.score(X_train, y_train)}\n')
print(f'Validation Accuracy: {accuracy}\n')

Train Accuracy: 0.922117135118334

Validation Accuracy: 0.8074244323721966



In [0]:
RandomForestClassifier(n_jobs=-1,
                           n_estimators=522, 
                           max_features='auto',
                           min_samples_leaf=2,
                           random_state=33)

In [0]:
submission = test.copy()
submission['status_group'] = pipe.predict(X_test)
submission = submission[['id', 'status_group']]
submission.set_index('id', inplace=True)
submission.head()
submission.to_csv('submission.csv')