<a href="https://colab.research.google.com/github/joeyMckinney/DS-Unit-2-Kaggle-Challenge/blob/master/module4-classification-metrics/josiah_mckinney_LS_DS_224_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 4*

---

# Classification Metrics

## Assignment
- [ ] If you haven't yet, [review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2), then submit your dataset.
- [ ] Plot a confusion matrix for your Tanzania Waterpumps model.
- [ ] Continue to participate in our Kaggle challenge. Every student should have made at least one submission that scores at least 70% accuracy (well above the majority class baseline).
- [ ] Submit your final predictions to our Kaggle competition. Optionally, go to **My Submissions**, and _"you may select up to 1 submission to be used to count towards your final leaderboard score."_
- [ ] Commit your notebook to your fork of the GitHub repo.
- [ ] Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](http://archive.is/DelgE), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in the lecture notebook.


## Stretch Goals

### Reading

- [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), by Google Research, with  interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- [Notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb)
- [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415)


### Doing
- [ ] Share visualizations in our Slack channel!
- [ ] RandomizedSearchCV / GridSearchCV, for model selection. (See module 3 assignment notebook)
- [ ] Stacking Ensemble. (See module 3 assignment notebook)
- [ ] More Categorical Encoding. (See module 2 assignment notebook)

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

In [3]:
import numpy as np

#this function will clean up data. droping features with high cardinality and
#replacing lat and log values with 0s to nan.then it splits the data
def wrangle_split(train, test):

  train_c = train.copy()
  test_c = test.copy()

  #change bad latitute measurements with 0
  train_c['latitude'] = train_c['latitude'].replace(-2e-08, 0)
  test_c['latitude'] = test_c['latitude'].replace(-2e-08, 0)
  
  #remove outliers
  train_c = train_c[(train_c['latitude'] >= np.percentile(train_c['latitude'], 0.05)) & 
        (train_c['latitude'] < np.percentile(train_c['latitude'], 99.95)) &
        (train_c['longitude'] >= np.percentile(train_c['longitude'], 0.05)) & 
        (train_c['longitude'] <= np.percentile(train_c['longitude'], 99.95))]
  #test_c = test_c[(test_c['latitude'] >= np.percentile(test_c['latitude'], 0.05)) & 
        #(test_c['latitude'] < np.percentile(test_c['latitude'], 99.95)) &
        #(test_c['longitude'] >= np.percentile(test_c['longitude'], 0.05)) & 
        #(test_c['longitude'] <= np.percentile(test_c['longitude'], 99.95))]

  #change all the latitude and longitude values that are 0 with nan
  cols_with_zeros = ['longitude', 'latitude']
  for col in cols_with_zeros:
     train_c[col] = train_c[col].replace(0, np.nan)
     test_c[col] = test_c[col].replace(0, np.nan)

  #get rid of features with high cardinality
  hc_cols_train = [col for col in train_c.describe(include='object').columns 
             if train_c[col].nunique() > 100]
  hc_cols_test = [col for col in test_c.describe(include='object').columns 
             if test_c[col].nunique() > 100]


  #Droping a repeated feature
  train_c = train_c.drop(['quantity_group'] + hc_cols_train, axis=1)
  test_c = test_c.drop(['quantity_group'] + hc_cols_test , axis=1)
  
  y = train_c['status_group']
  X = train_c.drop('status_group', axis=1)

  X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=77, stratify=y)

  return X_train, X_val, y_train, y_val, test_c

In [4]:
from sklearn.model_selection import train_test_split

In [5]:
#applying the wrange split function to the data
X_train, X_val, y_train, y_val, test_c = wrangle_split(train, test)

In [6]:
#baseline accuracy
baseline = y_train.value_counts(normalize=True).max()
baseline

0.5450871322411021

In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

  import pandas.util.testing as tm


In [None]:
model = make_pipeline(
    OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=77)
)

params = {'randomforestclassifier__n_estimators': range(1, 200, 10),
          'randomforestclassifier__max_depth': range(1,50)}

rs = RandomizedSearchCV(model, 
                        param_distributions=params, 
                        n_iter=10, 
                        scoring='accuracy',
                        n_jobs=-1,
                        verbose=1,
                        cv=5)

rs.fit(X_train, y_train)