# Assignment
- Read [Maximizing Scarce Maintenance Resources with Data: Applying predictive modeling, precision at k, and clustering to optimize impact](https://towardsdatascience.com/maximizing-scarce-maintenance-resources-with-data-8f3491133050), by Lambda DS3 student Michael Brady. His blog post extends the Tanzania Waterpumps scenario, far beyond what's in this lecture notebook.

If your Kaggle Public Leaderboard score is:
- **Nonexistent**: You need to work on your model and submit predictions
- **< 70%**: You should work on your model and submit predictions
- **70% < score < 80%**: You may want to work on visualizations and write a blog post
- **> 80%**: You should work on visualizations and write a blog post


## Stretch goals — Highly Recommended Links
- Read Google Research's blog post, [Attacking discrimination with smarter machine learning](https://research.google.com/bigpicture/attacking-discrimination-in-ml/), and explore the interactive visualizations. _"A threshold classifier essentially makes a yes/no decision, putting things in one category or another. We look at how these classifiers work, ways they can potentially be unfair, and how you might turn an unfair classifier into a fairer one. As an illustrative example, we focus on loan granting scenarios where a bank may grant or deny a loan based on a single, automatically computed number such as a credit score."_
- Read the blog post, [Visualizing Machine Learning Thresholds to Make Better Business Decisions](https://blog.insightdatascience.com/visualizing-machine-learning-thresholds-to-make-better-business-decisions-4ab07f823415). You can replicate the code as-is,  ["the hard way"](https://docs.google.com/document/d/1ubOw9B3Hfip27hF2ZFnW3a3z9xAgrUDRReOEo-FHCVs/edit). Or you can apply it to the Tanzania Waterpumps data.
- Read this [notebook about how to calculate expected value from a confusion matrix by treating it as a cost-benefit matrix](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb).
- (Re)read the [Simple guide to confusion matrix terminology](https://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) and watch the 35 minute video.

In [2]:
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/6e/a1/f7a22f144f33be78afeb06bfa78478e8284a64263a3c09b1ef54e673841e/category_encoders-2.0.0-py2.py3-none-any.whl (87kB)
[K     |████████████████████████████████| 92kB 4.2MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-2.0.0


In [0]:
!pip install category_encoders matplotlib==3.1.0

In [0]:
%matplotlib inline
import category_encoders as ce
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

In [0]:
LOCAL = '../data/tanzania/'
WEB = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Tree-Ensembles/master/data/tanzania/'
source = WEB

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(source + 'train_features.csv'), 
                 pd.read_csv(source + 'train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(source + 'test_features.csv')
sample_submission = pd.read_csv(source + 'sample_submission.csv')

# Split train into train & val. Make val the same size as test.
train, val = train_test_split(train, test_size=len(test),  
                              stratify=train['status_group'], random_state=42)

In [0]:
def wrangle(X):
    """Wrangles train, validate, and test sets in the same way"""
    X = X.copy()

    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']    
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Drop duplicate columns
    duplicate_columns = ['quantity_group']
    X = X.drop(columns=duplicate_columns)
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these like null values
    X['latitude'] = X['latitude'].replace(-2e-08, np.nan)
    
    # When columns have zeros and shouldn't, they are like null values
    cols_with_zeros = ['construction_year', 'longitude', 'latitude', 'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
    
    X['dwe'] = (X['installer']=='DWE')&(X['funder']=='Dwe')
    X['gov_install_fund'] = (X['installer']=='Government')&(X['funder']=='Government Of Tanzania')
    X['gov'] = (X['installer']=='Government')&(X['funder']=='Government Of Tanzania')&(X['scheme_name']=='Government')
    X['vwc_management'] = (X['management']=='vwc')&(X['scheme_management']=='VWC')
    X['gravity_extraction'] = (X['extraction_type']=='gravity')&(X['extraction_type_group']=='gravity')&(X['extraction_type_class']=='gravity')
    
    return X

# Wrangle train, validate, and test sets in the same way
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [39]:
# Arrange data into X features matrix and y target vector
target = 'status_group'
X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]
X_test = test

# Make pipeline!
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)
val_score = accuracy_score(y_val, y_pred)
print('Validation Accuracy', val_score*100)

Validation Accuracy 81.40409527789386


In [0]:
# Get feature importances
rf = pipeline.named_steps['randomforestclassifier']
importances = pd.Series(rf.feature_importances_, X_train.columns)

In [44]:
importances.index

Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region',
       'region_code', 'district_code', 'lga', 'ward', 'population',
       'public_meeting', 'scheme_management', 'scheme_name', 'permit',
       'construction_year', 'extraction_type', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group', 'payment',
       'payment_type', 'water_quality', 'quality_group', 'quantity', 'source',
       'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'year_recorded', 'month_recorded',
       'day_recorded', 'years'],
      dtype='object')

In [45]:
X_train.columns

Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region',
       'region_code', 'district_code', 'lga', 'ward', 'population',
       'public_meeting', 'scheme_management', 'scheme_name', 'permit',
       'construction_year', 'extraction_type', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group', 'payment',
       'payment_type', 'water_quality', 'quality_group', 'quantity', 'source',
       'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'year_recorded', 'month_recorded',
       'day_recorded', 'years'],
      dtype='object')

In [0]:
f_drop = X_train.columns.copy()
f_drop.drop(columns='quantity')
f_drop

In [53]:
new_f_score = []
target = 'status_group'
# new_f = importances.sort_values().index.tolist()
for feature in X_train.columns:  
  pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
  )
  
  X_train_subset = train.drop(columns=[target, feature])
  X_val_subset = val.drop(columns=[target, feature])

  pipeline.fit(X_train_subset, y_train)
  y_pred = pipeline.predict(X_val_subset)
  print(feature, accuracy_score(y_val, y_pred)*100)
  if accuracy_score(y_val, y_pred) > val_score:
    val_score = accuracy_score(y_val, y_pred)
    new_f_score = [x, val_score]
new_f_score

amount_tsh 81.03496308678089
funder 81.24390583646748
gps_height 81.00013929516646
installer 81.06978687839532
longitude 80.81209082044853
latitude 80.9792450201978
wpt_name 81.04889260342667
num_private 81.20908204485305
basin 80.93745647026049
subvillage 81.22301156149881
region 81.02103357013512
region_code 81.1881877698844
district_code 81.04192784510377
lga 81.09764591168687
ward 81.15336397826995
population 80.63100710405348
public_meeting 81.04889260342667
scheme_management 80.98620977852069
scheme_name 81.06978687839532
permit 80.93049171193759
construction_year 81.0767516367182
extraction_type 81.06282212007243
extraction_type_group 81.11854018665552
extraction_type_class 80.98620977852069
management 80.93745647026049
management_group 81.17425825323862
payment 81.1881877698844
payment_type 81.1881877698844
water_quality 81.2856943864048
quality_group 81.25087059479036
quantity 77.86599804986767
source 81.14639921994706
source_type 80.89566792032316
source_class 81.083716395041

[]

In [52]:
new_f_score = []
target = 'status_group'
# new_f = importances.sort_values().index.tolist()
for x in range(30,len(X_train.columns)+1):  
  pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
  )
  
  features = importances[-x:].index.tolist()
  
  X_train_subset = train[features]
  X_val_subset = val[features]
  y_train = train[target]
  y_val = val[target]
  
  pipeline.fit(X_train_subset, y_train)
  y_pred = pipeline.predict(X_val_subset)
  print(x, accuracy_score(y_val, y_pred)*100)
  if accuracy_score(y_val, y_pred) > val_score:
    val_score = accuracy_score(y_val, y_pred)
    new_f_score = [x, val_score]
new_f_score

30 79.95542554673352
31 79.87184844685889
32 80.10168547151414
33 80.08079119654548
34 80.08775595486837
35 80.63797186237638
36 80.6240423457306
37 80.721548962251
38 80.83994985374008
39 81.03496308678089
40 81.40409527789386


[]

In [49]:
# Arrange data into X features matrix and y target vector
target = 'status_group'
features = importances[-40:].index.tolist()
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test

# Make pipeline!
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='mean'), 
    RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_val)
val_score = accuracy_score(y_val, y_pred)
print('Validation Accuracy', val_score*100)

Validation Accuracy 81.40409527789386
