<a href="https://colab.research.google.com/github/elliotgunn/DS-Unit-2-Kaggle-Challenge/blob/master/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






In [0]:
# If you're in Colab...
import os, sys
in_colab = 'google.colab' in sys.modules

if in_colab:
    # Install required python packages:
    # category_encoders, version >= 2.0
    # pandas-profiling, version >= 2.0
    # plotly, version >= 4.0
    !pip install --upgrade category_encoders pandas-profiling plotly
    
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')


# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

def clean(X):
  
  # make a copy before modifying
  X = X.copy()
  
  # duplicates, near duplicates, missing values
  X = X.drop(columns=['payment', 'quantity_group', 'source_type', 'waterpoint_type', 
         'extraction_type', 'extraction_type_class', 'management_group',
         'water_quality', 'num_private'])
  
  # About 3% of the time, latitude has small values near zero,
  # outside Tanzania, so we'll treat these values like zero.
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  # some columns have zeros and shouldn't, they are like null values
  # replace those zeros with nulls, impute missing values later
  cols_with_zeros = ['longitude', 'latitude', 'population', 'construction_year',
                    'gps_height']
  for col in cols_with_zeros:
      X[col] = X[col].replace(0, np.nan)
  
  # extract year, month, day from date_recorded
  X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  # delete date_recorded
  X = X.drop(columns='date_recorded')

  # age of pump at time of inspection
  X['pump_age'] = X['year_recorded'] - X['construction_year']
  # there are five values with negatives, so we will return those as a np.nan
  X['pump_age'] = X['pump_age'].replace([-5, -4, -3, -2, -1, -7], np.nan)
  # remember to deal with missing years
  X['years_missing'] = X['pump_age'].isnull()
  
  # drop recorded_by (never varies) and id (always varies, random)
  X = X.drop(columns=['recorded_by', 'id'])
  
  # return the clean df
  return X

train = clean(train)
val = clean(val)
test = clean(test)

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')


# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

def clean(X):
  
  # make a copy before modifying
  X = X.copy()
  
  # duplicates, near duplicates, missing values
  X = X.drop(columns=['payment', 'quantity_group'])
  
  # About 3% of the time, latitude has small values near zero,
  # outside Tanzania, so we'll treat these values like zero.
  X['latitude'] = X['latitude'].replace(-2e-08, 0)
  
  # some columns have zeros and shouldn't, they are like null values
  # replace those zeros with nulls, impute missing values later
  cols_with_zeros = ['longitude', 'latitude', 'population', 'construction_year',
                    'gps_height']
  for col in cols_with_zeros:
      X[col] = X[col].replace(0, np.nan)
      # create a missing vaules column
      X[col+'_missing'] = X[col].isnull()
  
  # extract year, month, day from date_recorded
  X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
  X['year_recorded'] = X['date_recorded'].dt.year
  X['month_recorded'] = X['date_recorded'].dt.month
  X['day_recorded'] = X['date_recorded'].dt.day
  # delete date_recorded
  X = X.drop(columns='date_recorded')

  # age of pump at time of inspection
  X['pump_age'] = X['year_recorded'] - X['construction_year']
  # there are five values with negatives, so we will return those as a np.nan
  X['pump_age'] = X['pump_age'].replace([-5, -4, -3, -2, -1, -7], np.nan)
  # remember to deal with missing years
  X['years_missing'] = X['pump_age'].isnull()
  
  # drop recorded_by (never varies) and id (always varies, random)
  X = X.drop(columns=['recorded_by', 'id'])
  
  # return the clean df
  return X

train = clean(train)
val = clean(val)
test = clean(test)

In [0]:
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features

In [0]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]
X_test = test[features]

## Random forest classifier with Ordinal encoding

In [89]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

# already arranged X features matrix, y target vector

# pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=500, random_state=42, n_jobs=-1,
                          oob_score=True, min_samples_leaf = 1)
)

# Fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy', pipeline.score(X_val, y_val))

Validation Accuracy 0.8118686868686869


In [87]:
# to see how many features were added through ordinal encoding

encoder = pipeline.named_steps['ordinalencoder']
encoded = encoder.transform(X_train)

print(X_train.shape)
print(encoded.shape)

(47520, 38)
(47520, 38)


### parameter tuning

n_estimators: higher trees give better performance but make code slower. choose as high value as processor can handle as makes predictions stronger and more stable

min_sample_leaf:  leaf is the end node of a decision tree. a smaller leaf makes the model more prone to capturing noise in train data. try multiple leaf sizes for the optimum size

n_jobs: tells the engine how many processors it can use. -1 means no restriction. 

oob_score: cross validation method, much faster than leave one out validation technique. 

max_features: Max_feature is the number of features to consider each time to make the split decision. Let us say the the dimension of your data is 50 and the max_feature is 10, each time you need to find the split, you randomly select 10 features and use them to decide which one of the 10 is the best feature to use.




In [79]:
sample_leaf_options = [1,5,10]

for leaf_size in sample_leaf_options:
  # pipeline

  pipeline = make_pipeline(
      ce.OrdinalEncoder(),
      SimpleImputer(strategy='mean'),
      RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1,
                            min_samples_leaf = leaf_size)
  )

  # Fit on train, score on val
  pipeline.fit(X_train, y_train)
  
  print(f"AUC-ROC, {leaf_size}: ", pipeline.score(X_val, y_val))

  

AUC-ROC, 1:  0.8097643097643098
AUC-ROC, 5:  0.8087542087542088
AUC-ROC, 10:  0.7984006734006734


In [82]:
import numpy as np
max_features_options = [0.1, 0.2, 0.3]

for num_features in max_features_options:
  # pipeline
  pipeline = make_pipeline(
      ce.OrdinalEncoder(),
      SimpleImputer(strategy='mean'),
      RandomForestClassifier(n_estimators=200, random_state=42, n_jobs=-1,
                            max_features = num_features, min_samples_leaf = 1)
  )

  # Fit on train, score on val
  pipeline.fit(X_train, y_train)
  
  print(f"AUC-ROC, {num_features}: ", pipeline.score(X_val, y_val))


AUC-ROC, 0.1:  0.80993265993266
AUC-ROC, 0.2:  0.8111111111111111
AUC-ROC, 0.3:  0.8101010101010101


## Random Forest Classifier with one hot encoder

In [88]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names='True'),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_estimators=500, random_state=42, n_jobs=-1,
                          oob_score=True, min_samples_leaf = 1)
)

# n_estimators = number of trees in the forest
# n_job=-1 is asking for max power to process

# fit on train, score on val
pipeline.fit(X_train, y_train)
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Validation Accuracy:  0.8115319865319865


In [70]:
# to see how many features were added through one hot encoding

encoder = pipeline.named_steps['onehotencoder']
encoder = encoder.transform(X_train)

print(X_train.shape)
print(encoded.shape)

(47520, 38)
(47520, 26)


In [0]:
# use the one hot encoder model

y_pred = pipeline.predict(X_test)

## submit 

In [0]:
# Write submission csv file
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('submission-04.csv', index=False)

In [0]:
#from google.colab import drive
#drive.mount('/content/drive')
#%env KAGGLE_CONFIG_DIR=/content/drive/My Drive/

In [92]:
!kaggle competitions submit -c ds6-predictive-modeling-challenge -f submission-04.csv -m "fourth"

100% 266k/266k [00:03<00:00, 90.2kB/s]
Successfully submitted to DS6 Predictive Modeling Challenge