<a href="https://colab.research.google.com/github/ezorigo/DS-Unit-2-Kaggle-Challenge/blob/master/module2/assignment_kaggle_challenge_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 2

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:
    # Pull files from Github repo
    os.chdir('/content')
    !git init .
    !git remote add origin https://github.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge.git
    !git pull origin master
    
    # Install required python packages
    !pip install -r requirements.txt
    
    # Change into directory for module
    os.chdir('module2')

In [0]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

# train/ validation split
train, val = train_test_split(train,
                              test_size=0.25,
                              stratify=train['status_group'], 
                              random_state=0)

# data wrangling

def wrangle(x):
  """wrangle trian, val, and test sets in the same way"""
  
#   make copy
  x = x.copy()
  
#   replace outliers with 0
  x['latitude'] = x['latitude'].replace(-2e-08, 0)
  
#   replace 0's with np.nan
  cols_with_zeros = ['longitude', 'latitude', 'amount_tsh', 'construction_year', 'gps_height', 'population']
  for col in cols_with_zeros:
    x[col] = x[col].replace(0, np.nan)
    x[col+'_missing'] = x[col].isna()
    
#   drop duplicate
  x = x.drop(columns=['quantity_group', 'payment_type'])
  
#   to_datetime
  x['date_recorded'] = pd.to_datetime(x['date_recorded'], infer_datetime_format=True)
  
#   extract components
  x['year_recorded'] = x['date_recorded'].dt.year
  x['month_recorded'] = x['date_recorded'].dt.month
  x['day_recorded'] = x['date_recorded'].dt.day
  
  x = x.drop(columns='date_recorded')

#   Engineer feature: how many years from construction_year to date_recorded
  x['years'] = x['year_recorded'] - x['construction_year']
  x['years_MISSING'] = x['years'].isna()
    
#   drop recorded_by and id 
  x = x.drop(columns=['recorded_by', 'id'])
  
  return x

# apply wrangle() to all sets
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

# features matrix and target vector
target = 'status_group'

x_train = train.drop(columns=target)
y_train = train[target]

x_val = val.drop(columns=target)
y_val = val[target]

x_test = test

In [7]:
%%time

import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100,
                           random_state=0,
                           min_samples_split=8,
                           max_features='auto',
                           max_depth=None,
                           n_jobs=-1
    )
)

# Fit on train, score on val
pipeline.fit(x_train, y_train)

# y_pred

y_pred = pipeline.predict(x_test)

# print scores
print('Train Accuracy', pipeline.score(x_train, y_train))
print('Validation Accuracy', pipeline.score(x_val, y_val))


Train Accuracy 0.9359820426487093
Validation Accuracy 0.8172390572390572
CPU times: user 25.1 s, sys: 373 ms, total: 25.5 s
Wall time: 14.7 s


In [0]:
# most_frequent, 8 = .816835
# median, 8 = .817239
# mean, 9, 36 = .817979

In [54]:
%%time

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    RandomForestClassifier(n_estimators=100,
                           random_state=0,
                           min_samples_split=4,
                           min_samples_leaf=1,
                           max_features=1,
                           bootstrap=True ,
                           class_weight=None,
                           max_depth=None,
                           n_jobs=-1
    )
)

# Fit on train, score on val
pipeline.fit(x_train, y_train)

# print scores
print('Train Accuracy', pipeline.score(x_train, y_train))
print('Validation Accuracy', pipeline.score(x_val, y_val))

Train Accuracy 0.9739169472502806
Validation Accuracy 0.8024242424242424
CPU times: user 11 s, sys: 110 ms, total: 11.1 s
Wall time: 6.63 s


In [0]:
from sklearn.linear_model import LogisticRegressionCV

pipeline = make_pipeline(
    ce.HashingEncoder(), 
    SimpleImputer(strategy='median'), 
    LogisticRegressionCV(multi_class='multinomial',
                         solver='newton-cg',
                         n_jobs=-1,
                         random_state=0
    )
)

# Fit on train, score on val
pipeline.fit(x_train, y_train)

# print scores
print('Train Accuracy', pipeline.score(x_train, y_train))
print('Validation Accuracy', pipeline.score(x_val, y_val))



In [61]:
from sklearn.neural_network import MLPClassifier

pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(strategy='median'), 
    MLPClassifier(activation='relu',
                  solver='sgd',
                  verbose=False,
                  learning_rate='invscaling',
                  power_t=0.5,
                  shuffle=True,
                  momentum=0.5,
                  nesterovs_momentum=False,
                  tol=1e-4,
                  alpha=1e-4,
                  random_state=0
    )
)

# Fit on train, score on val
pipeline.fit(x_train, y_train)

# print scores
print('Train Accuracy', pipeline.score(x_train, y_train))
print('Validation Accuracy', pipeline.score(x_val, y_val))

Train Accuracy 0.542716049382716
Validation Accuracy 0.5384511784511784


### submission

In [0]:
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('submission-14.csv', index=False)

In [0]:
if in_colab:
    from google.colab import files
    files.download('submission-14.csv')