Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [0]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
#okay lets get that validation set
train, val = train_test_split(train, train_size = 0.80, test_size = .20,
                              stratify = train['status_group'], 
                              random_state = 42)

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [4]:
train.head()

Unnamed: 0,id,amount_tsh,date_recorded,funder,gps_height,installer,longitude,latitude,wpt_name,num_private,basin,subvillage,region,region_code,district_code,lga,ward,population,public_meeting,recorded_by,scheme_management,scheme_name,permit,construction_year,extraction_type,extraction_type_group,extraction_type_class,management,management_group,payment,payment_type,water_quality,quality_group,quantity,quantity_group,source,source_type,source_class,waterpoint_type,waterpoint_type_group,status_group
43360,72938,0.0,2011-07-27,,0,,33.542898,-9.174777,Kwa Mzee Noa,0,Lake Nyasa,Mpandapanda,Mbeya,12,4,Rungwe,Kiwira,0,True,GeoData Consultants Ltd,VWC,K,,0,gravity,gravity,gravity,vwc,user-group,never pay,never pay,soft,good,insufficient,insufficient,spring,spring,groundwater,communal standpipe,communal standpipe,functional
7263,65358,500.0,2011-03-23,Rc Church,2049,ACRA,34.66576,-9.308548,Kwa Yasinta Ng'Ande,0,Rufiji,Kitichi,Iringa,11,4,Njombe,Imalinyi,175,True,GeoData Consultants Ltd,WUA,Tove Mtwango gravity Scheme,True,2008,gravity,gravity,gravity,wua,user-group,pay monthly,monthly,soft,good,enough,enough,spring,spring,groundwater,communal standpipe,communal standpipe,functional
2486,469,25.0,2011-03-07,Donor,290,Do,38.238568,-6.179919,Kwasungwini,0,Wami / Ruvu,Kwedigongo,Pwani,6,1,Bagamoyo,Mbwewe,2300,True,GeoData Consultants Ltd,VWC,,False,2010,india mark ii,india mark ii,handpump,vwc,user-group,pay per bucket,per bucket,salty,salty,insufficient,insufficient,shallow well,shallow well,groundwater,hand pump,hand pump,functional
313,1298,0.0,2011-07-31,Government Of Tanzania,0,DWE,30.716727,-1.289055,Kwajovin 2,0,Lake Victoria,Kihanga,Kagera,18,1,Karagwe,Isingiro,0,True,GeoData Consultants Ltd,,,True,0,other,other,other,vwc,user-group,never pay,never pay,soft,good,enough,enough,shallow well,shallow well,groundwater,other,other,non functional
52726,27001,0.0,2011-03-10,Water,0,Gove,35.389331,-6.399942,Chama,0,Internal,Mtakuj,Dodoma,1,6,Bahi,Nondwa,0,True,GeoData Consultants Ltd,VWC,Zeje,True,0,mono,mono,motorpump,vwc,user-group,pay per bucket,per bucket,soft,good,enough,enough,machine dbh,borehole,groundwater,communal standpipe,communal standpipe,functional


In [5]:
train.columns

Index(['id', 'amount_tsh', 'date_recorded', 'funder', 'gps_height',
       'installer', 'longitude', 'latitude', 'wpt_name', 'num_private',
       'basin', 'subvillage', 'region', 'region_code', 'district_code', 'lga',
       'ward', 'population', 'public_meeting', 'recorded_by',
       'scheme_management', 'scheme_name', 'permit', 'construction_year',
       'extraction_type', 'extraction_type_group', 'extraction_type_class',
       'management', 'management_group', 'payment', 'payment_type',
       'water_quality', 'quality_group', 'quantity', 'quantity_group',
       'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'status_group'],
      dtype='object')

In [0]:
#side note, I'm really glad to see some of the stuff wrangled and caugh last
#assignment showed up in lecture notebook.  Glad to know I'm on the right track
import numpy as np


def wrangle(df):

  #deep copy since we are changing values and the shape of our data. Don't want
  #any warnings.
  df = df.copy()

  #Changing the almost zeros to zero on latitude.  Zero latitude is definitely
  #a mistake, as it is on another part of the world

  df['latitude'] = df['latitude'].replace(-2e08, 0)

  #last assignment I explored and found what columns had too many zeros, so we
  #are just going to build off that.
  columns = ['gps_height', 'longitude', 'latitude', 'population', 
             'construction_year']

  for column in columns:
    df[column].replace(0, np.nan, inplace = True)

    #I'm taking this from the lecture note book.  I suspect what we are doing
    #is creating a boolean column, to identify which of our data is collected
    #vs imputed(happening later in the pipeline).  Then our model may make
    #better decisions, because it can weigh the importance of imputed vs
    #collected data.  Just a theory.

    #quote from Xander: missing values may be a predictive signal
    df[column+'_MISSING'] = df[column].isnull()

  #drop the dupes
  df.drop(columns = ['quantity_group', 'payment_type'], inplace = True)

  #drop never varying, and always varying columns
  df.drop(columns = ['recorded_by', 'id'], inplace = True)

  #Convert date_recorded to dattime
  df['date_recorded'] = pd.to_datetime(df['date_recorded'], 
                                       infer_datetime_format = True)
  
  #making date more usable by extracting month, day and year
  df['year_recorded'] = df['date_recorded'].dt.year
  df['month_recorded'] = df['date_recorded'].dt.month
  df['day_recorded'] = df['date_recorded'].dt.day

  #dropping the datetime column
  df.drop(columns = 'date_recorded', inplace = True)

  #Cool feature I'm borrowing from class: How many years from construction to
  #date recorded
  df['years'] = df['year_recorded'] - df['construction_year']
  df['years_MISSING'] = df['years'].isnull()

  return df

In [0]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

#Baseline
Okay, I'm going to use an mvp based off our lecture notes as my baseline.  I will then form some hypothesis, test them, and incorporate them into my final model.  Hopefully through this iterative process I can improve my accuracy.

In [0]:
#since we are doing ordinal encoding, all I really need is to identify my target
target = 'status_group'

X_train = train.drop(columns = target)
y_train = train[target]
X_val = val.drop(columns = target)
y_val = val[target]
#no need to drop from test features, it is already dropped.

In [28]:
X_train.shape, X_val.shape, test.shape

((47520, 45), (11880, 45), (14358, 45))

In [9]:
%%time

import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy = 'mean'),
    RandomForestClassifier(n_jobs = -1, random_state = 42)
)

pipeline.fit(X_train, y_train)
print('Training Accuracy: ', pipeline.score(X_train, y_train))
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

Training Accuracy:  0.9999789562289563
Validation Accuracy:  0.8104377104377104
CPU times: user 20.6 s, sys: 306 ms, total: 20.9 s
Wall time: 11.7 s


#Hypothesis #1
Target Encoding will improved accuracy over ordinal encoding.

In [31]:
%%time

for i in range(2, 41, 5):
  for j in range(2, 41, 5):

    encoder = ce.TargetEncoder(min_samples_leaf = i, smoothing = j)

    X_train_encoded = encoder.fit_transform(X_train, y_train == 'functional')
    X_val_encoded = encoder.transform(X_val)

    pipeline = make_pipeline(
        SimpleImputer(strategy = 'mean'),
        RandomForestClassifier(n_jobs = -1, random_state = 42)
    )

    pipeline.fit(X_train_encoded, y_train == 'functional')
    print('min_samples_leaf: ', i)
    print('smoothing: ', j)
    print('Training Accuracy: ', pipeline.score(X_train_encoded, y_train))
    print('Validation Accuracy: ', pipeline.score(X_val_encoded, y_val))

min_samples_leaf:  2
smoothing:  2
Training Accuracy:  0.0
Validation Accuracy:  0.0
min_samples_leaf:  2
smoothing:  7
Training Accuracy:  0.0
Validation Accuracy:  0.0
min_samples_leaf:  2
smoothing:  12
Training Accuracy:  0.0
Validation Accuracy:  0.0
min_samples_leaf:  2
smoothing:  17
Training Accuracy:  0.0
Validation Accuracy:  0.0


KeyboardInterrupt: ignored

Unconfigured it didn't increase my accuracy, but after some reading I realized that I need to tweak the hyper parameters, and that it isn't really an encoder you can just "set and forget".  So I jumped in and started tweaking.
I think I will use this as my first submission model, and iterate a few more times and call it good.

In [23]:
#Smoothing seems to stay the most accurate around 2.  Where as min_samples_leaf
#is consistently increasing.  I ran it once, and I begin to lose accuracy
#after about 35.  So I'm going to go ahead and find where it is highest around
#there
%%time

for i in range(30, 37):
  pipeline = make_pipeline(
      ce.TargetEncoder(min_samples_leaf = i, smoothing = 2),
      SimpleImputer(strategy = 'mean'),
      RandomForestClassifier(n_jobs = -1, random_state = 42)
  )

  pipeline.fit(X_train, y_train == 'functional')
  print('min_samples_leaf: ', i)
  print('Training Accuracy: ', pipeline.score(X_train, y_train == 'functional'))
  print('Validation Accuracy: ', pipeline.score(X_val, y_val == 'functional'))

min_samples_leaf:  30
Training Accuracy:  0.9998526936026936
Validation Accuracy:  0.8088383838383838
min_samples_leaf:  31
Training Accuracy:  0.999810606060606
Validation Accuracy:  0.823063973063973
min_samples_leaf:  32
Training Accuracy:  0.9997053872053872
Validation Accuracy:  0.8232323232323232
min_samples_leaf:  33
Training Accuracy:  0.999452861952862
Validation Accuracy:  0.8245791245791246
min_samples_leaf:  34
Training Accuracy:  0.9994739057239057
Validation Accuracy:  0.8246632996632997
min_samples_leaf:  35
Training Accuracy:  0.9995159932659933
Validation Accuracy:  0.8272727272727273
min_samples_leaf:  36
Training Accuracy:  0.9994949494949495
Validation Accuracy:  0.8271043771043771
CPU times: user 2min 26s, sys: 1.65 s, total: 2min 28s
Wall time: 1min 25s


In [29]:
pipeline = make_pipeline(
    ce.TargetEncoder(min_samples_leaf = 35, smoothing = 2),
    SimpleImputer(strategy = 'mean'),
    RandomForestClassifier(n_jobs = -1, random_state = 42)
)

pipeline.fit(X_train, y_train == 'functional')
print('min_samples_leaf: ', 35)
print('Training Accuracy: ', pipeline.score(X_train, y_train))
print('Validation Accuracy: ', pipeline.score(X_val, y_val))

min_samples_leaf:  35
Training Accuracy:  0.0
Validation Accuracy:  0.0


In [0]:
y_pred = pipeline.predict(test)

submission = sample_submission.copy()

submission['status_group'] = y_pred
submission.to_csv('forest_submission_1.csv', index = False)