Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [x] Read [“Adopting a Hypothesis-Driven Workflow”](http://archive.is/Nu3EI), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [x] Continue to participate in our Kaggle challenge.
- [x] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [x] Try Ordinal Encoding.
- [X] Try a Random Forest Classifier.
- [x] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [x] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/category_encoders/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/category_encoders/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](https://contrib.scikit-learn.org/category_encoders/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](https://contrib.scikit-learn.org/category_encoders/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](https://contrib.scikit-learn.org/category_encoders/catboost.html)
- [Generalized Linear Mixed Model Encoder](https://contrib.scikit-learn.org/category_encoders/glmm.html)
- [James-Stein Encoder](https://contrib.scikit-learn.org/category_encoders/jamesstein.html)
- [Leave One Out](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html)
- [M-estimate](https://contrib.scikit-learn.org/category_encoders/mestimate.html)
- [Target Encoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html)
- [Weight of Evidence](https://contrib.scikit-learn.org/category_encoders/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/colinmorris/embedding-layers)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation — maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [3]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [4]:
import pandas as pd
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

In [13]:

def setup_data(target):   #load kaggle data and return 3 way split for target feature
    """take in the target feature, load kaggle and return 3-way split for processing,
    train, val, test,features, target"""

    kaggle_train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                     pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
    test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')

    train, val = train_test_split(kaggle_train, train_size=0.80, test_size=0.20, 
                                  stratify=kaggle_train[target], random_state=8)
    features = train.columns.drop(target) #intial set all features
    return train, val, test, features, target


In [14]:

def wrangle(X): 
    """wrangle single dataframe, return processed copy"""

    X = X.copy()

    #all strings to lower   #needs work
  #  catcols = X.describe(exclude='number').columns
   # for col in catcols:
        #    X[col] = X[col].str.lower()


    X.scheme_management = X.scheme_management.fillna('unknown')
    X.management = X.management.str.lower()
    X.scheme_management = X.scheme_management.str.lower()
    if 'manage' not in X.index:
        X['manage'] = X.management + X.scheme_management
      
       # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    if ('latitude' in X.index):
        X['latitude'] = X['latitude'].replace(-2e-08, 0)
        # When columns have zeros and shouldn't, they are like null values.
        # So we will replace the zeros with nulls, and impute missing values later.
        cols_with_zeros = ['longitude', 'latitude']
        for col in cols_with_zeros:
            X[col] = X[col].replace(0, np.nan)
    if ('permit' in X.index):
        X['permit'] = X['permit'].astype('str')
        X['permit'] = X['permit'].replace({'True': 'yes','False': 'no'})
   
    #X['age'] = pd.DatetimeIndex(X['date_recorded']).year - X.construction_year #not good due to zeros
      # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    
    X['years'] = X['year_recorded'] - X['construction_year']   #age 
    X['years_MISSING'] = X['years'].isnull()     

    dropcols = ['wpt_name',
                'date_recorded',
                'ward','scheme_name',
                'payment_type',     #duplicate of payment
                'quantity_group',   #same as quantity
                'recorded_by',      #1 unique value
                'id']

    for i in dropcols:
        if i in X.columns:
            X.drop(labels=i, axis=1, inplace=True)

    #deal generically with  nans here last resort 
    nancols = X.isnull().sum()[X.isnull().sum() > 0 ].index
    for col in nancols:
        X[col].fillna(value='missing',inplace=True)
    return X


In [15]:

def arrange(df, features):
    """ take in single dataframe, either train or val containing features
    and target, process all columns, and return x and y vectors for features, target lists::
    X_train, y_train = arrange(traindata, featurelist, target) 
    """
    data = wrangle(df) #get new processsed dataframe and column list

    X = data[features]
    y = df[target]
    return X,y

In [16]:

from sklearn.preprocessing import FunctionTransformer
import category_encoders as ce
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline


dt_pipe = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    DecisionTreeClassifier(min_samples_leaf=20, random_state=88)
)

rf_pipe = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier( random_state=88)
)
rf_he = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier( random_state=88)
)


In [17]:
train,val,test, allfeatures, target = setup_data('status_group')
catcols = train.select_dtypes(exclude='number')
numcols = train.select_dtypes(include='number').columns.drop('id').tolist()


X= wrangle(train)
y = train[target]

allfeatures = X.columns.drop(target)

card = X.select_dtypes(exclude='number').nunique()
catf = card[card <= 50].index.tolist()
feat_dt = numcols + catf
print(allfeatures)

Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'num_private', 'basin', 'subvillage', 'region',
       'region_code', 'district_code', 'lga', 'population', 'public_meeting',
       'scheme_management', 'permit', 'construction_year', 'extraction_type',
       'extraction_type_group', 'extraction_type_class', 'management',
       'management_group', 'payment', 'water_quality', 'quality_group',
       'quantity', 'source', 'source_type', 'source_class', 'waterpoint_type',
       'waterpoint_type_group', 'manage', 'year_recorded', 'month_recorded',
       'day_recorded', 'years', 'years_MISSING'],
      dtype='object')


In [43]:
train,val,test, _, target = setup_data('status_group')

features = allfeatures ## change selected features here

X_train= wrangle(train)[features]
y_train = train[target]
X_val, y_val = arrange(val,features)

X_test = wrangle(test)[features]

rf_pipe.fit(X_train,y_train)
score = rf_pipe.score(X_val,y_val)

dt_pipe.fit(X_train,y_train)

print ('random forest,  untuned validation accuracy',score)
print('decision tree with ordinal encoding', dt_pipe.score(X_val,y_val))



random forest,  untuned validation accuracy 0.8025252525252525
decision tree with ordinal encoding 0.765993265993266


In [45]:
rf_pipe.predict(X_test).shape,test.shape

((14358,), (14358, 40))

In [71]:

sub_rf_oe =pd.DataFrame( data=rf_pipe.predict(X_test), index=test['id'])
sub_dt_oe= pd.DataFrame( data=dt_pipe.predict(X_test), index=test['id'])
sub_rf_oe.columns = ['status_group']
sub_dt_oe.columns = ['status_group']
sub_rf_oe.to_csv('random_forest_oe.csv')
sub_dt_oe.to_csv('decision_tree_oe.csv')

In [72]:
sub_rf_oe

Unnamed: 0_level_0,status_group
id,Unnamed: 1_level_1
50785,functional
51630,functional
17168,functional
45559,non functional
49871,functional
...,...
39307,non functional
18990,functional
28749,functional
33492,functional


train,val,test, allfeatures, target = setup_data('status_group')
catcols = train.select_dtypes(exclude='number')
numcols = train.select_dtypes(include='number').columns.drop('id').tolist()


X= wrangle(train)
y = wrangle(train)[target]
X_val = wrangle(val)
y_val = val[target]
X_test =wrangle(test)
allfeatures = X.columns.drop(target)

card = X.select_dtypes(exclude='number').nunique()
catf = card[card <= 50].index.tolist()
feat_dt = numcols + catf
X= X[numcols]
rf_pipe.fit(X,y)
score = rf_pipe.score(X_val[numcols],y_val)
print ('random forest, numeric cols, untuned Validation accuracy',score)
y_pred = rf_pipe.predict()