Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [2]:
%%capture
import sys

# Editing local, bulk processing in Colab!  Hence:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# Primary location:
else:
    DATA_PATH = '../data/'

In [3]:
%%capture
!pip install folium
!pip install category_encoders
!pip install graphviz

In [35]:
import pandas as pd
import numpy as np
import folium
import plotly.express as px
import category_encoders as ce
import graphviz
import random, time, datetime

from sklearn.tree import export_graphviz
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from IPython.display import HTML, display, IFrame
from itertools import combinations

In [46]:
# for filenaming later

def timeStamp():
    ts = time.time()
    st = datetime.datetime.fromtimestamp(ts).strftime('%m%d_%H%M%S')
    return st

In [5]:
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train, val = train_test_split(train, train_size=.8, test_size=.2, 
                              stratify=train['status_group'], random_state=55)

train.shape, val.shape, test.shape, sample_submission.shape

((47520, 41), (11880, 41), (14358, 40), (14358, 2))

In [6]:
def cleanData(X):
    
    # Supress Warning
    X = X.copy()
    
    # Replace outlier lat vals with 0
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # Replace 0's with np.nan for future imputing of values
    cols_w_zeros = ['longitude', 'latitude']
    for col in cols_w_zeros:
        X[col] = X[col].replace(0, np.nan)
    
    # Drop duplicate values
    X = X.drop(columns=['quantity_group', 'amount_tsh'])
    
    X['date_recorded'] = pd.to_datetime(X['date_recorded'])
    X['dd_recorded'] = X['date_recorded'].dt.day
    X['mm_recorded'] = X['date_recorded'].dt.month
    X['yyyy_recorded'] = X['date_recorded'].dt.year
    
    return X

In [7]:
train = cleanData(train)
val = cleanData(val)
test = cleanData(test)

In [8]:
# Constant target
target = 'status_group'

# Defining different features for future isolation
features_all = train.drop(columns=['status_group', 'id'])

# Isolate numeric features
features_numeric = features_all.select_dtypes(include='number').columns.tolist()

# Isolate object features
features_obj = features_all.select_dtypes(exclude='number').nunique()

# Split object features into high and low cardinal
features_obj_low = features_obj[features_obj <= 45].index.tolist()
features_obj_high = features_obj[features_obj > 45].index.tolist()

In [82]:
# function to call within models to log Validate Score

def logOutcome(model, features, modelVars, score):
    newLine = pd.DataFrame([[timeStamp(), model, features, 
                             modelVars, score]],
                           columns=['timestamp',
                                    'model',
                                    'featuresUsed',
                                    'modelVars',
                                    'valScore'])
    
    logDf = pd.read_csv('../tracking/loggingProcesses')
    logDf = logDf.append(newLine, ignore_index=True)
    logDf.to_csv('../tracking/loggingProcesses', index=False)

In [83]:
# ran once to create csv for logging:
#
# tempDF = pd.DataFrame(
#     {'timestamp': timeStamp(),
#      'model': 'modelUsed',
#      'featuresUsed': ['selectedFeatures'],
#      'modelVars': ['modelVariables'],
#      'valScore': 99.999
#     }
# )

# tempDF.to_csv('../tracking/loggingProcesses', index=False)

In [62]:
# I want to test on multiple fronts, different models & with different features
# Rather than redefine the features variable each time, I will assign chosen sets 

featuresAll = features_all.columns
features01 = features_numeric + features_obj_low
features02 = ['gps_height', 'longitude',
            'latitude', 'num_private', 'basin', 'region',
            'region_code', 'district_code', 'population',
            'public_meeting', 'recorded_by', 'scheme_management',
            'permit', 'construction_year', 'extraction_type',
            'extraction_type_class', 'management',
            'payment', 'water_quality',
            'quantity', 'source', 'source_type', 'source_class',
            'waterpoint_type',
            'mm_recorded', 'yyyy_recorded']

In [76]:
def runDecisionTreeModel(train, val, test, features, leaf, depth):
    '''
    - Runs decision tree on train/val/test df
    - Prints accuracy
    - Logs validate accuracy to exterior csv
    - Returns the y_pred list for use in competition submission. 
    '''
    
    state_num = random.randint(1, 100)
    
    pipeline = make_pipeline(
        ce.OneHotEncoder(use_cat_names=True),
        SimpleImputer(strategy='mean'),
        DecisionTreeClassifier(min_samples_leaf = leaf,
                               max_depth = depth,
                               random_state=state_num)
    )
    
    X_train = train[features]
    y_train = train[target]
    X_val = val[features]
    y_val = val[target]
    X_test = test[features]
    
    pipeline.fit(X_train, y_train)
    train_score = pipeline.score(X_train, y_train)
    val_score = pipeline.score(X_val, y_val)
    
    print('Decision Tree Accuracy: \nTrain:    ', train_score,
      '\nValidate: ', val_score, '\n')    
    
    modelVars = [['leaf', leaf],['depth', depth]]
    
    logOutcome('DecisionTree', features, modelVars, val_score)
    
    y_pred = pipeline.predict(X_test)
    
    return y_pred


In [108]:
def runRandomForestModel(train, val, test, features, estimators):
    '''
    - Runs Random Forest on train/val/test df
    - Prints accuracy
    - Logs validate accuracy to exterior csv
    - Returns the y_pred list for use in competition submission. 
    '''
    
    state_num = random.randint(1, 100)
    
    pipeline = make_pipeline(
        ce.OneHotEncoder(use_cat_names=True),
        SimpleImputer(strategy='mean'),
        RandomForestClassifier(n_estimators=estimators,
                               random_state=state_num,
                               n_jobs=-2)
    )
    
    X_train = train[features]
    y_train = train[target]
    X_val = val[features]
    y_val = val[target]
    X_test = test[features]
    
    pipeline.fit(X_train, y_train)
    train_score = pipeline.score(X_train, y_train)
    val_score = pipeline.score(X_val, y_val)
    
    print('Random Forest Accuracy: \nTrain:    ', train_score,
          '\nValidate: ', val_score, '\n')
    
    modelVars = [['estimators', estimators]]
    
    logOutcome('RandomForest', features, modelVars, val_score)
    
    y_pred = pipeline.predict(X_test)
    
    return y_pred

In [109]:
runRandomForestModel(train, val, test, features01, 100)

Random Forest Accuracy: 
Train:     0.9977483164983165 
Validate:  0.8035353535353535 



array(['functional', 'functional', 'functional', ..., 'functional',
       'functional', 'non functional'], dtype=object)

In [91]:
runDecisionTreeModel(train, val, test, features02, 6, 21)

Decision Tree Accuracy: 
Train:     0.8431186868686869 
Validate:  0.7718855218855218 



array(['functional', 'functional', 'functional', ..., 'functional',
       'functional', 'non functional'], dtype=object)

In [110]:
pd.read_csv('../tracking/loggingProcesses').head(10)

Unnamed: 0,timestamp,model,featuresUsed,modelVars,valScore
0,1106_133648,modelUsed,selectedFeatures,modelVariables,99.999
1,1106_133749,DecisionTree,"['gps_height', 'longitude', 'latitude', 'num_p...","[['leaf', 7], ['depth', 20]]",0.771717
2,1106_133812,RandomForest,"['gps_height', 'longitude', 'latitude', 'num_p...","[['estimators', 100]]",0.804882
3,1106_133925,RandomForest,"['gps_height', 'longitude', 'latitude', 'num_p...","[['estimators', 100]]",0.808165
4,1106_134115,DecisionTree,"['gps_height', 'longitude', 'latitude', 'num_p...","[['leaf', 6], ['depth', 21]]",0.771886
5,1106_134140,RandomForest,"['gps_height', 'longitude', 'latitude', 'num_p...","[['estimators', 100]]",0.806229
6,1106_135620,RandomForest,"['gps_height', 'longitude', 'latitude', 'num_p...","[['estimators', 100]]",0.805724
7,1106_140752,RandomForest,"['gps_height', 'longitude', 'latitude', 'num_p...","[['estimators', 100]]",0.803535


In [72]:
def saveSubmission(prediction, filename):
    savePath = '../submissions/'
    
    submission = sample_submission.copy()
    submission['status_group'] = prediction
    submission.to_csv(savePath+filename+'.csv', index=False)

In [74]:
y_pred = runDecisionTreeModel(train, val, test, features02, 7, 20)
saveSubmission(y_pred, 'testSave')

Decision Tree Accuracy: 
Train:     0.8332070707070707 
Validate:  0.7717171717171717 



In [21]:
train.T.head(75)

Unnamed: 0,36168,26059,13381,11700,52628,36213,16839,23826,38051,41832,...,7475,8684,22678,58118,11218,20523,35701,51680,35496,35306
id,42669,12317,18536,13292,54594,35130,23993,59447,24944,43279,...,10371,23062,50390,7198,32403,39562,39065,51156,36369,3762
date_recorded,2011-02-25 00:00:00,2011-02-20 00:00:00,2013-03-27 00:00:00,2013-02-01 00:00:00,2011-03-25 00:00:00,2012-10-23 00:00:00,2011-02-18 00:00:00,2011-03-06 00:00:00,2013-03-27 00:00:00,2011-03-12 00:00:00,...,2011-03-13 00:00:00,2013-02-04 00:00:00,2011-03-20 00:00:00,2011-07-22 00:00:00,2011-07-21 00:00:00,2012-10-06 00:00:00,2013-02-02 00:00:00,2013-01-18 00:00:00,2011-07-20 00:00:00,2011-07-20 00:00:00
funder,Mdc,W.B,Mbiuwasa,Rwssp,Nethalan,World Vision,Tasaf,Danida,Unicef,China Government,...,Private Individual,Hesawa,Amref,Tanapa,Swedish,Netherlands,World Vision,Tasaf,Wananchi,P
gps_height,1678,106,1299,0,372,0,1677,1535,1165,239,...,204,1429,46,1040,0,0,0,222,0,0
installer,DWE,RDC,MBIUWASA,DWE,RWE,Consulting Engineer,TASAF,DANID,RWE,Ch,...,WU,Hesawa,AMREF,DWE,Sengerema Water Department,DWE,ADP Busangi,District water department,wananchi,P
longitude,35.2946,39.0643,34.9937,,37.4522,32.8673,35.7603,35.4827,37.3055,38.3105,...,38.3463,34.4029,39.0368,31.1117,32.4386,33.9675,32.5514,38.8984,33.7471,32.9199
latitude,-8.14543,-7.62444,-10.9441,,-6.37716,-4.12384,-7.87806,-7.75062,-3.26251,-6.56136,...,-6.63656,-1.80129,-7.44727,-6.56175,-2.68061,-3.11719,-3.65055,-9.97398,-9.5715,-2.47668
wpt_name,none,Rashid Omari,Karibu Na Kituo Cha Polisi,Kasela,Kwa Ngosha,Kiriyandeta Mbugani,none,none,Kwa Peter Mushi,Shule Ya Zamani,...,Mama Mlamsanga,Kyolang,Nguvu Kazi,Kwa Utitiri,Kwa Makoye,Neema B,Upendeleo,Kaliele,Kwa Enosi Fijabo,Kwa Primi Mazula
num_private,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
basin,Rufiji,Rufiji,Lake Nyasa,Lake Victoria,Wami / Ruvu,Lake Tanganyika,Rufiji,Rufiji,Pangani,Wami / Ruvu,...,Wami / Ruvu,Lake Victoria,Rufiji,Lake Rukwa,Lake Victoria,Lake Victoria,Lake Victoria,Ruvuma / Southern Coast,Lake Nyasa,Lake Victoria


In [None]:
#  
def draw_comm(G, target, pos):
    fig, ax = plt.subplots(figsize = (16,9))
    