Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](http://archive.is/Nu3EI), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/category_encoders/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/category_encoders/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](https://contrib.scikit-learn.org/category_encoders/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](https://contrib.scikit-learn.org/category_encoders/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](https://contrib.scikit-learn.org/category_encoders/catboost.html)
- [Generalized Linear Mixed Model Encoder](https://contrib.scikit-learn.org/category_encoders/glmm.html)
- [James-Stein Encoder](https://contrib.scikit-learn.org/category_encoders/jamesstein.html)
- [Leave One Out](https://contrib.scikit-learn.org/category_encoders/leaveoneout.html)
- [M-estimate](https://contrib.scikit-learn.org/category_encoders/mestimate.html)
- [Target Encoder](https://contrib.scikit-learn.org/category_encoders/targetencoder.html)
- [Weight of Evidence](https://contrib.scikit-learn.org/category_encoders/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/colinmorris/embedding-layers)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation — maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
train, val = train_test_split(train, test_size=0.25,
                              stratify=train["status_group"],
                              random_state=37)

train.shape, val.shape

((44550, 41), (14850, 41))

In [4]:
train["status_group"].unique()

array(['non functional', 'functional', 'functional needs repair'],
      dtype=object)

In [5]:
train["status_group"].value_counts(normalize=True)

functional                 0.543075
non functional             0.384242
functional needs repair    0.072682
Name: status_group, dtype: float64

In [6]:
train["functional"] = (train["status_group"] == "functional").astype(int)

In [7]:
train.shape, val.shape

((44550, 42), (14850, 41))

In [8]:
train["functional"].value_counts(normalize=True)

1    0.543075
0    0.456925
Name: functional, dtype: float64

In [9]:
# from pandas_profiling import ProfileReport
# profile = ProfileReport(train, explorative=True).to_notebook_iframe()

# profile

In [10]:
# import plotly.express as px
# px.scatter(train, x="longitude", y="latitude", color="functional")

In [11]:
train["latitude"].describe()

count    4.455000e+04
mean    -5.696868e+00
std      2.942822e+00
min     -1.164944e+01
25%     -8.525579e+00
50%     -5.007589e+00
75%     -3.324226e+00
max     -2.000000e-08
Name: latitude, dtype: float64

In [12]:
import numpy as np

def wrangle(X):
# ```This is a data-wrangling function.```    
    
    X = X.copy()
    
    X["latitude"] = X["latitude"].replace(-2e-08, 0)
    
    zero_cols = ["longitude", "latitude"]
    for col in zero_cols:
        X[col] = X[col].replace(0, np.nan)
        
#     X["date_recorded"] = pd.to_datetime(X["date_recorded"])

# Dropping highly correlated cells, unique value and redundancy.
    X = X.drop(["extraction_type_group", "extraction_type_class", "management_group",
               "quality_group", "payment_type", "quantity_group", "source_type",
               "source_class", "waterpoint_type_group", "recorded_by"],
               axis=1)
    
    return X

train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [13]:
train["latitude"].isnull().sum()

1339

In [14]:
# px.scatter(train, x="longitude", y="latitude", color="functional")

In [15]:
num_feats = ["amount_tsh", "gps_height", "longitude", "latitude", "num_private", "region_code",
             "district_code", "population", "construction_year"] 
lo_cat_feats = ["basin", "region", "public_meeting","scheme_management", "permit",
             "extraction_type", "management", "payment", "water_quality", "quantity",
             "source", "waterpoint_type"]
hi_cat_feats = ["date_recorded", "funder", "installer", "wpt_name", "subvillage", "lga", "ward", "scheme_name"]
features = num_feats + lo_cat_feats + hi_cat_feats

test = test.drop(columns=["id"])
X_test = test[features]

In [16]:
# Separating mumerical features from categorical and high cardinality from low cardinality to make my model easier to train
# later. The cells below gave me what I needed to split the columns.

target = "status_group"

train_feats = train.drop(columns=[target, "id", "functional"])
val_feats = val.drop(columns=[target, "id"])

numt_feats = train_feats.select_dtypes(include="number").columns.tolist()
numv_feats = val_feats.select_dtypes(include="number").columns.tolist()

T_card = train_feats.select_dtypes(exclude="number").nunique()
V_card = val_feats.select_dtypes(exclude="number").nunique()

lo_cardT = T_card[T_card <= 30].index.tolist()
lo_cardV = V_card[V_card <= 30].index.tolist()

hi_cardT = T_card[T_card >= 30].index.tolist()
hi_cardV = V_card[V_card >= 30].index.tolist()

In [17]:
train_feats.select_dtypes(include="number").columns.tolist()

['amount_tsh',
 'gps_height',
 'longitude',
 'latitude',
 'num_private',
 'region_code',
 'district_code',
 'population',
 'construction_year']

In [18]:
train_feats.select_dtypes(exclude="number").nunique()

date_recorded          349
funder                1646
installer             1871
wpt_name             28940
basin                    9
subvillage           16648
region                  21
lga                    124
ward                  2079
public_meeting           2
scheme_management       12
scheme_name           2485
permit                   2
extraction_type         18
management              12
payment                  7
water_quality            8
quantity                 5
source                  10
waterpoint_type          7
dtype: int64

In [19]:
hi_cardT

['date_recorded',
 'funder',
 'installer',
 'wpt_name',
 'subvillage',
 'lga',
 'ward',
 'scheme_name']

In [20]:
lo_cardT

['basin',
 'region',
 'public_meeting',
 'scheme_management',
 'permit',
 'extraction_type',
 'management',
 'payment',
 'water_quality',
 'quantity',
 'source',
 'waterpoint_type']

In [21]:
num_feats = ["amount_tsh", "gps_height", "longitude", "latitude", "num_private", "region_code",
             "district_code", "population", "construction_year"] 
lo_cat_feats = ["basin", "region", "public_meeting","scheme_management", "permit",
             "extraction_type", "management", "payment", "water_quality", "quantity",
             "source", "waterpoint_type"]
hi_cat_feats = ["date_recorded", "funder", "installer", "wpt_name", "subvillage", "lga", "ward", "scheme_name"]
features = num_feats + lo_cat_feats + hi_cat_feats

y_train = train[target]
X_train = train[features]
y_val = val[target]
X_val = val[features]

In [22]:
train.shape, X_train.shape

((44550, 32), (44550, 29))

In [23]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

In [24]:
num_feats = ["amount_tsh", "gps_height", "longitude", "latitude", "num_private", "region_code",
             "district_code", "population", "construction_year"] 
lo_cat_feats = ["basin", "region", "public_meeting","scheme_management", "permit",
             "extraction_type", "management", "payment", "water_quality", "quantity",
             "source", "waterpoint_type"]
hi_cat_feats = ["date_recorded", "funder", "installer", "wpt_name", "subvillage", "lga", "ward", "scheme_name"]
cat_feats = lo_cat_feats + hi_cat_feats

In [33]:
estimator = Pipeline([
    ("encoder", ce.OrdinalEncoder()),
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler()),
    ("forest", RandomForestClassifier(criterion="entropy", max_depth=45,
                                      min_samples_split=5,
                                      max_features=15, random_state=15))
])

In [34]:
estimator.fit(X_train, y_train);

In [35]:
print('Training Accuracy', estimator.score(X_train, y_train))
print('Validation Accuracy', estimator.score(X_val, y_val))

Training Accuracy 0.9885521885521885
Validation Accuracy 0.8042424242424242


In [36]:
y_pred = estimator.predict(X_test)

In [37]:
DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
whyse_submission2 = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')
submission2 = whyse_submission2.copy()
submission2['status_group'] = y_pred
submission2.to_csv('whyse-submission2.csv', index=False)

**Used the documentation and previous knowledge from Flexing Unit 2 to build the RandomizedSearchCV parameters.**

In [26]:
# %%time

# params = {
#     "simpleimputer__strategy" : ("median", "mean", "most_frequent"),
#     "randomforestclassifier__criterion" : ("gini", "entropy"),
#     "randomforestclassifier__max_depth" : (11, 21, 33),
#     "randomforestclassifier__min_samples_split" : (3, 5, 9),
#     "randomforestclassifier__max_features" : (7, 13, 19),
# }

# best_model = RandomizedSearchCV(
#     estimator,
#     param_distributions=params,
#     n_iter=54,
#     scoring="accuracy",
#     n_jobs=1,
#     cv=5,
#     verbose=1,
#     random_state=15,
#     return_train_score=True
# )

# best_model.fit(X_train, y_train)

# print("Best CV:", best_model.best_score_())
# print("Estimator:", best_model.best_params_)
# print("Model:", best_model.best_estimator_)