Lambda School Data Science

*Unit 2, Sprint 2, Module 2*

---

# Random Forests

## Assignment
- [ ] Read [“Adopting a Hypothesis-Driven Workflow”](https://outline.com/5S5tsB), a blog post by a Lambda DS student about the Tanzania Waterpumps challenge.
- [ ] Continue to participate in our Kaggle challenge.
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features.
- [ ] Try Ordinal Encoding.
- [ ] Try a Random Forest Classifier.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other [categorical encodings](https://contrib.scikit-learn.org/categorical-encoding/).
- [ ] Get and plot your feature importances.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Decision Trees
- A Visual Introduction to Machine Learning, [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/),  and _**[Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)**_
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU)

#### Random Forests
- [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/), Chapter 8: Tree-Based Methods
- [Coloring with Random Forests](http://structuringtheunstructured.blogspot.com/2017/11/coloring-with-random-forests.html)
- _**[Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_


### More Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

For this reason, mean encoding won't work well within pipelines for multi-class classification problems.

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categoricals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

### Setup

You can work locally (follow the [local setup instructions](https://lambdaschool.github.io/ds/unit2/local/)) or on Colab (run the code cell below).

In [1]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*

# If you're working locally:
else:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'

In [2]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

train.shape, test.shape

((59400, 41), (14358, 40))

In [3]:
# Split train into train & val
train, validate = train_test_split(
    train, 
    train_size=0.80, 
    test_size=0.20, 
    stratify=train['status_group'], 
    random_state=42)




In [4]:
def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()
    
    
    # return the wrangled dataframe
    return X



In [5]:
train = wrangle(train)
validate = wrangle(validate)
test = wrangle(test)

In [6]:
# The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality<=50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features 

In [7]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_validate = validate[features]
y_validate = validate[target]
X_test = test[features]

In [10]:
import category_encoders as ce
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

In [12]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    StandardScaler(),
    RandomForestClassifier(bootstrap=True, max_depth=50, max_features='auto', min_samples_leaf= 1, min_samples_split= 5, n_estimators=800, random_state=42, criterion='entropy')
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_validate, y_validate))


Train Accuracy 0.9584385521885522
Validation Accuracy 0.8153198653198653


In [15]:
from sklearn.model_selection import RandomizedSearchCV

In [23]:
X_train_encoded = ce.OrdinalEncoder().fit_transform(X_train)
X_train_imputed = SimpleImputer().fit_transform(X_train_encoded)
#X_train_scaled = StandardScaler().fit_transform(X_train_imputed)

X_validate_encoded = ce.OrdinalEncoder().fit_transform(X_validate)
X_validate_imputed = SimpleImputer().fit_transform(X_validate_encoded)
#X_validate_scaled = StandardScaler().fit_transform(X_validate_imputed)

X_test_encoded = ce.OrdinalEncoder().fit_transform(X_test)
X_test_imputed = SimpleImputer().fit_transform(X_test_encoded)
#X_test_scaled = StandardScaler().fit_transform(X_test_imputed)

In [25]:
n_estimators = range(10, 1000, 200)
criterion = ['gini', 'entropy']
max_features = ['auto', 'sqrt', 'log2']
max_depth = [2, 4, 8, 16, 32, 64]
min_samples_split = [2, 4, 6, 8, 10]
min_samples_leaf = [2, 4, 6, 8, 10]
random_state = range(1, 42, 1)
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'random_state' : random_state,
               'bootstrap': bootstrap}
print(random_grid)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 1000, cv = 3, verbose=2, random_state=42, n_jobs=-2)
# Fit the random search model
rf_random.fit(X_train_imputed, y_train)


print ('Train Accuracy', pipeline.score(X_train_imputed, y_train))
print ('Validation Accuracy', pipeline.score(X_validate_imputed, y_validate))


rf_random.best_params_

{'n_estimators': range(10, 1000, 200), 'criterion': ['gini', 'entropy'], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [2, 4, 8, 16, 32, 64], 'min_samples_split': [2, 4, 6, 8, 10], 'min_samples_leaf': [2, 4, 6, 8, 10], 'random_state': range(1, 42), 'bootstrap': [True, False]}
Fitting 3 folds for each of 1000 candidates, totalling 3000 fits


[Parallel(n_jobs=-2)]: Using backend LokyBackend with 7 concurrent workers.
[Parallel(n_jobs=-2)]: Done  27 tasks      | elapsed:  2.7min
[Parallel(n_jobs=-2)]: Done 148 tasks      | elapsed: 17.2min


KeyboardInterrupt: 

In [None]:
n_estimators = range(400, 1000, 100)
criterion = ['entropy']
max_features = ['auto', 'sqrt', 'log2']
max_depth = [4]
min_samples_split = [8]
min_samples_leaf = [10]
random_state = range(1, 42, 1)
bootstrap = [True]

random_grid = {'n_estimators': n_estimators,
               'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'random_state' : random_state,
               'bootstrap': bootstrap}
print(random_grid)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random02 = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 1000, cv = 3, verbose=2, random_state=42, n_jobs=None)
# Fit the random search model
rf_random02.fit(X_train, y_train)

In [10]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="mean"),
    RandomForestClassifier(random_state=0, n_jobs=-2))
    
%time
pipeline.fit(X_train, y_train)
print("Baseline RandomForestClassifier Training accuracy:", pipeline.score(X_train, y_train))
print("Baseline RandomForestClassifier Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs
Baseline RandomForestClassifier Training accuracy: 0.9979166666666667
Baseline RandomForestClassifier Validation accuracy: 0.8113636363636364


In [17]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="mean"),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=200))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=200 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=200 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.2 µs
RandomForestClassifier. n_estimators=200 Training accuracy: 0.9999789562289563
RandomForestClassifier, n_estimators=200 Validation accuracy: 0.8104377104377104


In [10]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="mean"),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=1000))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=1000 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=1000 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 4.77 µs
RandomForestClassifier. n_estimators=1000 Training accuracy: 0.9979377104377104
RandomForestClassifier, n_estimators=1000 Validation accuracy: 0.8116161616161616


In [19]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="mean"),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=1000, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.78 µs
RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 Training accuracy: 0.9441077441077441
RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 Validation accuracy: 0.8127104377104377


In [11]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="mean"),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=1000, min_samples_leaf=2, max_depth=32))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 4.53 µs
RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 Training accuracy: 0.9232954545454546
RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 Validation accuracy: 0.8167508417508418


In [22]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy="median"),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=1000, min_samples_leaf=2, max_depth=32))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs
RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 Training accuracy: 0.9436868686868687
RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 Validation accuracy: 0.8141414141414142


In [23]:
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [24]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    IterativeImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, n_estimators=1000, min_samples_leaf=2, max_depth=32))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 IterativeImputer Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 IterativeImputer Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs
RandomForestClassifier. n_estimators=1000 min_samples_leaf=2 max_depth=32 IterativeImputer Training accuracy: 0.9489898989898989
RandomForestClassifier, n_estimators=1000 min_samples_leaf=2 max_depth=32 IterativeImputer Validation accuracy: 0.813973063973064


In [25]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.657, n_estimators=375, max_depth=20))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.48 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20 Training accuracy: 0.9719486531986532
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20 Validation accuracy: 0.8088383838383838


In [26]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 5 µs, sys: 0 ns, total: 5 µs
Wall time: 9.54 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.9449074074074074
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.8118686868686869


In [29]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.4, n_estimators=400, max_depth=20, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 6.91 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.9407196969696969
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.8121212121212121


In [31]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.3, n_estimators=400, max_depth=20, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 7 µs, sys: 0 ns, total: 7 µs
Wall time: 13.1 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.9357744107744108
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.812962962962963


In [32]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.2, n_estimators=400, max_depth=20, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.72 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.9287247474747474
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.8132996632996633


In [33]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.2, n_estimators=1000, max_depth=20, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 6 µs, sys: 0 ns, total: 6 µs
Wall time: 11.7 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.929040404040404
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.813973063973064


In [34]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.2, n_estimators=1000, max_depth=30, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 5.96 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.9543981481481482
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.815993265993266


In [11]:
pipeline = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(),
    RandomForestClassifier(random_state=0, n_jobs=-2, max_features=.2, n_estimators=1000, max_depth=32, min_samples_leaf=2))
    
%time
pipeline.fit(X_train, y_train)
print("RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy:", pipeline.score(X_train, y_train))
print("RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy:", pipeline.score(X_validate, y_validate))

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 4.77 µs
RandomForestClassifier. max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Training accuracy: 0.927946127946128
RandomForestClassifier, max_features=.657, n_estimators=375, max_depth=20, min_samples_leaf=2 Validation accuracy: 0.8165824915824916


In [139]:
encoder = ce.TargetEncoder(min_samples_leaf=10, smoothing=50) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_validate_encoded = encoder.transform(X_validate, y_validate=='functional')

In [140]:
X_train_imputed = SimpleImputer().fit_transform(X_train_encoded)
X_validate_imputed = SimpleImputer().fit_transform(X_validate_encoded)

In [10]:
import xgboost as xgb
model=xgb.XGBClassifier(random_state=1,learning_rate=0.01)

In [12]:
encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train, y_train)
X_validate_encoded = encoder.fit_transform(X_validate, y_validate)

In [13]:
X_train_imputed = SimpleImputer().fit_transform(X_train_encoded)
X_validate_imputed = SimpleImputer().fit_transform(X_validate_encoded)

In [13]:
model.fit(X_train_imputed, y_train)

print(model.score(X_train_imputed, y_train))

print(model.score(X_validate_imputed, y_validate))

0.7184132996632997
0.7187710437710437


In [14]:
model=xgb.XGBClassifier(random_state=1,learning_rate=0.1)

model.fit(X_train_imputed, y_train)

print(model.score(X_train_imputed, y_train))

print(model.score(X_validate_imputed, y_validate))

0.7482954545454545
0.7457070707070707


In [15]:
model=xgb.XGBClassifier(random_state=1,learning_rate=1)

model.fit(X_train_imputed, y_train)

print(model.score(X_train_imputed, y_train))

print(model.score(X_validate_imputed, y_validate))

0.8081018518518519
0.77996632996633


In [17]:
model=xgb.XGBClassifier(random_state=1,learning_rate=1, min_child_weight=2)

model.fit(X_train_imputed, y_train)

print(model.score(X_train_imputed, y_train))

print(model.score(X_validate_imputed, y_validate))

0.8076809764309765
0.7828282828282829


In [24]:
model = RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
                       max_depth=32, max_features='auto', max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=5,
                       min_weight_fraction_leaf=0.0, n_estimators=1000,
                       n_jobs=-1, oob_score=True, random_state=42, verbose=0,
                       warm_start=False)

model.fit(X_train_imputed, y_train)

print(model.score(X_train_imputed, y_train))

print(model.score(X_validate_imputed, y_validate))

0.9172558922558922
0.8163299663299664


In [14]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
param_distributions = {
    'n_estimators': [100, 250, 500, 1000],
    'max_depth': [26, 32, 36, 40],
    'min_samples_split': [1,3,5],
    'min_samples_leaf': [1,2,3,4,5],
    'max_features': [.1,.2,.3,.4,.5,.6,.7, .8,.9,1]
}

rfc = RandomForestClassifier(n_jobs=-1, random_state=8)

final_search = RandomizedSearchCV(
    estimator = rfc, 
    param_distributions=param_distributions,
    n_iter=50,
    scoring='accuracy',
    n_jobs=-1,
    cv=10,
    verbose=50,
    return_train_score=True,
    random_state=8
)

final_search.fit(X_train_imputed, y_train)

print(final_search.score(X_train_imputed, y_train))

print(final_search.score(X_validate_imputed, y_validate))

Fitting 10 folds for each of 50 candidates, totalling 500 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:   37.9s
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   38.2s
[Parallel(n_jobs=-1)]: Done   3 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done   4 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   38.4s
[Parallel(n_jobs=-1)]: Done   6 tasks      | elapsed:   38.6s
[Parallel(n_jobs=-1)]: Done   7 tasks      | elapsed:   38.6s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   38.6s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:   39.1s
[Parallel(n_jobs=-1)]: Done  11 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done  12 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done  13 tasks      | elapsed:   39.3s
[Parallel(n_jobs=-1)]: Done  14 tasks      | elapsed:  



[Parallel(n_jobs=-1)]: Done  23 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  27 tasks      | elapsed:  1.5min
[Parallel(n_jobs=-1)]: Done  28 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  29 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  30 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done  31 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  32 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  35 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  36 tasks      | elapsed:  1.7min
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done  38 tasks      | elapsed:  1.9min
[Paralle

In [18]:
train.columns

Index(['amount_tsh', 'funder', 'gps_height', 'installer', 'longitude',
       'latitude', 'wpt_name', 'num_private', 'basin', 'subvillage', 'region',
       'region_code', 'district_code', 'lga', 'ward', 'population',
       'public_meeting', 'scheme_management', 'scheme_name', 'permit',
       'construction_year', 'extraction_type', 'extraction_type_group',
       'extraction_type_class', 'management', 'management_group', 'payment',
       'water_quality', 'quality_group', 'quantity', 'source', 'source_type',
       'source_class', 'waterpoint_type', 'waterpoint_type_group',
       'status_group', 'longitude_MISSING', 'latitude_MISSING',
       'construction_year_MISSING', 'gps_height_MISSING', 'population_MISSING',
       'year_recorded', 'month_recorded', 'day_recorded', 'years',
       'years_MISSING'],
      dtype='object')

In [20]:
train["years"]

43360     NaN
7263      3.0
2486      1.0
313       NaN
52726     NaN
         ... 
9795      2.0
58170     NaN
17191     1.0
8192     25.0
49783    29.0
Name: years, Length: 47520, dtype: float64

In [12]:
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')


In [13]:
y_pred = pipeline.predict(X_test)
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission

Unnamed: 0,id,status_group
0,50785,functional
1,51630,functional
2,17168,functional
3,45559,non functional
4,49871,functional
...,...,...
14353,39307,non functional
14354,18990,functional
14355,28749,functional
14356,33492,functional


In [14]:
submission.to_csv('alex-pakalniskis-kaggle-submission-day-2.csv', index=False)