<a href="https://colab.research.google.com/github/Vanagand/DS-Unit-2-Kaggle-Challenge/blob/master/module2-random-forests/Michel_Laporte_LS_DS_221_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lambda School Data Science

*Unit 2, Sprint 2, Module 1*

---

# Decision Trees

## Assignment
- [ ] [Sign up for a Kaggle account](https://www.kaggle.com/), if you don’t already have one. Go to our Kaggle InClass competition website. You will be given the URL in Slack. Go to the Rules page. Accept the rules of the competition. Notice that the Rules page also has instructions for the Submission process. The Data page has feature definitions.
- [ ] Do train/validate/test split with the Tanzania Waterpumps data.
- [ ] Begin with baselines for classification.
- [ ] Select features. Use a scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
- [ ] Get your validation accuracy score.
- [ ] Get and plot your feature importances.
- [ ] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [ ] Commit your notebook to your fork of the GitHub repo.


## Stretch Goals

### Reading

- A Visual Introduction to Machine Learning
  - [Part 1: A Decision Tree](http://www.r2d3.us/visual-intro-to-machine-learning-part-1/)
  - [Part 2: Bias and Variance](http://www.r2d3.us/visual-intro-to-machine-learning-part-2/)
- [Decision Trees: Advantages & Disadvantages](https://christophm.github.io/interpretable-ml-book/tree.html#advantages-2)
- [How a Russian mathematician constructed a decision tree — by hand — to solve a medical problem](http://fastml.com/how-a-russian-mathematician-constructed-a-decision-tree-by-hand-to-solve-a-medical-problem/)
- [How decision trees work](https://brohrer.github.io/how_decision_trees_work.html)
- [Let’s Write a Decision Tree Classifier from Scratch](https://www.youtube.com/watch?v=LDRbO9a6XPU) — _Don’t worry about understanding the code, just get introduced to the concepts. This 10 minute video has excellent diagrams and explanations._
- [Random Forests for Complete Beginners: The definitive guide to Random Forests and Decision Trees](https://victorzhou.com/blog/intro-to-random-forests/)


### Doing
- [ ] Add your own stretch goal(s) !
- [ ] Define a function to wrangle train, validate, and test sets in the same way. Clean outliers and engineer features. (For example, [what columns have zeros and shouldn't?](https://github.com/Quartz/bad-data-guide#zeros-replace-missing-values) What columns are duplicates, or nearly duplicates? Can you extract the year from date_recorded? Can you engineer new features, such as the number of years from waterpump construction to waterpump inspection?)
- [ ] Try other [scikit-learn imputers](https://scikit-learn.org/stable/modules/impute.html).
- [ ] Make exploratory visualizations and share on Slack.


#### Exploratory visualizations

Visualize the relationships between feature(s) and target. I recommend you do this with your training set, after splitting your data. 

For this problem, you may want to create a new column to represent the target as a number, 0 or 1. For example:

```python
train['functional'] = (train['status_group']=='functional').astype(int)
```



You can try [Seaborn "Categorical estimate" plots](https://seaborn.pydata.org/tutorial/categorical.html) for features with reasonably few unique values. (With too many unique values, the plot is unreadable.)

- Categorical features. (If there are too many unique values, you can replace less frequent values with "OTHER.")
- Numeric features. (If there are too many unique values, you can [bin with pandas cut / qcut functions](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html?highlight=qcut#discretization-and-quantiling).)

You can try [Seaborn linear model plots](https://seaborn.pydata.org/tutorial/regression.html) with numeric features. For this classification problem, you may want to use the parameter `logistic=True`, but it can be slow.

You do _not_ need to use Seaborn, but it's nice because it includes confidence intervals to visualize uncertainty.

#### High-cardinality categoricals

This code from a previous assignment demonstrates how to replace less frequent values with 'OTHER'

```python
# Reduce cardinality for NEIGHBORHOOD feature ...

# Get a list of the top 10 neighborhoods
top10 = train['NEIGHBORHOOD'].value_counts()[:10].index

# At locations where the neighborhood is NOT in the top 10,
# replace the neighborhood with 'OTHER'
train.loc[~train['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
test.loc[~test['NEIGHBORHOOD'].isin(top10), 'NEIGHBORHOOD'] = 'OTHER'
```


In [0]:
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/data/'
    !pip install category_encoders==2.*
    !pip install pandas-profiling==2.*

# If you're working locally:
else:
    DATA_PATH = '../data/'



In [0]:
import pandas as pd
from sklearn.model_selection import train_test_split

train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)

train.shape, val.shape, test.shape

((47520, 41), (11880, 41), (14358, 40))

In [0]:
# Check Pandas Profiling version
# import pandas_profiling
# pandas_profiling.__version__

'2.5.0'

In [0]:
from pandas_profiling import ProfileReport
profile = ProfileReport(train, minimal=True).to_notebook_iframe()

profile

In [0]:
val['funder'].describe()

count                      11149
unique                       857
top       Government Of Tanzania
freq                        1763
Name: funder, dtype: object

In [0]:
import numpy as np
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv(DATA_PATH+'waterpumps/train_features.csv'), 
                 pd.read_csv(DATA_PATH+'waterpumps/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv(DATA_PATH+'waterpumps/test_features.csv')
sample_submission = pd.read_csv(DATA_PATH+'waterpumps/sample_submission.csv')

# Split train into train & val
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['status_group'], random_state=42)


def wrangle(X):
    """Wrangle train, validate, and test sets in the same way"""
    
    # Prevent SettingWithCopyWarning
    X = X.copy()
    
    # About 3% of the time, latitude has small values near zero,
    # outside Tanzania, so we'll treat these values like zero.
    X['latitude'] = X['latitude'].replace(-2e-08, 0)
    
    # When columns have zeros and shouldn't, they are like null values.
    # So we will replace the zeros with nulls, and impute missing values later.
    # Also create a "missing indicator" column, because the fact that
    # values are missing may be a predictive signal.
    cols_with_zeros = ['longitude', 'latitude', 'construction_year', 
                       'gps_height', 'population']
    for col in cols_with_zeros:
        X[col] = X[col].replace(0, np.nan)
        X[col+'_MISSING'] = X[col].isnull()
            
    # Drop duplicate columns
    duplicates = ['quantity_group', 'payment_type']
    X = X.drop(columns=duplicates)
    
    # Drop recorded_by (never varies) and id (always varies, random)
    unusable_variance = ['recorded_by', 'id']
    X = X.drop(columns=unusable_variance)
    
    # Convert date_recorded to datetime
    X['date_recorded'] = pd.to_datetime(X['date_recorded'], infer_datetime_format=True)
    
    # Extract components from date_recorded, then drop the original column
    X['year_recorded'] = X['date_recorded'].dt.year
    X['month_recorded'] = X['date_recorded'].dt.month
    X['day_recorded'] = X['date_recorded'].dt.day
    X = X.drop(columns='date_recorded')
    
    # Engineer feature: how many years from construction_year to date_recorded
    X['years'] = X['year_recorded'] - X['construction_year']
    X['years_MISSING'] = X['years'].isnull()
    
    # return the wrangled dataframe
    return X

In [0]:
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

In [0]:
# # The status_group column is the target
target = 'status_group'

# Get a dataframe with all train columns except the target
train_features = train.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
# categorical_features = cardinality[cardinality <= 50].index.tolist()
categorical_features = cardinality.index.tolist()

# Combine the lists 
features = numeric_features + categorical_features

In [0]:
train['funder'].value_counts().head(50)

Government Of Tanzania            7321
Danida                            2491
Hesawa                            1760
Rwssp                             1107
World Bank                        1058
Kkkt                              1019
World Vision                      1001
Unicef                             840
Tasaf                              719
District Council                   684
Dhv                                673
Private Individual                 659
Dwsp                               645
Norad                              614
0                                  607
Tcrs                               485
Germany Republi                    481
Water                              468
Ministry Of Water                  450
Dwe                                390
Netherlands                        385
Lga                                361
Hifab                              356
Amref                              345
Adb                                345
Fini Water               

In [0]:
# recovering back some lost features

# reducing cardinality
top_funder = train['funder'].value_counts()[:50].index
train.loc[~train['funder'].isin(top_funder), 'funder'] = 'else'
val.loc[~val['funder'].isin(top_funder), 'funder'] = 'else'
test.loc[~test['funder'].isin(top_funder), 'funder'] = 'else'

top_installer = train['installer'].value_counts()[:50].index
train.loc[~train['installer'].isin(top_installer), 'installer'] = 'else'
val.loc[~val['installer'].isin(top_installer), 'installer'] = 'else'
test.loc[~test['installer'].isin(top_installer), 'installer'] = 'else'

top_wpt_name = train['wpt_name'].value_counts()[:50].index
train.loc[~train['wpt_name'].isin(top_wpt_name), 'wpt_name'] = 'else'
val.loc[~val['wpt_name'].isin(top_wpt_name), 'wpt_name'] = 'else'
test.loc[~test['wpt_name'].isin(top_wpt_name), 'wpt_name'] = 'else'

top_subvillage = train['subvillage'].value_counts()[:50].index
train.loc[~train['subvillage'].isin(top_subvillage), 'subvillage'] = 'else'
val.loc[~val['subvillage'].isin(top_subvillage), 'subvillage'] = 'else'
test.loc[~test['subvillage'].isin(top_subvillage), 'subvillage'] = 'else'

top_lga = train['lga'].value_counts()[:50].index
train.loc[~train['lga'].isin(top_lga), 'lga'] = 'else'
val.loc[~val['lga'].isin(top_lga), 'lga'] = 'else'
test.loc[~test['lga'].isin(top_lga), 'lga'] = 'else'

top_ward = train['ward'].value_counts()[:50].index
train.loc[~train['ward'].isin(top_ward), 'ward'] = 'else'
val.loc[~val['ward'].isin(top_ward), 'ward'] = 'else'
test.loc[~test['ward'].isin(top_ward), 'ward'] = 'else'

top_scheme_name = train['scheme_name'].value_counts()[:50].index
train.loc[~train['scheme_name'].isin(top_scheme_name), 'scheme_name'] = 'else'
val.loc[~val['scheme_name'].isin(top_scheme_name), 'scheme_name'] = 'else'
test.loc[~test['scheme_name'].isin(top_scheme_name), 'scheme_name'] = 'else'

In [0]:
X_train = train[features]
y_train = train[target]

X_val = val[features]
y_val = val[target]

X_test = test[features]

In [0]:
X_train.shape, X_val.shape

((47520, 45), (11880, 45))

In [0]:
cardinality[cardinality > 50].index.tolist()

['funder', 'installer', 'wpt_name', 'subvillage', 'lga', 'ward', 'scheme_name']

In [0]:
import category_encoders as ce
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
import xgboost as xgb
from xgboost import XGBClassifier
from xgboost.sklearn import XGBRegressor

import warnings
from sklearn.exceptions import DataConversionWarning, FitFailedWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)
warnings.filterwarnings(action='ignore', category=FitFailedWarning)

# Whiteboard

In [0]:
pipeline = make_pipeline(
    #ce.OneHotEncoder(use_cat_names=True),
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    StandardScaler(),
    RandomForestClassifier(bootstrap=True, max_depth=50, max_features='auto', min_samples_leaf= 1, min_samples_split= 5, n_estimators=800, random_state=42, criterion='entropy')
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_val, y_val))

# RandomForestClassifier(bootstrap=True, max_depth=50, max_features='auto', min_samples_leaf= 1, min_samples_split= 5, n_estimators=800, random_state=42, criterion='entropy')
# Train Accuracy 0.960206228956229
# Validation Accuracy 0.817929292929293

Train Accuracy 0.960206228956229
Validation Accuracy 0.817929292929293


In [0]:
bagging01 = RandomForestClassifier(bootstrap=True, max_depth=50, max_features='auto', min_samples_leaf= 1, min_samples_split= 5, n_estimators=10, random_state=42, criterion='entropy', n_jobs=None)
bagging01 = BaggingClassifier(base_estimator=bagging01, n_estimators=1000, random_state=42, n_jobs=None)

pipeline = make_pipeline(
    #ce.OneHotEncoder(use_cat_names=True),
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    StandardScaler(),
    bagging01
)

pipeline.fit(X_train, y_train)

print ('Train Accuracy', pipeline.score(X_train, y_train))
print ('Validation Accuracy', pipeline.score(X_val, y_val))

Train Accuracy 0.9107112794612795
Validation Accuracy 0.8156565656565656


In [0]:
n_estimators = range(10, 1000, 200)
criterion = ['gini', 'entropy']
max_features = ['auto', 'sqrt', 'log2']
max_depth = [2, 4, 8, 16, 32, 64]
min_samples_split = [2, 4, 6, 8, 10]
min_samples_leaf = [2, 4, 6, 8, 10]
random_state = range(1, 42, 1)
bootstrap = [True, False]

random_grid = {'n_estimators': n_estimators,
               'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'random_state' : random_state,
               'bootstrap': bootstrap}
print(random_grid)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 1000, cv = 3, verbose=2, random_state=42, n_jobs=None)
# Fit the random search model
rf_random.fit(X_train, y_train)

In [0]:
rf_random.best_params_

{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': 4,
 'max_features': 'log2',
 'min_samples_leaf': 8,
 'min_samples_split': 10,
 'n_estimators': 810,
 'random_state': 25}

In [0]:
n_estimators = range(400, 1000, 100)
criterion = ['entropy']
max_features = ['auto', 'sqrt', 'log2']
max_depth = [4]
min_samples_split = [8]
min_samples_leaf = [10]
random_state = range(1, 42, 1)
bootstrap = [True]

random_grid = {'n_estimators': n_estimators,
               'criterion' : criterion,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'random_state' : random_state,
               'bootstrap': bootstrap}
print(random_grid)

# Use the random grid to search for best hyperparameters
# First create the base model to tune
rf = RandomForestClassifier()
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_random02 = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, n_iter = 1000, cv = 3, verbose=2, random_state=42, n_jobs=None)
# Fit the random search model
rf_random02.fit(X_train, y_train)

{'n_estimators': range(400, 1000, 100), 'criterion': ['entropy'], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth': [4], 'min_samples_split': [8], 'min_samples_leaf': [10], 'random_state': range(1, 42), 'bootstrap': [True]}
Fitting 3 folds for each of 738 candidates, totalling 2214 fits
[CV] random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True 
[CV]  random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True, total=   0.0s
[CV] random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True 
[CV]  random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True, total=   0.1s
[CV] random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.1s remaining:    0.0s


[CV]  random_state=1, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True, total=   0.1s
[CV] random_state=2, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True 
[CV]  random_state=2, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True, total=   0.1s
[CV] random_state=2, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True 
[CV]  random_state=2, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True, total=   0.0s
[CV] random_state=2, n_estimators=400, min_samples_split=8, min_samples_leaf=10, max_features=auto, max_depth=4, criterion=entropy, bootstrap=True 
[CV]  random_state=2, n_estimators=400, min_samples_split=8, min_sa

[Parallel(n_jobs=1)]: Done 2214 out of 2214 | elapsed:  1.9min finished


ValueError: ignored

In [0]:
rf_random02.best_params_

{'bootstrap': True,
 'criterion': 'entropy',
 'max_depth': 4,
 'max_features': 'auto',
 'min_samples_leaf': 10,
 'min_samples_split': 8,
 'n_estimators': 400,
 'random_state': 1}

In [0]:
# Save test results as a .csv:
test = sample_submission.copy()

test['status_group'] = pipeline.predict(X_test)
test[['id', 'status_group']].to_csv('test.csv', index=False)