Lambda School Data Science, Unit 2: Predictive Modeling

# Kaggle Challenge, Module 3

## Assignment
- [X] [Review requirements for your portfolio project](https://lambdaschool.github.io/ds/unit2/portfolio-project/ds6), then choose your dataset, and [submit this form](https://forms.gle/nyWURUg65x1UTRNV9), due today at 4pm Pacific.
- [X] Continue to participate in our Kaggle challenge.
- [X] Try xgboost.
- [X] Get your model's permutation importances.
- [X] Try feature selection with permutation importances.
- [X] Submit your predictions to our Kaggle competition. (Go to our Kaggle InClass competition webpage. Use the blue **Submit Predictions** button to upload your CSV file. Or you can use the Kaggle API to submit your predictions.)
- [X] Commit your notebook to your fork of the GitHub repo.

## Stretch Goals

### Doing
- [ ] Add your own stretch goal(s) !
- [X] Do more exploratory data analysis, data cleaning, feature engineering, and feature selection.
- [ ] Try other categorical encodings.
- [ ] Try other Python libraries for gradient boosting.
- [X] Look at the bonus notebook in the repo, about monotonic constraints with gradient boosting.
- [ ] Make visualizations and share on Slack.

### Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - _**[A Kaggle Master Explains Gradient Boosting](http://blog.kaggle.com/2017/01/23/a-kaggle-master-explains-gradient-boosting/)**_
  - [_An Introduction to Statistical Learning_](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf) Chapter 8
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html)
  - _**[Boosting](https://www.youtube.com/watch?v=GM3CDQfQ4sw) (2.5 minute video)**_

#### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- _**[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)**_
- _**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)**_
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

#### Imposter Syndrome
- [Effort Shock and Reward Shock (How The Karate Kid Ruined The Modern World)](http://www.tempobook.com/2014/07/09/effort-shock-and-reward-shock/)
- [How to manage impostor syndrome in data science](https://towardsdatascience.com/how-to-manage-impostor-syndrome-in-data-science-ad814809f068)
- ["I am not a real data scientist"](https://brohrer.github.io/imposter_syndrome.html)
- _**[Imposter Syndrome in Data Science](https://caitlinhudon.com/2018/01/19/imposter-syndrome-in-data-science/)**_






### Python libraries for Gradient Boosting
- [scikit-learn Gradient Tree Boosting](https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting) — slower than other libraries, but [the new version may be better](https://twitter.com/amuellerml/status/1129443826945396737)
  - Anaconda: already installed
  - Google Colab: already installed
- [xgboost](https://xgboost.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://xiaoxiaowang87.github.io/monotonicity_constraint/)
  - Anaconda, Mac/Linux: `conda install -c conda-forge xgboost`
  - Windows: `conda install -c anaconda py-xgboost`
  - Google Colab: already installed
- [LightGBM](https://lightgbm.readthedocs.io/en/latest/) — can accept missing values and enforce [monotonic constraints](https://blog.datadive.net/monotonicity-constraints-in-machine-learning/)
  - Anaconda: `conda install -c conda-forge lightgbm`
  - Google Colab: already installed
- [CatBoost](https://catboost.ai/) — can accept missing values and use [categorical features](https://catboost.ai/docs/concepts/algorithm-main-stages_cat-to-numberic.html) without preprocessing
  - Anaconda: `conda install -c conda-forge catboost`
  - Google Colab: `pip install catboost`

### Categorical Encodings

**1.** The article **[Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)** mentions 4 encodings:

- **"Categorical Encoding":** This means using the raw categorical values as-is, not encoded. Scikit-learn doesn't support this, but some tree algorithm implementations do. For example, [Catboost](https://catboost.ai/), or R's [rpart](https://cran.r-project.org/web/packages/rpart/index.html) package.
- **Numeric Encoding:** Synonymous with Label Encoding, or "Ordinal" Encoding with random order. We can use [category_encoders.OrdinalEncoder](https://contrib.scikit-learn.org/categorical-encoding/ordinal.html).
- **One-Hot Encoding:** We can use [category_encoders.OneHotEncoder](http://contrib.scikit-learn.org/categorical-encoding/onehot.html).
- **Binary Encoding:** We can use [category_encoders.BinaryEncoder](http://contrib.scikit-learn.org/categorical-encoding/binary.html).


**2.** The short video 
**[Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)** introduces an interesting idea: use both X _and_ y to encode categoricals.

Category Encoders has multiple implementations of this general concept:

- [CatBoost Encoder](http://contrib.scikit-learn.org/categorical-encoding/catboost.html)
- [James-Stein Encoder](http://contrib.scikit-learn.org/categorical-encoding/jamesstein.html)
- [Leave One Out](http://contrib.scikit-learn.org/categorical-encoding/leaveoneout.html)
- [M-estimate](http://contrib.scikit-learn.org/categorical-encoding/mestimate.html)
- [Target Encoder](http://contrib.scikit-learn.org/categorical-encoding/targetencoder.html)
- [Weight of Evidence](http://contrib.scikit-learn.org/categorical-encoding/woe.html)

Category Encoder's mean encoding implementations work for regression problems or binary classification problems. 

For multi-class classification problems, you will need to temporarily reformulate it as binary classification. For example:

```python
encoder = ce.TargetEncoder(min_samples_leaf=..., smoothing=...) # Both parameters > 1 to avoid overfitting
X_train_encoded = encoder.fit_transform(X_train, y_train=='functional')
X_val_encoded = encoder.transform(X_train, y_val=='functional')
```

**3.** The **[dirty_cat](https://dirty-cat.github.io/stable/)** library has a Target Encoder implementation that works with multi-class classification.

```python
 dirty_cat.TargetEncoder(clf_type='multiclass-clf')
```
It also implements an interesting idea called ["Similarity Encoder" for dirty categories](https://www.slideshare.net/GaelVaroquaux/machine-learning-on-non-curated-data-154905090).

However, it seems like dirty_cat doesn't handle missing values or unknown categories as well as category_encoders does. And you may need to use it with one column at a time, instead of with your whole dataframe.

**4. [Embeddings](https://www.kaggle.com/learn/embeddings)** can work well with sparse / high cardinality categoricals.

_**I hope it’s not too frustrating or confusing that there’s not one “canonical” way to encode categorcals. It’s an active area of research and experimentation! Maybe you can make your own contributions!**_

In [350]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Merge train_features.csv & train_labels.csv
train = pd.merge(pd.read_csv('../data/tanzania/train_features.csv'), 
                 pd.read_csv('../data/tanzania/train_labels.csv'))

# Read test_features.csv & sample_submission.csv
test = pd.read_csv('../data/tanzania/test_features.csv')
sample_submission = pd.read_csv('../data/tanzania/sample_submission.csv')

In [346]:
### Select features, scikit-learn pipeline to encode categoricals, impute missing values, and fit a decision tree classifier.
import category_encoders as ce

import xgboost as xgb
from xgboost import XGBClassifier

from scipy.stats import randint, uniform

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, IterativeImputer
from sklearn.linear_model import LogisticRegression, Ridge
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, RandomizedSearchCV

import eli5
from eli5.sklearn import PermutationImportance

### Wrangle function
def wrangle(df):
    """I heard you like docstrings"""
    import numpy as np
    
    ### Easy stuff
    # Copy the dataframe to itself so warnings can shsshshsh
    df = df.copy()
    
    df['latitude'] = df['latitude'].replace(-2e-08,0)
    #status_map = {'functional':2,'functional needs repair':1,'non functional':0}
    #df['status_group'] = df['status_group'].map(status_map)
    ### Feature creation
    df['date_recorded'] = pd.to_datetime(df['date_recorded'],infer_datetime_format=True)
    df['year_recorded'] = df['date_recorded'].dt.year
    df['month_recorded'] = df['date_recorded'].dt.month
    df['day_recorded'] = df['date_recorded'].dt.day
    
    # Drop dupes
    df = df.drop(columns=['id','date_recorded','quantity_group',
                          'scheme_management','quality_group',
                          'waterpoint_type_group','extraction_type_group',
                          'payment_type'])
    ### Replacing zeroes and such
    noZeroes = ['longitude','latitude','subvillage',
                'installer',
               ## BELOW
                'population','construction_year','gps_height']
    for col in noZeroes:
        df[col] = df[col].replace(0,np.nan)
        df[col] = df[col].replace('0',np.nan)
        df[col+'_MISSING'] = df[col].isnull()
    
    # Ryan's Engineer feature: how many years from construction_year to date_recorded
    df['years'] = df['year_recorded'] - df['construction_year']
    df['years_MISSING'] = df['years'].isnull()
    ## My feature engineering stuff
    # Seasons
    df['season'] = df['month_recorded']
    df['season'].replace([3,4,5,6,7,8], 'hot', inplace=True)
    df['season'].replace([9,10,11,12,1,2], 'cool', inplace=True)

    # YEEHAW dataframe return
    return df

### Reduce cardinality function for training and validation dataframesdef reduceCard(df1,df2,col,amt):
def reduceCard(df1,df2,col,amt):
    df1[col] = df1[col].astype(str)
    df2[col] = df2[col].astype(str)
    listoftop = df1[col].value_counts()[:amt].index
    df1.loc[~df1[col].isin(listoftop),col] = 'other'
    df2.loc[~df2[col].isin(listoftop),col] = 'other'
    return df1, df2

### Do this cool permutation importance thing
def permutationCreator(X_train, X_val,
                      y_train, y_val):
    ## Transform manually cause pipelines don't work
    transformers = make_pipeline(
        ce.OrdinalEncoder(),
        SimpleImputer(strategy='mean')
    )
    X_train_t = transformers.fit_transform(X_train)
    X_val_t = transformers.transform(X_val)
    
    ## Create model and fit
    # n_esties 550
    model = RandomForestClassifier(n_estimators=200,random_state=4,n_jobs=-1)
    model.fit(X_train_t, y_train)
    
    ## Fit permutation stuff
    permuter = PermutationImportance(
        model,
        scoring = 'accuracy',
        n_iter = 2,
        random_state = 42
    )
    permuter.fit(X_val_t,y_val)
    features = X_train.columns.tolist()
    return permuter, features

### Get that cool green/red graph
def permutationColor(permuter, features):
    x = eli5.show_weights(
        permuter, feature_names=features, top=None
    )
    return x

### Prune by feature importance
def featureImportance(train, val, permuter, min_imp):
    mask = permuter.feature_importances_ > min_imp
    features = train.columns[mask]
    X_train, X_val = train[features], val[features]
    
    return X_train, X_val, features

###########################################################
### Pipeline to pass to cross validation
cv_pip = make_pipeline(
    # All params are going to be in a separate dictionary below X_train
    ce.TargetEncoder(),
    SimpleImputer(),
    SelectKBest(),
    Ridge()
)

### Logistic Regression w/ One-hot encoder
lg = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True),
    ## Imputer completes missing values
    #SimpleImputer(),
    IterativeImputer(),
    ## Scaler scales to unit variance, removing the mean(?)
    StandardScaler(),
    LogisticRegression(solver='lbfgs',multi_class='auto',max_iter=1000)
)

### Decision trees w/ One-hot Encoder
dt = make_pipeline(
    ## One hot encode
    ce.OneHotEncoder(use_cat_names=True),
    #SimpleImputer(),
    IterativeImputer(),
    DecisionTreeClassifier(min_samples_leaf=25,random_state=42)
)

##### !!!!
### Random Forest w/ Ordinal Encoder
rf = make_pipeline(
    ce.OrdinalEncoder(),
    #ce.OneHotEncoder(use_cat_names=True),
    SimpleImputer(strategy='mean'),
    # n_estimators is # of trees in forest
    ## n_estimators=650
    RandomForestClassifier(n_estimators=1000,random_state=4,n_jobs=-1)
)

### Gradient Booster
gb = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='median'),
    GradientBoostingClassifier()
)

### XGB Classifier!!!
### (does not have a score method: use accuracy_score)
xgb = make_pipeline(
    ce.OrdinalEncoder(),
    # No special Early Stopping
    XGBClassifier(n_estimators=100,n_jobs=-1)
)

In [339]:
### Cross-validation
target = 'status_group'

# Assign variables
X_train = wrangle(train)
y_train = X_train[target]

#X_test = wrangle(test)
#status_map = {'functional':1,'non functional':0}
#df['status_group'] = df['status_group'].map(status_map)

param_dist = {
    'simpleimputer__strategy': ['mean','median'],
    'selectkbest__k': range(1,len(X_train.columns)+1),
    'ridge__alpha': uniform(1,10)
}

search = RandomizedSearchCV(
    cv_pip,
    param_distributions=param_dist,
    # Randomly try 100 options sampling
    n_iter=100,
    cv=5,
    scoring='neg_mean_absolute_error',
    verbose=10,
    return_train_score=True,
    n_jobs=-1
)
search.fit(X_train,y_train)

Fitting 5 folds for each of 100 candidates, totalling 500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   1 tasks      | elapsed:    8.9s
[Parallel(n_jobs=-1)]: Done   8 tasks      | elapsed:   11.3s
[Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:   18.6s
[Parallel(n_jobs=-1)]: Done  26 tasks      | elapsed:   26.7s
[Parallel(n_jobs=-1)]: Done  37 tasks      | elapsed:   35.5s
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:   39.0s
[Parallel(n_jobs=-1)]: Done  61 tasks      | elapsed:   52.2s
[Parallel(n_jobs=-1)]: Done  74 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done  89 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:  1.4min
[Parallel(n_jobs=-1)]: Done 121 tasks      | elapsed:  1.6min
[Parallel(n_jobs=-1)]: Done 138 tasks      | elapsed:  1.8min
[Parallel(n_jobs=-1)]: Done 157 tasks      | elapsed:  2.1min
[Parallel(n_jobs=-1)]: Done 176 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 197 tasks      | elapsed:  

RandomizedSearchCV(cv=5, error_score='raise-deprecating',
                   estimator=Pipeline(memory=None,
                                      steps=[('targetencoder',
                                              TargetEncoder(cols=None,
                                                            drop_invariant=False,
                                                            handle_missing='value',
                                                            handle_unknown='value',
                                                            min_samples_leaf=1,
                                                            return_df=True,
                                                            smoothing=1.0,
                                                            verbose=0)),
                                             ('simpleimputer',
                                              SimpleImputer(add_indicator=False,
                                                           

In [341]:
pd.DataFrame(search.cv_results_).sort_values(by='rank_test_score').head()

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_ridge__alpha,param_selectkbest__k,param_simpleimputer__strategy,params,split0_test_score,split1_test_score,...,mean_test_score,std_test_score,rank_test_score,split0_train_score,split1_train_score,split2_train_score,split3_train_score,split4_train_score,mean_train_score,std_train_score
53,6.741357,0.186261,0.475346,0.039904,1.12062,1,median,"{'ridge__alpha': 1.1206150100968428, 'selectkb...",-2.4e-05,-2.4e-05,...,-2.4e-05,5.210146e-08,1,-2.4e-05,-2.4e-05,-2.4e-05,-2.4e-05,-2.4e-05,-2.4e-05,3.351569e-09
80,6.563229,0.240826,0.498597,0.060254,1.08522,3,mean,"{'ridge__alpha': 1.0852214193152456, 'selectkb...",-3e-05,-3e-05,...,-3e-05,1.139385e-07,2,-2.5e-05,-2.5e-05,-2.5e-05,-2.5e-05,-2.5e-05,-2.5e-05,1.880905e-08
45,6.334849,0.435081,0.481739,0.053925,1.10609,9,mean,"{'ridge__alpha': 1.1060927973862935, 'selectkb...",-3.2e-05,-3.2e-05,...,-3.2e-05,1.313771e-07,3,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,2.900468e-08
66,7.621224,0.404084,0.499595,0.052229,1.08393,38,median,"{'ridge__alpha': 1.0839300928310494, 'selectkb...",-3.3e-05,-3.3e-05,...,-3.3e-05,2.089458e-07,4,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,-2.8e-05,3.183432e-08
20,6.710138,0.197392,0.547614,0.046214,1.35027,2,median,"{'ridge__alpha': 1.3502736948261675, 'selectkb...",-3.6e-05,-3.7e-05,...,-3.7e-05,2.179493e-07,5,-3e-05,-3e-05,-3e-05,-3e-05,-3e-05,-3e-05,1.611511e-08


In [356]:
pipeline = search.best_estimator_

In [357]:
X_test = wrangle(test)
#y_test = X_test[target]
#status_map = {2:'functional',1:'functional needs repair',0:'non functional'}
#y_test['status_group'] = y_test['status_group'].map(status_map)

y_pred = pipeline.predict(X_test)

ValueError: Unexpected input dimension 45, expected 46

In [None]:
''' FOR NON-CROSSVALIDATION
# Split data
train, val = train_test_split(train, train_size=0.80, test_size=0.2,
                              stratify=train['status_group'],random_state=42)
# Assign variables
train = wrangle(train)
val = wrangle(val)
test = wrangle(test)

#__________________________
### Assign target(s) and feature(s) [old habits die hard]
target = 'status_group'
features = _

X_train = train.drop(columns=target)
y_train = train[target]
X_val = val.drop(columns=target)
y_val = val[target]

#__________________________
## Get permuter
permuter, features_old = permutationCreator(X_train, X_val, y_train, y_val)

### Get feature importances
X_train, X_val, features = featureImportance(X_train, X_val, permuter, 0)
X_test = test[features]

#___________________________
permutationColor(permuter, features_old)

#___________________________
rf.fit(X_train,y_train)
#accuracy_score(y_val, xgb.predict(X_val))
rf.score(X_val,y_val)

'''

In [None]:
'''
# Put predict on this
y_pred = rf.predict(X_test)
'''
# Write submission csv file
submission = sample_submission.copy()
submission['status_group'] = y_pred
submission.to_csv('../../sub-01.csv', index=False)
#''';