<a href="https://colab.research.google.com/github/bundickm/CheatSheets/blob/master/Machine_Learning_Support_Cheat_Sheet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Resources
[Cheat Sheets](https://github.com/bundickm/CheatSheets)
- [Regression](https://github.com/bundickm/CheatSheets/blob/master/Classification_Validation_Cheat_Sheet.ipynb)

##Universal Workflow of Machine Learning
1. **Define the problem at hand and the data on which you’ll train.** Collect this data, or annotate it with labels if need be.

2. **Choose how you’ll measure success on your problem.** Which [metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values) will you monitor on your validation data?

3. **Determine your evaluation protocol.** Hold-out validation? K-fold validation? Which portion of the data should you use for validation?

4. **Develop a first model that does better than a basic baseline:** a model with statistical power.

5. **Develop a model that overfits.** The universal tension in machine learning is between optimization and generalization; the ideal model is one that stands right at the border between underfitting and overfitting; between undercapacity and overcapacity. To figure out where this border lies, first you must cross it.

6. **Regularize your model and tune its hyperparameters, based on performance on the validation data.** Repeatedly modify your model, train it, evaluate on your validation data (not the test data, at this point), modify it again, and repeat, until the model is as good as it can get. Iterate on feature engineering: add new features, or remove features that don’t seem to be informative.

Once you’ve developed a satisfactory model configuration, you can train your final production model on all the available data (training and validation) and evaluate it one last time on the test set.

###Validation Curve
Validation curves visualize the performance metric over a range of values for some hyperparameter.

<center><img src= "https://camo.githubusercontent.com/f89eaf0abb225cda2ab4beb8eee18d621d7cacf4/68747470733a2f2f6a616b657664702e6769746875622e696f2f507974686f6e44617461536369656e636548616e64626f6f6b2f666967757265732f30352e30332d76616c69646174696f6e2d63757276652e706e67" width=400></center>

In [0]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import validation_curve

model = RandomForestRegressor(n_estimators=100)

depth = [2, 3, 4, 5, 6]
train_score, val_score = validation_curve(
    model, X_train, y_train,
    param_name='max_depth', param_range=depth, 
    scoring='neg_mean_absolute_error', cv=3)

plt.plot(depth, np.median(train_score, 1), color='blue', label='training score')
plt.plot(depth, np.median(val_score, 1), color='red', label='validation score')
plt.legend(loc='best')
plt.xlabel('depth');

##Hyperparameter Optimization

**Hyperparameter Optimization** - The problem of choosing a set of optimal hyperparameters for a learning algorithm. The same kind of machine learning model can require different constraints, weights or learning rates to generalize different data patterns. These measures are called hyperparameters, and have to be tuned so that the model can optimally solve the machine learning problem. Hyperparameter optimization finds a tuple of hyperparameters that yields an optimal model which minimizes a predefined loss function on given independent data. The objective function takes a tuple of hyperparameters and returns the associated loss. Cross-validation is often used to estimate this generalization performance.

In [0]:
from sklearn.model_selection import RandomizedSearchCV

param_distributions = {
    'n_estimators':[100,200],
    'max_depth':[4,5],
    'criterion':['mse','mae']
}

search = RandomizedSearchCV(
    RandomForestRegressor(n_jobs=-1, random_state=42),
    param_distributions=param_distributions, n_iter=8, cv=3,
    scoring='neg_mean_absolute_error', verbose=10,
    return_train_score=True, n_jobs=-1
)

search.fit(X_train, y_train)
results = pd.DataFrame(search.cv_results_)
results.sort_values(by='rank_test_score').head(1)

In [0]:
search.best_estimator_

In [0]:
#Graphing importances in descending order
importances = pd.Series(search.best_estimator_.feature_importances_,X_train.columns)
plt.figure(figsize=(5,10))
importances.sort_values().plot.barh(color='grey');

In [0]:
#Final scoring on test data
from sklearn.metrics import mean_absolute_error

final = search.best_estimator_
y_pred = final.predict(X_test)
test_mae = mean_absolute_error(y_test,y_pred)

print('MAE with holdout test set:',test_mae)

##Feature Engineering
**Feature engineering** - The process of using your own knowledge about the data and about the machine learning algorithm at hand to make the algorithm work better by applying hardcoded (non-learned) transformations to the data before it goes into the model. In many cases, it isn’t reasonable to expect a machine-learning model to be able to learn from completely arbitrary data. The data needs to be presented to the model in a way that will make the model’s job easier.

#Visualizations

In [0]:
#pip installs
!pip install eli5
!pip install pdpbox
!pip install shap

##Feature Importances

In [0]:
#Feature Importances
  #best is the best model from RandomizedSearchCV
n = 20
figsize = (5,n//3)

importances = pd.Series(best.feature_importances_, X_train.columns)
top_n = importances.sort_values()[-n:]

plt.figure(figsize=figsize)
top_n.plot.barh(color='gray');

##Permutation Importances
Permutation Importance is a compromise between Feature Importance based on impurity reduction (which is the fastest) and Drop Column Importance (which is the "best.")

The ELI5 library documentation explains,

    Importance can be measured by looking at how much the score (accuracy, F1, R^2, etc. - any score we’re interested in) decreases when a feature is not available.

    To do that one can remove feature from the dataset, re-train the estimator and check the score. But it requires re-training an estimator for each feature, which can be computationally intensive. ...

    To avoid re-training the estimator we can remove a feature only from the test part of the dataset, and compute score without using this feature. It doesn’t work as-is, because estimators expect feature to be present. So instead of removing a feature we can replace it with random noise - feature column is still there, but it no longer contains useful information. This method works if noise is drawn from the same distribution as original feature values (as otherwise estimator may fail). The simplest way to get such noise is to shuffle values for a feature, i.e. use other examples’ feature values - this is how permutation importance is computed.

    The method is most suitable for computing feature importances when a number of columns (features) is not huge; it can be resource-intensive otherwise.


In [0]:
#table of feature importances via permutation importance
import eli5
from eli5.sklearn import PermutationImportance

permuter = PermutationImportance(best, scoring='roc_auc', cv='prefit', 
                                 n_iter=2, random_state=42)

permuter.fit(X_test.values, y_test)

feature_names = X_test.columns.tolist()
eli5.show_weights(permuter, top=None, feature_names=feature_names)

In [0]:
#dropping features with 0 importance for faster training but almost equal score
mask = permuter.feature_importances_ > 0
features = X_train.columns[mask]
X_train = X_train[features]

###Refit the model after removing features###

##Partial Dependence Plots

When using black box machine learning algorithms like random forest and boosting, it is hard to understand the relations between predictors and model outcome. For example, in terms of random forest, all we get is the feature importance. Although we can know which feature is significantly influencing the outcome based on the importance calculation, we don’t know in which direction it is influencing. And in most cases, the effect is non-monotonic. We need tools to help understanding of the complex relations between predictors and model prediction.

Partial dependence plots show how a feature affects predictions of a Machine Learning model on average.
1. Define grid along feature
2. Model predictions at grid points
3. Line per data instance -> ICE (Individual Conditional Expectation) curve
4. Average curves to get a PDP (Partial Dependence Plot)




In [0]:
#single feature dependence plot
from pdpbox.pdp import pdp_isolate, pdp_plot

feature = 'sub_grade'

isolated = pdp_isolate(model=best, dataset=X_test, 
                       model_features=X_test.columns, feature=feature)

pdp_plot(isolated, feature_name=feature);

In [0]:
#2 feature interaction partial dependence plot (heatmap)
from pdpbox.pdp import pdp_interact, pdp_interact_plot

features =['sub_grade','dti']

interaction = pdp_interact(model=best, dataset=X_test, 
                           model_features=X_test.columns, features=features)

pdp_interact_plot(interaction, plot_type='grid', feature_names=features)

##Shapley Values

SHAP Values (an acronym from SHapley Additive exPlanations) break down a prediction to show the impact of each feature.

In [0]:
#SHAP Values diagram
import shap
shap.initjs()

#this is just any entry, may be true positive, false negative, etc.
data_for_prediction = X_test[X_test.index==91777] 

explainer = shap.TreeExplainer(best)
shap_values = explainer.shap_values(data_for_prediction)
shap.force_plot(explainer.expected_value, shap_values, data_for_prediction)

#Gradient Descent
**Gradient descent** - an optimization algorithm used to minimize some function by iteratively moving in the direction of steepest descent as defined by the negative of the gradient. In machine learning, we use gradient descent to update the parameters of our model. Parameters refer to coefficients in Linear Regression and weights in neural networks.

1.   Define a Cost Function
2.   Evaluate slope (gradient) at current point
3.   Take small step (alpha or learning rate) in direction of slope descent
4.   Repeat steps 2 and 3 until slope approaches 0

