# Feature Engineering, More Algorithms, Tuning, and Training

### Why?

As with our previous topics, the goal is not to parachute you into a predictive analytics team where you'll start wrangling records.

Instead, we want to focus on process and organization elements -- so that you can plan, manage, hire, and set reasonable expectations. We will continue to include small code examples as part of our mission to remove the mystery around AI so that it can integrate into your business.

## Feature Engineering

Remember that picture from the Google paper, where AI/ML was the little black box surrounded by huge boxes representing data engineering and related work?

A big part of that "other" work -- which turns out to be crucial to getting good AI performance -- is *feature engineering*.

By __feature__ we usually mean an individual property of a data record that we might analyze in our ML work. For example, if we have customer records, and we know the customer's age, that age might be a feature. In some situations, it might be useless -- or even illegal -- to use that feature, so we might want to change it. 

We might also have a list of transactions for that customer. Are the transaction properties "features"? They could be... We probably need to decide how to use this list of transactions. A "brute force" approach might be including every aspect of every transaction in a big long list of features. On the other hand, maybe we're more concerned with the total value of the transactions... or the frequency... or the percentile of the customer's spend among our customer base. 

There is no simple or automatic "right answer" and -- at least as of today -- there are no automated tools that can find the optimal features in every case, so we need __feature engineering__.

Feature engineering involves several different activities, including:
* Removing irrelevant features
* Determining correlated or redundant features and possibly removing them
* Learning which features contribute the most (or the least) information about our problem
* Creating new features by combining or altering existing information
* Changing feature values by scaling or bucketing them, or mapping categorical features to numeric values

Note that feature engineering in practice requires both __business domain knowledge__ and __algorithm knowledge__

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

df = pd.read_csv('data/diamonds.csv')
df2 = df.drop(df.columns[0], axis=1)
df3 = pd.get_dummies(df2)

y = df3['price']

X = df3.drop(df3.columns[3], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

Now we'll run the original linear regression without doing any new featurization. Why? When working with AI models, it is critical to maintain a collection of __baseline__ models and performance numbers.

Since there are so many ways to change our modeling, it is critical to be able to measure the statistical significance of performance changes, determine whether there is real improvement, and -- even when there is -- compare them to the costs incurred.

For example, a swapping in a more complex model can easily have significant costs that ripple throughout...
* the software process lifecycle
* data acquisition, processing, and storage costs
* hardware provisioning
* business continuity
* personnel
* legal/compliance/regulatory
* PR/communications

So: whether a model is "better" is a complex business decision; more statistical accuracy may not be worth it!

Our current baseline:

In [None]:
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

lr = linear_model.LinearRegression()
linear_model = lr.fit(X_train, y_train)

y_pred = linear_model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

## Example: Deskewing

Let us imagine that we want to improve our linear regression model on the diamonds dataset.

After running some initial experiments, and exploratory analysis, we notice that some predictors -- e.g., carat -- appear to roughly follow a power-law distribution:

In [None]:
from matplotlib import pyplot as plt

%matplotlib inline

plt.hist(df.carat, bins=20)

Some kinds of models fit better (or more efficiently) when predictors are normally distributed. Suppose that we want to try to make the `carat` predictor more Gaussian. Scikit and other libraries implement feature transformations that can help do this. Let's try one.

In [None]:
from sklearn.preprocessing import PowerTransformer

pt = PowerTransformer(method='box-cox')

pt_transformer = pt.fit(X[['carat']])

carat_processed = pt_transformer.transform(X[['carat']])

plt.hist(carat_processed, bins=20)

Ok, that kind of looks a bit more Gaussian. To complete the example, how would we use this on the full dataset and modeling process?

The API is straightforward, but a key point is that we don't want to calculate paramters for the transformation from test data and then use that in our model (it would effectively contaminate the model training with information from the test set), so we'll "learn" the transform params just from the training data, and apply then apply the same result to both training and test:

In [None]:
pt = PowerTransformer()

X_train = pt.fit_transform(X_train)

linear_model = lr.fit(X_train, y_train)

Now, to test, *we want to use the fitted (from training) transform* on the test data

Does this improve things?

In [None]:
X_test = pt.transform(X_test)

y_pred = linear_model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

Ok, this is definitely worse!

But the goal here is just to demonstrate the workflow.

Let's try a lab exercise in feature engineering where we apply a transformation that may not improve modeling accuracy, but will help give us more insight into the data and model.

[Feature Engineering Lab](./03a-FeatureEng-Lab-StandardScaling.ipynb)

## Example: Decision Tree

Decision trees are conceptually simple -- they correspond to your intuitive notions about applying business rules to determine a result. For example, you may have a set of conditions which affect whether to grant credit to a customer, and you arrive at the answer by asking a set of questions about the customer, purchase, etc., until you arrive at a result.

<img src="https://materials.s3.amazonaws.com/i/augvwPa.jpg">

One tricky -- or interesting, depending upon how you look at it -- part is deciding which decisions or "splits" to use, and where to place them in the tree.

For today, we will not get into the relevant math, but suffice to say that there are several widely used approaches to obtaining optimal splits from the data, while certain decisions -- such as how deep to let a tree grow when there are lots of possible decision splits -- is something you'll need to decide through experimentation or tuning.

In [None]:
from sklearn.tree import DecisionTreeRegressor

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeRegressor()
model = dt.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("RMSE %f" % np.sqrt(mean_squared_error(y_test, y_pred)) )

That looks like a nice improvement. We're in the neighborhood of our kNN model, but with the portability and speed of a smaller model. Moreover, decision trees are easily interpretable by humans, so we can inspect and audit the model if we need to. (That said, most statisticians consider trees to be nonparametric or semiparametric models because the values they use come directly from the data samples, rather than representing a mathematical abstraction, but we don't need to worry about that.)

So what could go wrong?

It turns out that the values in the tree are bit __too__ closely dervied from (i.e., sensitive to) the original data.

Let's take a look, then we'll discuss why this is a problem, and what we can do about it.

In [None]:
# Run this helper function, but it's not necessary to walk through all of the code

from sklearn.tree import _tree

def tree_to_code(tree, feature_names):
    tree_ = tree.tree_
    feature_name = [
        feature_names[i] if i != _tree.TREE_UNDEFINED else "undefined!"
        for i in tree_.feature
    ]
    print("def tree({}):".format(", ".join(feature_names)))

    def recurse(node, depth):
        indent = "  " * depth
        if tree_.feature[node] != _tree.TREE_UNDEFINED:
            name = feature_name[node]
            threshold = tree_.threshold[node]
            print("{}if {} <= {}:".format(indent, name, threshold))
            recurse(tree_.children_left[node], depth + 1)
            print("{}else:  # if {} > {}".format(indent, name, threshold))
            recurse(tree_.children_right[node], depth + 1)
        else:
            print("{}return {}".format(indent, tree_.value[node]))

    recurse(0, 1)

To keep this output reasonable, we'll build a smaller model, limiting max depth to 5:

In [None]:
model = DecisionTreeRegressor(max_depth=5).fit(X_train, y_train)

# Now we'll convert our tree model to a Python function ... 
# a simple, textual representation of the decision splits:

tree_to_code(model, list(df3.drop(df3.columns[3], axis=1)))

As you can see, a lot of the values used in this tree are fairly precise ... probably artificially so. And they depend on the specific items in the training set. 

Moreover, they depend on some assumptions we baked into the tree-building algorithm (though it's hard to see, because we used mostly default configurations).

So consider the following: if we allow a tree to grow arbitrarily deep, it will eventually develop leaves for most if not all of our data points. At that point, we will have built an interesting data structure for storing the information we have about our diamonds -- we call this "memorizing the training set." But it may not make the right choices for new diamonds.

When a model is allowed to develop complexity that is tailored to a dataset and limits its generality, we call this "overfitting" because the model has fit the training data too closely.

## Bias-Variance Tradeoff

Do all models have this "overfitting" tendency? 

No ... some models are so simple that they aren't flexible enough to overfit. On the other hand, those models may be so simple that they cannot accommodate complex nuance in the data. 

This spectrum is called __bias-variance tradeoff__.

<img src="https://materials.s3.amazonaws.com/i/L8Lv2N1.png">

## Ensemble Methods

Is there any way to mitigate overfitting tendencies in high-variance model families?

Yes, there are several techniques.

One approach, which we will leave as an optional lab, is to perform two steps.

First, we limit an algorithm's ability to overfit -- for example, with a decision tree, we can restrict it to being just 1 or 2 splits deep. That will produce a poor model.

But then we do a second step where we train a whole bunch of these "weak" models and then we combine their predictions (e.g., by averaging or voting) we get a nice compromise, featuring higher accuracy without the overfitting. Why? How? The basic idea is that each model will make mistakes, but their mistakes will be different and so will cancel out ... whereas the correct parts of the prediction, since they are based on real information in the data, should be more likely to reinforce each other.

If we follow this pattern with trees, and we make different small trees by randomly selecting subsets of features each time, we get a __Random Forest__. Random forests are one of the most popular algorithms because they meet these objectives well, and are still relatively simple models.

If you'd like to take a quick try at this, see http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html

#### What if we don't choose our forest randomly?

Random forests are great because they can be quick, easy, trivially parallelized, etc.

But we can do better. We can create a series of trees, where each tree is trained to correct mistakes from the previous trees in the series. If we construct the trees -- and the definitions of "correct" and "mistakes" properly -- we get a very powerful technique called gradient-boosting.

Gradient-boosted trees (or tree ensembles) are some of the most powerful, versatile, and robust machine learning models available. 

* You may have heard of __xgboost__ a popular implementation that has been called the "winningest" algorithm on Kaggle, a ML competition site now owned by Google.
  * https://github.com/dmlc/xgboost
* Some companies use enhanced, proprietary tools like TreeNet 
  * https://www.salford-systems.com/products/treenet
  
Let's try Gradient Boosting on our Diamonds regression, and see if we can do better than before!


[GBT Lab](./03c-Lab-GBT.ipynb)

## Measuring

We've talked about the "performance" of a model, and we've looked at one statistic that measures this performance, RMSE.

What are some other common ways of measuring model behavior?

Common metrics for regression problems include
* RMSE
* MSE (mean square error)
* MAE (mean absolute error)
* r2 (r-squared, or coefficient of determination)

For  classification problems, we often use
* Accuracy
* Precision / Recall
* AUC (area under ROC curve)
* F1-score

The details of all these metrics are a bit beyond our goals for today. The key things to remember are:
1. The team working on AI accuracy should have good reasoning for the metrics they are using on a project
2. Each metric measures some important aspect(s) of a model, but at the expense of some other(s), so it's there's no one "right" metric
3. In some cases, there are definitely "wrong" metrics: the classic examples is using accuracy (fraction of correct predictions) on very rare events -- if my fraud rate is 1 in a million, and my model always predicts "not fraud," its accuracy will be 99.9999%, but of course it will be totally useless
4. If you're curious, this is a helpful summary: https://en.wikipedia.org/wiki/Precision_and_recall

## Tuning

Aside from changing the algorithm or the features, another major way to improve the performance of a modeling approach is via __hyperparameter tuning__.

*Hyperparameter* refers to values that are inputs to a modeling approach, but are not contained in the data records themselves.

So far, we've seen (or brushed past) a few examples of hyperparameters:
* in our k-nearest-neighbors example, the best value for "k" or how many neighbors to look at
* in the decision tree, the maximum depth for the tree to prevent overfitting
* for a random forest, how many trees do we want to include in the ensemble

We've generally used default so far, which is why these values have not played a large part in the story.

But to get good results from any of these approaches, we will want to __tune__ the parameters. 

#### How does hyperparameter tuning work?

Unfortunately, there is no magic formula or arcane knowledge that reveals the ideal hyperparams for any particular problem.

There are some general ranges (some quite large) which usually work.

So we try a collection of values distributed across these ranges, and build models with each of the combinations of values. We then compare the performance of all of those models, and decide whether to keep a set of nice parameters, or to go back and repeat the tuning.

### Iris Dataset and Grid Search with Crossvalidation

This example is from the scikit-learn documentation and uses a well known but small dataset of iris (flower) features two different flavors of SVM (support vector machine) -- the linear one we've already seen, and more mathematically complex (but much more powerful one) called "SVM with RBF Kernel" where RBF stands for radial-basis function. 

The math isn't the critical part here. Rather, we want to see how we can use a tool to run multiple configurations of our modeling algorithm, and compare results:

In [None]:
from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svc = svm.SVC()
clf = GridSearchCV(svc, parameters)
clf.fit(iris.data, iris.target)
print("best score %f" % clf.best_score_)
print("best params %s" % clf.best_params_)