# Modeling

Running a model is usually the easiest part of model building.  What ends up being a bigger challenge is model evaluation, and in this module we will go through estimator selection framework, understand how to initialize and train a model, and spend most of our time in model evaluation and selection

## Estimator Choice

scikit learn has a pretty helpful chart for choosing the right model type.  There's a few key decisions that you will need to make to get to the best model type:
- are you predicting a categorical variable or a continuous one?
    - if the variable is continous, are your variables bounded in any way (e.g. [0, 1], [-1, 1])
- how many data samples do you have for training?
- are there any statistical properties of your data that you need to deal with, e.g.:
    - degree of outliers / low sample features that can cause overfitting
    - very high feature:sample size ratio
    - multi-collinearity
- do you need any intermediate outputs from the model?  e.g.:
    - do you need to get feature importance?
    - do you need to explain the decision boundary


![Model Type](ml_map.png)

## Building a model and predicting

We can build any model by:
1. initializing the model and
2. calling the `fit` method with our training data

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lr = LinearRegression()

In [None]:
lr.fit([[0, 0], [1, 1], [2, 2]], [0, 1, 2])

after we have fitted the model, we can now call `predict` to get a prediction on the target variable given inputs

In [None]:
lr.predict([[4, 4], [6, 6], [4, 9]])

and modeling is done!

However, how do we know that this model is good?  How do we define what good is?

To solve this problem, we will need to introduce the following concepts:
1. splitting our data set into training and test data
2. use a model evaluation metric to test the performance of our model
3. cross-validation

## Training, Testing and Model Evaluation

The primary goal of building a model is to use seen samples to infer a relationship between features and the target, and use this model to predict unseen samples.  The classic example of this is to use historical prices to build a model that predicts future prices of an asset.

One major issue that arises is overfitting - it is very easy to build a model that fits so well to the intricacies of the seen samples, but does not generalize well to predict unseen samples, which defeats the goal of building the model.

To help mitigate this, it's best practice during model training to hold out a part of the available samples as a test set.  This way we will have a better gauge of whether the trained model can generalize to out of sample data points.

To do this with sklearn we can leverage `train_test_split` to split our data:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [None]:
X, y = datasets.make_regression(random_state=0, n_features=1, noise=1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.4, random_state=0
)

In the example above, `train_test_split` will randomly hold out 40% of our sample data to use for testing, and return the appropriate X, y datasets from both training and testing.  We can then use the train datasets to train the model and test on the test dataset.

First we can verify that test is 40% of the total data set:

In [None]:
X.shape, X_train.shape, X_test.shape

In [None]:
y.shape, y_train.shape, y_test.shape

Next, we can use the training data set to train our model

In [None]:
model = LinearRegression().fit(X_train, y_train)

And finally, we can assess the quality of our model by using the test set

In [None]:
mean_squared_error(y_test, model.predict(X_test))

Now we have a single metric to show how our trained model performs against an out of sample test set.

We can also use other types of metrics dpeending on our problem space.  For example:
- mean absolute error may be more appropriate than mean squared error (e.g. if large outliers are not something you want to penalie the model for):
- explained variance can be good for measuring how much variability you're explaining with the model
- max error captures the _worst_ error that can be generated
We simply need to import the metrics, and evaluatel them using the model priction and the test target variable:

In [None]:
from sklearn.metrics import mean_absolute_error, max_error, explained_variance_score

In [None]:
mean_absolute_error(y_test, model.predict(X_test))

In [None]:
max_error(y_test, model.predict(X_test))

In [None]:
explained_variance_score(y_test, model.predict(X_test))

Using these metrics we can now engineer and transform features and change the model type to try to lower the target metric.

In addition, depending on the problem space, the type of emtric we need to use can be very different also.  For example, if we are performing a classification, we probably want to look at `roc_curve` or `precision_recall_curve`, whereas if we're predicting a continuous variable we're more likely to use `mean_squared_error`.

### Time Series

One note on time series - because the data for time series is time dependent, using `train_test_split` will not work because it will randomly take a set of samples to be test samples.  This is a problem for time series since the data is not randomly ordered - we should not be able to see data points in the future when we train our model otherwise it will have futuresight.  

Luckily, sklearn has `TimeSeriesSplit` which allows us to easily split out train and test data sets while making sure that we do not accidentally see into the future.

In [None]:
from sklearn.model_selection import TimeSeriesSplit
import numpy as np

In [None]:
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([1, 2, 3, 4, 5, 6])

In [None]:
tss = TimeSeriesSplit(n_splits=3, test_size=1)

In [None]:
for train_idx, test_idx in tss.split(X):
    print("TRAIN:", train_idx, "TEST:", test_idx)
    

we can now iteratively predict on the next [n] items, using all historical data afor training

## Cross Validation (Part 1)

In our simple example above, we used train/test splitting to help assess our model performance on out of sample data.  However, this may still not be sufficient, especially if we have models that have hyperparameters that need to be tuned.  

In this situation we will need to re-run train/test multiple times with different hyperparameters to optimize them, however this can now lead to hyperparameter overfitting because we can overfit on the test data.

The solution for this is to utilize a technique called cross-validation.  In this scenario, we will split the training data into k even sets (e.g. 5 sets).  Then, k-1 sets are used to train the model, and the last set will be used to evaluate the model.  This is done across all k combinations, and the performance of the model is the average of all k runs.  After the best hyperparameter is chosen, we can then evaluate it on the test set to get the final result.

![Cross Validation](cross_validation.png)

An example

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.datasets import load_iris

In [None]:
X, y = datasets.load_iris(return_X_y=True)

In [None]:
clf = SVC(kernel='linear', C=1, random_state=42)

In [None]:
scores = cross_val_score(clf, X, y, cv=5)

In [None]:
scores.mean()

In [None]:
clf = SVC(kernel='linear', C=0.1, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()

In [None]:
clf = SVC(kernel='linear', C=100, random_state=42)
scores = cross_val_score(clf, X, y, cv=5)
scores.mean()