# Model Validation in Python

# TO ADD TO CAPSTONE:
* $\star$ Compare how well the model performs on the data it *has* seen, as compared to the data it *hasn't* seen to help determine if over- or underfitting. 
    * Goal: eval metrics should be similar when performed on data model has seen to data model has not seen.

Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? Goal: cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.

* **Model Validation** consists of various steps and processes that ensure your model performs as expected on new data. The most common way to do this is to test your model's accuracy (or, insert evaluation metric of your choice) on data it has never seen before (called a **holdout set**). If your model's accuracy is similar for the data it was trained on and the holdout data. You can claim that your model is validated. The ultimate goal of model validation is to end up with the best performing model possible that achieves high accuracy on new data. 

#### Model validation consists of:
    * Ensuring your model performs as expected on new data
    * Testing model performance on holdout datasets
    * Selecting the best model, parameters, and accuracy metrics
    * Achieving the best accuracy for the data given
    
### Scikit-learn modeling review
#### Basic modeling steps
* 1. Create a model by specifying model type and its parameters
* 2. Fit the model using the `.fit()` method
* 3. To assess model accuracy, we generate predictions for data using the `.predict()` method. 
* 4. Look at accuracy metrics.

* **The process of generating a model, fitting, predicting, and then reviewing model accuracy was introduced earlier in:**
    * Intermediate Python
    * Supervised Learning with scikit-learn

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X= X_train, y= y_train)
predictions = model.predict(X_test)
print("{0:.2f}".format(mae(y_true = y_test, y_pred= predictions)))
```
* **Model validation's main goal is to ensure that a predictive model will perform as expected on new data.**
* Training data = seen data

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
```

* Testing data = unseen data

```
model = RandomForestRegressor(n_estimators = 500, random_state=1111)
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
```

* If your training and testing errors are vastly different, it may be a sign that your model is overfitted
* Use model validation to make sure you get the best testing error possible

### Regression models
* More specifically: Random Forest Regression models using scikit-learn
* Random forest algorithms have a lot of parameters, but here we focus on three:
    * **`n_estimators`:** is the number of trees in the forest
    * **`max_depth`:** the maximum depth of the trees (or how many times we can split the data). Also described as the maximum length from the beginning of a tree to the tree's end nodes 
    * **`random_state`:** random seed; allows us to create reproducible models
* The most common way to set model parameters is to do so when initiating the model
* However, they can also be set later, by assigning a new value to a model's attribute.
    * This second method could be helpful when testing out different sets of parameters
    
```
rfr = RandomForestRegressor(random_state=1111)
rfr.n_estimators = 50
rfr.max_depth = 10
```

#### Feature importance
* After a model is created, we can assess how important different features (or columns) of the data were in the model by using the **`.feature_importances_`** attribute.

* Use code below, so long as data is in pandas DataFrame

```
for i, item in enumerate(rfr.feature_importances_):
    print("{0:s}: {1:.2f}".format(X.columns[i], item))
```

* **The larger this number is, the more important that column was in the model.

```
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)
```

#### Classification models
* Several methods are shared across all scikit-learn models, but some are unique to the specific type of model
* View how many observations were assigned to each class by turning the array of predictions into a pandas Series and then using the method `.value_counts()`:
* `pd.Series(rfc.predict(X_test)).value_counts()`

* Another prediction method is `.predict_proba()`, which returns an arry of predicted probabilities for each class
* Sometimes in model validation, we want to know the probability values and not just the classification
* Each entry of the array returned by `.predict_proba()` contains probabilities that sum to 1

* **`rfc.get_params()`:**
    * is used to review which parameters went into a scikit-learn model 
    * will print out a dictionary of parameters and their values, allowing us to see exactly which parameters were used
    * Knowing a model's parameters is essential when assessing model quality, rerunning models, and even parameter tuning
* **`.score()`:**
    * A quick way to look at the overall accuracy of the classification model
    
* Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as Stack Overflow. 
* The best way to do this is to replicate your work by reusing model parameters.

```
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))
```

#### Creating train, test, and validation datasets
* "Seen" data (used for training)
* "Unseen" data (unavailable for training)
* In model validation, we use **holdout samples** to replicate this idea
* We define a **holdout dataset** as any data that is not used for training and is only used to assess model performance
    * The available data is split into two datasets:
        * One used for training
        * one that is simply off-limits while we are training our models (called a test--or holdout-- dataset)
        
 * When evaluating model's performance using different parameter values: use **validation set**
 * **Validation set** should be same size as the testing set.
 
 * To create **training**, **validation**, and **testing** datasets, we use `.train_test_split()` **TWICE**:
 

 * The first call will create training and testing datasets like normal.
 * The following second call will split this so-called temporary training dataset into the final training and validation datasets.
 
``` 
X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state=1111)
X_train, X_val, y_train, y_val = train_test_split(X_temp, test_size=0.25, random_state=1111)

 

## Accuracy metrics: Regression models

#### Mean Absolute Error (MAE)
* The simplest and most intuitive error metric
* This metric treats all points equally and is not sensitive to outliers
* When dealing with applications where we don't want large errors to have a major impact, the mean absolute error can be used
* As an example: An MAE of 10 would mean that we are about 10 percentage points off (on average) when predicting the values of our dataset

#### Mean squared error (MSE)
* The most widely used regression error metric for regression models 
* It is calculated similarly to the mean absolute error, but this time we square the difference term
* The MSE allows larger errors to have a larger impact on the model
* Allows outlier errors to contribute more to the overall error 

* **Choosing between MAE and MSE comes down to the application**
* Accuracy metrics are always application-specific
* MAE and MSE error terms are in different units entirely and should not be directly compared 

#### Accuracy for a subset of data
* For example, how does this particular model perform on *only chocolate* candies from the Ultimate Halloween candy dataset?
* Filter dataset based on chocolate and not-chocolate candies and run accuracy score of choice on each

```
# Find the East conference teams
east_teams = labels == "E"

# Create arrays for the true and predicted values
true_east = y_test[east_teams]
preds_east = predictions[east_teams]

# Print the accuracy metrics
print('The MAE for East teams is {}'.format(mean_absolute_error(true_east, preds_east)))
```

#### Classification metrics
* There are a lot of accuracy metrics available for classification problems
    * Accuracy
    * Precision 
    * Recall (Sensitivity)
    * F1 Score
    * Alternate F Scores
    * Specificity
* One way to calculate these metrics is to use the values from the confusion matrix
* When making predictions, especially if there is a binary outcome, this matrix is one of the first outputs you should review
* When we have a binary outcome, the confusion matrix is a 2x2 matrix that shows how your predictions faired across the two outcomes
* **Create a confusion matrix using: `confusion_matrix()`**

```
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predictions)
print(cm)
```
* In this matrix, the row index represents the true category and the column index represents the predicted category.
* **Accuracy** represents the overall ability of your model to correctly predict the correct classification.
* **Precision** is the number of true positives out of all predicted positive values.
    * Precision is used when we don't want to overpredict positive values
* **Recall** is about finding all positive values
    * Recall is used when we can't afford to miss any positive values
    
* Accuracy, precision, and recall, are called similarly

```
from sklearn.metrics import accuracy_score, precision_score, recall_score
accuracy_score(y_test, test_predictions)
precision_score(y_test, test_predictions)
recall_score(y_test, test_predictions)
```

```
# Calculate and print the accuracy
accuracy = (TN + TP) / (953)
print("The overall accuracy is {0: 0.2f}".format(accuracy))

# Calculate and print the precision
precision = (TP) / (TP + FP)
print("The precision is {0: 0.2f}".format(precision))

# Calculate and print the recall
recall = (TP) / (TP + FN)
print("The recall is {0: 0.2f}".format(recall))
```

```
from sklearn.metrics import confusion_matrix

# Create predictions
test_predictions = rfc.predict(X_test)

# Create and print the confusion matrix
cm = confusion_matrix(y_test, test_predictions)
print(cm)

# Print the true positives (actual 1s that were predicted 1s)
print("The number of true positives is: {}".format(cm[1, 1]))
```

#### Bias-variance trade-off

* How to identify when we have a good-fitting model? One way is to consider bias and variance
* **Variance** occurs when a model pays too close attention to the training data and fails to generalize to the testing data
    * Models with high variance perform well on only the training data, but not the testing data, and are considered to be **overfit**
    * Overfitting occurs when our model starts to attach meaning to the noise in the training data
    * Overfitting is easy to identify, as the training error will be a lot lower than the testing error
* **Bias** occurs when the model fails to find the relationships between the data and the response value.
    * Bias leas to high errors on both the training and testing datasets and is associated with an **underfit** model.
    * Underfitting occurs when the model could not find the underlying patterns available in the data
    * Underfitting might occur if we don't have enough trees or if the trees aren't deep enough
    * Underfitting is more difficult to identify because the training and testing errors will both be high, and it's difficult to know if we got the most out of the data, or if we can improve the testing error.
    
* When our model is getting the most out of the training data, while still performing on the testing data, we have **optimal performance**
* How do we tell if we have a good fit, or if we are just underfitting?
    * For random forest models, some parameters that affect performance are max depth and max features 
    * One way to check for a poorly fit model is to try additional parameter sets and check both the training and testing error metrics
   
* **As you run more random forest models, you will get a better sense of which parameters you should tweak.**
* **We always compare how well the model performed on the data it has seen to the data it has not seen.**

## Cross-validation
#### The problem with holdout sets
* Repeating the validation process with a different random seed (or if you don't specify one at all), will result in different results
* **The split matters.** 
* Cross-validation is the gold-standard for model-validation
#### Cross-validation
* Cross validation uses multiple training/validation splits; cv runs a single model on various training-validation combinations and gives us a lot more confidence in our final metrics 
    * And we can do this is such a manner thaat all of the data will only be used in one of the validation sets- i.e. sampling *without* replacement. In this way we can ensure that every point is used for validation exactly one time
    * Using each point in only one validation set is not required in cross-validation, it is often good practice to do so. 
    
* sklearn's **`KFold()`** function gives us a few option for splitting data into several training and validation sets 
    * **`n_splits`:** specify the number of splits that we want
    * **`shuffle`:** boolean indicating to shuffle data before splitting
    * **`random_state`:** random seed
    
```
from sklearn.model_selection import KFold

X = np.array(range(40))
y = np.array([0] * 20 + [1] * 20)

kf = KFold(n_splits = 5)
splits = kf.split(X)
```
* **This only generates indices for us to use (*not* training and validation datasets)**
* Here, we create a list of indices that can be used for our splits
* **KFold** is generally used when we want to fit the same model using KFold cross-validation\

```
rfr = RandomForestRegressor(n_estimators= 25, random_state=1111)
errors = []
for train_index, val_index in splits:
    X_train, y_train = X[train_index], y[train_index]
    X_val, y_val = X[val_index], y[val_index]
    
    rfr.fit(X_train, y_train)
    predictions = rfr.predict(X_test)
    errors.append(,some_accuracy_metric>)

```
```
from sklearn.model_selection import KFold

# Use KFold
kf = KFold(n_splits=5, shuffle=True, random_state=1111)

# Create splits
splits = kf.split(X)

# Print the number of indices
for train_index, val_index in splits:
    print("Number of training indices: %s" % len(train_index))
    print("Number of validation indices: %s" % len(val_index))
```

```
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

rfc = RandomForestRegressor(n_estimators=25, random_state=1111)

# Access the training and validation indices of splits
for train_index, val_index in splits:
    # Setup the training and validation data
    X_train, y_train = X[train_index], y[train_index]
    X_val, y_val = X[val_index], y[val_index]
    # Fit the random forest model
    rfc.fit(X_train, y_train)
    # Make predictions, and print the accuracy
    predictions = rfc.predict(X_val)
    print("Split accuracy: " + str(mean_squared_error(y_val, predictions)))
```


#### sklearn's cross_val_score()
* KFold is a great way to create indices that we can use for cross-validation
* If you just want to just straight in to cross-validation, and don't want to mess with the indices, you can use sklearn's **`cross_val_score()`** method
* **`cross_val_score()`:** 
    * requires four parameters:
        * `estimator` or specific model you want to use
        * `X` to specify the complete training data set
        * `y` to specify the response values
        * `cv` the number of cross-validation splits

```
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()

cross_val_score(estimator=rfc, X=X, y=y, cv=5)
```
* By default, `cross_val_score()` will use a defauly scoring function for whichever model you have specified (for most classifiers, accuracy... for most regressors, R2)
* If you want to use a different scoring function, you can create a scorer by using the `.make_scorer()` method and specifying the metric that you want to use.

```
from sklearn.metrics import mean_absolute_error, make_scorer
mae_scorer = make_scorer(mean_absolute_error)
cross_val_score(<estimator>, <X>, <y>, cv=5, scoring=mae_scorer)
```

```
from sklearn.ensemble import RandomForetsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error, make_scorer

rfc = RandomForestRegressor(n_estimators=20, max_depth=5, random_state=1111)
mse = make_scorer(mean_squared_error)

cv_results = cross_val_score(rfc, X, y, cv=5, scoring=mse)

print(cv_results)
```

* When we use cross-validation, we usually report the mean of the errors; this is a much more realistic estimate for the out-of-sample accuracy that we can expect to see on new data

```
cv_results.mean()
cv_results.std()
```
* Find the mean and standard deviation of your cross validation results to determine spread and average; the smaller the standard deviation, the tighter you 5 means were
