# Model Validation in Python

Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? Goal: cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.

* **Model Validation** consists of various steps and processes that ensure your model performs as expected on new data. The most common way to do this is to test your model's accuracy (or, insert evaluation metric of your choice) on data it has never seen before (called a **holdout set**). If your model's accuracy is similar for the data it was trained on and the holdout data. You can claim that your model is validated. The ultimate goal of model validation is to end up with the best performing model possible that achieves high accuracy on new data. 

#### Model validation consists of:
    * Ensuring your model performs as expected on new data
    * Testing model performance on holdout datasets
    * Selecting the best model, parameters, and accuracy metrics
    * Achieving the best accuracy for the data given
    
### Scikit-learn modeling review
#### Basic modeling steps
* 1. Create a model by specifying model type and its parameters
* 2. Fit the model using the `.fit()` method
* 3. To assess model accuracy, we generate predictions for data using the `.predict()` method. 
* 4. Look at accuracy metrics.

* **The process of generating a model, fitting, predicting, and then reviewing model accuracy was introduced earlier in:**
    * Intermediate Python
    * Supervised Learning with scikit-learn

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X= X_train, y= y_train)
predictions = model.predict(X_test)
print("{0:.2f}".format(mae(y_true = y_test, y_pred= predictions)))
```
* **Model validation's main goal is to ensure that a predictive model will perform as expected on new data.**
* Training data = seen data

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
```

* Testing data = unseen data

```
model = RandomForestRegressor(n_estimators = 500, random_state=1111)
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
```

* If your training and testing errors are vastly different, it may be a sign that your model is overfitted
* Use model validation to make sure you get the best testing error possible

### Regression models
* More specifically: Random Forest Regression models using scikit-learn
* Random forest algorithms have a lot of parameters, but here we focus on three:
    * **`n_estimators`:** is the number of trees in the forest
    * **`max_depth`:** the maximum depth of the trees (or how many times we can split the data). Also described as the maximum length from the beginning of a tree to the tree's end nodes 
    * **`random_state`:** random seed; allows us to create reproducible models
* The most common way to set model parameters is to do so when initiating the model
* However, they can also be set later, by assigning a new value to a model's attribute.
    * This second method could be helpful when testing out different sets of parameters
    
```
rfr = RandomForestRegressor(random_state=1111)
rfr.n_estimators = 50
rfr.max_depth = 10
```

#### Feature importance
* After a model is created, we can assess how important different features (or columns) of the data were in the model by using the **`.feature_importances_`** attribute.

* Use code below, so long as data is in pandas DataFrame

```
for i, item in enumerate(rfr.feature_importances_):
    print("{0:s}: {1:.2f}".format(X.columns[i], item))
```

* **The larger this number is, the more important that column was in the model.

```
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)
```

#### Classification models
* Several methods are shared across all scikit-learn models, but some are unique to the specific type of model
* View how many observations were assigned to each class by turning the array of predictions into a pandas Series and then using the method `.value_counts()`:
* `pd.Series(rfc.predict(X_test)).value_counts()`

* Another prediction method is `.predict_proba()`, which returns an arry of predicted probabilities for each class
* Sometimes in model validation, we want to know the probability values and not just the classification
* Each entry of the array returned by `.predict_proba()` contains probabilities that sum to 1

* **`rfc.get_params()`:**
    * is used to review which parameters went into a scikit-learn model 
    * will print out a dictionary of parameters and their values, allowing us to see exactly which parameters were used
    * Knowing a model's parameters is essential when assessing model quality, rerunning models, and even parameter tuning
* **`.score()`:**
    * A quick way to look at the overall accuracy of the classification model
    
* Replicating model performance is vital in model validation. Replication is also important when sharing models with co-workers, reusing models on new data or asking questions on a website such as Stack Overflow. 
* The best way to do this is to replicate your work by reusing model parameters.

```
rfc = RandomForestClassifier(n_estimators=50, max_depth=6, random_state=1111)

# Print the classification model
print(rfc)

# Print the classification model's random state parameter
print('The random state is: {}'.format(rfc.random_state))

# Print all parameters
print('Printing the parameters dictionary: {}'.format(rfc.get_params()))
```

#### Creating train, test, and validation datasets
* "Seen" data (used for training)
* "Unseen" data (unavailable for training)
* In model validation, we use **holdout samples** to replicate this idea
* We define a **holdout dataset** as any data that is not used for training and is only used to assess model performance
    * The available data is split into two datasets:
        * One used for training
        * one that is simply off-limits while we are training our models (called a test--or holdout-- dataset)
        
 * When evaluating model's performance using different parameter values: use **validation set**
 * **Validation set** should be same size as the testing set.
 
 * To create **training**, **validation**, and **testing** datasets, we use `.train_test_split()` **TWICE**:
 

 * The first call will create training and testing datasets like normal.
 * The following second call will split this so-called temporary training dataset into the final training and validation datasets.
 
``` 
X_temp, X_test, y_temp, y_test = train_test_split(X,y, test_size=0.2, random_state=1111)
X_train, X_val, y_train, y_val = train_test_split(X_temp, test_size=0.25, random_state=1111)

 

## Accuracy metrics: Regression models

#### Mean Absolute Error (MAE)
* The simplest and most intuitive error metric
* This metric treats all points equally and is not sensitive to outliers
* When dealing with applications where we don't want large errors to have a major impact, the mean absolute error can be used
* As an example: An MAE of 10 would mean that we are about 10 percentage points off (on average) when predicting the values of our dataset

#### Mean squared error (MSE)
* The most widely used regression error metric for regression models 
* It is calculated similarly to the mean absolute error, but this time we square the difference term
* The MSE allows larger errors to have a larger impact on the model
* Allows outlier errors to contribute more to the overall error 

* **Choosing between MAE and MSE comes down to the application**
* Accuracy metrics are always application-specific
* MAE and MSE error terms are in different units entirely and should not be directly compared 

#### Accuracy for a subset of data
* For example, how does this particular model perform on *only chocolate* candies from the Ultimate Halloween candy dataset?
* Filter dataset based on chocolate and not-chocolate candies and run accuracy score of choice on each

```
# Find the East conference teams
east_teams = labels == "E"

# Create arrays for the true and predicted values
true_east = y_test[east_teams]
preds_east = predictions[east_teams]

# Print the accuracy metrics
print('The MAE for East teams is {}'.format(mean_absolute_error(true_east, preds_east)))
```