# Model Validation in Python

Without proper validation, the results of running new data through a model might not be as accurate as expected. Model validation allows analysts to confidently answer the question, how good is your model? Goal: cover the basics of model validation, discuss various validation techniques, and begin to develop tools for creating validated and high performing models.

* **Model Validation** consists of various steps and processes that ensure your model performs as expected on new data. The most common way to do this is to test your model's accuracy (or, insert evaluation metric of your choice) on data it has never seen before (called a **holdout set**). If your model's accuracy is similar for the data it was trained on and the holdout data. You can claim that your model is validated. The ultimate goal of model validation is to end up with the best performing model possible that achieves high accuracy on new data. 

#### Model validation consists of:
    * Ensuring your model performs as expected on new data
    * Testing model performance on holdout datasets
    * Selecting the best model, parameters, and accuracy metrics
    * Achieving the best accuracy for the data given
    
### Scikit-learn modeling review
#### Basic modeling steps
* 1. Create a model by specifying model type and its parameters
* 2. Fit the model using the `.fit()` method
* 3. To assess model accuracy, we generate predictions for data using the `.predict()` method. 
* 4. Look at accuracy metrics.

* **The process of generating a model, fitting, predicting, and then reviewing model accuracy was introduced earlier in:**
    * Intermediate Python
    * Supervised Learning with scikit-learn

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X= X_train, y= y_train)
predictions = model.predict(X_test)
print("{0:.2f}".format(mae(y_true = y_test, y_pred= predictions)))
```
* **Model validation's main goal is to ensure that a predictive model will perform as expected on new data.**
* Training data = seen data

```
model = RandomForestRegressor(n_estimators=500, random_state=1111)
model.fit(X_train, y_train)
train_predictions = model.predict(X_train)
```

* Testing data = unseen data

```
model = RandomForestRegressor(n_estimators = 500, random_state=1111)
model.fit(X_train, y_train)
test_predictions = model.predict(X_test)
```

* If your training and testing errors are vastly different, it may be a sign that your model is overfitted
* Use model validation to make sure you get the best testing error possible

### Regression models
* More specifically: Random Forest Regression models using scikit-learn
* Random forest algorithms have a lot of parameters, but here we focus on three:
    * **`n_estimators`:** is the number of trees in the forest
    * **`max_depth`:** the maximum depth of the trees (or how many times we can split the data). Also described as the maximum length from the beginning of a tree to the tree's end nodes 
    * **`random_state`:** random seed; allows us to create reproducible models
* The most common way to set model parameters is to do so when initiating the model
* However, they can also be set later, by assigning a new value to a model's attribute.
    * This second method could be helpful when testing out different sets of parameters
    
```
rfr = RandomForestRegressor(random_state=1111)
rfr.n_estimators = 50
rfr.max_depth = 10
```

#### Feature importance
* After a model is created, we can assess how important different features (or columns) of the data were in the model by using the **`.feature_importances_`** attribute.

* Use code below, so long as data is in pandas DataFrame

```
for i, item in enumerate(rfr.feature_importances_):
    print("{0:s}: {1:.2f}".format(X.columns[i], item))
```

* **The larger this number is, the more important that column was in the model.

```
# Set the number of trees
rfr.n_estimators = 100

# Add a maximum depth
rfr.max_depth = 6

# Set the random state
rfr.random_state = 1111

# Fit the model
rfr.fit(X_train, y_train)
```