# Avoiding Overfitting

The ultimate goal of machine learning is to make accurate predictions on unseen data. One of the benefits of using EvalML to build models is that provides guardrails to ensure you are building pipelines that will perform reliably in the future.

The sections on this page describe the various ways EvalML helps you avoid overfitting to your data. In the end, EvalML aims to help you build a model that will perform as you expect once it is deployed in to the real world.

In [1]:
import evalml

## Detects Label Leakage

A common problem is having features that include information from your label in your training data. By default, will provide a warning when it detects this may be the case.

In [2]:
import pandas as pd

X = pd.DataFrame({
    "leaked_feature": [6, 6, 10, 5, 5, 11, 5, 10, 11, 4],
    "leaked_feature_2": [3, 2.5, 5, 2.5, 3, 5.5, 2, 5, 5.5, 2],
    "correct_feature": [3, 1, 3, 2, 4, 6, 1, 3, 3, 11]
})

y = pd.Series([1, 1, 0, 1, 1, 0, 1, 0, 0, 1])

clf = evalml.AutoClassifier(
    max_pipelines=1,
    model_types=["linear_model"],
    detect_label_leakage=True,
)
clf.fit(X, y)

[1m*****************************[0m
[1m* Beginning pipeline search *[0m
[1m*****************************[0m

Optimizing for Precision. Greater score is better.

Searching up to 1 pipelines. No time limit is set. Set one using max_time parameter.

Possible model types: linear_model

Testing LogisticRegression w/ imputation + scaling: 100%|██████████| 1/1 [00:01<00:00,  1.54s/it]

✔ Optimization finished


In the example above, EvalML warned about the input features "leaked_feature" and "lead_feature_2", which are both very closely correlated with the label we are trying to predct. If you'd like to turn this check off, set `detect_label_leakage=False`.

The second way to find features that may be leaking label information is to look at the top features of the model. As we can see below, the top features in our model are the 2 leaked features.

In [3]:
best_pipeline = clf.best_pipeline
best_pipeline.feature_importances

Unnamed: 0,feature,importance
0,leaked_feature,-1.773115
1,leaked_feature_2,-1.731261
2,correct_feature,-0.247665



## Peforms cross-validation for pipeline evaluation

By default, EvalML performs 3-fold cross validation when building pipelines. This means that it evaluates each pipeline 3 times using different for training and testing. In each trial the data used for testing is has no overlap from the data used for training to avoid overfitting.

While this is a good baseline approach, you can pass your own cross validation object to be used during modeling. The cross validation object can be any of the CV methods defined in [scikit-learn](https://scikit-learn.org/stable/modules/cross_validation.html) or use a compatible API.

For example, if we wanted to do a time series split:

In [4]:
from sklearn.model_selection import TimeSeriesSplit

X, y = evalml.demos.load_breast_cancer()

clf = evalml.AutoClassifier(
    cv=TimeSeriesSplit(n_splits=6), 
    max_pipelines=1
)

clf.fit(X, y)

[1m*****************************[0m
[1m* Beginning pipeline search *[0m
[1m*****************************[0m

Optimizing for Precision. Greater score is better.

Searching up to 1 pipelines. No time limit is set. Set one using max_time parameter.

Possible model types: linear_model, random_forest, xgboost

Testing XGBoost w/ imputation: 100%|██████████| 1/1 [00:00<00:00,  1.70it/s]

✔ Optimization finished


if we describe the 1 pipeline we built, we can see the scores for each of the 6 splits as determined by the cross-validation object we provided

In [5]:
clf.describe_pipeline(0)

[1m************************[0m
[1m* Pipeline Description *[0m
[1m************************[0m

Pipeline Name: XGBoost w/ imputation
Model type: xgboost
Objective: Precision (greater is better)
Total training time (including CV): 0.6 seconds

Parameters
• eta: 0.5928446182250184
• min_child_weight: 8.598391737229157
• max_depth: 4
• impute_strategy: most_frequent
• percent_features: 0.6273280598181127

Cross Validation
               F1  Precision  Recall   AUC  Log Loss  # Training  # Testing
0           0.822      0.974   0.822 0.950     0.578      83.000         81
1           0.988      1.000   0.988 1.000     0.163     164.000         81
2           0.972      0.964   0.972 0.968     0.134     245.000         81
3           0.955      1.000   0.955 0.997     0.106     326.000         81
4           0.968      1.000   0.968 0.998     0.116     407.000         81
5           0.983      0.983   0.983 0.998     0.077     488.000         81
mean        0.948      0.987   0.948 0.98

## Detects unstable pipelines

When we perform cross validation we are trying generate an estimate of pipeline performance. EvalML does this by taking the mean of the score across the folds. If the performance across the folds varies greatly, it is indicative the the estimated value may be unreliable. 

To protect the user against this, EvalML check to see if performance of the pipeline has a high variance between different folds. It triggers a warning is the "coeffient of variance" of the scores (the standard deviation divided by mean) or the pipelines scores exeeds .2.

This warning will appear in the pipeline rankings under `high_variance_cv`.

In [6]:
clf.rankings

Unnamed: 0,id,pipeline_name,score,high_variance_cv,parameters
0,0,XGBoostPipeline,0.986776,False,"{'eta': 0.5928446182250184, 'min_child_weight'..."


## Create holdout for model validation

EvalML offers method to quickly create an holdout validation set. A holdout validation set is data that is not used during the process of optmizing or training the model. You should only use this validation set once you've picked the final model you'd like to use.

Below we create a holdout set of 20% of our data

In [7]:
X, y = evalml.demos.load_breast_cancer()
X_train, X_holdout, y_train, y_holdout = evalml.preprocessing.split_data(X, y, test_size=.2)

In [8]:
clf = evalml.AutoClassifier(
    max_pipelines=3,
    detect_label_leakage=True
)
clf.fit(X_train, y_train)
clf.rankings

[1m*****************************[0m
[1m* Beginning pipeline search *[0m
[1m*****************************[0m

Optimizing for Precision. Greater score is better.

Searching up to 3 pipelines. No time limit is set. Set one using max_time parameter.

Possible model types: linear_model, random_forest, xgboost

Testing XGBoost w/ imputation: 100%|██████████| 3/3 [00:01<00:00,  2.07it/s]                     

✔ Optimization finished


Unnamed: 0,id,pipeline_name,score,high_variance_cv,parameters
0,0,LogisticRegressionPipeline,0.965669,False,"{'penalty': 'l2', 'C': 8.444214828324364, 'imp..."
1,1,LogisticRegressionPipeline,0.965669,False,"{'penalty': 'l2', 'C': 6.239401330891865, 'imp..."
2,2,XGBoostPipeline,0.955733,False,"{'eta': 0.5928446182250184, 'min_child_weight'..."


then we can retrain the best pipeline on all of our training data and see how it performs compared to the estimate

In [9]:
pipeline = clf.best_pipeline
pipeline.fit(X_train, y_train)
pipeline.score(X_holdout, y_holdout)

0.9726027397260274