# Using Data Checks in AutoML

The ultimate goal of machine learning is to make accurate predictions on unseen data. EvalML aims to help you build a model that will perform as you expect once it is deployed in to the real world.

One of the benefits of using EvalML to build models is that it provides data checks to ensure you are building pipelines that will perform reliably in the future. This page describes how data checks are and can be used during the search process.

In [None]:
import evalml

## Default data checks in AutoML

By default, AutoML will run the series of data checks in `DefaultDataChecks` when `automl.search()` is called to check that inputs are valid before running the search and fitting pipelines. Currently, `DefaultDataChecks` contains a data check to check if a column is more 95% or more null, since that likely indicates a column with no or minimal useful information.

If the data checks returns any error or warning messages, `automl.search()` will raise a `ValueError` and quit before searching. This allows users to address any issues before running the potentially time-intensive search process. For example, here we have some data that contain a lot of null values, causing `DefaultDataChecks` to raise a `ValueError` when try to run the search below.

In [None]:
import pandas as pd
X = pd.DataFrame({'lots_of_null': [None] * 19 + [5],
                     'no_null': range(20)})
y = pd.Series([1,0]*10)
automl = evalml.AutoClassificationSearch(max_pipelines=1)
automl.search(X, y)

In [None]:
pd.Series([1,0]*10)


To access the exact warning and error messages our data checks returned, we can access `automl.latest_data_check_results`.

In [None]:
for message in automl.latest_data_check_results:
    print (message.message)

## Using your own data check with AutoML

If you'd prefer to pass in your own data check, you can do so by passing in a `DataChecks` object as the value for the `data_checks`.

In [None]:

from evalml.data_checks import DataCheck, DataChecks
from evalml.data_checks.data_check_message import DataCheckWarning

class ZeroVarianceDataCheck(DataCheck):
    def validate(self, X, y):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        warning_msg = "Column '{}' has zero variance"
        return [DataCheckWarning(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1]


data_checks = DataChecks(data_checks=[ZeroVarianceDataCheck()])

X = pd.DataFrame({'no_var': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
                  'any_average_col': [2, 0, 1, 2, 1, 2, 0, 1, 2, 1],
                  'another_average_col': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]})
y = pd.Series([0,1,1,0,0,0,1,1,0,0])

automl = evalml.AutoClassificationSearch(max_pipelines=1)
automl.search(X, y, data_checks=data_checks)


Accessing the `latest_data_check_results` will help us begin to address the issues raised by data checks.

In [None]:
for message in automl.latest_data_check_results:
    print (message.message)

## Disabling Data Checks

If you'd prefer not to run any data checks before running search, you can provide an `EmptyDataChecks` instance to `search()` instead.

In [None]:
from evalml.data_checks import EmptyDataChecks
import pandas as pd

automl = evalml.AutoClassificationSearch(max_pipelines=1)
automl.search(X, y, data_checks=EmptyDataChecks())

Unlike above, no data checks will be run and hence, the same input data we used above will not raise an error and continue with the search process.