# Data Checks

EvalML provides data checks to help guide you in achieving the highest performing model. These utility functions help deal with problems such as overfitting, abnormal data, and missing data. These data checks can be found under `evalml/data_checks`. Below we will cover examples for each available data check in EvalML, as well as the `DefaultDataChecks` used in `AutoMLSearch.search`.

## Missing Data

Missing data or rows with `NaN` values provide many challenges for machine learning pipelines. In the worst case, many algorithms simply will not run with missing data! EvalML pipelines contain imputation [components](../user_guide/components.ipynb) to ensure that doesn't happen. Imputation works by approximating missing values with existing values. However, if a column contains a high number of missing values, a large percentage of the column would be approximated by a small percentage. This could potentially create a column without useful information for machine learning pipelines. By using the `HighlyNullDataCheck` data check, EvalML will alert you to this potential problem by returning the columns that pass the missing values threshold.

In [None]:
import numpy as np
import pandas as pd

from evalml.data_checks import HighlyNullDataCheck

X = pd.DataFrame([[1, 2, 3], 
                  [0, 4, np.nan],
                  [1, 4, np.nan],
                  [9, 4, np.nan],
                  [8, 6, np.nan]])

null_check = HighlyNullDataCheck(pct_null_threshold=0.8)

for message in null_check.validate(X):
    print (message.message)

## Abnormal Data

EvalML provides a few data checks to check for abnormal data: 
- `OutliersDataCheck`
- `ClassImbalanceDataCheck`
- `IDColumnsDataCheck`
- `NoVarianceDataCheck`
- `HighVarianceCVDataCheck`
- `InvalidTargetDataCheck`

### Zero Variance

Data with zero variance indicates that all values are identical. If a feature has zero variance, it is not likely to be a useful feature. Similarly, if the target has zero variance, there is likely something wrong. `NoVarianceDataCheck` checks if the target or any feature has only one unique value and alerts you to any such columns.

In [None]:
from evalml.data_checks import NoVarianceDataCheck

X = pd.DataFrame([[0, 53, 1, 5],
                  [0, 90, 3, 10],
                  [0, 90, 18, 20]])
y = pd.Series([1, 0, 1])
no_variance_data_check = NoVarianceDataCheck()

for message in no_variance_data_check.validate(X, y):
    print (message.message)

Note that `NaN` values count as an unique value, but `NoVarianceDataCheck` will still return a warning if there is only one unique non-`NaN` value in a given column.

In [None]:
from evalml.data_checks import NoVarianceDataCheck

X = pd.DataFrame([[0, np.nan, 1, 5],
                  [10, 90, 3, 10],
                  [0, 90, 18, 20]])
y = pd.Series([1, 0, 1])

no_variance_data_check = NoVarianceDataCheck()

for message in no_variance_data_check.validate(X, y):
    print (message.message)

### Class Imbalance

For classification problems, the distribution of examples across each class can vary. For small variations, this is normal and expected. However, when the number of examples for each class label is disproportionately biased or skewed towards a particular class (or classes), it can be difficult for machine learning models to predict well. In addition, having a low number of examples for a given class could mean that one or more of the CV folds generated for the training data could only have few or no examples from the minority class. This may cause the model need only predict the majority class correctly, resulting in a poor-performant model.

`ClassImbalanceDataCheck` checks if the target labels are imbalanced beyond a specified threshold for a certain number of CV folds. It returns errors for any classes that have less samples than double the number of CV folds, and warnings for any classes that fall below the set threshold percentage.

In [None]:
from evalml.data_checks import ClassImbalanceDataCheck

X = pd.DataFrame({[[1, 2, 0, 1],
                  [4, 1, 9, 0],
                  [4, 4, 8, 3],
                  [9, 2, 7, 1]]})
y = pd.Series([0, 1, 1, 1, 1])
class_imbalance_check = ClassImbalanceDataCheck(threshold=0.25)

for message in class_imbalance_check.validate(X, y):
    print (message.message)

### Target Leakage

### Invalid Target Data 

### ID Columns

ID columns in your dataset provide little to no benefit to a machine learning pipeline as the pipeline cannot extrapolate useful information from unique identifiers. Thus, `IDColumnsDataCheck` reminds you if these columns exists. In the given example, 'user_number' and 'id' columns are both identified as potentially being unique identifiers that should be removed.

### High Variance Cross-Validation Scores

In [None]:
from evalml.data_checks import IDColumnsDataCheck

X = pd.DataFrame([[0, 53, 6325, 5],[1, 90, 6325, 10],[2, 90, 18, 20]], columns=['user_number', 'cost', 'revenue', 'id'])
id_col_check = IDColumnsDataCheck(id_threshold=0.9)

for message in id_col_check.validate(X):
    print (message.message)

## Outliers

Outliers are observations that differ significantly from other observations in the same sample. Many machine learning pipelines suffer in performance if outliers are not dropped from the training set as they are not representative of the data. `OutliersDataCheck()` uses Isolation Forests to notify you if a sample can be considered an outlier.

Below we generate a random dataset with some outliers.

In [None]:
data = np.random.randn(100, 100)
X = pd.DataFrame(data=data)

# generate some outliers in rows 3, 25, 55, and 72
X.iloc[3, :] = pd.Series(np.random.randn(100) * 10)
X.iloc[25, :] = pd.Series(np.random.randn(100) * 20)
X.iloc[55, :] = pd.Series(np.random.randn(100) * 100)
X.iloc[72, :] = pd.Series(np.random.randn(100) * 100)

We then utilize `OutliersDataCheck()` to rediscover these outliers.

In [None]:
from evalml.data_checks import OutliersDataCheck

outliers_check = OutliersDataCheck()

for message in outliers_check.validate(X):
    print (message.message)

## DefaultDataChecks

`AutoMLSearch.search` is able to run a set of data checks to ensure that the input data being passed will not run into some common issues before running a potentially time-consuming search. This is controlled by the `data_checks` parameter. By default, `data_checks` is set to `'auto'`, which will run the collection of data checks st
We can also pass in our own list of data checks by 'none' or 

### Default Data Checks

By default, `AutoMLSearch.search` runs a collection of data checks before it searches and iterates over pipelines. This collection of data checks is stored in the `DefaultDataChecks` class. It consists of a few data checks that are generally helpful for any machine learning problem. They are:
- `HighlyNullDataCheck`
- `IDColumnsDataCheck`
- `TargetLeakageDataCheck`
- `InvalidTargetDataCheck`
- `ClassImbalanceDataCheck`
- `NoVarianceDataCheck`

## Writing Your Own Data Check

If you would prefer to write your own data check, you can do so by extending the DataCheck class and implementing the `validate(self, X, y)` class method. Below, we've created a new DataCheck, `ZeroVarianceDataCheck`.

In [None]:
from evalml.data_checks import DataCheck
from evalml.data_checks.data_check_message import DataCheckError

class ZeroVarianceDataCheck(DataCheck):
    def validate(self, X, y):
        if not isinstance(X, pd.DataFrame):
            X = pd.DataFrame(X)
        warning_msg = "Column '{}' has zero variance"
        return [DataCheckError(warning_msg.format(column), self.name) for column in X.columns if len(X[column].unique()) == 1]