# Getting started with EvalML

EvalML streamlines the creation and implementation of machine learning models for tabular data. One of the many features it offers is data checks, which are geared towards determining the health of the data before we train a model on it. In our default data checks, we have the following checks:
- HighlyNullDataCheck: Checks whether the rows or columns are highly null
- IDColumnsDataCheck: Checks for columns that could be ID columns
- TargetLeakageDataCheck: Checks if any of the input features have high association with the associated targets
- InvalidTargetDataCheck: Checks if there are null or other invalid features in the targets
- NoVarianceDataCheck: Checks if any targets or features have no variance
- NaturalLanguageNaNDataCheck: Checks if any natural language columns have missing data
- DateTimeNaNDataCheck: Checks if any datetime columns have missing data

EvalML has additional data checks which can be accessed through the API, and the documentation for that is [here](https://evalml.alteryx.com/en/stable/api_index.html#data-checks), with steps to use them [here](https://evalml.alteryx.com/en/stable/user_guide/data_checks.html). We will walk through example usage of the default data checks that EvalML provides.


First, we import the necessary requirements to demonstrate these checks.

In [None]:
import numpy as np
import woodwork as ww
import pandas as pd
from evalml import AutoMLSearch
from evalml.demos import load_fraud

Let's look at the X data. EvalML uses the [Woodwork](https://woodwork.alteryx.com/en/stable/) library to represent this data. The demo data that EvalML returns is of Woodwork's DataTable and DataColumn types.

In [None]:
X, y = load_fraud(n_rows=1000)
X

This data is already clean and compatible with EvalML's ``AutoMLSearch``. In order to demonstrate the default data checks that EvalML can do, we add some noise and unhealthy data to this distribution. These changes we include are:
- A row of null values
- A column of mostly null values (0.5% non-null)
- An ID column
- A column with low/no variance
- A missing target value

Note that these aren't all of the scenarios that the default data checks can catch.

In [None]:
# add a column with no variance in the data
X['no variance'] = [1 for _ in range(X.shape[0])]

# add an ID column
X['id'] = [i+1 for i in range(X.shape[0])]

# make row 1 all nan values
X.iloc[1] = [np.nan] * X.shape[1]

# add a column with 99.5% null values
X['mostly_nulls'] = [np.nan] * 995 + [i for i in range(5)]

# make one of the target values null
y[990] = None

# since we changed the data, let's reinitialize the woodwork datatable
X.ww.init()
y = ww.init_series(y)
# Let's take another look at the new X data
X

If we call AutoMLSearch on this data, we will see that the search fails. This is because there are a lot of issues with the input data (issues that we added)

In [None]:
automl = AutoMLSearch(X_train=X, y_train=y, problem_type='binary')
try:
    automl.search()
except ValueError as e:
    print("Search errored out! Message received is: {}".format(e))

We can use the search function provided in EvalML to determine what potential health issues our data has. Note that this `search` function is a public method available through `evalml.automl` and is different from the search function of the `AutoMLSearch` class in EvalML.

In [None]:
from evalml.automl import search
results = search(X, y, problem_type='binary')
results

The return value of the `search` function above is a tuple. The first element is the `AutoMLSearch` object if it runs (None otherwise), and the second is a dictionary of potential warnings and errors that the default data checks find in the passed-in `X` and `y` data. We can look at the `actions` key of the dictionary in order to see what how we can fix and clean the data.

In [None]:
results[1]['actions']

We note that there are 4 action tasks that we can take to clean the data. 3 of the tasks ask us to drop a row or column in the features, while 1 task asks us to impute the target value. 

In [None]:
# The first action states to drop the first row
X.drop(1, axis=0, inplace=True)
# we must also drop this for y since we are removing its associated feature input
y.drop(index=1, inplace=True)

print("The new length of X is {} and y is {}".format(len(X),len(y)))

In [None]:
# Remove the 'mostly_nulls' column from X, which is the second action item
X.drop('mostly_nulls', axis=1, inplace=True)
X.head()

In [None]:
# Address the null in targets, which is the third action item
y.fillna(False, inplace=True)
y.isna().any()

In [None]:
# Finally, we can drop the 'no variance' column, which is the final action item
X.drop('no variance', axis=1, inplace=True)
X.head()

In [None]:
# let's reinitialize the dataframe using Woodwork and try the search again
X.ww.init()
results = search(X, y, problem_type='binary')

Note that this time, we do get an `AutoMLSearch` object returned us, as well as an empty dictionary of warnings and errors. We can use the `AutoMLSearch` object as needed, and we can see that the resulting warning dictionary is empty. 

In [None]:
aml = results[0]
aml.rankings

In [None]:
warnings_dic = results[1]
warnings_dic

In the future, we aim to provide a helper function to allow users to quickly clean the data by taking in the list of actions and creating an appropriate pipeline of transformers to alter the data.