# ZenML Data Validation With Evidently

## Purpose

Data profiling and validation is the process of examining and analyzing data to understand its characteristics, patterns, and quality. The goal of this process is to gain insight into the data, identify potential issues or errors, and ensure that the data is fit for its intended use.

Evidently is a Python package that provides tools for data profiling and validation. Evidently makes it easy to generate reports on your data, which can provide insights into its distribution, missing values, correlation, and other characteristics. These reports can be visualized and examined to better understand the data and identify any potential issues or errors.

Data validation involves testing the quality and consistency of the data. This can be done using a variety of techniques, such as checking for missing values, duplicate records, and outliers, as well as testing the consistency and accuracy of the data. Evidently provides a suite of tests that can be used to evaluate the quality of the data, and provides scores and metrics for each test, as well as an overall data quality score.

ZenML implements some standard steps that you can use to get reports or test your
data with Evidently for quality and other purposes. These steps are:

* `EvidentlyReportStep` and `EvidentlySingleDatasetReportStep`: These steps generate
a report for one or two given datasets. Similar to how you configure an Evidently
Report, you can configure a list of metrics, metric presets or metrics generators
for the step as parameters. The full list of metrics can be found
[here](https://docs.evidentlyai.com/reference/all-metrics/).

* `EvidentlyTestStep` and `EvidentlySingleDatasetTestStep`: These step test one
or two given datasets using various Evidently tests. Similar to how you configure
an Evidently TestSuite, you can configure a list of tests, a test presets or
test generators for the step as parameters. The full list of tests can be found
[here](https://docs.evidentlyai.com/reference/all-tests/).

If you want to run this notebook in an interactive environment, feel free to run
it in a [Google Colab](https://colab.research.google.com/github/zenml-io/zenml/blob/main/examples/evidently_drift_detection/evidently.ipynb)
or view it on [GitHub](https://github.com/zenml-io/zenml/tree/main/examples/evidently_drift_detection) directly.

## Install libraries

In [None]:
# Install the ZenML CLI tool, Evidently and scikit-learn

!pip install zenml 
!zenml integration install evidently sklearn -y
!pip install pyarrow

Once the installation is completed, you can go ahead and create a ZenML repository for this project by running:

In [None]:
# Initialize a ZenML repository
!zenml init

Now, the setup is completed. For the next steps, just make sure that you are executing the code within your ZenML repository.

## Setup the Stack

You need to have an Evidently Data Validator component to your stack to be able to use Evidently data profiling in your ZenML pipelines. Creating such a stack is easily accomplished:

In [None]:
!zenml data-validator register evidently -f evidently
!zenml stack register evidently_stack -o default -a default -dv evidently --set

## Import relevant packages

We will use pipelines and steps to train our model.

In [None]:
import pandas as pd
from rich import print
from sklearn import datasets

from zenml.pipelines import pipeline
from zenml.steps import Output, step

## Define ZenML Steps

In the code that follows, we are defining the various steps of our pipeline. Each step is decorated with `@step`, the main abstraction that is currently available for creating pipeline steps, with the exception of the Evidently data drift built-in step that is shipped with ZenML.

The first step is a `data_loader` step that downloads the OpenML women's e-commerce clothing reviews dataset and returns it as a panda DataFrame. We'll use this as the reference dataset for our data drift detection example.

In [None]:
@step
def data_loader() -> pd.DataFrame:
    """Load the OpenML women's e-commerce clothing reviews dataset."""
    reviews_data = datasets.fetch_openml(
        name="Womens-E-Commerce-Clothing-Reviews", version=2, as_frame="auto"
    )
    reviews = reviews_data.frame
    return reviews


We then add a `data_splitter` step that takes the input dataset and splits it into two subsets. Later on, in the pipeline, we'll compare these datasets against each other using Evidently and generate a data drift profile and associated dashboard.

In [None]:
@step
def data_splitter(
    reviews: pd.DataFrame,
) -> Output(reference_dataset=pd.DataFrame, comparison_dataset=pd.DataFrame):
    """Splits the dataset into two subsets, the reference dataset and the
    comparison dataset.
    """
    ref_df = reviews[reviews.Rating > 3].sample(
        n=5000, replace=True, ignore_index=True, random_state=42
    )
    comp_df = reviews[reviews.Rating < 3].sample(
        n=5000, replace=True, ignore_index=True, random_state=42
    )
    return ref_df, comp_df

Next, we add an Evidently step that takes in the reference dataset and partial dataset and generates a data profile report. This step is already defined as part of the ZenML library, so we only need to add it to our pipeline with a custom configuration. Under the hood, ZenML uses Evidently in the implementation of this step to generate Evidently reports and Materializers to automatically persist them as Artifacts into the Artifact Store.

In [None]:
from zenml.integrations.evidently.metrics import EvidentlyMetricConfig
from zenml.integrations.evidently.steps import (
    EvidentlyColumnMapping,
    EvidentlyReportParameters,
    evidently_report_step,
)

text_data_report = evidently_report_step(
    step_name="text_data_report",
    params=EvidentlyReportParameters(
        column_mapping=EvidentlyColumnMapping(
            target="Rating",
            numerical_features=["Age", "Positive_Feedback_Count"],
            categorical_features=[
                "Division_Name",
                "Department_Name",
                "Class_Name",
            ],
            text_features=["Review_Text", "Title"],
        ),
        metrics=[
            EvidentlyMetricConfig.metric("DataQualityPreset"),
            EvidentlyMetricConfig.metric(
                "TextOverviewPreset", column_name="Review_Text"
            ),
            EvidentlyMetricConfig.metric_generator(
                "ColumnRegExpMetric",
                columns=["Review_Text", "Title"],
                reg_exp=r"[A-Z][A-Za-z0-9 ]*",
            ),
        ],
        # We need to download the NLTK data for the TextOverviewPreset
        download_nltk_data=True,
    ),
)

This next step serves as an example showing how the Evidently profile returned as output from the previous step can be used in other steps in the pipeline to analyze the data drift report in detail and take different actions depending on the results. 

In [None]:
import json

@step
def text_analyzer(
    report: str,
) -> Output(ref_missing_values=int, comp_missing_values=int):
    """Analyze the Evidently text Report and return the number of missing
    values in the reference and comparison datasets.
    """
    result = json.loads(report)["metrics"][0]["result"]
    return (
        result["current"]["number_of_missing_values"],
        result["reference"]["number_of_missing_values"],
    )


## Define ZenML Pipeline

A pipeline is defined with the `@pipeline` decorator. This defines the various steps of the pipeline and specifies the dependencies between the steps, thereby determining the order in which they will be run.

Note how the ZenML Evidently step returns two artifacts: the Evidently Report in both JSON and HTML formats. We only use the JSON report in the pipeline, while the HTML report will be extracted and rendered separately in the post execution workflow, via the ZenML Evidently visualizer.

In [None]:
@pipeline(enable_cache=False)
def text_data_report_test_pipeline(
    data_loader,
    data_splitter,
    text_report,
    text_analyzer,
):
    """Links all the steps together in a pipeline."""
    data = data_loader()
    reference_dataset, comparison_dataset = data_splitter(data)
    report, _ = text_report(
        reference_dataset=reference_dataset,
        comparison_dataset=comparison_dataset,
    )
    text_analyzer(report)


## Run the pipeline

Running the pipeline is as simple as calling the `run()` method on an instance of the defined pipeline.

In [None]:
pipeline_instance = text_data_report_test_pipeline(
    data_loader=data_loader(),
    data_splitter=data_splitter(),
    text_report=text_data_report,
    text_analyzer=text_analyzer(),
)
pipeline_instance.run()

# Post execution workflow

We did mention above that the Materializer takes care of persisting the Evidently HTML reports in the Artifact Store. These artifacts can be extracted and visualized after the pipeline run is complete.

In [None]:
from zenml.integrations.evidently.visualizers import EvidentlyVisualizer

last_run = pipeline_instance.get_runs()[0]
text_analysis_step = last_run.get_step(step="text_analyzer")

print(
    "Reference missing values: ",
    text_analysis_step.outputs["ref_missing_values"].read(),
)
print(
    "Comparison missing values: ",
    text_analysis_step.outputs["comp_missing_values"].read(),
)

The ZenML Evidently visualizer takes in a ZenML pipeline step run and renders the Evidently report that was generated during its execution.

In [None]:
text_report_step = last_run.get_step(step="text_report")

EvidentlyVisualizer().visualize(text_report_step)

# Congratulations!

You have successfully used ZenML and Evidently to generate and visualize data reports.

For more ZenML features and use-cases, you should check out some of the other ZenML examples. You should also take a look at our [docs](https://docs.zenml.io/) or our [Github](https://github.com/zenml-io/zenml) repo, or even better, join us on our [Slack channel](https://zenml.io/slack-invite).

Cheers!