# Automated Data Quality Checks with the Data Profiler and Great Expectations

## Overview
An engineer can leverage the DataProfiler tool in order to get columnar and tabular metadata metrics that describe their data. An engineer can use the DataProfiler `diff()` functionality to compare two profiles from one another. However, this example is going to run through a scenario where an engineer wants to have automated data quality checks set up for future data that will be aggregated to the original dataset.

In this example, an engineer is going to generate a DataProfiler report on an initial batch of data and then use the report to set up automated data quality checks. Then the engineer will run those checks against a future batch of data in order to identify any data quality concerns.

### Download the Data
Download the csv that contains the data we will be using for this example which can be found [here](https://github.com/great-expectations/gx_tutorials/blob/main/data/yellow_tripdata_sample_2019-01.csv) then save it to the `/examples/great_expectations/data` directory.

### Imports
First we need to import great expectations and the data profiler.

In [5]:
import os
from pathlib import Path

import great_expectations as ge
from great_expectations.cli.datasource import sanitize_yaml_and_save_datasource, check_if_datasource_name_exists
from great_expectations.core.batch import BatchRequest
from great_expectations.checkpoint import SimpleCheckpoint
from great_expectations.rule_based_profiler.data_assistant_result import (
    DataAssistantResult,
)
import dataprofiler as dp

from capitalone_dataprofiler_expectations.rule_based_profiler.data_assistant.data_profiler_structured_data_assistant import (
    DataProfilerStructuredDataAssistant,
)
from capitalone_dataprofiler_expectations.rule_based_profiler.data_assistant_result.data_profiler_structured_data_assistant_result import (
    DataProfilerStructuredDataAssistantResult,
)
import capitalone_dataprofiler_expectations.metrics.data_profiler_metrics

### Great Expecations Set Up
Next you will need to create and load the DataContext, build their batch request, retrieve the validator from the context, and create a checkpoint.

In [6]:
# Prepare Batch Request
context = ge.get_context()
datasource_config = f"""
    name: taxi_multi_batch_datasource
    class_name: Datasource
    module_name: great_expectations.datasource
    execution_engine:
        module_name: great_expectations.execution_engine
        class_name: PandasExecutionEngine
    data_connectors:
        inferred_data_connector_all_years:
            class_name: InferredAssetFilesystemDataConnector
            base_directory: ../data
            default_regex:
                group_names:
                    - data_asset_name
                    - year
                    - month
                pattern: (yellow_tripdata_sample)_(\\d.*)-(\\d.*)\\.csv
"""
context.test_yaml_config(yaml_config=datasource_config)
sanitize_yaml_and_save_datasource(context, datasource_config, overwrite_existing=False)
batch_request = {
    "datasource_name": "taxi_multi_batch_datasource",
    "data_connector_name": "inferred_data_connector_all_years",
    "data_asset_name": "yellow_tripdata_sample",
    "data_connector_query": {"index": 0},
}

# Prepare a new expectation suite
context = ge.data_context.DataContext()
expectation_suite_name = "yellow_tripdata_sample"
expectation_suite = context.create_expectation_suite(
    expectation_suite_name=expectation_suite_name, overwrite_existing=True
)

validator = context.get_validator(
    batch_request=BatchRequest(**batch_request),
    expectation_suite_name=expectation_suite_name,
)

checkpoint_config = {
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": batch_request,
            "expectation_suite_name": expectation_suite_name,
        }
    ],
}

checkpoint = SimpleCheckpoint(
    f"{validator.active_batch_definition.data_asset_name}_{expectation_suite_name}",
    context,
    **checkpoint_config,
)

Attempting to instantiate class from config...
	Instantiating as a Datasource, since class_name is Datasource
	Successfully instantiated Datasource


ExecutionEngine class name: PandasExecutionEngine
Data Connectors:
	inferred_data_connector_all_years : InferredAssetFilesystemDataConnector

	Available data_asset_names (1 of 1):
		yellow_tripdata_sample (1 of 1): ['yellow_tripdata_sample_2019-01.csv']

	Unmatched data_references (0 of 0):[]



### Run the data assistant and save the expectation suite
Now we will pass the profile path into the data profiler data assistant to generate the expectation suite.

In [7]:
profile_path: str = os.path.join(os.getcwd(), "data/yellow_tripdata_sample_2019-01.csv")
saved_profile_path: str = os.path.join(os.getcwd(), "data/yellow_tripdata_sample_2019-01.pkl")
data = dp.Data(profile_path)
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})
profiler = dp.Profiler(data)
profiler.save(filepath="data/yellow_tripdata_sample_2019-01.pkl")
report = profiler.report()

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns...  (with 15 processes)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:07<00:00,  2.53it/s]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 18/18 [00:06<00:00,  2.71it/s]


In [8]:
exclude_column_names = [
    # "vendor_id",
    "pickup_datetime",
    "dropoff_datetime",
    # "passenger_count",
    # "trip_distance",
    # "rate_code_id",
    "store_and_fwd_flag",
    # "pickup_location_id",
    # "dropoff_location_id",
    # "payment_type",
    # "fare_amount",
    # "extra",
    # "mta_tax",
    # "tip_amount",
    # "tolls_amount",
    # "improvement_surcharge",
    # "total_amount",
    "congestion_surcharge",
]


result: DataAssistantResult = context.assistants.data_profiler.run(
    batch_request=batch_request,
    exclude_column_names=exclude_column_names,
    numeric_rule={
        "profile_path": saved_profile_path,
    },
)




Generating Expectations:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling Dataset:         0%|          | 0/14 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

### Run validation against the expectation suite
 Then the expectation suite will be saved to the validator. Then the checkpoint will be run.

In [None]:
validator.expectation_suite = result.get_expectation_suite(
    expectation_suite_name=expectation_suite_name
)

validator.save_expectation_suite(discard_failed_expectations=False)

checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/59 [00:00<?, ?it/s]

### Build and display the data docs
The checkpoint results from above will be used to populate the data docs which will display the expectation suite validation results again the new data.

In [None]:
context.build_data_docs()

validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)