## Overview
This notebook will demo how to utilize great expectations with the data profiler. The expectation that is being used in this example will expect that the difference in metrics between two reports is within the specified range.

In [None]:
import os

import pandas as pd
import numpy as np

import dataprofiler as dp
import great_expectations as ge
context = ge.get_context()
from capitalone_dataprofiler_expectations.expectations.expect_column_values_confidence_for_data_label_to_be_greater_than_or_equal_to_threshold import ExpectColumnValuesConfidenceForDataLabelToBeGreaterThanOrEqualToThreshold
from great_expectations.self_check.util import build_pandas_validator_with_data

Below we are importing a csv file which holds data regarding gun crime statistics.

In [None]:
guns_data_path = "../../dataprofiler/tests/data/csv/guns.csv"
df = pd.read_csv(guns_data_path)
df

For this expectation we are going to compare the max value in one column across two different time frames in this dataset. Below we are gathering the different years that are recorded in this dataset.

In [None]:
df.sort_values(by="year", axis=0, inplace=True)
years = df["year"].unique().tolist()
years.reverse()
years

Now that we have the years, we will capture all records from each year in their own dataframes so we can process them separately.

In [None]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["year"]==year]
    current_year_df = current_year_df.drop("year", axis=1).drop("month", axis=1)
    individual_dataframes.append(current_year_df)
individual_dataframes[0]

Now we will create a report on the first individual dataframe which corresponds to the year 2014, then we will output the report for this dataframe.

In [None]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})

profile = dp.Profiler(individual_dataframes[0], len(individual_dataframes[0]), options=profiler_options)
profile.save(filepath='previous_profile.pkl')
report  = profile.report(report_options={"output_format": "compact"})
report['data_stats'][3]['statistics']['max']

In [None]:
validator = build_pandas_validator_with_data(individual_dataframes[1])
results = validator.expect_profile_numeric_columns_diff_between_inclusive_threshold_range(
    profile_path='previous_profile.pkl',
    limit_check_report_keys={
            "age": {
                "min": {"lower": 0, "upper": 10.0},
                "max": {"lower": 10, "upper": 30.0},
            },
            "education": {
                "min": {"lower": 0, "upper": 2},
                "max": {"lower": 0, "upper": 4},
            },
        }
)
results

Above you can see that the max age is populating the unexpected_values. This is because the actual difference between the two profiles for max age was -5. This is because the max age of the first profile is 102 and the max age of the second profile is 107. All other key checks from the reports are within the bounds sent above.