## Overview

- **Expectation Used:** [expect_profile_numeric_columns_diff_between_inclusive_threshold_range](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_profile_numeric_columns_diff_between_inclusive_threshold_range.py)
- **Expectation Description:** This expectation will check the use-specified difference between use-specified metrics from an original report generated by the Data Profiler and a new report. The two reports are generated on two different sets of data with matching schemas.
- **Use Case:** Imagine a user that has a dataset which keeps record of the daily product sales count of a product that is sold at their business. Assume that monthly sales records get aggregated to the dataset at the end of every month. With this expectation, they can generate a report using the Data Profiler on the new data before it is aggregated to the dataset to indicate if `median` metric in the report is outside the expected range set by the user. This can help them easily identify if they need to scale up or scale back production based on sales.
- **Example Details:**
This notebook will demo how to utilize great expectations with the data profiler. The expectation that is being used in this example will expect that the difference in metrics between two reports is within the specified range.

This expectation is useful for users who want to monitor changes in report metrics from older data against new data. With ranges set up similar to the last cell, the user will get expectation violations when their new data metrics are too different from the old data.

In [None]:
import os

import pandas as pd
import numpy as np

# Great expectations imports
import great_expectations as ge
from capitalone_dataprofiler_expectations.expectations. \
    expect_profile_numeric_columns_diff_between_inclusive_threshold_range \
    import ExpectProfileNumericColumnsDiffBetweenInclusiveThresholdRange
from great_expectations.self_check.util import build_pandas_validator_with_data

# Data Profiler import
import dataprofiler as dp

### Setup
Below we are going to import a dataset from the Data Profile testing suite. This csv holds information on the salaries of individuals in the data science field from all over the world.

In [None]:
context = ge.get_context()

In [None]:
data_path = "../../dataprofiler/tests/data/csv/ds_salaries.csv"
df = pd.read_csv(data_path)
df.head()

For this expectation we are going to compare the diff median `salary_in_usd` between `work_year` 2020 and 2022. Below we are gathering the different years that are recorded in this dataset.

In [None]:
df.sort_values(by="work_year", axis=0, inplace=True)
years = df["work_year"].unique().tolist()
years

Now that we have the years, we will capture all records from each year in their own dataframes so that we can process them separately.

In [None]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["work_year"]==year]
    current_year_df = current_year_df.drop("work_year", axis=1)
    individual_dataframes.append(current_year_df)
individual_dataframes[0]

Now we will create a report on the first individual dataframe which corresponds to the year 2020, then we will output the median `salary_in_usd` from this dataframe.

In [None]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})

profile = dp.Profiler(individual_dataframes[0], len(individual_dataframes[0]), options=profiler_options)
profile.save(filepath='previous_profile.pkl')
report  = profile.report(report_options={"output_format": "compact"})

In [None]:
report['data_stats'][6]['statistics']['median']

With the validator below, we are setting up an expectation that the difference between the `median` `salary_in_usd` is at least 10000 and no more than 30000. Meaning, this validator will check if the `median` `salary_in_usd` has increased by 10000 to 30000 from 2020 to 2022.

In [20]:
validator = build_pandas_validator_with_data(individual_dataframes[1])
results = validator.expect_profile_numeric_columns_diff_between_inclusive_threshold_range(
    profile_path='previous_profile.pkl',
    limit_check_report_keys={
            "salary_in_usd": {
                "median": {"lower": 8000, "upper": 30000},
            },
        }
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(val, np.float):


INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns... 



  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]

  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]

  df_series = df_series.loc[true_sample_list]
100%|██████████| 11/11 [00:00<00:00, 36.65it/s]

INFO:DataProfiler.profilers.profile_builder: Calculating the statistics... 




  0%|          | 0/11 [00:00<?, ?it/s][A
  9%|▉         | 1/11 [00:00<00:01,  5.87it/s][A
 18%|█▊        | 2/11 [00:00<00:01,  6.95it/s][A
 36%|███▋      | 4/11 [00:00<00:00,  9.38it/s][A
 55%|█████▍    | 6/11 [00:00<00:00, 10.87it/s][A
 73%|███████▎  | 8/11 [00:00<00:00, 10.67it/s][A
100%|██████████| 11/11 [00:01<00:00, 10.69it/s][A
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mat1 = np.array(matrix1, dtype=np.float)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  mat2 = np.array(matrix2, dtype=np.float)
  t = (mean1 - mean2) / np.sqrt(s_delta)
  welch_df = s_delta**2 / (


### Results
From the output below, the data owner can see that the expectation has an unexpected value. The result shows that the diff between the two profiles is slightly less than 7000.

In [21]:
results

{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "unexpected_values": {
      "salary_in_usd": {
        "median": {
          "lower_bound": 8000,
          "upper_bound": 30000,
          "value_found": 6968.818625
        }
      }
    }
  },
  "meta": {}
}