# Historical Profiler x Great Expectations Proof of Concept

## Overview:

This notebook will serve as a demo to showcase how you could generate meaningful thresholds that new data is expected to be bounded by from a history of older snapshots of the same dataset. We will specifically be measuring the changes between each snapshot in the dataset.

## Idea:
The `HistoricalProfiler` was built with methods giving the user the ability to generate reports detailing the way a dataset changes overtime. These reports can be used in conjunction with Capital One's DataProfiler Expectations package in `Great Expectations` to generate meaningful thresholds that *new* snapshots of a historical dataset could be expected to be bounded by. This notebook will serve as a demo to showcase the following:
- How to install all packages needed to use the HistoricalProfiler with Great Expectations
- Instantiating a HistoricalProfiler from a list of `Profiler` objects taken on different snapshots of the same dataset
- Importing and using custom expectations from the `capitalone_dataprofiler_expectations` package.
- Generating meaningful threshold bounds that new data could be expected to reside within, that could be used for automating data quality checks

## Instructions

### Offline work
Begin by creating a new directory for your project. We advise using a virtual environment which you can make by executing the following commands:
 - Initialize your virtual environment: `python3 -m venv venv`
 - Activate the virtual environment: `source venv/bin/activate`

Now install the following packages:
- DataProfiler: `pip install DataProfiler`
- Great Expectations: `pip install great_expectations`
- Capital One's DataProfiler Expectations: `pip install capitalone_dataprofiler_expectations`
    - NOTE: this package is currently not published. You can download the package [here](https://github.com/great-expectations/great_expectations/tree/develop/contrib/capitalone_dataprofiler_expectations), and install it using: `pip install -e <path_to_downloaded_package>`

Initialize Great Expectations:
- Run the following command to initialize Great Expectations: `great_expectations init`
    - NOTE: This step is crucial in order generate a `DataContext` that we will obtain later

### Online Work (python)

#### Imports
First, we will need to import the packages we just installed. In this demo, we will be using `pandas` as well to load and preprocess our data. 

Special Instructions:
- Before you import the specific custom expectation you'd like to use from Capital One's DataProfiler Expectation package, first you must import `great_expectations` and load your data context using: `context = ge.get_context()`.
- Then import the custom expectation of your choosing and register it properly with great_expectations by: `from capitalone_dataprofiler_expectations.expectations.<snake_case_expectation_name> import <CamelCaseExpectationName>`

This can be seen in the code cell below.

In [None]:
import dataprofiler as dp
import pandas as pd
import numpy as np
import great_expectations as ge
context = ge.get_context()
from capitalone_dataprofiler_expectations.expectations.expect_profile_numeric_columns_diff_between_inclusive_threshold_range import ExpectProfileNumericColumnsDiffBetweenInclusiveThresholdRange
from great_expectations.self_check.util import build_pandas_validator_with_data
import os

#### Preprocessing Data for Historical Profiler

In this demo, we will be using a dataset that contains records of gun reports. Each record has a year between 2012-2014 and a month. We will be splitting up this dataset into 36 individual datasets in order (i.e. Jan. 2012 - Dec. 2014)

In [None]:
guns_data_path = "../dataprofiler/tests/data/csv/guns.csv"
df = pd.read_csv(guns_data_path)
df

Now, we will split the dataset into years, then months. You can see from the output of the cell below that the years and months are sorted in reverse order. We do this because the historical profiler *assumes* that the list of input profiles is sorted in order from most to least recent

In [None]:
df.sort_values(by="year", axis=0, inplace=True)
years = df["year"].unique().tolist()
years.reverse()
df.sort_values(by="month", axis=0, inplace=True)
months = df["month"].unique().tolist()
months.reverse()

years, months

In [None]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["year"]==year]
    current_year_df = current_year_df.drop("year", axis=1)
    for month in months:
        current_month_df = current_year_df.loc[current_year_df["month"]==month]
        current_month_df = current_month_df.drop("month", axis=1)
        individual_dataframes.append(current_month_df)
individual_dataframes[0]

Now, we will profile each dataset. This operation could become very time consuming. In order to address this, the data labeler has been disabled in the profiler options. However, it is suggested that profiles on previous snapshots of the dataset are saved and loaded when needed as this *drastically* reduces the time it takes to get the profiles necessary to make a `HistoricalProfiler`

In [None]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})
profiler_objs = []
for data in individual_dataframes:
    profiler_objs.append(dp.Profiler(data, len(data), options=profiler_options))
len(profiler_objs)

The below function is used for this demo only. It will initialize the HistoricalProfiler and return all the data we need for use with our custom expectation. 

In a real, ideal production environment, you would already have your HistoricalProfiler. Then, this expectation would primarily be used in the case that you have a *new* snapshot of the same dataset that you have yet to add to the HistoricalProfiler.

The custom expectation would be used to check that the difference between the new dataset and the previous most recent dataset, is within expected bounds. In other words, we are expecting that the change from the last snapshot to this snapshot of the dataset does not have a greater magnitude more significant than it historically has.

In [None]:
#Initialize Historical Profiler & Get Args for Expectation
def get_historical_profiler_and_data_for_demo(profilers, dataframes, index):
    historical_profile = dp.HistoricalProfiler(profiler_objs[index:])
    new_dataframe = individual_dataframes[index-1]
    profile_path = "/previous_profile.pkl"
    dir_path = os.path.dirname(os.path.abspath("./historical_profile_great_expectations_proof_of_concept.ipynb"))
    profile_path = dir_path + profile_path
    profiler_objs[index].save(profile_path)
    actual_diff = profiler_objs[index-1].diff(profiler_objs[index])
    return historical_profile, new_dataframe, profile_path, actual_diff

ge_historical_profile, new_dataframe, profile_path, actual_diff = get_historical_profiler_and_data_for_demo(
    profiler_objs,
    individual_dataframes,
    1
)

In our example:
- `ge_historical_profile` refers to the historical profile containing snapshot profiles n-1 -> 0.
- `new_dataframe` refers to the newest snapshot of the dataset at time n
- `profile_path` refers to a path to a saved copy of the profile at time n-1
- `actual_diff` is the actual difference between profile n and n-1. This is used only for the purposes of this demo.

Now, we will generate the `limit_check_report_keys` to be used in our custom expectation. This dictionary contains the lower and upper bound thresholds for each statistic key for each column. 

In [None]:
#Expected bounds for difference between new dataframe and most recent profile in historical profile
diff_min_max_report = ge_historical_profile.get_diff_min_and_max_report()
diff_min_max_report
limit_check_report_keys = ge_historical_profile.get_statistics_limit_check_report_keys(
    diff_min_max_report
    )
limit_check_report_keys

Below is the actual difference for comparison. We are expecting that these difference values fall between `lower` and `upper` for each column and each statistic key in `limit_check_report_keys`, respectively.

In [None]:
actual_diff["data_stats"][3]

After comparing the two, we can see that the under the `age` column, the `stddev` and `variance` delta is less than the `stddev` and `variance` deltas between all of the profiles in this `historical profile`. In other words, `stddev` and `variance` for the `age` column has decreased *more* than it ever has historically. 

We expect our expectation to pick up on this and let us know about this unprecedented change.

#### Running the Expectation

In order to run the custom expectation in our python code, we need to obtain a `validator` from Great Expectations. The validator will act as a wrapper around our data and allow us to call our custom expectation that we imported earlier. We can obtain this validator by the following: `build_pandas_validator_with_data(<dataframe>)`.

Then, to run the expectation simply call `validator.<snake_case_expectation_name>(args)`. This will return a dictionary containing the results of the expectation.

In [None]:
#Run Expectation
validator = build_pandas_validator_with_data(new_dataframe)
result = validator.expect_profile_numeric_columns_diff_between_inclusive_threshold_range(
    profile_path=profile_path, 
    limit_check_report_keys=limit_check_report_keys,   
    )
result