## Overview
This notebook will demo how to utilize great expectations with the data profiler. The expectation that is being used in this example will expect that the max value of a specified column will report less than or equal to the max value which is passed in the previous report.

## Instructions

### Offline work
Begin by creating a new directory for your project. We advise using a virtual environment which you can make by executing the following commands:
 - Initialize your virtual environment: `python3 -m venv venv`
 - Activate the virtual environment: `source venv/bin/activate`

Now install the following packages:
- DataProfiler: `pip install DataProfiler`
- Great Expectations: `pip install great_expectations`
- Capital One's DataProfiler Expectations: `pip install capitalone_dataprofiler_expectations`
    - NOTE: this package is currently not published. You can download the package [here](https://github.com/great-expectations/great_expectations/tree/develop/contrib/capitalone_dataprofiler_expectations), and install it using: `pip install -e <path_to_downloaded_package>`
    - Once the package is downloaded, the following line might need to be added to `great_expectations/contrib/capitalone_dataprofiler_expectations/setup.py` if the `pip install` is failing
        - `py_modules=[]`

Initialize Great Expectations:
- Run the following command to initialize Great Expectations: `great_expectations init`
    - NOTE: This step is crucial in order generate a `DataContext` that we will obtain later

In [None]:
import dataprofiler as dp
import pandas as pd
import numpy as np
import great_expectations as ge
context = ge.get_context()
from capitalone_dataprofiler_expectations.expectations.expect_column_values_to_be_equal_to_or_less_than_profile_max import ExpectColumnValuesToBeEqualToOrLessThanProfileMax
from great_expectations.self_check.util import build_pandas_validator_with_data
import os

Below we are importing a csv file which holds data regarding gun crime statistics.

In [None]:
guns_data_path = "../dataprofiler/tests/data/csv/guns.csv"
df = pd.read_csv(guns_data_path)
df

For this expectation we are going to compare the max value in one column across two different time frames in this dataset. Below we are gathering the different years that are recorded in this dataset.

In [None]:
df.sort_values(by="year", axis=0, inplace=True)
years = df["year"].unique().tolist()
years.reverse()
years

Now that we have the years, we will capture all records from each year in their own dataframes so we can process them separately.

In [None]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["year"]==year]
    current_year_df = current_year_df.drop("year", axis=1).drop("month", axis=1)
    individual_dataframes.append(current_year_df)
individual_dataframes[0]

Now we will create a report on the first individual dataframe which corresponds to the year 2014, then we will output the maximum age from the "age" column as found in the report.

In [None]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})
profile = dp.Profiler(individual_dataframes[0], len(individual_dataframes[0]), options=profiler_options)
report  = profile.report(report_options={"output_format": "compact"})
report['data_stats'][3]['statistics']['max']

In [None]:
validator = build_pandas_validator_with_data(individual_dataframes[1])
results = validator.expect_column_values_to_be_equal_to_or_less_than_profile_max(
    column='age',
    profile=report
)
results

After we generate the expectation we find that there is one row with a value that exceeds the max age from the previous report with 107 as well as 11 rows with missing values.