## Overview
- **Expectation Used:** [expect_column_values_to_be_equal_to_or_greater_than_profile_min](https://github.com/great-expectations/great_expectations/blob/develop/contrib/capitalone_dataprofiler_expectations/capitalone_dataprofiler_expectations/expectations/expect_column_values_to_be_equal_to_or_greater_than_profile_min.py)
- **Expectation Description:** This expectation will take the report of an initial dataset and compare it to the report which is generated by an additional dataset of the same schema. The expectation is that the user specified column should contain values greater than or equal to the min value metric generated in the report of the initial dataset.
- **Use Case:** If a user has data that tracks the daily spending on an account, they might want some data quality checks to track when daily spending reaches an all-time high. With this expectation, as new data is generated based on daily spending, it will raise an expectation violation if the new data indicates less spending than the min daily spending recorded previously by this account. This is one very practical use for fraud monitoring and detection.
- **Example Details:** In this example, we are using this expectation to check the max age found in the original datasets report against all the age values in the new dataset. The expectation is that all the ages in the new dataset should be less than the max age of the original dataset, otherwise a violation will be raised indicating exactly what values caused a violation.

### Imports

In [11]:
import os

import pandas as pd
import numpy as np

# Great expectations imports
import great_expectations as ge
from capitalone_dataprofiler_expectations.expectations. \
    expect_column_values_to_be_equal_to_or_greater_than_profile_min \
    import ExpectColumnValuesToBeEqualToOrGreaterThanProfileMin
from great_expectations.self_check.util import build_pandas_validator_with_data

# Data Profiler import
import dataprofiler as dp

### Setup
Below we are going to import a dataset from the Data Profile testing suite. This csv holds information on the salaries of individuals in the data science field from all over the world.

In [12]:
context = ge.get_context()

In [13]:
data_path = "../../dataprofiler/tests/data/csv/ds_salaries.csv"
df = pd.read_csv(data_path)
df.head()

Unnamed: 0,uuid,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0c898791-6efc-4410-983d-c1346e1cb390,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,bf898c23-1c84-441f-addf-9cbf1fc16a90,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,e7ccbded-d7f8-470e-a5d0-45c84913c51e,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,ac158555-0bb5-48c4-927a-22984062adeb,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,4be685e7-12c5-4171-8a19-4091b7f20699,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L


In this example we are going to split up the dataset into three separate years so we can simulate a dataset which will have a yearly aggregation of data.

In [14]:
df.sort_values(by="work_year", axis=0, inplace=True)
years = df["work_year"].unique().tolist()
years

[2020, 2021, 2022]

Now that we have the years, we will capture all records from each year in their own dataframes, so we can process them separately.

In [15]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["work_year"]==year]
    current_year_df = current_year_df.drop("work_year", axis=1)
    individual_dataframes.append(current_year_df)
individual_dataframes[0].head()

Unnamed: 0,uuid,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,0c898791-6efc-4410-983d-c1346e1cb390,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
51,715fb124-30a9-408e-831a-aa5d6d6e1c9e,EN,FT,Data Analyst,91000,USD,91000,US,100,US,L
50,2f6505ef-1d37-4cd9-894d-6ee6adf0745c,EN,FT,Data Analyst,450000,INR,6072,IN,0,IN,S
49,eb7a27be-3112-453b-9717-737e57f531ad,MI,FT,Data Engineer,61500,EUR,70139,FR,50,FR,L
48,4cb2acbd-f98f-40f5-ae52-626f65a726a1,MI,FT,Data Scientist,105000,USD,105000,US,100,US,L


Now we will create a report on the first `individual_dataframe` which corresponds to the year 2020, then we will output the `median` metric from the `salary_in_usd` column as found in the report.

In [16]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})
profile = dp.Profiler(individual_dataframes[0], len(individual_dataframes[0]), options=profiler_options)
report  = profile.report(report_options={"output_format": "compact"})

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns...  (with 11 processes)


100%|██████████| 11/11 [00:05<00:00,  2.03it/s]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(val, np.float):


In [17]:
report['data_stats'][6]['statistics']['max']

450000.0

### Running the Expectation
We build the validator by passing in the `individual_dataframe` corresponding to 2022. Then we will use the exception below to find that there are no values in the `salary_in_usd` column that exceed the `max` metric for the `age` column generated in `report`.

In [19]:
validator = build_pandas_validator_with_data(individual_dataframes[1])
results = validator.expect_column_values_to_be_equal_to_or_greater_than_profile_min(
    column='salary_in_usd',
    profile=report
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

### Results
After we generate the expectation we find that there is one row with a value that exceeds the max age from the previous report with 107 as well as 11 rows with missing values.

In [20]:
results

{
  "result": {
    "element_count": 217,
    "unexpected_count": 5,
    "unexpected_percent": 2.3041474654377883,
    "partial_unexpected_list": [
      4000,
      4000,
      5679,
      5409,
      2859
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 2.3041474654377883,
    "unexpected_percent_nonmissing": 2.3041474654377883
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": false,
  "meta": {}
}