## Overview
This notebook will demo how to utilize great expectations with the data profiler. The expectation that is being used in this example will expect that the max value of a specified column will report less than or equal to the max value which is passed in the previous report.

In [7]:
import os

import pandas as pd
import numpy as np

# Great expectations imports
import great_expectations as ge
from capitalone_dataprofiler_expectations.expectations. \
    expect_column_values_to_be_equal_to_or_less_than_profile_max \
    import ExpectColumnValuesToBeEqualToOrLessThanProfileMax
from great_expectations.self_check.util import build_pandas_validator_with_data

# Data Profiler import
import dataprofiler as dp

Below we are importing a csv file which holds data regarding gun crime statistics.

In [8]:
context = ge.get_context()
guns_data_path = "../../dataprofiler/tests/data/csv/guns.csv"
df = pd.read_csv(guns_data_path)
df

Unnamed: 0,year,month,intent,police,sex,age,race,hispanic,place,education
0,2012,1,Suicide,0,M,34.0,Asian/Pacific Islander,100,Home,4.0
1,2012,1,Suicide,0,F,21.0,White,100,Street,3.0
2,2012,1,Suicide,0,M,60.0,White,100,Other specified,4.0
3,2012,2,Suicide,0,M,64.0,White,100,Home,4.0
4,2012,2,Suicide,0,M,31.0,White,100,Other specified,2.0
...,...,...,...,...,...,...,...,...,...,...
100793,2014,12,Homicide,0,M,36.0,Black,100,Home,2.0
100794,2014,12,Homicide,0,M,19.0,Black,100,Street,2.0
100795,2014,12,Homicide,0,M,20.0,Black,100,Street,2.0
100796,2014,12,Homicide,0,M,22.0,Hispanic,260,Street,1.0


For this expectation we are going to compare the max value in one column across two different time frames in this dataset. Below we are gathering the different years that are recorded in this dataset.

In [9]:
df.sort_values(by="year", axis=0, inplace=True)
years = df["year"].unique().tolist()
years.reverse()
years

[2014, 2013, 2012]

Now that we have the years, we will capture all records from each year in their own dataframes so we can process them separately.

In [10]:
individual_dataframes = []
for year in years:
    current_year_df = df.loc[df["year"]==year]
    current_year_df = current_year_df.drop("year", axis=1).drop("month", axis=1)
    individual_dataframes.append(current_year_df)
individual_dataframes[0]

Unnamed: 0,intent,police,sex,age,race,hispanic,place,education
89596,Suicide,0,M,28.0,White,100,Home,2.0
89597,Suicide,0,M,51.0,White,100,Other specified,2.0
89598,Suicide,0,M,43.0,Asian/Pacific Islander,100,Home,4.0
89599,Homicide,0,M,43.0,Black,100,Other unspecified,2.0
89600,Homicide,0,F,22.0,White,100,Other specified,2.0
...,...,...,...,...,...,...,...,...
78390,Homicide,0,M,25.0,Black,100,Street,2.0
78389,Suicide,0,M,23.0,White,100,Other specified,3.0
78388,Homicide,0,M,23.0,Black,100,Street,1.0
78386,Homicide,0,M,23.0,Black,100,Street,2.0


Now we will create a report on the first individual dataframe which corresponds to the year 2014, then we will output the maximum age from the "age" column as found in the report.

In [11]:
profiler_options = dp.ProfilerOptions()
profiler_options.set({"data_labeler.is_enabled": False})
profile = dp.Profiler(individual_dataframes[0], len(individual_dataframes[0]), options=profiler_options)
report  = profile.report(report_options={"output_format": "compact"})
report['data_stats'][3]['statistics']['max']

INFO:DataProfiler.profilers.profile_builder: Finding the Null values in the columns...  (with 8 processes)


100%|██████████| 8/8 [00:04<00:00,  1.60it/s]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]
  df_series = df_series.loc[true_sample_list]


INFO:DataProfiler.profilers.profile_builder: Calculating the statistics...  (with 4 processes)


100%|██████████| 8/8 [00:03<00:00,  2.23it/s]
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  if isinstance(val, np.float):


102.0

In [12]:
validator = build_pandas_validator_with_data(individual_dataframes[1])
results = validator.expect_column_values_to_be_equal_to_or_less_than_profile_max(
    column='age',
    profile=report
)
results

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 33636,
    "unexpected_count": 1,
    "unexpected_percent": 0.002973977695167286,
    "partial_unexpected_list": [
      107.0
    ],
    "missing_count": 11,
    "missing_percent": 0.03270305624925675,
    "unexpected_percent_total": 0.0029730051135687953,
    "unexpected_percent_nonmissing": 0.002973977695167286
  },
  "success": false,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

After we generate the expectation we find that there is one row with a value that exceeds the max age from the previous report with 107 as well as 11 rows with missing values.