## Overview
This notebook will demo how to utilize great expectations with the data profiler. The expectation that is being used in this example will expect values in a specified column to be labeled with a confidence value greather than or equal to what is specified in the expectation.

## Instructions

### Offline work
Begin by creating a new directory for your project. We advise using a virtual environment which you can make by executing the following commands:
 - Initialize your virtual environment: `python3 -m venv venv`
 - Activate the virtual environment: `source venv/bin/activate`

Now install the following packages:
- In the `great_expectations_examples` directory run `pip install -r requirements.txt`
- Capital One's DataProfiler Expectations: `pip install capitalone_dataprofiler_expectations`
    - NOTE: this package is currently not published. You can download the package [here](https://github.com/great-expectations/great_expectations/tree/develop/contrib/capitalone_dataprofiler_expectations), and install it using: `pip install -e <path_to_downloaded_package>`
    - Once the package is downloaded, the following line might need to be added to `great_expectations/contrib/capitalone_dataprofiler_expectations/setup.py` if the `pip install` is failing
        - `py_modules=[]`

Initialize Great Expectations:
- Run the following command to initialize Great Expectations: `great_expectations init`
    - NOTE: This step is crucial in order generate a `DataContext` that we will obtain later

In [None]:
import os

import pandas as pd
import numpy as np

import dataprofiler as dp
import great_expectations as ge
context = ge.get_context()
from capitalone_dataprofiler_expectations.expectations.expect_column_values_confidence_for_data_label_to_be_greater_than_or_equal_to_threshold import ExpectColumnValuesConfidenceForDataLabelToBeGreaterThanOrEqualToThreshold
from great_expectations.self_check.util import build_pandas_validator_with_data

Below we are going to import a csv file into a dataframe. This csv holds a column named "srcip" which will be used further below.

In [None]:
guns_data_path = "../../dataprofiler/tests/data/csv/aws_honeypot_marx_geo.csv"
df = pd.read_csv(guns_data_path)
df

We will use the exception below to find values in the "srcip" column that are labeled as "IPV4" with a confidence of .85 or higher.

In [None]:
validator = build_pandas_validator_with_data(df)
results = validator.expect_column_values_confidence_for_data_label_to_be_greater_than_or_equal_to_threshold(
    column='srcip',
    data_label='IPV4',
    threshold=.85
)
results

Here you can see that there are 9 values in the "srcip" column which are detected by our expectation with a confidence value greater than .85.