This notebook is for the Expectation Suite creation and validation

In [None]:
# Step 1 - Install required software

!pip install -r requirements.txt

In [11]:
# Step 2 - Imports

import great_expectations as gx
from great_expectations.data_context import FileDataContext
from great_expectations.core.expectation_configuration import ExpectationConfiguration

In [12]:
# Step 3 - Initiate a Filesystem Data Context
# Note: Replace /Users/fernandoembrioni/Documents/Fer/repos/ for the path previous to this repository

path_to_empty_folder = "/Users/fernandoembrioni/Documents/Fer/repos/fer-gx-validator/filecontext"
context = FileDataContext.create(project_root_dir=path_to_empty_folder)

In [13]:
# Step 4 - Create a Validator by connecting to data

validator = context.sources.pandas_default.read_csv(
    "https://raw.githubusercontent.com/great-expectations/gx_tutorials/main/data/yellow_tripdata_sample_2019-01.csv"
)

In [14]:
# Step 5 - Create Expectations and save them to the context

# IMPORTANT:
# Each expectation created is validated against the data source, and its result (success or failure)
# is used to add the Expectation or not to the expectation suite.
# In this case, I expect the 'rate_code_id' column values to be in set {1}. But the 
# datasource also has values of {2,3,4,5,99} for this column. It means that the
# validation will fail and the expectation will not be added to the expectation suite.
# How to solve this?
# I am using the 'mostly' parameter to solve the issue, plus a further edit of the suite
# to change it to what I need.

column_list = [
            "vendor_id",
            "pickup_datetime",
            "dropoff_datetime",
            "passenger_count",
            "trip_distance",
            "rate_code_id",
            "store_and_fwd_flag",
            "pickup_location_id",
            "dropoff_location_id",
            "payment_type",
            "fare_amount",
            "extra",
            "mta_tax",
            "tip_amount",
            "tolls_amount",
            "improvement_surcharge",
            "total_amount",
            "congestion_surcharge",
        ]

validator.expect_table_columns_to_match_ordered_list(column_list=column_list)
validator.expect_column_values_to_be_in_set(column='rate_code_id', value_set={1}, mostly=0.0)
validator.expect_column_values_to_not_be_null(column='vendor_id', mostly=0.95)
validator.save_expectation_suite()

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

In [15]:
# Step 6 - Add a checkpoint

checkpoint = context.add_or_update_checkpoint(
    name="my_checkpoint",
    validator=validator,
)

In [17]:
# Step 7 - Recover checkpoint from context

checkpoint = context.get_checkpoint("my_checkpoint")
result = checkpoint.run()

Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]

In [8]:
# Step 8 - View an HTML representation of the validation results

context.view_validation_result(result)

## Step 9 - Manually edit the Suite

As you may notice, all the expectations ran with 'success' status, but for the case of 'rate_code_id' there are many unexpected values in this column I would like to be treated as a failure.

The way to achieve this is by changing the value of 'mostly' to 100% for that expectation.

Fortunately, Great Expectations allows us to Edit the Suite...

### Step 9.1 - Click on "How to Edit This Suite" yellow button in the HTML page

Step 9.2 - Then, copy this `great_expectations suite edit default` from the popup window and paste it in your terminal (Be sure to be located at your project `filecontext` folder).

Step 9.3 - Select option 1 to manually edit the Expectation Suite and hit Enter.

Step 9.4 - You will be presented with a jupyter notebook named `edit_default` in your web browser. Select the `Python 3 (ipykernel)` to start editing. (In case the notebook does not open, look for the URLs given by `great_expectations` command and look for the `edit_default.ipynb` file. Tip: Look at "Or copy and paste one of these URLs" message to find them).

Step 9.5 - Run the first and second cells of the notebook.

Step 9.6 - At third cell change `"mostly": 0.0,` to `"mostly": 1.0,` and run this cell and the remaining ones.

Step 9.7 - Close the `jupyter` page in your web browser.

In [18]:
# Step 10 - Recover checkpoint from context

checkpoint = context.get_checkpoint("my_checkpoint")
result = checkpoint.run()

Calculating Metrics:   0%|          | 0/15 [00:00<?, ?it/s]

In [19]:
# Step 11 - View an HTML representation of the validation results

context.view_validation_result(result)

# Notice that now the Expectation for the 'rate_code_id' column is unsuccessful given
# that it has unexpected values