# Great Expecatations

This notebook aims to show the key concepts of Great Expectations (*GX*) in a few minutes. 🕙

In [1]:
from great_expectations.checkpoint import SimpleCheckpoint
import ruamel.yaml as yaml
from typing import Any, Dict
from great_expectations.data_context import BaseDataContext
from great_expectations.data_context.types.base import DataContextConfig
from google.cloud import storage
from pathlib import Path
import great_expectations as gx
import os
import time

## Create Context

The context is the entry point to all GX functionalities. Here, it is fed from the configuration file.  However, it is also possible to describe all settings directly in the code via a so-called *Emphemeral Data Context*.

In [2]:
context = gx.get_context()

## Build an expectation suite

Under the so-called expectation suite, a set of expectations are stored.

In [3]:
EXPECTATION_SUITE_NAME = "suite01"

In [4]:
from great_expectations.exceptions import DataContextError

try:
    suite = context.get_expectation_suite(expectation_suite_name=EXPECTATION_SUITE_NAME)
    print(f'Loaded ExpectationSuite "{suite.expectation_suite_name}" containing {len(suite.expectations)} expectations.')
except DataContextError:
    suite = context.create_expectation_suite(expectation_suite_name=EXPECTATION_SUITE_NAME)
    print(f'Created ExpectationSuite "{suite.expectation_suite_name}".')

Loaded ExpectationSuite "suite01" containing 2 expectations.


### Define a batch request

Later it is decided whether the so-called interactive mode should be used. In this case, when formulating the expected values, testing is done directly on a batch of the data. If the amount of data is larger, it is advisable to use a limit. 

In [5]:
INTERACTIVE_EVALUATION = True

In [6]:
batch_request = {'datasource_name': 'source_local',
                'data_connector_name': 'default_inferred_data_connector_name',
                'data_asset_name': 'Housing.csv',
                }

In [7]:
from great_expectations.core.batch import BatchRequest

validator = context.get_validator(
    batch_request=BatchRequest(**batch_request),
    expectation_suite_name=EXPECTATION_SUITE_NAME
)

validator.interactive_evaluation = INTERACTIVE_EVALUATION

for old_expectation in suite.expectations:
    validator.remove_expectation(old_expectation)

### Forumlate expectations

There is a whole range of off-the-shelf expectations. The list can be found at [here](https://greatexpectations.io/expectations/).

But there are also ways to define and customize your own expectations. If you have done so, it is worth sharing then. (Community 💪)

In [8]:
validator.expect_column_max_to_be_between(column="parking", min_value=3, max_value=4)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": 3
  }
}

In [9]:
validator.expect_column_mean_to_be_between(column="price", min_value=4000000, max_value=5000000)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": 4766729.247706422
  }
}

In [10]:
# validator.expect_column_stdev_to_be_between(column="price", min_value=1000000, max_value=1100000) 

### Save it

In [11]:
validator.save_expectation_suite(
    discard_failed_expectations=False,
    discard_include_config_kwargs=False)

In [12]:
suite_saved = context.get_expectation_suite(expectation_suite_name=EXPECTATION_SUITE_NAME)

In [13]:
suite_saved

{
  "ge_cloud_id": null,
  "meta": {
    "great_expectations_version": "0.15.28"
  },
  "expectations": [
    {
      "meta": {},
      "expectation_type": "expect_column_max_to_be_between",
      "kwargs": {
        "column": "parking",
        "max_value": 4,
        "min_value": 3
      }
    },
    {
      "meta": {},
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "price",
        "max_value": 5000000,
        "min_value": 4000000
      }
    }
  ],
  "expectation_suite_name": "suite01",
  "data_asset_type": null
}

## Build a checkpoint and run a validation

> Checkpoints provide a convenient abstraction for bundling the Validation of a Batch (or Batches) of data against an Expectation Suite (or several), as well as the Actions that should be taken after the validation.

Here, we build a *SimpleCheckpoint*. 
It consists of
* the validation of the expectation suite itself
* Storing the results
* Updating the so-called *data-docs*, a kind of report about past validation runs. 

In [14]:
EXPECTATION_SUITE_NAME = "suite01"

In [15]:
batch_request = {'datasource_name': 'source_local',
                'data_connector_name': 'default_inferred_data_connector_name',
                'data_asset_name': 'Housing.csv',
                }

In [16]:
from great_expectations.checkpoint import SimpleCheckpoint

In [17]:
checkpoint_config = {
    "class_name": "SimpleCheckpoint",
    "validations": [
        {
            "batch_request": batch_request,
            "expectation_suite_name": EXPECTATION_SUITE_NAME
        }
    ]
}
checkpoint = SimpleCheckpoint(
    f"simple_checkpoint_{EXPECTATION_SUITE_NAME}",
    context,
    **checkpoint_config
)

In [18]:
checkpoint_result = checkpoint.run()

Calculating Metrics:   0%|          | 0/5 [00:00<?, ?it/s]

## Take a look at the response

In [19]:
checkpoint_result.keys()

dict_keys(['_run_id', '_run_results', '_checkpoint_config', '_success', '_validation_results', '_data_assets_validated', '_data_assets_validated_by_batch_id', '_validation_result_identifiers', '_expectation_suite_names', '_data_asset_names', '_validation_results_by_expectation_suite_name', '_validation_results_by_data_asset_name', '_batch_identifiers', '_statistics', '_validation_statistics', '_validation_results_by_validation_result_identifier'])

In [20]:
identifier = list(checkpoint_result["run_results"].keys())[0]

### One can access the overall success...

In [21]:
checkpoint_result["run_results"][identifier]["validation_result"]["success"]

True

### ... as well as detailed results.

In [22]:
# suite_saved = context.get_expectation_suite(expectation_suite_name=EXPECTATION_SUITE_NAME)

In [23]:
suite_saved

{
  "ge_cloud_id": null,
  "meta": {
    "great_expectations_version": "0.15.28"
  },
  "expectations": [
    {
      "meta": {},
      "expectation_type": "expect_column_max_to_be_between",
      "kwargs": {
        "column": "parking",
        "max_value": 4,
        "min_value": 3
      }
    },
    {
      "meta": {},
      "expectation_type": "expect_column_mean_to_be_between",
      "kwargs": {
        "column": "price",
        "max_value": 5000000,
        "min_value": 4000000
      }
    }
  ],
  "expectation_suite_name": "suite01",
  "data_asset_type": null
}

In [24]:
checkpoint_result["run_results"][identifier]["validation_result"]["results"]

[{
   "success": true,
   "meta": {},
   "exception_info": {
     "raised_exception": false,
     "exception_traceback": null,
     "exception_message": null
   },
   "result": {
     "observed_value": 3
   }
 },
 {
   "success": true,
   "meta": {},
   "exception_info": {
     "raised_exception": false,
     "exception_traceback": null,
     "exception_message": null
   },
   "result": {
     "observed_value": 4766729.247706422
   }
 }]

### Actions


In [25]:
checkpoint_result["run_results"][identifier]["actions_results"]

{'store_validation_result': {'class': 'StoreValidationResultAction'},
 'store_evaluation_params': {'class': 'StoreEvaluationParametersAction'},
 'update_data_docs': {'local_site': 'file:///Users/carsten/Desktop/pydata_uk_2023_gx/great_expectations/uncommitted/data_docs/local_site/validations/suite01/__none__/20230603T133440.077339Z/9b1a7743ca605aec17d0eb64dfdcd7b9.html',
  'class': 'UpdateDataDocsAction'}}

## Last, let's take a look at the data docs..

In [26]:
# context.open_data_docs()

# Wrap-Up

* The context forms the entry point to GX. It containt the information about where the storage locations are hosted and which data sources are available.
* One or many expectations are are combined in one suite.
* Expectations can be defined via pure Python code. They are translated into a yaml configuration in the background.
* The execution engine (pandas, spark or sqlalchemy) executes the actual queries and produces the metrics.
* Checkpoints bundle validation with potential follow-up actions.
* One action can be the update of the so called data docs, a convenient reporting possibility to get an overview of suites and validation runs.

# References
* [Official documentation](https://docs.greatexpectations.io/docs/)