# Great Expectations(GX) - Data Integrity and Profiling


## Overview

- Set up a GX environment
- Connect to data
- Define Expectations
- Run Validations
- Trigger Actions and Checkpointing

- [Great Expectations - Web](https://docs.greatexpectations.io/docs/home)
- [Great Expectations - Git](https://github.com/great-expectations/great_expectations)

## Installation

In [None]:
!pip install great-expectations
!python -m pip install 'great_expectations[spark]'

[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m
[43mNote: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip available: [0m[31;49m22.3.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Imports and Setup

In [None]:
import warnings
warnings.filterwarnings("ignore")

In [None]:
import great_expectations as gx
from great_expectations.checkpoint import Checkpoint
from great_expectations.core.expectation_suite import ExpectationSuite
from pyspark.sql import SparkSession

In [None]:
df = spark.sql("SELECT * from {table_name} where current_date = '2024-12-01'")


## Create Data Context

In [None]:
context = gx.get_context(mode="file", project_root_dir="./new_context_folder")
print(context)


{
  "checkpoint_store_name": "checkpoint_store",
  "config_variables_file_path": "uncommitted/config_variables.yml",
  "config_version": 4.0,
  "data_context_id": "dccd67e1-b148-474c-96f2-a8166d17a174",
  "data_docs_sites": {
    "local_site": {
      "class_name": "SiteBuilder",
      "show_how_to_buttons": true,
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "uncommitted/data_docs/local_site/"
      },
      "site_index_builder": {
        "class_name": "DefaultSiteIndexBuilder"
      }
    }
  },
  "expectations_store_name": "expectations_store",
  "fluent_datasources": {},
  "plugins_directory": "plugins/",
  "stores": {
    "expectations_store": {
      "class_name": "ExpectationsStore",
      "store_backend": {
        "class_name": "TupleFilesystemStoreBackend",
        "base_directory": "expectations/"
      }
    },
    "validation_results_store": {
      "class_name": "ValidationResultsStore",
      "store_backend": {
 

## Connect to Data

**Create Data Source**

In [None]:
data_source_name = "churn_users"
data_source = context.data_sources.add_spark(name=data_source_name)

**Create Data Asset**

In [None]:
data_asset_name = "churn_users_gp"
data_asset = data_source.add_dataframe_asset(name=data_asset_name)

**Batch Definition**

In [None]:
batch_definition_name = "churn_users_gp_batch"

batch_definition = data_asset.add_batch_definition_whole_dataframe(
    batch_definition_name
)

In [None]:
batch_parameters = {"dataframe": df}

## Scenarios


**1. UserId should not be null**

**2. UserId should not have duplicate instance**

**3. current_day_order_count should be null Intermittently**

**4. Total Number of rows should be between 100K To 1000K**

**5. lifetime_gsv should have data type of `FloatType`**

**6. Minimum lifetime_gsv should be 0**

**Create an Expectation Suite**

In [None]:
expectation_suite = context.suites.add(
    gx.core.expectation_suite.ExpectationSuite(name="users gp data expectations for user_id_and_current_day_order_expectations")
)

**Define expectations**

In [None]:
df.count()

20832782

In [None]:
#missingness
user_id_not_null_expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column="user_id")

#uniqueness
user_id_not_duplicated_expectation = gx.expectations.ExpectColumnValuesToBeUnique(column="user_id")
order_count_null_expectation = gx.expectations.ExpectColumnValuesToNotBeNull(column="current_day_order_count", mostly=0.999)

#volume
total_rows_expectation = gx.expectations.ExpectTableRowCountToBeBetween(min_value=20*10^6, max_value=21*10^6)

#schema validation
gsv_schema_expectation = gx.expectations.ExpectColumnValuesToBeOfType(column="lifetime_gsv", type_="FLOAT")
col_presence_expectation = gx.expectations.ExpectColumnToExist(column="user_id")

expectation_suite.add_expectation(user_id_not_null_expectation)
expectation_suite.add_expectation(order_count_null_expectation)
expectation_suite.add_expectation(user_id_not_duplicated_expectation)
expectation_suite.add_expectation(total_rows_expectation)
expectation_suite.add_expectation(gsv_schema_expectation)
expectation_suite.add_expectation(col_presence_expectation)


ExpectColumnToExist(id='36b1d13e-b7a3-4e36-92e4-4c60c6be14ed', meta=None, notes=None, result_format=<ResultFormat.BASIC: 'BASIC'>, description=None, catch_exceptions=False, rendered_content=None, windows=None, batch_id=None, column='user_id', column_index=None)

**Validation**

In [None]:
batch = batch_definition.get_batch(batch_parameters=batch_parameters)
batch

<great_expectations.datasource.fluent.interfaces.Batch at 0xffff1f7d1c30>

In [None]:

# validation_result = batch.validate(expectation_suite)

# print(f"Expectation Suite passed: {validation_result['success']}\n")

# for result in validation_result["results"]:
#     expectation_type = result["expectation_config"]["type"]
#     # col_name = result["expectation_config"]["kwargs"]["column"]
#     expectation_passed = result["success"]
#     print(f"{expectation_type} : {expectation_passed}")

In [None]:
definition_name = "users gp data validation"
validation_definition = gx.ValidationDefinition(
    data=batch_definition, suite=expectation_suite, name=definition_name
)
validation_definition = context.validation_definitions.add(validation_definition)

In [None]:
validation_definition

ValidationDefinition(name='users gp data validation', data=BatchDefinition(id=UUID('07cceaea-84a3-4f66-b87e-7bb11273984b'), name='churn_users_gp_batch', partitioner=None), suite={
  "name": "users gp data expectations for user_id_and_current_day_order_expectations",
  "id": "0ecfc93d-37b5-421b-889c-c418a91c3c43",
  "expectations": [
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "user_id"
      },
      "meta": {},
      "id": "bec2c2ac-406d-4b0c-bce1-b9d3af791ba2"
    },
    {
      "type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "current_day_order_count",
        "mostly": 0.999
      },
      "meta": {},
      "id": "59ab5584-8fbf-42f0-a215-45670347744f"
    },
    {
      "type": "expect_table_row_count_to_be_between",
      "kwargs": {
        "min_value": 206,
        "max_value": 212
      },
      "meta": {},
      "id": "2a680a2a-6a38-4fc7-b108-a2c9c2b1b43b"
    },
    {
      "type": "expect_co

In [None]:
validation_definition = context.validation_definitions.get(definition_name)
validation_results = validation_definition.run(batch_parameters=batch_parameters)
print(validation_results)

Calculating Metrics:   0%|          | 0/19 [00:00<?, ?it/s]

{
  "success": false,
  "results": [
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_values_to_not_be_null",
        "kwargs": {
          "batch_id": "churn_users-churn_users_gp",
          "column": "user_id"
        },
        "meta": {},
        "id": "bec2c2ac-406d-4b0c-bce1-b9d3af791ba2"
      },
      "result": {
        "element_count": 20832782,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "partial_unexpected_counts": []
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    },
    {
      "success": true,
      "expectation_config": {
        "type": "expect_column_to_exist",
        "kwargs": {
          "batch_id": "churn_users-churn_users_gp",
          "column": "user_id"
        },
        "meta": {},
        "id": "36b1d13e-b7a3-4e36-92e4-4c60c6be1

In [None]:
validation_results["results"][0]["expectation_config"]


{"type": "expect_column_values_to_not_be_null", "kwargs": {"batch_id": "churn_users-churn_users_gp", "column": "user_id"}, "meta": {}, "id": "bec2c2ac-406d-4b0c-bce1-b9d3af791ba2"}

## Action Trigger

In [None]:
from great_expectations.checkpoint import (
    SlackNotificationAction,
    UpdateDataDocsAction,
)

In [None]:
validation_definitions = [
    context.validation_definitions.get(definition_name)
]
validation_definitions

[ValidationDefinition(name='users gp data validation', data=BatchDefinition(id=UUID('07cceaea-84a3-4f66-b87e-7bb11273984b'), name='churn_users_gp_batch', partitioner=None), suite={
   "name": "users gp data expectations for user_id_and_current_day_order_expectations",
   "id": "0ecfc93d-37b5-421b-889c-c418a91c3c43",
   "expectations": [
     {
       "type": "expect_column_values_to_not_be_null",
       "kwargs": {
         "column": "user_id"
       },
       "meta": {},
       "id": "bec2c2ac-406d-4b0c-bce1-b9d3af791ba2"
     },
     {
       "type": "expect_column_values_to_not_be_null",
       "kwargs": {
         "column": "current_day_order_count",
         "mostly": 0.999
       },
       "meta": {},
       "id": "59ab5584-8fbf-42f0-a215-45670347744f"
     },
     {
       "type": "expect_table_row_count_to_be_between",
       "kwargs": {
         "min_value": 206,
         "max_value": 212
       },
       "meta": {},
       "id": "2a680a2a-6a38-4fc7-b108-a2c9c2b1b43b"
     },


In [None]:
action_list = [
    SlackNotificationAction(
        name="send_slack_notification_on_failed_expectations",
        slack_token="${validation_notification_slack_webhook}",
        slack_channel="${validation_notification_slack_channel}",
        notify_on="failure",
        show_failed_expectations=True,
    ),
    # This Action updates the Data Docs static website with the Validation
    #   Results after the Checkpoint is run.
    UpdateDataDocsAction(
        name="update_all_data_docs",
    ),
]

## Checkpoints
A Checkpoint executes one or more Validation Definitions and then performs a set of Actions based on the Validation Results

In [None]:
checkpoint_name = "user_gp_data_checkpoint"
checkpoint = gx.Checkpoint(
    name=checkpoint_name,
    validation_definitions=validation_definitions,
    actions=action_list,
    result_format={"result_format": "COMPLETE"},
)

Add the Checkpoint to your Data Context.

In [None]:
context.checkpoints.add(checkpoint)

Run the checkpoint

- Running a Checkpoint will cause it to validate all of its Validation Definitions. It will then execute its Actions based on the results returned from those Validation Definitions. Finally, the Validation Results will be returned by the Checkpoint.

In [None]:
validation_results = checkpoint.run(
    batch_parameters=batch_parameters
)