
# How to use Great Expectations with AWS Glue Data Catalog

In this notebook, we will walk through the steps-by-steps to setup Great Expectations with AWS Glue Data Catalog. Before you can start running this notebook, you must start an interactive session.

## 0. Setup AWS Glue Interactive Session

Before starting coding, be aware that there are a few options to set up an AWS Glue interactive session. In the following code cell, we are using some of them. Refer to the [AWS glue interactive session documentation](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html) for a complete view of the options available. 

Run the next code cell to setup the interactive session:

In [None]:
%additional_python_modules great_expectations
%glue_version 3.0
%number_of_workers 2

Once you run the cell, the interactive session will be configured to use Glue 3.0, allocate 2 DPUs and install the Great Expectations package. Feel free to add or modify this setup as you like. The interactive session will not be created until we execute some code. Let’s start the session by running the following code cell to import some standard Glue libraries:

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)


## 1. Create a Data Context

For the sake of simplicity, we will create the Great Expectations Data Context in-memory using an Amazon S3 bucket as the store backend. The following example shows a Data Context configuration with a Spark datasource using the new AWS Glue Catalog Data Connector. Refer to the [following documentation](https://docs.greatexpectations.io/docs/guides/setup/configuring_data_contexts/how_to_instantiate_a_data_context_without_a_yml_file#2-pass-this-datacontextconfig-as-a-project_config-to-basedatacontext) for a better understanding.

Run the following cell to create the Data Context:


In [None]:
from great_expectations.data_context.types.base import DataContextConfig, DatasourceConfig, S3StoreBackendDefaults
from great_expectations.data_context import BaseDataContext

data_connectors = {
    "Runtime": {
        "class_name": "RuntimeDataConnector",
        "batch_identifiers": ["batch_id"]
    },
    "InferredGlue": {
        "class_name": "InferredAssetAWSGlueDataCatalogDataConnector"
    },
    "ConfiguredGlue": {
        "class_name": "ConfiguredAssetAWSGlueDataCatalogDataConnector",
        "assets": {
            "nyc_trip_data_asset": {
                "table_name": "tb_nyc_trip_data",
                "database_name": "db_ge_with_glue_demo"
            }
        }
    }
}

glue_data_source = DatasourceConfig(
    class_name="Datasource",
    execution_engine={
        "class_name": "SparkDFExecutionEngine",
        "force_reuse_spark_context": True,
    },
    data_connectors=data_connectors
)

metastore_backed = S3StoreBackendDefaults(default_bucket_name="great-expectations-glue-demo-<AWS_ACCOUNT_ID>-us-east-1")

data_context_config = DataContextConfig(
    datasources={"GlueDataSource": glue_data_source},
    store_backend_defaults=metastore_backed,
)

context = BaseDataContext(project_config=data_context_config)

Now that we have a DataContext, we can check for the available data assets that the AWS Glue Data Connector was able to get from the Glue Catalog. Run the following code to get a list of available data assets:

In [None]:
from pprint import pprint

pprint(context.get_available_data_asset_names())

If you have deployed the provided terraform code into you AWS account, you shall see that the InferredGlue connector returned the table *db_ge_with_glue_demo.tb_nyc_trip_data*. This table was created as part of the solution deployment. If you have more tables in the Glue Catalog, this connector shall output all of them. 

For a fine-grained control of what tables shall be available, use the **ConfiguredGlue** as an example. The Configured Connector requires that you define each database table you would like to validate. Be aware that the connector will not validate if, for any of the tables you define, the table exists nor if you have permissions to access it. Note: If you’re using AWS Lake Formation, check if the Glue IAM role has permissions to access it.

## 2. Create Expectations

Once you have the data context already setup, you can start to create the Expectations for our tables. Run the following code cell to create an Expectation Suite and a validator that we can use to create the expectations interactively. 

In [None]:
from great_expectations.core.batch import BatchRequest

expectation_suite_name = "demo.taxi_trip.warning"

# Create Expectation Suite
suite = context.create_expectation_suite(
    expectation_suite_name=expectation_suite_name,
    overwrite_existing=True,
)

# Batch Request
batch_request = BatchRequest(
    datasource_name="GlueDataSource",
    data_connector_name="InferredGlue",
    data_asset_name="db_ge_with_glue_demo.tb_nyc_trip_data",
    data_connector_query={
        "batch_filter_parameters": {
            "year": "2022", 
            "month": "03"
        }
    }
)

# Validator
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name,
)

# Print
df = validator.head(n_rows=5, fetch_all=False)
pprint(df.info())

Be aware that, the table we are using in this demo, is partitioned by year and month. Check the Table in Glue Data Catalog and you shall see that there are three partitions, named: 2022-01, 2022-02, 2022-03. For each table partition, the connector will create a batch identifier that allows us to filter a batch of data based on the partition values. In the code above, we are loading the partition 2022-03. If you do not specify the partition, the connector will get the first partition available by default. Note: filtering a partition is only available if your table is partitioned, otherwise, the table data will be loaded in a single batch.

After running the code cell above, you can start to define the Expectations you want. Use the following code as example on how to define the Expectations and save it:


In [None]:
# Define the Expectations
validator.expect_table_row_count_to_be_between(min_value=1, max_value=None)
validator.expect_column_values_to_not_be_null(column="vendorid")
validator.expect_column_values_to_be_between(column="passenger_count", min_value=0, max_value=9)

# Save the Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

Validate that the Expectation Suite was saved into the S3 Bucket we have defined as our store backend. 

## 3. Validate Data

With the Expectation Suite already created, we can create a Checkpoint to validate data. Run the following code cell to create, save, and run the Checkpoint to get the validation results:

In [None]:
from ruamel import yaml

# Create Checkpoint
my_checkpoint_name = "demo.taxi_trip.checkpoint" 

yaml_config = f"""
name: {my_checkpoint_name}
config_version: 1.0
module_name: great_expectations.checkpoint
class_name: Checkpoint
run_name_template: "%Y%m%d-TaxiTrip-GlueInferred"
action_list:
  - name: store_validation_result
    action:
      class_name: StoreValidationResultAction
  - name: update_data_docs
    action:
      class_name: UpdateDataDocsAction
      site_names: []
validations:
  - batch_request:
      datasource_name: GlueDataSource
      data_connector_name: InferredGlue
      data_asset_name: db_ge_with_glue_demo.tb_nyc_trip_data
      data_connector_query:
        batch_filter_parameters:
          year: '2022'
          month: '03'
    expectation_suite_name: {expectation_suite_name}
"""

# Save Checkpoint
_ = context.add_checkpoint(**yaml.load(yaml_config))

# Run Checkpoint
r = context.run_checkpoint(checkpoint_name=my_checkpoint_name)
pprint(r)

🚀🚀 Congratulations! 🚀🚀 You’ve successfully connected Great Expectations with your data through AWS Glue Data Catalog.