# Great Expectations 

Run the following command to install GX in your Python environment:

In [1]:
%pip install great-expectations




Run the following command to import configuration information that you'll use in the following steps:

In [2]:
import great_expectations as gx

from great_expectations.checkpoint import Checkpoint

## Set up GX

To avoid configuring external resources, you'll use your local filesystem for your Metadata Stores and Data Docs store.

Run the following code to create a Data Context with the default settings:

In [3]:
context = gx.get_context()

## Connect to your data

Use a connection_string to securely connect to your PostgreSQL instance. For example

In [4]:
# POSTGRES_CONNECTION_STRING=postgresql://postgres:${MY_DB_PW}@localhost:5432/postgres
PG_CONNECTION_STRING = "postgresql+psycopg2://postgres:rinintha@localhost:5432/postgres"

Run the following command to create a Data Source to represent the data available in your PostgreSQL database:

In [5]:
pg_datasource = context.sources.add_sql(
    name="pg_datasource", connection_string=PG_CONNECTION_STRING
)

Run the following command to create a Data Asset to represent a discrete set of data:

In [6]:
pg_datasource.add_table_asset(
    name="car_assignment", table_name="car_assignment"
)

TableAsset(name='car_assignment', type='table', id=None, order_by=[], batch_metadata={}, splitter=None, table_name='car_assignment', schema_name=None)

Run the following command to build a Batch Request using the Data Asset you configured previously:

In [7]:
batch_request = pg_datasource.get_asset("car_assignment").build_batch_request()

## Create Expectations

You'll use a Validator to interact with your batch of data and generate an Expectation Suite.

Every time you evaluate an Expectation with validator.expect_*, it is immediately Validated against your data. This instant feedback helps you identify unexpected data and removes the guesswork from data exploration. The Expectation configuration is stored in the Validator. When you are finished running the Expectations on the dataset, you can use validator.save_expectation_suite() to save all of your Expectation configurations into an Expectation Suite for later use in a checkpoint.

Run the following command to create the suite and get a Validator:

In [8]:
expectation_suite_name = "hacktiv8_car_assignment"
context.add_or_update_expectation_suite(expectation_suite_name=expectation_suite_name)
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name,
)

print(validator.head())

Calculating Metrics: 100%|██████████| 1/1 [00:00<00:00,  8.58it/s]

   car_id  symboling                   carname fueltype aspiration doornumber  \
0       1          3        alfa-romero giulia      gas        std        two   
1       2          3       alfa-romero stelvio      gas        std        two   
2       3          1  alfa-romero Quadrifoglio      gas        std        two   
3       4          2               audi 100 ls      gas        std       four   
4       5          2                audi 100ls      gas        std       four   

       carbody drivewheel enginelocation  wheelbase  ...  enginesize  \
0  convertible        rwd          front       88.6  ...         130   
1  convertible        rwd          front       88.6  ...         130   
2    hatchback        rwd          front       94.5  ...         152   
3        sedan        fwd          front       99.8  ...         109   
4        sedan        4wd          front       99.4  ...         136   

   fuelsystem  boreratio  stroke compressionratio horsepower  peakrpm citympg  \




## Expectations
An Expectation is a verifiable assertion about source data. Similar to assertions in traditional Python unit tests, Expectations provide a flexible, declarative language for describing expected behaviors. Unlike traditional unit tests which describe the expected behavior of code given a specific input, Expectations apply to the input data itself. For example, you can define an Expectation that a column contains no null values. When GX runs that Expectation on your data it generates a report which indicates if a null value was found.

Expectations can be built directly from the domain knowledge of subject matter experts, interactively while introspecting a set of data, or through automated tools provided by GX.

You can see the galery of Expectations here: https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html

Expectations is basically what do we expect from the data.

For example we expect our columns:
- not to be empty
- to be unique
- should be between x and y
- should match with regex
- and many more

Let's start with the simplest one: `expect_column_values_to_not_be_null` which checks if the column has null values.

In [9]:
validator.expect_column_values_to_not_be_null(column="car_id")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics: 100%|██████████| 8/8 [00:00<00:00, 74.40it/s]


{
  "result": {
    "element_count": 205,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

We can also use `expect_column_values_to_be_unique` to check if the column has unique values or use `expect_column_values_to_be_between` to check if the values are between a range.

In [10]:
validator.expect_column_values_to_be_between(
    column="car_id", min_value=0, max_value=300
)

Calculating Metrics:  18%|█▊        | 2/11 [00:00<00:00, 668.31it/s]

Calculating Metrics: 100%|██████████| 11/11 [00:00<00:00, 167.83it/s]


{
  "result": {
    "element_count": 205,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

Let’s start with the primary key column. First of all, we check if the column exists in the dataset with the expected column to exist and provide it the column name, ProductKey. This displays the result of the test and whether it succeeds or fails.

In [11]:
validator.expect_column_to_exist(column="car_id")

Calculating Metrics: 100%|██████████| 2/2 [00:00<00:00, 249.34it/s]


{
  "result": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

Next, we find out if our primary key is unique. This returns a little more information about the test. We get the status which is a success. It gives us the total record count and whether we have missing values and what percentage is missing. On both accounts it is zero. Which is a good sign. Our primary key columns look in good shape. We can perform a null test to check if it contains any nulls with our next test. This is a success as well. Our source system is producing some good data. Let’s wrap it up with a data type test. We know this column is of type integer so we put this assumption to test.

In [12]:
validator.expect_column_values_to_be_unique('car_id')

Calculating Metrics:  60%|██████    | 6/10 [00:00<00:00, 166.57it/s]

Calculating Metrics: 100%|██████████| 10/10 [00:00<00:00, 93.87it/s] 


{
  "result": {
    "element_count": 205,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

In [13]:
validator.expect_column_values_to_be_in_type_list("horsepower", ["INTEGER"])

Calculating Metrics: 100%|██████████| 1/1 [00:00<00:00, 200.29it/s] 


{
  "result": {
    "observed_value": "INTEGER"
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

Let’s move on to other columns in our dataset. We check if certain columns contain values in a set. For example, we have a `enginelocation` column that groups the values into 2 categories. We can perform this test on columns with few distinct values such as carbody or drivewheel that contains distinct values. To `expect_column_values_to_be_in_set` function we provide column name and the list of expected values. We are checking if the enginelocation only contains these 2 values. This assumption is correct. 

In [14]:
validator.expect_column_values_to_be_in_set("enginelocation", ['front', 'rear'])

Calculating Metrics: 100%|██████████| 11/11 [00:00<00:00, 175.91it/s]

Calculating Metrics: 100%|██████████| 11/11 [00:00<00:00, 173.12it/s]


{
  "result": {
    "element_count": 205,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

Next, we will test if the column values are between a range. We can perform this test on a numeric column to check if they fall in a certain range.

In [15]:
validator.expect_column_max_to_be_between("peakrpm", 1, 7000)

Calculating Metrics: 100%|██████████| 6/6 [00:00<00:00, 203.27it/s]


{
  "result": {
    "observed_value": 6600
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

We can also check the average of a column to be between a range.

In [16]:
validator.expect_column_mean_to_be_between("price", 100, 14000)

Calculating Metrics: 100%|██████████| 6/6 [00:00<00:00, 97.52it/s] 


{
  "result": {
    "observed_value": 13276.710575457317
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": true,
  "meta": {}
}

For the next step, we can make this test to fail by changing the range to 50 to 100. This will fail the test and we will get a failure message.

In [17]:
validator.expect_column_mean_to_be_between("horsepower", 50, 100)

Calculating Metrics:  83%|████████▎ | 5/6 [00:00<00:00, 121.95it/s]Using lossy conversion for decimal 104.1170731707317073 to float object to support serialization.
Calculating Metrics: 100%|██████████| 6/6 [00:01<00:00,  4.15it/s] 


{
  "result": {
    "observed_value": 104.1170731707317
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "success": false,
  "meta": {}
}

Run the following command to save your Expectation Suite (all the unique Expectation Configurations from each run of validator.expect_*) to your Expectation Store:

In [18]:
validator.save_expectation_suite(discard_failed_expectations=False)

## Validate your data

You'll create and store a Checkpoint for your batch, which you can use to validate and run post-validation actions.

Run the following command to create the Checkpoint configuration that uses your Data Context:

In [19]:
my_checkpoint_name = "hacktiv8_cars_checkpoint"

checkpoint = Checkpoint(
    name=my_checkpoint_name,
    run_name_template="%Y%m%d-%H%M%S-hacktiv8_cars_checkpoint",
    data_context=context,
    batch_request=batch_request,
    expectation_suite_name=expectation_suite_name,
    action_list=[
        {
            "name": "store_validation_result",
            "action": {"class_name": "StoreValidationResultAction"},
        },
        {"name": "update_data_docs", "action": {"class_name": "UpdateDataDocsAction"}},
    ],
)

Run the following command to save the Checkpoint:

In [20]:
context.add_or_update_checkpoint(checkpoint=checkpoint)

{
  "action_list": [
    {
      "name": "store_validation_result",
      "action": {
        "class_name": "StoreValidationResultAction"
      }
    },
    {
      "name": "update_data_docs",
      "action": {
        "class_name": "UpdateDataDocsAction"
      }
    }
  ],
  "batch_request": {
    "datasource_name": "pg_datasource",
    "data_asset_name": "car_assignment",
    "options": {}
  },
  "class_name": "Checkpoint",
  "config_version": 1.0,
  "evaluation_parameters": {},
  "expectation_suite_name": "hacktiv8_car_assignment",
  "module_name": "great_expectations.checkpoint",
  "name": "hacktiv8_cars_checkpoint",
  "profilers": [],
  "run_name_template": "%Y%m%d-%H%M%S-my-run-name-template",
  "runtime_configuration": {},
  "validations": []
}

Run the following command to run the Checkpoint and pass in your Batch Request (your data) and your Expectation Suite (your tests):

In [21]:
checkpoint_result = checkpoint.run()

Calculating Metrics:  33%|███▎      | 12/36 [00:00<00:00, 39.41it/s]Using lossy conversion for decimal 104.1170731707317073 to float object to support serialization.
Calculating Metrics: 100%|██████████| 36/36 [00:00<00:00, 113.01it/s]


Your Checkpoint configuration includes the store_validation_result and update_data_docs actions. The store_validation_result action saves your validation results from the Checkpoint run and allows the results to be persisted for future use. The update_data_docs action builds Data Docs files for the validations run in the Checkpoint.

## Build and view Data Docs

Your Checkpoint contained an UpdateDataDocsAction, so your Data Docs have already been built from the validation you ran and your Data Docs store contains a new rendered validation result.

Run the following command to open your Data Docs and review the results of your Checkpoint run:

In [22]:
context.open_data_docs()