# Great Expectations tutorial

### What is Great Expectations?

Great Expectations is a tool that allows you to test batch data. It generates reports about the data, containing documentation of the data translated from the definitions of the tests.

### Why you should use it: data quality

You should test your data for two main reasons:
- better data quality leads to better predictions and insights relying on the data,
- it's an additional way to test data pipelines.

### Key concepts and terminology

A *Datasource* is a source of data to be tested. For example, a SQL database.

A *data asset* is a subset of data from a data source (that share the same structure). For example, a table in a SQL database.

An *expectation* is the definition of a test on a data asset.

An *expectation suite* is a set of expectations on a data asset.

### Install and setup great_expectations

To install Great Expectations, run `pip install great_expectations` in your terminal. Using a virtual environment is a good practice to install programs with `pip`.

To initialize Great Expectations for a project, run `great_expectations init` in your terminal in the project's directory and follow the instructions.

For more information on how to set up everything, have a look at https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html and feel free to refer to the official documentation.

### How it works

Great Expectations stores everything related to a project in the `great_expectations` subdirectory in the project's directory.


# Getting started

Let's jump into it then!

In [55]:
import great_expectations as ge

TODO: explain what a context is

By creating a new `DataContext` object, great_expectations will read the configuration we have already set up for you - don't worry about that for now, we'll get back to it later.

In [56]:
context = ge.data_context.DataContext()

Now that we have our data context ready, we can add an expectation suite. Think of this like a test suite, but for your data instead of for your code.

In [24]:
suite = context.create_expectation_suite(
    "check_avocado_data",
    overwrite_existing=True
)

Next, we load our dataset, `avocado.csv`, from our data context. Again, don't worry about this too much, great_expectations usually handles this for you.

In [25]:
batch_kwargs = {
    'datasource': 'data_dir',
    'path': 'data/avocado.csv',
}
batch = context.get_batch(batch_kwargs, suite)

Alright, that's it for setup! Now let's have a look at the data we are working with here.

In [60]:
batch.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


 Some documentation that came with the data:
 - Date - The date of the observation
 - AveragePrice - the average price of a single avocado
 - type - agriculture type: conventional or organic
 - Region - the city or region of the observation
 - Total Volume - Total number of avocados sold
 - 4046 - Total number of avocados with PLU 4046 sold (small Hass)
 - 4225 - Total number of avocados with PLU 4225 sold (large Hass)
 - 4770 - Total number of avocados with PLU 4770 sold (extra large Hass)
 
These descriptions sure help us to understand the dataset a bit better, but they don't exactly provide much guarantees. When consuming this dataset, what assumptions can we make? Will the `region` field always be specified? Will the `Date` field always be in the same format? Those sales counts, are they supposed to add up?

great_expectations helps us to codify these properties by writing `Expectations`. Think of it like an unit test, but for data.

We'll create a simple one to get started! Maybe we can just check whether a certain column is present in the dataset.

In [27]:
batch.expect_column_to_exist('Date')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {},
  "meta": {}
}

Great success! That column does indeed seem to exist.

We received dict describing the result of the check. Since this was a very basic expectation, there is not that much in there, but keep an eye on the results as we proceed to more complicated expectations.

Now let's address one of the concerns we raised: can we add an `Expectation` that ensures every record will have its `region` specified?

In [64]:
batch.expect_column_values_to_not_be_null('region')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {}
}

That worked! in the `result` section, we can now see that all 18249 records passed the check.

We can also add a check for the value type, so that we don't end up with numeric regions.

In [71]:
batch.expect_column_values_to_be_of_type('region', 'str')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {}
}

You can do many different kinds of checks with great_expectations. For example, we can make sure all the listed avocado prices are reasonable.

In [42]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0)

{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  },
  "meta": {}
}

Oops! That failed. Looks like we have some outliers here!

If we want to allow this, we can add some tolerance to the check by using the `mostly` parameter. Lets settle for having 99% of avocados being priced within the range we specified.

In [74]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0, mostly=0.99)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  },
  "meta": {}
}

If we are using ordinal values, such as the `type` field in our dataset, we can easily check that only known values show up:

In [75]:
batch.expect_column_distinct_values_to_be_in_set('type', ['conventional', 'organic'])

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": [
      "conventional",
      "organic"
    ],
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {}
}

We could even add a check on the value frequencies! For example, if we want the ratio of organic to conventional to be roughly equal, we could check the [Kullback-Leiber divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between our assumed distribution, and the one that is observed in the dataset.

In [85]:
partition_object = {
    'values': ['conventional', 'organic'],
    'weights': [0.5, 0.5],
    
}
batch.expect_column_kl_divergence_to_be_less_than('type', partition_object, 0.1)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": 1.351245850704074e-08,
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {}
}

Now that we checked out some expectations, maybe try adding one yourself? You can check out the [glossary of expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) for a complete list of what you can do. Go wild!

In [80]:
# all yours

## Tests == docs

So, while we were experimenting up there, great_expectations remembered all the expectations we ran. Now we can easily inspect them:

In [88]:
batch.get_expectation_suite()

2021-01-13T17:20:18+0100 - INFO - 	6 expectation(s) included in expectation_suite. Omitting 1 expectation(s) that failed when last run; set discard_failed_expectations=False to include them. result_format settings filtered.


{
  "meta": {
    "great_expectations_version": "0.13.4"
  },
  "expectations": [
    {
      "kwargs": {
        "column": "Date"
      },
      "expectation_type": "expect_column_to_exist",
      "meta": {}
    },
    {
      "kwargs": {
        "column": "type",
        "value_set": [
          "conventional",
          "organic"
        ]
      },
      "expectation_type": "expect_column_distinct_values_to_be_in_set",
      "meta": {}
    },
    {
      "kwargs": {
        "column": "region"
      },
      "expectation_type": "expect_column_values_to_not_be_null",
      "meta": {}
    },
    {
      "kwargs": {
        "column": "AveragePrice",
        "min_value": 0.5,
        "max_value": 3.0,
        "mostly": 0.99
      },
      "expectation_type": "expect_column_values_to_be_between",
      "meta": {}
    },
    {
      "kwargs": {
        "column": "type",
        "partition_object": {
          "values": [
            "conventional",
            "organic"
          ],
      

The returned dict shows us how great_expectations keeps track of your expectation suite internally.

This representation shows us 

In [99]:
batch.save_expectation_suite()

2021-01-14T09:21:25+0100 - INFO - 	6 expectation(s) included in expectation_suite. Omitting 1 expectation(s) that failed when last run; set discard_failed_expectations=False to include them. result_format settings filtered.


Now let's take a look at the file structure of our great_expectations setup, and find out where our file went.

In [110]:
!tree great_expectations -I "uncommitted"

[01;34mgreat_expectations[0m
├── [01;34mcheckpoints[0m
├── [01;34mexpectations[0m
│   └── check_avocado_data.json
├── great_expectations.yml
├── [01;34mnotebooks[0m
│   ├── [01;34mpandas[0m
│   │   └── validation_playground.ipynb
│   ├── [01;34mspark[0m
│   │   └── validation_playground.ipynb
│   └── [01;34msql[0m
│       └── validation_playground.ipynb
└── [01;34mplugins[0m
    └── [01;34mcustom_data_docs[0m
        ├── [01;34mrenderers[0m
        ├── [01;34mstyles[0m
        │   └── data_docs_custom_styles.css
        └── [01;34mviews[0m

11 directories, 6 files


It's right there in the `expectations` folder! Remember that we named our suite `check_avocado_data` back at the start.

If you don't trust us, feel free to check for yourself ;-)

In [None]:
!cat great_expectations/expectations/check_avocado_data.json

In [89]:
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])

2021-01-13T17:20:47+0100 - INFO - Setting run_name to: 20210113T162047.180704Z
2021-01-13T17:20:47+0100 - INFO - 	7 expectation(s) included in expectation_suite.


TODO: make sure these docs are in the repo, and linke to them for online readers

In [115]:
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs()

In [97]:
context.list_validation_operators()

[{'class_name': 'ActionListValidationOperator',
  'action_list': [{'name': 'store_validation_result',
    'action': {'class_name': 'StoreValidationResultAction'}},
   {'name': 'store_evaluation_params',
    'action': {'class_name': 'StoreEvaluationParametersAction'}},
   {'name': 'update_data_docs',
    'action': {'class_name': 'UpdateDataDocsAction'}}],
  'name': 'action_list_operator'}]

## Metrics

## Profiling: generating expectations

## Setting up data context and source

## (Airflow integration)

## (Spark)

In [None]:
!cat great_expectations/great_expectations.yml

In [1]:
import great_expectations as ge

In [2]:
my_df = ge.read_csv("data/avocado.csv")

In [3]:
my_df.head()
my_df.expect_column_values_to_be_in_set("type", ["conventional"])

{
  "success": false,
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 9123,
    "unexpected_percent": 49.991780371527206,
    "unexpected_percent_nonmissing": 49.991780371527206,
    "partial_unexpected_list": [
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic"
    ]
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [4]:
my_df.get_expectation_suite()

{
  "meta": {
    "great_expectations_version": "0.13.4"
  },
  "expectation_suite_name": "default",
  "data_asset_type": "Dataset",
  "expectations": []
}