# Great Expectations tutorial

TODO: update the following paragraph as the project evolves

Welcome to this hands-on tutorial on Great Expectations! We'll show you why and how to use Great Expectations to enhance the quality of your data. We'll first introduce you to the tool and show you how to get started. Then we'll go over expectations and expectation suites, which are the key building blocks to test your data. We'll show you how to generate beautiful reports on your data. We'll then build a checkpoint, and finally introduce you to more advanced stuff.

### What is Great Expectations exactly?

Great Expectations is a tool that allows you to test batch data. It generates reports about the data, containing documentation of the data translated from the definitions of the tests.

### Why you should use it: data quality

You should test your data for two main reasons:
- better data quality leads to better predictions and insights relying on the data,
- it's an additional way to test data pipelines.

# Getting started

Let's jump into it then!

In [None]:
import great_expectations as ge

First we'll need a `DataContext`. This represents a Great Expectations project, holding all your configurations, expectation suites, data sources and so on. We'll go over how to configure a `DataContext` yourself later [TODO: jump-ahead], but to get started we shipped a simple one with this tutorial.

We'll load that one right now. By default, Great Expectations will look for your configuration in the `great_expectations` directory.

In [None]:
context = ge.data_context.DataContext()

Now that we have our `DataContext` ready, we can add an expectation suite. Think of this like a test suite, but for your data instead of for your code. Usually you'll do this through the CLI, but we will get to that later [TODO: jump-ahead]. We'll name the suite `check_avocado_data`.

In [None]:
suite = context.create_expectation_suite(
    "check_avocado_data",
    overwrite_existing=True
)

Next, we load our dataset, `avocado.csv`, from our data context. This involves a bit of configuration, but don't worry about it too much for now. We'll get back to that later [TODO: when + link].

In [None]:
batch_kwargs = {
      'path': 'data/avocado.csv',
      'datasource': 'data_dir',
      'reader_method': 'read_csv',
      'data_asset_name': 'avocado',
}
batch = context.get_batch(batch_kwargs, suite)

Alright, that's it for setup!

Let's continue to our avocado sales data.

In [None]:
batch.head()

This is the documentation that came with the data:
 - Date - The date of the observation
 - AveragePrice - the average price of a single avocado
 - type - agriculture type: conventional or organic
 - Region - the city or region of the observation
 - Total Volume - Total number of avocados sold
 - 4046 - Total number of avocados with PLU 4046 sold (small Hass)
 - 4225 - Total number of avocados with PLU 4225 sold (large Hass)
 - 4770 - Total number of avocados with PLU 4770 sold (extra large Hass)
 
These descriptions sure help us to understand the dataset a bit better, but they don't exactly provide much guarantees. When consuming this dataset, what assumptions can we make? Will the `region` field always be specified? Will the `Date` field always be in the same format? Those sales counts, are they supposed to add up?

Great Expectations helps us to codify these properties in a set of `Expectations`. An `Expectation` is, well, something that you expect to be true in your data. Again, think of it like an unit test for your dataset.

Let's run a basic `Expectation` to get started. For example, we could check whether the `Date` column is present in our dataset.

In [None]:
batch.expect_column_to_exist('Date')

Great success! That column does indeed seem to exist.

The result `dict` we got back might seem a bit weird at first, but it will start to make sense as we move along the tutorial. Promise!

This was a simple check that only assesses the data shape, but doesn't touch the values in there (it is a _table-level check_).

Let's try adding a check for the values now. Maybe we can address one of the concerns we raised: can we add an `Expectation` that ensures every record will have its `region` specified?

In [None]:
batch.expect_column_values_to_not_be_null('region')

That worked! This time we got a bit more info back: the `result` section now contains some metrics about our data. We can see that all 18249 records passed the check, and there were no unexpected (i.e. `null`) values. If Great Expectations finds any offending values, they will be listed in the `partial_unexpected_list`.

Now let's do something that's a bit more strict. It would be nice, for example, to make sure that all `region`s are actually strings, so that we don't end up with numeric regions.

In [None]:
batch.expect_column_values_to_be_of_type('region', 'str')

Note that metrics on the amount of missing values were still collected. This way, we can disambiguate between missing values and incorrect values. In case you were wondering, the `unexpected_percent_nonmissing` refers to the percentage of present (non-null) values that did not meet our expectation (they were not a string). If other metrics are unclear to you, check out [this documentation page](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/result_format.html#behavior-for-summary).

Now that we covered the basics, let's get to some fancier expectations. For example, we could make sure that all `Date`s are in the expected format:

In [None]:
batch.expect_column_values_to_match_strftime_format('Date', "%Y-%m-%d")

Another example: we can make sure all the listed avocado prices are reasonable.

In [None]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0)

Oops! That failed. Looks like we have some outliers here! Great Expectations helpfully collected them for us. By default, it will collect up to 20 examples of values that didn't meet the expectation (that's why it's called the _partial_ unexpected list).

If we want to allow these outliers, we can add some tolerance to the check by using the `mostly` parameter. Let's settle for having 99% of avocados being priced within the range we specified.

In [None]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0, mostly=0.99)

Another common usecase would be when you only expect a certain set of values to show up in a column. This is the case for our `type` column, since we only know about `conventional` and `organic` grown avocados. Let's add a check for that:

In [None]:
batch.expect_column_distinct_values_to_be_in_set('type', ['conventional', 'organic'])

We could even add a check on the value frequencies! For example, if we want the ratio of organic to conventional to be roughly equal, we could check the [Kullback-Leiber divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between our assumed distribution, and the one that is observed in the dataset.

In [None]:
partition_object = {
    'values': ['conventional', 'organic'],
    'weights': [0.5, 0.5],
    
}
batch.expect_column_kl_divergence_to_be_less_than('type', partition_object, 0.1)

Now that we checked out some expectations, maybe try adding one yourself? You can check out the [glossary of expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) for a complete list of what you can do. Go wild!

In [None]:
# The stage is all yours

## The Expectation Suite

So, while we were experimenting up there, great_expectations remembered all the expectations we ran. We can now retrieve the suite contents as follows:

In [None]:
batch.get_expectation_suite()

That gave us the `dict` representation Great Expectations uses under the hood to keep track of our exepectation suite. Can you recognise some of the expectations we wrote?

This representation can then be saved to a file, so that we can load it again at another time, without depending on the python code that produced it.

Note that by default, expectations that failed on the `batch` we ran them against will be omitted. If you want to include them anyways, you could add the `discard_failed_expectations=False` parameter.

In [None]:
batch.save_expectation_suite()

Now let us get back to that configuration we mentioned earlier. As we said, it's just some files living in the `great_expectations` directory. This is what it looks like:

In [None]:
!tree great_expectations -I "uncommitted"

So, we have the main `great_expectations.yml` configuration file, a folder with checkpoints, a folder with expectation suites, some playground notebooks, and a folder for plugins.

As you can see, the `save_expectation_suite` command saved our `check_avocado_data` suite to the `expectations` folder! That's all there is to it, the expectation suite is just a file. It contains that same internal representation that we just retrieved. You can check if you like!

In [None]:
!cat great_expectations/expectations/check_avocado_data.json

Now that we added our expectation suite to our `DataContext`, we can try running the entire suite. This is done by applying a `ValidationOperator` to the suite and the dataset. `ValidationOperator`s for your project are defined in the `great_expectations.yml` file. We already provided a `ValidationOperator` called `action_list_operator` [TODO: this is the default name, should we change it?] which will run the expectation suite and record its results.

If you'd like to know more you can check out the [validation operators and actions](https://docs.greatexpectations.io/en/latest/reference/core_concepts/validation_operators_and_actions.html) or [how to add a validation operator](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/validation/how_to_add_a_validation_operator.html) documentation pages, but for it's sufficent to understand that you can configure these operators yourself.



In [18]:
results = context.run_validation_operator("action_list_operator", assets_to_validate=[batch])

We didn't specify our expectation suite for that command, but remember that the `batch` dataset kept track of our suite for us, so it will know what to do.

That produced a big datadump for us. Let's inspect it.

In [19]:
results

{
  "validation_operator_config": {
    "class_name": "ActionListValidationOperator",
    "module_name": "great_expectations.validation_operators",
    "name": "action_list_operator",
    "kwargs": {
      "action_list": [
        {
          "name": "store_validation_result",
          "action": {
            "class_name": "StoreValidationResultAction"
          }
        },
        {
          "name": "store_evaluation_params",
          "action": {
            "class_name": "StoreEvaluationParametersAction"
          }
        },
        {
          "name": "update_data_docs",
          "action": {
            "class_name": "UpdateDataDocsAction"
          }
        }
      ],
      "result_format": {
        "result_format": "SUMMARY",
        "partial_unexpected_count": 20
      }
    }
  },
  "run_id": {
    "run_name": "20210115T165147.798358Z",
    "run_time": "2021-01-15T16:51:47.798358+00:00"
  },
  "evaluation_parameters": null,
  "run_results": {
    "ValidationResultIdenti

This is called a *validation result*. Validation results are kept in the *validation store*, which by default is the `great_expectations/uncommitted/validations` directory by default.

In [20]:
!tree great_expectations/uncommitted/validations

[01;34mgreat_expectations/uncommitted/validations[0m
└── [01;34mcheck_avocado_data[0m
    └── [01;34m20210115T165147.798358Z[0m
        └── [01;34m20210115T165147.798358Z[0m
            └── a52e8a35d5f03815b708c7306612dbde.json

3 directories, 1 file


Great Expectations also allows you to set other backends as a validation store, such as an S3 bucket or a SQL database. Check out [metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you would like to learn more!

## Data Docs

We can render these results to a friendly report, called a data doc. These data docs will describe the expectations that the data should meet, as well as the metrics detailing how well the data meets the requirements. This is how Great Expectations combines testing with documenting. Running the code below will generate the data docs and open them in a new tab, make sure to have a look around. You'll see the code we ran above reflected in the different sections - it's pretty self-explanatory!

TODO: make sure these docs are in the repo, and add a link to them for online readers

In [21]:
context.build_data_docs()

# get the result identifier for our run
validation_result_identifier = list(results["run_results"].keys())[0]
context.open_data_docs(validation_result_identifier)

# Checkpoints

To get started, we manually ran our expectation suite against our dataset. While that worked, there is a better way: checkpoints.

A checkpoint couples expectation suites with datasets that they will be run on. \
In Great Expectations parlance, we call these datasets *data assets*. Data assets live within a *datasource*. \
A datasource could for example be an SQL database, and a data asset could be one of its tables. The datasource we had preconfigured is just a simple folder with the `avocado` csv file inside. We named it `data_dir`.

The check is ran using a *validation operator*. For now, we are just using the default one generated by great_expectations, called `action_list_operator`. It runs the expectation suites and generates the data docs we previously saw. Don't worry about this one yet, just remember that it can be configured.

We can create a checkpoint by adding a file in the `checkpoints` directory of our great_expectations configuration:

In [23]:
%%writefile great_expectations/checkpoints/avocado_data.yml

validation_operator_name: action_list_operator
batches:
  - batch_kwargs:
      path: data/avocado.csv
      datasource: data_dir
      reader_method: read_csv
      data_asset_name: avocado
    expectation_suite_names:
      - check_avocado_data

Writing great_expectations/checkpoints/avocado_data.yml


`batches` is a list of (data asset, expectation suites) pairs. `batch_kwargs` specifies how the data asset should be loaded, you might recognise the parameters from earlier! \
Note that it is possible to add multiple expectation suites to check one batch.

The checkpoint can be executed by using the great_expectations cli:

In [24]:
!great_expectations checkpoint run avocado_data

[33mHeads up! This feature is Experimental. It may change. Please give us your feedback![0m[0m
Validation succeeded![0m

Suite Name                                   Status     Expectations met[0m
- check_avocado_data                         [32m✔ Passed[0m   7 of 7 (100.0 %)[0m
[0m

So, to summarize: a checkpoint is a _runnable check_ for your data. They are your first stop for integrating Great Expectations into your pipelines and workflows.
For more info on how to do that, refer to the [validation guides](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/validation.html), or the [workflows and patterns](https://docs.greatexpectations.io/en/latest/guides/workflows_patterns.html) guides.

# Profiling

In the previous sections we explored how we could get some metrics about our data using expectations. But what if you don't know what exactly to expect of your data? Well, you could try using Great Expectations' profiling feature, which can try to extract some useful metrics from your data. To try profiling our preconfigured `data_dir` data source, we can use the CLI:

In [None]:
!great_expectations datasource profile data_dir -y

Running that command should have presented you with freshly built data docs. You can find the results in the `Profiling Results` tab. The profiler also generated an expectation suite based on its observations, which you can find in the `Expectation Suites` tab. Be mindful that this is an experimental feature and the generated suite is usually not that helpful, but it could be a good starting point for writing your own.


If you'd like to know more about profiling, the [profiling reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/profiling_reference.html) can help you out.

## Install and setup great_expectations

Here are a few guidelines if you want to setup everything yourself for your own projects.

To install Great Expectations, run `pip install great_expectations` in your terminal. Using a virtual environment is a good practice to install programs with `pip`.

To initialize Great Expectations for a project, run `great_expectations init` in your terminal in the project's directory and follow the instructions.

For more information on how to set up everything, have a look at https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html and feel free to refer to the official documentation.
