# Great Expectations tutorial
Welcome! In this tutorial we'll have a look at Great Expectations, a tool written and configured in Python that aids you in keeping an eye on your data quality. It provides a batteries-included solution for testing and documenting your data, so that nobody has to run into any surprises when consuming it. To achieve this, you create _expectation suites_. You can think of them as unit tests, but for data. They also double as documentation for your dataset, so that you won't have to repeat yourself.

What do we mean by data quality? Well, bad quality data can happen for different reasons. Usually, data has bad quality if its structure (for example the columns and their types in a table) or its contents (specific cells in a table) are not what you expected.

For more background on Great Expectations and the problems it solves, we can recommend the authors' blogpost: [Down with Pipeline debt / Introducing Great Expectations](https://medium.com/@expectgreatdata/down-with-pipeline-debt-introducing-great-expectations-862ddc46782a). It's a good read!

## What is Great Expectations exactly?

<img src='figures/in_out.png' width=800px>

Great Expectations can be used with your existing data assets - it is able to use different backends such as SQL databases, Spark clusters, or just your plain old filesystem. It will execute your expectation suites on these backends, and generate reports on the results of your validation.  
Writing your expectation suite is usually done through Jupyter notebooks, so you'll feel at home. This notebook itself would be an example of how that works!


## In this tutorial
We'll give you a brief introduction to the main concepts used in Great Expectations, walking you through writing your first expectations and generating your first data report. We have added many references to the official documentation that you can reference to when you are configuring your own setup.

Contents:
- [Getting started](#section-getting-started)
- [The Expectation Suite](#section-expectation-suite)
- [Data Docs](#section-data-docs)
- [Data Context](#section-data-context)
- [Checkpoints](#section-checkpoints)
- [Data Profiling](#section-profiling)
- [The Great Expectations CLI](#section-cli)
- [Setting up your own project](#section-setup)

## Running on Google Colab

If you are running this on Google Colab, make sure to run the cell below to set everything up.

In [None]:
%%bash
if [[ ! -d great_expectations ]]
then 
  git init
  git remote add origin https://github.com/datarootsio/tutorial-great-expectations.git
  git pull origin main
  pip install great_expectations==0.13
fi

## Getting started

Let's jump into it then!

In [2]:
import great_expectations as ge

First we'll need a `DataContext`. This represents a Great Expectations project, holding all your configurations, expectation suites, data sources and so on. We'll have a better look at the data context later [[Data Context]](#section-data-context), but just to get started we shipped a simple one with this tutorial.

We'll load that one right now. By default, Great Expectations will look for your configuration in the `great_expectations` directory.

In [3]:
context = ge.data_context.DataContext()

Now that we have our `DataContext` ready, we can add an expectation suite. Think of this like a test suite, but for your data instead of for your code. Usually you'll do this through the CLI, but we will get to that later [[The Great Expectations CLI]](#section-cli). We'll name the suite `check_avocado_data`.

In [4]:
suite = context.create_expectation_suite(
    'check_avocado_data',
    overwrite_existing=True
)

Next, we load our dataset, `avocado.csv`, from our data context. This involves a bit of configuration, but don't worry about it too much for now. We'll get back to that later [[Data Context]](#section-data-context).

In [5]:
batch_kwargs = {
    'path': 'data/avocado.csv',
    'datasource': 'data_dir',
    'data_asset_name': 'avocado',
    'reader_method': 'read_csv',
    'reader_options': {
        'index_col': 0,
    }
}
batch = context.get_batch(batch_kwargs, suite)

Alright, that's it for setup!

Let's continue to our avocado sales data.

In [6]:
batch.head()

Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


This is the documentation that came with the data:
 - Date - The date of the observation
 - AveragePrice - the average price of a single avocado
 - type - agriculture type: conventional or organic
 - Region - the city or region of the observation
 - Total Volume - Total number of avocados sold
 - 4046 - Total number of avocados with PLU 4046 sold (small Hass)
 - 4225 - Total number of avocados with PLU 4225 sold (large Hass)
 - 4770 - Total number of avocados with PLU 4770 sold (extra large Hass)
 
These descriptions sure help us to understand the dataset a bit better, but they don't exactly provide much guarantees. When consuming this dataset, what expectations can we have? Will the `region` field always be specified? Will the `Date` field always be in the same format? Those sales counts, are they supposed to add up?

Great Expectations helps us to codify these properties in a set of `Expectations`. An `Expectation` is, well, something that you expect to be true in your data. Again, think of it as an unit test for your dataset.

Let's run a basic `Expectation` to get started. We want to check whether our expectation that the Date column is present holds true.

In [7]:
batch.expect_column_to_exist('Date')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {}
}

The resulting `dict` we got back might feel a bit weird at first, but you'll see later on how this output is used to generate reports [[Data Docs]](#section-data-docs). For now, just note that `success` has the value `true`, indicating that our expectation passed!

This was a simple check that only assesses the data shape, but doesn't touch the values in there (it is a _table-level check_).

Let's try adding a check for the values now. Maybe we can address one of the concerns we raised: can we add an `Expectation` that ensures every record will have its `region` specified?

In [8]:
batch.expect_column_values_to_not_be_null('region')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 18249,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  }
}

That worked! This time we got a bit more info back: the `result` section now contains some metrics about our data. We can see that all 18249 records passed the check, and there were no unexpected (i.e. `null`) values. If Great Expectations finds any offending values, they will be listed in the `partial_unexpected_list`.

Now let's do something that's a bit more strict. It would be nice, for example, to make sure that all `region`s are actually strings, so that we don't end up with numeric regions. Note that the type you specify here should match your backend - you can't expect a spark backend to have PostgresQL types. Refer to [the documentation](https://docs.greatexpectations.io/en/latest/autoapi/great_expectations/expectations/core/expect_column_values_to_be_of_type/index.html) to see what type name you should use.

In [9]:
batch.expect_column_values_to_be_of_type('region', 'str')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  }
}

Note that metrics on the amount of missing values were still collected. This way, we can disambiguate between missing values and incorrect values. In case you were wondering, the `unexpected_percent_nonmissing` refers to the percentage of present (non-null) values that did not meet our expectation (they were not a string). If other metrics are unclear to you, check out [this documentation page](https://docs.greatexpectations.io/en/latest/reference/core_concepts/expectations/result_format.html#behavior-for-summary).

Now that we covered the basics, let's get to some fancier expectations. For example, we could make sure that all `Date`s are in the expected format:

In [10]:
batch.expect_column_values_to_match_strftime_format('Date', "%Y-%m-%d")

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  }
}

Another example: we can make sure all the listed avocado prices are reasonable.

In [11]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0)

{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  }
}

Oops! That failed. Looks like we have some outliers here! Great Expectations helpfully collected them for us. By default, it will collect up to 20 examples of values that didn't meet the expectation (that's why it's called the _partial_ unexpected list).

If we want to allow these outliers, we can add some tolerance to the check by using the `mostly` parameter. Let's replace that expecation with a new one, that only expects 99% of avocados being priced within the range we specified.

In [12]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0, mostly=0.99)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  }
}

Another common usecase would be when you only expect a certain set of values to show up in a column. This is the case for our `type` column, since we only know about `conventional` and `organic` grown avocados. Let's add a check for that:

In [13]:
batch.expect_column_distinct_values_to_be_in_set('type', ['conventional', 'organic'])

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": [
      "conventional",
      "organic"
    ],
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  }
}

We could even add a check on the value frequencies! For example, if we want the ratio of organic to conventional to be roughly equal, we could check the [Kullback-Leiber divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) between our assumed distribution, and the one that is observed in the dataset.

In [14]:
partition_object = {
    'values': ['conventional', 'organic'],
    'weights': [0.5, 0.5],
    
}
batch.expect_column_kl_divergence_to_be_less_than('type', partition_object, 0.1)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "result": {
    "observed_value": 1.351245850704074e-08,
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  }
}

Now that we checked out some expectations, maybe try adding one yourself? You can check out the [glossary of expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) for a complete list of what you can do. Go wild!

In [15]:
# The stage is all yours

<a id="section-expectation-suite"></a>
## The Expectation Suite

So, while we were experimenting up there, great_expectations remembered all the expectations we ran. We can now retrieve the suite contents as follows:

In [16]:
batch.get_expectation_suite()

{
  "data_asset_type": "Dataset",
  "expectation_suite_name": "check_avocado_data",
  "meta": {
    "great_expectations_version": "0.13.5"
  },
  "expectations": [
    {
      "meta": {},
      "kwargs": {
        "column": "Date"
      },
      "expectation_type": "expect_column_to_exist"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "region"
      },
      "expectation_type": "expect_column_values_to_not_be_null"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "region",
        "type_": "str"
      },
      "expectation_type": "expect_column_values_to_be_of_type"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "Date",
        "strftime_format": "%Y-%m-%d"
      },
      "expectation_type": "expect_column_values_to_match_strftime_format"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "AveragePrice",
        "min_value": 0.5,
        "max_value": 3.0,
        "mostly": 0.99
      },
      "expectation_type":

That gave us the `dict` representation Great Expectations uses under the hood to keep track of our exepectation suite. Can you recognise some of the expectations we wrote?

An expectation suite is just a sequence of expectations, as shown below.
<img src="figures/expectation_suite.png">
This representation can then be saved to a file, so that we can load it again at another time, without depending on the python code that produced it.

Note that by default, expectations that failed on the `batch` we ran them against will be omitted. If you want to include them anyways, you could add the `discard_failed_expectations=False` parameter.

In [18]:
batch.save_expectation_suite()

What did that command do? Let's open up our configuration folder to try and find our expectation suite.

In [19]:
!tree great_expectations -nI "uncommitted"

great_expectations
├── checkpoints
├── expectations
│   └── check_avocado_data.json
├── great_expectations.yml
├── notebooks
│   ├── pandas
│   │   └── validation_playground.ipynb
│   ├── spark
│   │   └── validation_playground.ipynb
│   └── sql
│       └── validation_playground.ipynb
└── plugins
    └── custom_data_docs
        ├── renderers
        ├── styles
        │   └── data_docs_custom_styles.css
        └── views

11 directories, 6 files


We will get back to the configuration in a minute [[Data Context]](#section-data-context), so don't get confused about this yet.

As you can see, the `save_expectation_suite` command saved our `check_avocado_data` suite to the `expectations` folder. That's all there is to it, the expectation suite is just a json file. It contains that same internal representation that we retrieved from `get_expectation_suite()`. You can check it out if you like.

In [20]:
!cat great_expectations/expectations/check_avocado_data.json

{
  "data_asset_type": "Dataset",
  "expectation_suite_name": "check_avocado_data",
  "expectations": [
    {
      "expectation_type": "expect_column_to_exist",
      "kwargs": {
        "column": "Date"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_not_be_null",
      "kwargs": {
        "column": "region"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_be_of_type",
      "kwargs": {
        "column": "region",
        "type_": "str"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_match_strftime_format",
      "kwargs": {
        "column": "Date",
        "strftime_format": "%Y-%m-%d"
      },
      "meta": {}
    },
    {
      "expectation_type": "expect_column_values_to_be_between",
      "kwargs": {
        "column": "AveragePrice",
        "max_value": 3.0,
        "min_value": 0.5,
        "mostly": 0.99
      }

Expectations are stored in the *expectation store*, which by default is the `expectations` folder inside your configuration, but you can use other storage backends as well, such as a SQL database or cloud storage (S3, Azure Blob Storage or GCS). See [metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) for more information.

<a id="validation-results"></a>
Now that we added our expectation suite to our `DataContext`, we can try running the entire suite.
Validiating your data against an expectation suite is done by running a **validation operator**. A validation operator describes what should be done with your validation results. Here, we would like to store the results on disk, and generate a friendly report on them. We'll show you how this is configured in the [Data Context section](#section-data-context), in the meanwhile we'll use `my_validation_operator`, which we shipped with the configuration.


In [22]:
results = context.run_validation_operator('my_validation_operator', assets_to_validate=[batch])

One validation run can inculde multiple batches and expectation suites. This way, it is possible to test multiple files in the same run. Compare this to how one run of your test suite can test multiple software modules

We didn't explicitly specify the expectation suite to use with our data batch, but remember that the `batch` varible kept track of the expectation suite for us.

Now that we got through that, let's have a look at the results.

In [23]:
results

{
  "run_results": {
    "ValidationResultIdentifier::check_avocado_data/20210121T134619.799634Z/20210121T134619.799634Z/966da3deeba5d9b2be246213aa75e7b7": {
      "validation_result": {
        "success": true,
        "statistics": {
          "evaluated_expectations": 7,
          "successful_expectations": 7,
          "unsuccessful_expectations": 0,
          "success_percent": 100.0
        },
        "meta": {
          "great_expectations_version": "0.13.5",
          "expectation_suite_name": "check_avocado_data",
          "run_id": {
            "run_time": "2021-01-21T13:46:19.799634+00:00",
            "run_name": "20210121T134619.799634Z"
          },
          "batch_kwargs": {
            "path": "data/avocado.csv",
            "datasource": "data_dir",
            "data_asset_name": "avocado",
            "reader_method": "read_csv",
            "reader_options": {
              "index_col": 0
            }
          },
          "batch_markers": {
            "ge_load

This is called a *validation result*. Validation results are kept in the *validation store*, which is the `great_expectations/uncommitted/validations` directory by default.

In [24]:
!tree -n great_expectations/uncommitted/validations

great_expectations/uncommitted/validations
└── check_avocado_data
    └── 20210121T134619.799634Z
        └── 20210121T134619.799634Z
            └── 966da3deeba5d9b2be246213aa75e7b7.json

3 directories, 1 file


Great Expectations also allows you to set other backends as a validation store, such as your favourite cloud storage offering, or a SQL database. Check out [metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you would like to learn more!

<a id="section-data-docs"></a>
## Data Docs

We can render these results to a friendly report, called a data doc. These data docs will describe the expectations that the data should meet, as well as the metrics detailing how well the data meets the requirements. This is how Great Expectations combines testing with documenting. Running the code below will generate the data docs and open them in a new tab, make sure to have a look around. You'll see the code we ran above reflected in the different sections - it's pretty self-explanatory!

In [25]:
context.open_data_docs()

You can now have a look at the data docs. Just go to http://127.0.0.1:8888/view/great_expectations/uncommitted/data_docs/local_site/index.html in your current browser.

If you are just reading along, you can check out the generated docs [here](https://dataroots.gitlab.io/internal-public/tutorial-great_expectations/validations/check_avocado_data/20210119T131032.261169Z/20210119T131032.261169Z/a52e8a35d5f03815b708c7306612dbde.html).

Just like for validation results, different storage backends can be configured for your data docs. You could, for example, host them on cloud storage for easy viewing. Refer to [configuring data docs](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_data_docs.html) for more information.

<a id="section-data-context"></a>
## Data Context

Before we move on, let's take a moment to look at the `DataContext`, which represents your Great Expectations setup. It consists of a directory holding configuration files, named `great_expectations` by default.

Note: we are omitting the `uncommitted` directory here. It contains output files (such as rendered data docs), which are not part of the configuration.

In [26]:
!tree great_expectations -nI 'uncommitted'

great_expectations
├── checkpoints
├── expectations
│   └── check_avocado_data.json
├── great_expectations.yml
├── notebooks
│   ├── pandas
│   │   └── validation_playground.ipynb
│   ├── spark
│   │   └── validation_playground.ipynb
│   └── sql
│       └── validation_playground.ipynb
└── plugins
    └── custom_data_docs
        ├── renderers
        ├── styles
        │   └── data_docs_custom_styles.css
        └── views

11 directories, 6 files


The main configuration is located in `great_expectations.yml`. We won't go into all the details here, you can refer to the [data context reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/data_context_reference.html) for that. Instead, we'll just introduce some concepts you'll want to be familiar with:

- A **data source** is something that can provide data to Great Expectations, such as an SQL database.
- A **data asset** is one dataset that lives in a *data source*, such as an SQL table.

In the configuration we provided, there is one *data source* named `data_dir`, which is just a folder with csv files inside. the `avocado.csv` file we are working with would be a *data asset*.
More information on data sources can be found in the [data context reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/data_context_reference.html#datasources). For configuring your own, refer to the [configuring datasources](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_datasources.html) guides.

- A **validation operator** specifies what should be done with your validation results. Some examples could be writing the validation results to a database, publishing data docs, or sending a notification to a slack channel.
    If you'd like to know more you can check out the [validation operators and actions](https://docs.greatexpectations.io/en/latest/reference/core_concepts/validation_operators_and_actions.html) and [how to add a validation operator](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/validation/how_to_add_a_validation_operator.html) documentation pages.


- **stores** can be used to configure how expectation and validation data will be stored. See [configuring metadata stores](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/configuring_metadata_stores.html) if you're interested.

These are all configured in the `great_expectations.yml` file. We'll have a brief look at its contents now, but don't mind it too much, this is here for illustration purposes only.

In [27]:
!cat great_expectations/great_expectations.yml

# Welcome to Great Expectations! Always know what to expect from your data.
#
# Here you can define datasources, batch kwargs generators, integrations and
# more. This file is intended to be committed to your repo. For help with
# configuration please:
#   - Read our docs: https://docs.greatexpectations.io/en/latest/how_to_guides/spare_parts/data_context_reference.html#configuration
#   - Join our slack channel: http://greatexpectations.io/slack

# config_version refers to the syntactic version of this config file, and is used in maintaining backwards compatibility
# It is auto-generated and usually does not need to be changed.
config_version: 2.0

# Datasources tell Great Expectations where your data lives and how to get it.
# You can use the CLI command `great_expectations datasource new` to help you
# add a new datasource. Read more at https://docs.greatexpectations.io/en/latest/reference/core_concepts/datasource_reference.html
datasources:
  data_dir:
    batch_kwa

In addition, we also have two important directories: `expectations`, which holds our expectation suites, and `checkpoints`, which we'll check out next.

The diagram below shows a representation of our data context.
<img src="figures/data_context.png" width=800px>

<a id="section-checkpoints"></a>
## Checkpoints

Remember how we launched a validation run back in the [Expectation Suite section](#section-expectation-suite). There, we wrote code to run the validation on the data batch and expectation suite that we defined earlier on. If we bundle all these run parameters in a single configuration file, we could easily rerun the validation, for example each time our data changes. Such a configuration file is called a `Checkpoint` in Great Expectations.

As a quick reminder, for running a validation we need:
- A *validation operator* to handle the validation results
- A list of *batches*, each consisting of
    - A batch of data to check
    - expectation suites to check against
    
To create a checkpoint, we simply create a file in the `checkpoints` directory of our great_expectations configuration.

In [29]:
%%writefile great_expectations/checkpoints/avocado_data.yml

validation_operator_name: my_validation_operator
batches:
  - batch_kwargs:
      path: data/avocado.csv
      datasource: data_dir
      data_asset_name: avocado
      reader_method: read_csv
      reader_options:
        index_col: 0
    expectation_suite_names:
      - check_avocado_data

Writing great_expectations/checkpoints/avocado_data.yml


The `batch_kwargs` property specifies how the data asset should be loaded. You might recognise the parameters from when we first loaded the `avocado.csv` file.

This might also be a good time to point out that our data batch will get read by pandas under the hood (we configured that in the `data_dir` data source). In `batch_kwargs`, we specify that we'd like to use the pandas `read_csv` method, which will receive the `reader_options` dict as additional parameters.

We created the file manually here for demonstration purposes, but when doing this in your own project you probably want to use the CLI [The Great Expectations CLI](#section-cli), which will also help you in setting the right parameters. If you need to configure them further, try [creating batches](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/creating_batches.html).

The checkpoint can be executed by using the great_expectations cli:

In [30]:
!great_expectations checkpoint run avocado_data

[33mHeads up! This feature is Experimental. It may change. Please give us your feedback![0m[0m
Validation succeeded![0m

Suite Name                                   Status     Expectations met[0m
- check_avocado_data                         [32m✔ Passed[0m   7 of 7 (100.0 %)[0m


So, to summarize: a checkpoint is a _runnable check_ for your data. They are your first stop for integrating Great Expectations into your pipelines and workflows.
For more info on how to do that, refer to the [validation guides](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/validation.html), or the [workflows and patterns](https://docs.greatexpectations.io/en/latest/guides/workflows_patterns.html) guides.

Checkpoints and batches are represented conceptually below.

<img src="figures/checkpoint.png" width=600px>
<img src="figures/batch.png" width=600px>

<a id="section-profiling"></a>
## Profiling

In the previous sections we explored how we could get some metrics about our data using expectations. But what if you don't know what exactly to expect of your data? Well, you could try using Great Expectations' profiling feature, which can try to extract some useful metrics from your data. To try profiling our preconfigured `data_dir` data source, we can use the CLI:

In [33]:
!great_expectations datasource profile data_dir -y

[33mHeads up! This feature is Experimental. It may change. Please give us your feedback![0m[0m
Profiling 'data_dir' will create expectations and documentation.[0m
            Preparing column 1 of 14: Unnamed: 0
            Preparing column 2 of 14: Date
            Preparing column 3 of 14: AveragePrice
            Preparing column 4 of 14: Total Volume
            Preparing column 5 of 14: 4046
            Preparing column 6 of 14: 4225
            Preparing column 7 of 14: 4770
            Preparing column 8 of 14: Total Bags
            Preparing column 9 of 14: Small Bags
            Preparing column 10 of 14: Large Bags
            Preparing column 11 of 14: XLarge Bags
            Preparing column 12 of 14: type
            Preparing column 13 of 14: year
            Preparing column 14 of 14: region


Great Expectations is building Data Docs from the data you just profiled![0m

The following Data Docs sites will be built:

 - [36mlocal_site:[0m file:///home/ilion/src/tut

Running that command should have presented you with freshly built data docs. You can find the results in the `Profiling Results` tab. The profiler also generated an expectation suite based on its observations, which you can find in the `Expectation Suites` tab. Be mindful that this is an experimental feature and the generated suite is usually not that helpful, but it could be a good starting point for writing your own.


If you'd like to know more about profiling, the [profiling reference](https://docs.greatexpectations.io/en/latest/reference/spare_parts/profiling_reference.html) can help you out.

<a id="section-cli"></a>
## The Great Expectations CLI

For the purposes of this tutorial, we mostly interacted directly with Great Expectations. If you are going to set up and use Great Expectations for yourself, we recommend using the CLI as much as possible. The concepts should be familiar by now - refer to the  [CLI guide](https://docs.greatexpectations.io/en/latest/guides/how_to_guides/miscellaneous/command_line.html) for more.

In [34]:
!great_expectations --help

Usage: great_expectations [OPTIONS] COMMAND [ARGS]...

  Welcome to the great_expectations CLI!

  Most commands follow this format: great_expectations <NOUN> <VERB>

  The nouns are: datasource, docs, project, suite, validation-operator

  Most nouns accept the following verbs: new, list, edit

  In particular, the CLI supports the following special commands:

  - great_expectations init : create a new great_expectations project

  - great_expectations datasource profile : profile a datasource

  - great_expectations docs build : compile documentation from expectations

Options:
  --version      Show the version and exit.
  -v, --verbose  Set great_expectations to use verbose output.
  --help         Show this message and exit.

Commands:
  checkpoint           Checkpoint operations
  datasource           Datasource operations
  docs                 Data Docs operations
  init                 Initialize a new Great Expectations project.
  project           

<a id="section-setup"></a>
## Setting up your own project

To initialize your own project, run `great_expectations init` and follow the instructions. This will scaffold a simple configuration for you, just like the one we provided.

Once you created your suite using `great_expectations suite new`, you can use the `great_expectations suite edit` command to open up an auto-generated notebook that you can use to set up your suite. You should be able to recognise the structure of the first part of this notebook a bit ;-)

The [getting started guide](https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html) can  help you along the way. For ideas on how Great Expectation can fit into your workflow, check out [Deployment patterns](https://docs.greatexpectations.io/en/latest/reference/core_concepts/validation.html#deployment-patterns).

<a id="section-conclusion"></a>
## Final words

Just to recap, in this tutorial notebook, we started by giving you an overview of the tool and its purpose. We then showed you how to get started with the Python library and define your expectations. We saw that expectations can be bundled as suites, which can be used with validation operators to produce validation results. We had a look at data docs, a clean way to visualize your results and data documentation. We then dived into the data context, showing how the tool is configured. We had a look at checkpoints, which allow you to automate your data testing. We talked a bit about profiling, an experimental feature to generate expectations from given data. Finally, we introduced you to the CLI and set you on the right path to start using Great Expectations right away!

We hope you enjoyed the tutorial and wish you all the best in using Great Expectations with your projects!

Interested in support? Feel free to reach out to info@dataroots.io .