# What is Great Expectations?

# Getting started

Let's jump into it then!

In [55]:
import great_expectations as ge

TODO: explain what a context is

By creating a new `DataContext` object, great_expectations will read the configuration we have already set up for you - don't worry about that for now, we'll get back to it later.

In [56]:
context = ge.data_context.DataContext()

Now that we have our data context ready, we can add an expectation suite. Think of this like a test suite, but for your data instead of for your code.

In [24]:
suite = context.create_expectation_suite(
    "check_avocado_data",
    overwrite_existing=True
)

Next, we load our dataset, `avocado.csv`, from our data context. Again, don't worry about this too much, great_expectations usually handles this for you.

In [25]:
batch_kwargs = {
    'datasource': 'data_dir',
    'path': 'data/avocado.csv',
}
batch = context.get_batch(batch_kwargs, suite)

Alright, that's it for setup! Now let's have a look at the data we are working with here.

In [60]:
batch.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


TODO: explain dataset columns?

TODO: what is an expectation?

Now, let's run our very first expectation!

In [27]:
batch.expect_column_to_exist('Date')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {},
  "meta": {}
}

The expectation succeeded! We received a json object describing the result of the check. Since this was a very basic expectation, there is not much info here, but keep an eye on the results as we proceed to more complicated expectations.

A common usecase would be to check the completeness of your data. We'll add a check to make sure that the `region` field is present in every record.

In [64]:
batch.expect_column_values_to_not_be_null('region')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {}
}

That worked! in the `result` section we can see that all 18249 records passed the check.

We can also check for the value types:

In [71]:
batch.expect_column_values_to_be_of_type('region', 'str')

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {}
}

In [42]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0)

{
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  },
  "meta": {}
}

Oops! That failed. Looks like we have some outliers here!

If we want to allow this, we can add some tolerance to the check by using the `mostly` parameter. Lets settle for having 99% of avocados being priced within the range we specified.

In [74]:
batch.expect_column_values_to_be_between('AveragePrice', min_value=0.5, max_value=3.0, mostly=0.99)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 11,
    "unexpected_percent": 0.06027727546714888,
    "unexpected_percent_nonmissing": 0.06027727546714888,
    "partial_unexpected_list": [
      0.49,
      0.46,
      3.03,
      3.12,
      3.25,
      0.44,
      0.49,
      0.48,
      3.05,
      3.04,
      3.17
    ]
  },
  "meta": {}
}

If we only want to allow certain values, such as in ordinal data, we can easily check for that:

In [75]:
batch.expect_column_distinct_values_to_be_in_set('type', ['conventional', 'organic'])

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": [
      "conventional",
      "organic"
    ],
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {}
}

Suppose we would like our dataset to be balanced between conventional and organic avocados. great_expectations has us covered! Lets put an expectation on the distribution of avocado types, using the [Kullback-Leiber divergence](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence) from an equal distribution.

In [76]:
partition_object = {
    'values': ['conventional', 'organic'],
    'weights': [0.5, 0.5],
    
}
batch.expect_column_kl_divergence_to_be_less_than('type', partition_object, 0.1)

{
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "result": {
    "observed_value": 1.351245850704074e-08,
    "element_count": 18249,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {}
}

Now that we checked out some expectations, maybe try adding one yourself? You can check out the [glossary of expectations](https://docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html) for a complete list of what you can do.

In [63]:
# all yours

## Metrics

## Profiling: generating expectations

## Tests == docs

## Setting up data context and source

## (Airflow integration)

## (Spark)


In [None]:
!cat great_expectations/great_expectations.yml

In [1]:
import great_expectations as ge

In [2]:
my_df = ge.read_csv("data/avocado.csv")

In [3]:
my_df.head()
my_df.expect_column_values_to_be_in_set("type", ["conventional"])

{
  "success": false,
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 9123,
    "unexpected_percent": 49.991780371527206,
    "unexpected_percent_nonmissing": 49.991780371527206,
    "partial_unexpected_list": [
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic"
    ]
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [4]:
my_df.get_expectation_suite()

{
  "meta": {
    "great_expectations_version": "0.13.4"
  },
  "expectation_suite_name": "default",
  "data_asset_type": "Dataset",
  "expectations": []
}

## Expectations

## Metrics

## Profiling: generating expectations

## Tests == docs

## Setting up data context and source

## (Airflow integration)

## (Spark)