# Great Expectations tutorial

### What is Great Expectations?

Great Expectations is a tool that allows you to test batch data. It generates reports about the data, containing documentation of the data translated from the definitions of the tests.

### Why you should use it: data quality

You should test your data for two main reasons:
- better data quality leads to better predictions and insights relying on the data,
- it's an additional way to test data pipelines.

### Key concepts and terminology

A *Datasource* is a source of data to be tested. For example, a SQL database.

A *data asset* is a subset of data from a data source (that share the same structure). For example, a table in a SQL database.

An *expectation* is the definition of a test on a data asset.

An *expectation suite* is a set of expectations on a data asset.

### Install and setup great_expectations

To install Great Expectations, run `pip install great_expectations` in your terminal. Using a virtual environment is a good practice to install programs with `pip`.

To initialize Great Expectations for a project, run `great_expectations init` in your terminal in the project's directory and follow the instructions.

For more information on how to set up everything, have a look at https://docs.greatexpectations.io/en/latest/guides/tutorials/getting_started.html and feel free to refer to the official documentation.

### How it works

Great Expectations stores everything related to a project in the `great_expectations` subdirectory in the project's directory.


# Getting started

In [23]:
import datetime
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.data_context.types.resource_identifiers import ValidationResultIdentifier

context = ge.data_context.DataContext()

In [24]:
suite = context.create_expectation_suite(
    "check_avocado_data",
    overwrite_existing=True
)

In [25]:
batch_kwargs = {
    'datasource': 'data_dir',
    'path': 'data/avocado.csv',
}
batch = context.get_batch(batch_kwargs, suite)

In [26]:
batch.head()

Unnamed: 0.1,Unnamed: 0,Date,AveragePrice,Total Volume,4046,4225,4770,Total Bags,Small Bags,Large Bags,XLarge Bags,type,year,region
0,0,2015-12-27,1.33,64236.62,1036.74,54454.85,48.16,8696.87,8603.62,93.25,0.0,conventional,2015,Albany
1,1,2015-12-20,1.35,54876.98,674.28,44638.81,58.33,9505.56,9408.07,97.49,0.0,conventional,2015,Albany
2,2,2015-12-13,0.93,118220.22,794.7,109149.67,130.5,8145.35,8042.21,103.14,0.0,conventional,2015,Albany
3,3,2015-12-06,1.08,78992.15,1132.0,71976.41,72.58,5811.16,5677.4,133.76,0.0,conventional,2015,Albany
4,4,2015-11-29,1.28,51039.6,941.48,43838.39,75.78,6183.95,5986.26,197.69,0.0,conventional,2015,Albany


## Expectations

## Metrics

## Profiling: generating expectations

## Tests == docs

## Setting up data context and source

## (Airflow integration)

## (Spark)


In [None]:
!cat great_expectations/great_expectations.yml

In [1]:
import great_expectations as ge

In [2]:
my_df = ge.read_csv("data/avocado.csv")

In [3]:
my_df.head()
my_df.expect_column_values_to_be_in_set("type", ["conventional"])

{
  "success": false,
  "meta": {},
  "result": {
    "element_count": 18249,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 9123,
    "unexpected_percent": 49.991780371527206,
    "unexpected_percent_nonmissing": 49.991780371527206,
    "partial_unexpected_list": [
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic",
      "organic"
    ]
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [4]:
my_df.get_expectation_suite()

{
  "meta": {
    "great_expectations_version": "0.13.4"
  },
  "expectation_suite_name": "default",
  "data_asset_type": "Dataset",
  "expectations": []
}

## Expectations

## Metrics

## Profiling: generating expectations

## Tests == docs

## Setting up data context and source

## (Airflow integration)

## (Spark)