# Validation Playground

**Watch** a [short tutorial video](https://greatexpectations.io/videos/getting_started/integrate_expectations) or **read** [the written tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data)

We'd love it if you **reach out for help on** the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack)

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
S3_ENDPOINT = "https://minio.cwerner.ai"
BUCKET_NAME = 'ge-example'

In [3]:
# monkey patch boto3
import functools
import boto3
import os
boto3.client = functools.partial(boto3.client, endpoint_url=S3_ENDPOINT)
boto3.resource = functools.partial(boto3.resource, endpoint_url=S3_ENDPOINT)

In [4]:
# minio credentials
AWS_ACCESS_KEY_ID = "ENXJEPJGSyWjaCc63bci" 
AWS_SECRET_ACCESS_KEY = "yT9m0N2VjWGgqkQ51E94EgpcZVyzPJBU" 
S3_ENDPOINT = "https://minio.cwerner.ai"
S3DIRECT_REGION = "us-east-1"
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY
os.environ['S3DIRECT_REGION'] = S3DIRECT_REGION
os.environ['S3_ENDPOINT'] = S3_ENDPOINT

In [5]:
import json
import great_expectations as ge
from great_expectations.profile import ColumnsExistProfiler
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
from datetime import datetime

2019-12-05T09:16:42+0100 - INFO - Great Expectations logging enabled at INFO level by JupyterUX module.


## 1. Get a DataContext
This represents your **project** that you just created using `great_expectations init`. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#get-a-datacontext-object)

In [6]:
context = ge.data_context.DataContext()

2019-12-05T09:16:42+0100 - INFO - Using project config: /Users/werner-ch/Documents/Repos/ge-example/great_expectations/great_expectations.yml


## 2. List the CSVs in your folder

The `DataContext` will now introspect your pandas `Datasource` and list the CSVs it finds. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#list-data-assets)

In [7]:
ge.jupyter_ux.list_available_data_asset_names(context)

Inspecting your data sources. This may take a moment...


## 3. Pick a csv and the expectation suite

Internally, Great Expectations represents csvs and dataframes as `DataAsset`s and uses this notion to link them to `Expectation Suites`. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#pick-a-data-asset-and-expectation-suite)


In [8]:
data_asset_name = "data_2018" # TODO: replace with your value!
normalized_data_asset_name = context.normalize_data_asset_name(data_asset_name)
normalized_data_asset_name

NormalizedDataAssetName(datasource='tereno_fendt', generator='s3', generator_asset='data_2018')

We recommend naming your first expectation suite for a table `warning`. Later, as you identify some of the expectations that you add to this suite as critical, you can move these expectations into another suite and call it `failure`. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=integrate_validation#choose-data-asset-and-expectation-suite)

In [9]:
expectation_suite_name = "basic" # TODO: replace with your value!

#### 3.a. If you don't have an expectation suite, let's create a simple one

You need expectations to validate your data. Expectations are grouped into Expectation Suites. 

If you don't have an expectation suite for this data asset, the notebook's next cell will create a suite of very basic expectations, so that you have some expectations to play with. The expectation suite will have `expect_column_to_exist` expectations for each column.

If you created an expectation suite for this data asset, you can skip executing the next cell (if you execute it, it will do nothing).

To create a more interesting suite, open the [create_expectations.ipynb](create_expectations.ipynb) notebook.



In [10]:
try:
    print('1')
    context.get_expectation_suite(normalized_data_asset_name, expectation_suite_name)
except great_expectations.exceptions.DataContextError:
    print('2')
    context.create_expectation_suite(data_asset_name=normalized_data_asset_name, expectation_suite_name=expectation_suite_name, overwrite_existing=True);
    batch_kwargs = context.yield_batch_kwargs(data_asset_name)
    batch = context.get_batch(normalized_data_asset_name, expectation_suite_name, batch_kwargs)
    ColumnsExistProfiler().profile(batch)
    batch.save_expectation_suite()
    expectation_suite = context.get_expectation_suite(normalized_data_asset_name, expectation_suite_name)
    context.build_data_docs()


1


In [31]:
context.get_expectation_suite(normalized_data_asset_name, "basic")

{'data_asset_name': 'tereno_fendt/s3/data_2018',
 'expectation_suite_name': 'basic',
 'meta': {'great_expectations.__version__': '0.8.6'},
 'expectations': [{'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'BattV_Avg'}},
  {'expectation_type': 'expect_column_mean_to_be_between',
   'kwargs': {'column': 'PTemp_C_Avg', 'min_value': -20, 'max_value': 40}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'PTemp_C_Avg'}},
  {'expectation_type': 'expect_column_min_to_be_between',
   'kwargs': {'column': 'Ramount', 'min_value': 0, 'max_value': 10}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'Ramount'}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'Rduration_Avg'}}],
 'data_asset_type': 'Dataset'}

## 4. Load a batch of data you want to validate

To learn more about `get_batch` with other data types (such as existing pandas dataframes, SQL tables or Spark), see [this tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#load-a-batch-of-data-to-validate)


In [32]:
colnames = open('/Users/werner-ch/Documents/Repos/ge-example/data/colnames.csv').readline().split(',')


batch_kwargs = context.yield_batch_kwargs(data_asset_name, 
                                          reader_options={'encoding': 'utf-8',
                                                          'names': colnames,
                                                          'index_col': 'TIMESTAMP',
                                                          'parse_dates': ['TIMESTAMP']
                                                         })

batch_kwargs = dict(batch_kwargs)
batch_kwargs['reader_method'] = 'csv'
print(batch_kwargs)
batch = context.get_batch(normalized_data_asset_name, expectation_suite_name, batch_kwargs)
batch.head()

{'s3': 's3a://ge-example/raw/tereno_fendt/2018/Fen_M_18_017.dat', 'reader_options': {'sep': ',', 'header': None, 'index_col': 'TIMESTAMP', 'na_values': 'NAN', 'engine': 'python', 'encoding': 'utf-8', 'names': ['TIMESTAMP', 'BattV_Avg', 'PTemp_C_Avg', 'Wdmin_Min', 'Wdavg', 'Wdmax_Max', 'Wsmin_Min', 'Wsavg_Avg', 'Wsmax_Max', 'airtemp_Avg', 'relhumidity_Avg', 'airpressure_Avg', 'Ramount', 'Rduration_Avg', 'Rintensity_Avg', 'Hamount', 'Hduration_Avg', 'Hintensity_Avg', 'T107_2_West_Avg', 'T107_6_West_Avg', 'T107_12_West_Avg', 'T107_25_West_Avg', 'T107_35_West_Avg', 'T107_50_West_Avg', 'T107_2_Mitte_Avg', 'T107_6_Mitte_Avg', 'T107_12_Mitte_Avg', 'T107_25_Mitte_Avg', 'T107_35_Mitte_Avg', 'T107_50_Mitte_Avg', 'T107_2_Ost_Avg', 'T107_6_Ost_Avg', 'T107_12_Ost_Avg', 'T107_25_Ost_Avg', 'T107_35_Ost_Avg', 'T107_50_Ost_Avg', 'VWC_2_West_Avg', 'VWC_6_West_Avg', 'VWC_12_West_Avg', 'VWC_25_West_Avg', 'VWC_35_West_Avg', 'VWC_50_West_Avg', 'VWC_2_Mitte_Avg', 'VWC_6_Mitte_Avg', 'VWC_12_Mitte_Avg', 'VWC_2

Unnamed: 0_level_0,BattV_Avg,PTemp_C_Avg,Wdmin_Min,Wdavg,Wdmax_Max,Wsmin_Min,Wsavg_Avg,Wsmax_Max,airtemp_Avg,relhumidity_Avg,...,IR_TempC_Avg,Total_Avg,Diffuse_Avg,Sun,H_Flux_sc_9_Ost_Avg,H_Flux_sc_8_fernerOst_Avg,H_Flux_sc_8_Mitte_Avg,shf_cal(1),shf_cal(2),shf_cal(3)\n
TIMESTAMP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-17 00:00:00,13.28,2.899,265,277,292,1.3,1.8,2.4,2.2,89.8,...,1.908,19.41,20.56,0,-6.334002,-5.797156,-5.934458,0.028665,0.049997,0.048279
2018-01-17 00:01:00,13.28,2.897,213,238,279,1.9,3.0,4.3,2.3,90.1,...,2.187,3.125,4.112,0,-6.217782,-5.697205,-5.796448,0.028665,0.049997,0.048279
2018-01-17 00:02:00,13.48,2.894,129,206,223,0.7,2.5,3.4,2.4,90.0,...,2.16,0.987,1.974,0,,,,0.028665,0.049997,0.048279
2018-01-17 00:03:00,13.47,2.899,223,268,315,1.0,2.5,3.6,2.5,89.7,...,2.185,1.151,1.809,0,,,,0.028665,0.049997,0.048279
2018-01-17 00:04:00,13.54,2.91,218,237,275,1.6,2.1,2.3,2.6,89.1,...,2.165,17.93,18.91,0,,,,0.028665,0.049997,0.048279


## 5. Get a pipeline run id

Generate a run id, a timestamp, or a meaningful string that will help you refer to validation results. We recommend they be chronologically sortable.
[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/getting_started/pipeline_integration.html?utm_source=notebook&utm_medium=validate_data#set-a-run-id)

In [33]:
# Let's make a simple sortable timestamp. Note this could come from your pipeline runner.
run_id = datetime.utcnow().isoformat().replace(":", "") + "Z"
run_id

'2019-12-05T084120.328550Z'

## 6. Validate the batch

This is the "workhorse" of Great Expectations. Call it in your pipeline code after loading data and just before passing it to your computation.

[Read more about the validate method in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#validate-the-batch)


In [35]:
#context.expect

In [36]:
exit = 1
while exit < 2:
    if exit> 2: 
        break
    print(data_asset_name)
    batch_kwargs = context.yield_batch_kwargs(data_asset_name, 
                                              reader_options={'encoding': 'utf-8',
                                                              'names': colnames,
                                                              'index_col': 'TIMESTAMP',
                                                              'parse_dates': ['TIMESTAMP']
                                                             })

    batch_kwargs = dict(batch_kwargs)
    batch_kwargs['reader_method'] = 'csv'
    batch = context.get_batch(normalized_data_asset_name, expectation_suite_name, batch_kwargs)
    
    
    
    validation_result = batch.validate(run_id=run_id, result_format='COMPLETE')

    if validation_result["success"]:
        print("This data meets all expectations for {}".format(str(data_asset_name)))
        print(json.dumps(validation_result, indent=4))
    else:
        print("This data is not a valid batch of {}".format(str(data_asset_name)))
        
    exit += 1

data_2018
2019-12-05T09:41:29+0100 - INFO - 	6 expectation(s) included in expectation_suite.
This data meets all expectations for data_2018
{
    "results": [
        {
            "success": true,
            "result": {
                "element_count": 1439,
                "unexpected_count": 0,
                "unexpected_percent": 0.0,
                "partial_unexpected_list": [],
                "unexpected_list": [],
                "unexpected_index_list": []
            },
            "expectation_config": {
                "expectation_type": "expect_column_values_to_not_be_null",
                "kwargs": {
                    "column": "BattV_Avg",
                    "result_format": "COMPLETE"
                }
            },
            "exception_info": {
                "raised_exception": false,
                "exception_message": null,
                "exception_traceback": null
            }
        },
        {
            "success": true,
            "result": {

In [38]:
# update docs
context.build_data_docs()



{'local_site': '/Users/werner-ch/Documents/Repos/ge-example/great_expectations/uncommitted/data_docs/local_site/index.html'}

In [39]:
context.open_data_docs()

## 6.a. OPTIONAL: Review the JSON validation results

Don't worry - this blob of JSON is meant for machines. Continue on or skip this to see this in Data Docs!

In [31]:
#print(json.dumps(validation_result, indent=4))

## 7. Validation Operators

The `validate` method evaluates one batch of data against one expectation suite and returns a dictionary of validation results. This is sufficient when you explore your data and get to know Great Expectations.
When deploying Great Expectations in a **real data pipeline, you will typically discover additional needs**:

* validating a group of batches that are logically related
* validating a batch against several expectation suites such as using a tiered pattern like `warning` and `failure`
* doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).

`Validation Operators` provide a convenient abstraction for both bundling the validation of multiple expectation suites and the actions that should be taken after the validation.

[Read more about Validation Operators in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#save-validation-results)

In [32]:
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file


results = context.run_validation_operator(
    assets_to_validate=[batch],
    run_id=run_id,
    validation_operator_name="action_list_operator"
)

2019-11-28T13:56:58+0100 - INFO - 	6 expectation(s) included in expectation_suite.


In [33]:
context.get_config()


ordereddict([('config_version', 1), ('datasources', ordereddict([('tereno_fendt', ordereddict([('class_name', 'PandasDatasource'), ('data_asset_type', ordereddict([('class_name', 'PandasDataset')])), ('generators', ordereddict([('s3', ordereddict([('class_name', 'S3Generator'), ('bucket', 'ge-example'), ('delimiter', '/'), ('reader_options', ordereddict([('sep', ','), ('header', None), ('index_col', 0), ('na_values', 'NAN'), ('engine', 'python')])), ('assets', ordereddict([('data_2018', ordereddict([('prefix', 'raw/tereno_fendt/2018/'), ('regex_filter', '.*.dat')])), ('data_2019', ordereddict([('prefix', 'raw/tereno_fendt/2019/'), ('regex_filter', '.*.dat')]))]))]))]))]))])), ('config_variables_file_path', 'uncommitted/config_variables.yml'), ('plugins_directory', 'plugins/'), ('validation_operators', ordereddict([('action_list_operator', {'class_name': 'ActionListValidationOperator', 'action_list': [{'name': 'store_validation_result', 'action': {'class_name': 'StoreAction'}}, {'name':

In [34]:
run_id

'2019-11-28T125654.187738Z'

## 8. View the Validation Results in Data Docs

Let's now build and look at your Data Docs. These will now include an **data quality report** built from the `ValidationResults` you just created that helps you communicate about your data with both machines and humans.

[Read more about Data Docs in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#view-the-validation-results-in-data-docs)

In [35]:
context.open_data_docs()

## Congratulations! You ran Validations!

## Next steps:

### 1. Author more interesting Expectations

Here we used some **extremely basic** `Expectations`. To really harness the power of Great Expectations you can author much more interesting and specific `Expectations` to protect your data pipelines and defeat pipeline debt. Go to [create_expectations.ipynb](create_expectations.ipynb) to see how!

### 2. Explore the documentation & community

You are now among the elite data professionals who know how to build robust descriptions of your data and protections for pipelines and machine learning models. Join the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack) to see how others are wielding these superpowers.

In [87]:
results = context.run_validation_operator(
    assets_to_validate=[batch],
    run_id="first-batch-test",
    validation_operator_name="ActionListValidationOperator",
)

KeyError: 'ActionListValidationOperator'