# Create Expectation Interactively

In [63]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# Author Expectations

Watch a [short tutorial video](https://greatexpectations.io/videos/getting_started/create_expectations?utm_source=notebook&utm_medium=create_expectations) or read [the written tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations)

We'd love it if you **reach out for help on** the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack)

In [64]:
S3_ENDPOINT = "https://minio.cwerner.ai"
#S3_ENDPOINT = "https://s3.imk-ifu.kit.edu:8082"
BUCKET_NAME = 'ge-example'

In [65]:
# monkey patch boto3
import functools
import boto3
boto3.client = functools.partial(boto3.client, endpoint_url=S3_ENDPOINT)
boto3.resource = functools.partial(boto3.resource, endpoint_url=S3_ENDPOINT)

import json
import os
import great_expectations as ge
import great_expectations.jupyter_ux
import pandas as pd

In [66]:
# minio credentials
AWS_ACCESS_KEY_ID = "ENXJEPJGSyWjaCc63bci" 
AWS_SECRET_ACCESS_KEY = "yT9m0N2VjWGgqkQ51E94EgpcZVyzPJBU" 
S3_ENDPOINT = "https://minio.cwerner.ai"
S3DIRECT_REGION = "us-east-1"
os.environ['AWS_ACCESS_KEY_ID'] = AWS_ACCESS_KEY_ID
os.environ['AWS_SECRET_ACCESS_KEY'] = AWS_SECRET_ACCESS_KEY
os.environ['S3DIRECT_REGION'] = S3DIRECT_REGION
os.environ['S3_ENDPOINT'] = S3_ENDPOINT

In [67]:
ge.__version__

'0.8.6'

In [68]:
! echo $S3_ENDPOINT $AWS_ACCESS_KEY_ID

https://minio.cwerner.ai ENXJEPJGSyWjaCc63bci


## 1. Get a DataContext
This represents your **project** that you just created using `great_expectations init`. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#get-a-datacontext-object)

In [69]:
context = ge.data_context.DataContext()
context

2019-12-05T09:45:59+0100 - INFO - Using project config: /Users/werner-ch/Documents/Repos/ge-example/great_expectations/great_expectations.yml


<great_expectations.data_context.data_context.DataContext at 0x11f9e44c0>

## 2. List the CSVs in your folder

The `DataContext` will now introspect your pandas `Datasource` and list the CSVs it finds. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#list-data-assets)

In [70]:
great_expectations.jupyter_ux.list_available_data_asset_names(context)

Inspecting your data sources. This may take a moment...


In [71]:
context.list_expectation_suite_keys()

[{'data_asset_name': tereno_fendt/s3/data_2018,
  'expectation_suite_name': 'basic'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T083557.585527Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T083953.866517Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T084656.492846Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T111151.292794Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T111746.568767Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T113138.181305Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T124021.705060Z'},
 {'data_asset_name': tereno_fendt/s3/data_2018/basic,
  'expectation_suite_name': '2019-11-28T124702.340515Z'},
 {'data_asset_name

## 3. Pick a CSV and set the expectation suite name

Internally, Great Expectations represents CSVs and dataframes as `DataAsset`s and uses this notion to link them to `Expectation Suites`. [Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#pick-a-data-asset-and-set-the-expectation-suite-name)


In [72]:
data_asset_name = "data_2018" #"Fen_M_19_234" 
normalized_data_asset_name = context.normalize_data_asset_name(data_asset_name)
normalized_data_asset_name

NormalizedDataAssetName(datasource='tereno_fendt', generator='s3', generator_asset='data_2018')

We recommend naming your first expectation suite for a table `warning`. Later, as you identify some of the expectations that you add to this suite as critical, you can move these expectations into another suite and call it `failure`.

In [73]:
expectation_suite_name = "basic2" # TODO: replace with your value!

In [74]:
print(data_asset_name)
print(expectation_suite_name)

data_2018
basic2


## 4. Create a new empty expectation suite

In [75]:
context.create_expectation_suite(data_asset_name=data_asset_name, expectation_suite_name=expectation_suite_name, overwrite_existing=True)

{'data_asset_name': 'tereno_fendt/s3/data_2018',
 'expectation_suite_name': 'basic2',
 'meta': {'great_expectations.__version__': '0.8.6'},
 'expectations': []}

## 5. Load a batch of data you want to use to create `Expectations`

To learn more about `get_batch` with other data types (such as existing pandas dataframes, SQL tables or Spark), see [this tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#load-a-batch-of-data-to-create-expectations)

In [76]:
colnames = open('/Users/werner-ch/Documents/Repos/ge-example/data/colnames.csv').readline()[:-1].split(',')


In [77]:
batch_kwargs = context.yield_batch_kwargs(data_asset_name, 
                                          reader_options={'encoding': 'utf-8',
                                                          'names': colnames,
                                                          'index_col': 'TIMESTAMP',
                                                          'parse_dates': ['TIMESTAMP']
                                                         })
batch_kwargs = dict(batch_kwargs)
print(batch_kwargs)
batch_kwargs['reader_method'] = 'csv'

#batch_kwargs = {'path': '/Users/werner-ch/Documents/Repos/ge-example/great_expectations/../data/Fen_M_19_234.dat', 'header': 0}

{'s3': 's3a://ge-example/raw/tereno_fendt/2018/Fen_M_18_001.dat', 'reader_options': {'sep': ',', 'header': None, 'index_col': 'TIMESTAMP', 'na_values': 'NAN', 'engine': 'python', 'encoding': 'utf-8', 'names': ['TIMESTAMP', 'BattV_Avg', 'PTemp_C_Avg', 'Wdmin_Min', 'Wdavg', 'Wdmax_Max', 'Wsmin_Min', 'Wsavg_Avg', 'Wsmax_Max', 'airtemp_Avg', 'relhumidity_Avg', 'airpressure_Avg', 'Ramount', 'Rduration_Avg', 'Rintensity_Avg', 'Hamount', 'Hduration_Avg', 'Hintensity_Avg', 'T107_2_West_Avg', 'T107_6_West_Avg', 'T107_12_West_Avg', 'T107_25_West_Avg', 'T107_35_West_Avg', 'T107_50_West_Avg', 'T107_2_Mitte_Avg', 'T107_6_Mitte_Avg', 'T107_12_Mitte_Avg', 'T107_25_Mitte_Avg', 'T107_35_Mitte_Avg', 'T107_50_Mitte_Avg', 'T107_2_Ost_Avg', 'T107_6_Ost_Avg', 'T107_12_Ost_Avg', 'T107_25_Ost_Avg', 'T107_35_Ost_Avg', 'T107_50_Ost_Avg', 'VWC_2_West_Avg', 'VWC_6_West_Avg', 'VWC_12_West_Avg', 'VWC_25_West_Avg', 'VWC_35_West_Avg', 'VWC_50_West_Avg', 'VWC_2_Mitte_Avg', 'VWC_6_Mitte_Avg', 'VWC_12_Mitte_Avg', 'VWC_2

Load a batch of data and take a peek at the first few rows.

In [78]:
batch = context.get_batch(data_asset_name, expectation_suite_name, batch_kwargs)
batch.head()

Unnamed: 0_level_0,BattV_Avg,PTemp_C_Avg,Wdmin_Min,Wdavg,Wdmax_Max,Wsmin_Min,Wsavg_Avg,Wsmax_Max,airtemp_Avg,relhumidity_Avg,...,IR_TempC_Avg,Total_Avg,Diffuse_Avg,Sun,H_Flux_sc_9_Ost_Avg,H_Flux_sc_8_fernerOst_Avg,H_Flux_sc_8_Mitte_Avg,shf_cal(1),shf_cal(2),shf_cal(3)
TIMESTAMP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-01 00:00:00,13.6,-0.053,180,199,211,0.6,0.8,1.0,0.6,89.1,...,-1.93,17.77,18.43,0,-11.27593,-10.07353,-9.107799,0.028665,0.04813,0.048112
2018-01-01 00:01:00,13.33,-0.053,136,150,167,0.3,0.5,0.7,0.5,88.5,...,-1.972,0.987,1.81,0,-11.27593,-10.03892,-9.107799,0.028665,0.04813,0.048112
2018-01-01 00:02:00,13.33,-0.056,156,166,183,0.5,0.7,0.8,0.5,88.3,...,-1.902,1.645,2.797,0,,,,0.028665,0.04813,0.048112
2018-01-01 00:03:00,13.52,-0.058,162,178,187,0.6,0.7,0.8,0.5,88.5,...,-1.874,17.93,2.139,0,,,,0.028665,0.04813,0.048112
2018-01-01 00:04:00,13.58,-0.053,172,180,188,0.7,0.9,1.1,0.5,88.8,...,-1.741,17.44,18.59,0,,,,0.028665,0.04813,0.048112


#### Optionally, customize and review batch options

`BatchKwargs` are extremely flexible - to learn more [read the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#load-a-batch-of-data-to-create-expectations)

Here are the batch kwargs used to load your batch

In [79]:
#batch.batch_kwargs

In [80]:
# The datasource can add and store additional identifying information to ensure you can track a batch through
# your pipeline
batch.batch_id

{'timestamp': 1575535624.957027,
 'fingerprint': '9b28f421378c31ccd52dfe0780a5bd9d'}

## 6. Author Expectations

With a batch, you can add expectations by calling specific expectation methods. They all begin with `.expect_` which makes autocompleting easy.

See available expectations in the [expectation glossary](https://docs.greatexpectations.io/en/latest/glossary.html?utm_source=notebook&utm_medium=create_expectations).
You can also see available expectations by hovering over data elements in the HTML page generated by profiling your dataset.

Below is an example expectation that checks if the values in the batch's first column are null.

[Read more in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/create_expectations.html?utm_source=notebook&utm_medium=create_expectations#author-expectations)

In [81]:
column_name = batch.get_table_columns()[0]
batch.expect_column_values_to_not_be_null(column_name)

{'success': True,
 'result': {'element_count': 1440,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

# Some experiments

In [82]:
column_name = 'PTemp_C_Avg'
batch.expect_column_mean_to_be_between(column_name, min_value=-20, max_value=40)
batch.expect_column_values_to_not_be_null(column_name)

{'success': True,
 'result': {'element_count': 1440,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

Add more expectations here. **Hint** start with `batch.expect_` and hit tab for Jupyter's autocomplete to see all the expectations!

In [83]:
batch.expect_column_min_to_be_between('Ramount', 0, 10)

{'success': True,
 'result': {'observed_value': 0.0,
  'element_count': 1440,
  'missing_count': 0,
  'missing_percent': 0.0}}

In [84]:
batch.ex

Unnamed: 0_level_0,BattV_Avg,PTemp_C_Avg,Wdmin_Min,Wdavg,Wdmax_Max,Wsmin_Min,Wsavg_Avg,Wsmax_Max,airtemp_Avg,relhumidity_Avg,...,IR_TempC_Avg,Total_Avg,Diffuse_Avg,Sun,H_Flux_sc_9_Ost_Avg,H_Flux_sc_8_fernerOst_Avg,H_Flux_sc_8_Mitte_Avg,shf_cal(1),shf_cal(2),shf_cal(3)
TIMESTAMP,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018-01-01 00:00:00,13.60,-0.053,180,199,211,0.6,0.8,1.0,0.6,89.1,...,-1.930,17.770,18.430,0,-11.27593,-10.073530,-9.107799,0.028665,0.04813,0.048112
2018-01-01 00:01:00,13.33,-0.053,136,150,167,0.3,0.5,0.7,0.5,88.5,...,-1.972,0.987,1.810,0,-11.27593,-10.038920,-9.107799,0.028665,0.04813,0.048112
2018-01-01 00:02:00,13.33,-0.056,156,166,183,0.5,0.7,0.8,0.5,88.3,...,-1.902,1.645,2.797,0,,,,0.028665,0.04813,0.048112
2018-01-01 00:03:00,13.52,-0.058,162,178,187,0.6,0.7,0.8,0.5,88.5,...,-1.874,17.930,2.139,0,,,,0.028665,0.04813,0.048112
2018-01-01 00:04:00,13.58,-0.053,172,180,188,0.7,0.9,1.1,0.5,88.8,...,-1.741,17.440,18.590,0,,,,0.028665,0.04813,0.048112
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-01-01 23:55:00,13.60,4.865,249,259,266,4.1,4.9,5.5,5.9,71.2,...,3.787,0.493,1.480,0,91.05236,-9.793701,-9.567810,0.028665,0.04813,0.048396
2018-01-01 23:56:00,13.58,4.872,207,231,253,5.2,6.2,7.3,5.9,71.0,...,3.816,0.658,1.480,0,87.67697,-9.516279,-9.463995,0.028665,0.04813,0.048396
2018-01-01 23:57:00,13.60,4.880,236,245,255,4.3,5.5,6.7,5.9,70.8,...,3.709,0.493,1.480,0,94.70739,-9.827722,-9.670483,0.028665,0.04813,0.048396
2018-01-01 23:58:00,13.58,4.888,193,205,213,5.3,6.2,6.9,5.9,71.1,...,3.819,19.570,19.740,0,99.94019,-9.724253,-9.636411,0.028665,0.04813,0.048396


In [59]:
batch.expect_column_values_to_not_be_null('Ramount')

{'success': True,
 'result': {'element_count': 1440,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

In [60]:
batch.expect_column_values_to_not_be_null('Rduration_Avg')

{'success': True,
 'result': {'element_count': 1440,
  'unexpected_count': 0,
  'unexpected_percent': 0.0,
  'partial_unexpected_list': []}}

## 7. Review and save your Expectations

Expectations that are `True` on this data batch are added automatically. Let's view all the expectations you created in machine-readable JSON.

In [61]:
batch.get_expectation_suite()

2019-12-05T09:27:12+0100 - INFO - 	6 expectation(s) included in expectation_suite. result_format settings filtered.


{'data_asset_name': 'tereno_fendt/s3/data_2018',
 'expectation_suite_name': 'basic',
 'meta': {'great_expectations.__version__': '0.8.6'},
 'expectations': [{'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'BattV_Avg'}},
  {'expectation_type': 'expect_column_mean_to_be_between',
   'kwargs': {'column': 'PTemp_C_Avg', 'min_value': -20, 'max_value': 40}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'PTemp_C_Avg'}},
  {'expectation_type': 'expect_column_min_to_be_between',
   'kwargs': {'column': 'Ramount', 'min_value': 0, 'max_value': 10}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'Ramount'}},
  {'expectation_type': 'expect_column_values_to_not_be_null',
   'kwargs': {'column': 'Rduration_Avg'}}],
 'data_asset_type': 'Dataset'}

    
    
If you decide not to save some expectations that you created, use [remove_expectaton method](https://docs.greatexpectations.io/en/latest/module_docs/data_asset_module.html?highlight=remove_expectation&utm_source=notebook&utm_medium=create_expectations#great_expectations.data_asset.data_asset.DataAsset.remove_expectation). You can also choose not to filter expectations that were `False` on this batch.


The following method will save the expectation suite as a JSON file in the `great_expectations/expectations` directory of your project:
    

In [62]:
batch.save_expectation_suite()

2019-12-05T09:27:55+0100 - INFO - 	6 expectation(s) included in expectation_suite. result_format settings filtered.


## 8. View the Expectations in Data Docs

Let's now build and look at your Data Docs. These will now include an **Expectation Suite Overview** built from the expectations you just created that helps you communicate about your data with both machines and humans.

In [54]:
context.build_data_docs()



{'local_site': '/Users/werner-ch/Documents/Repos/ge-example/great_expectations/uncommitted/data_docs/local_site/index.html'}

In [55]:
context.open_data_docs()

## Congratulations! You created and saved Expectations

## Next steps:

### 1. Play with Validation

Validation is the process of checking if new batches of this data meet to your expectations before they are processed by your pipeline. Go to [validation_playground.ipynb](validation_playground.ipynb) to see how!


### 2. Explore the documentation & community

You are now among the elite data professionals who know how to build robust descriptions of your data and protections for pipelines and machine learning models. Join the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack) to see how others are wielding these superpowers.