# Validation Playground

**Watch** a [short tutorial video](https://greatexpectations.io/videos/getting_started/integrate_expectations) or **read** [the written tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data)

#### This notebook assumes that you created at least one expectation suite in your project.
#### Here you will learn how to validate data loaded into a PySpark DataFrame against an expectation suite.


We'd love it if you **reach out for help on** the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack)

In [1]:
import json
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
from datetime import datetime

2020-04-21T16:42:33-0700 - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.


## 1. Get a DataContext
This represents your **project** that you just created using `great_expectations init`.

In [2]:
context = ge.data_context.DataContext()

## 2. Choose an Expectation Suite

List expectation suites that you created in your project

In [3]:
context.list_expectation_suite_names()

['Titanic_Test_Alex_0_ExpectationSuite']

In [4]:
expectation_suite_name = 'Titanic_Test_Alex_0_ExpectationSuite' # TODO: set to a name from the list above

## 3. Load a batch of data you want to validate

To learn more about `get_batch`, see [this tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#load-a-batch-of-data-to-validate)


In [5]:
# list datasources of the type SparkDFDatasource in your project
[datasource['name'] for datasource in context.list_datasources() if datasource['class_name'] == 'SparkDFDatasource']

['leo_datasource']

In [6]:
datasource_name = 'leo_datasource' # TODO: set to a datasource name from above

In [None]:
# # If you would like to validate a file on a filesystem:
# batch_kwargs = {'path': "YOUR_FILE_PATH", 'datasource': datasource_name}
# # To customize how Spark reads the file, you can add options under reader_options key in batch_kwargs (e.g., header='true') 

# # If you already loaded the data into a PySpark Data Frame:
# batch_kwargs = {'dataset': "YOUR_DATAFRAME", 'datasource': datasource_name}


# batch = context.get_batch(batch_kwargs, expectation_suite_name)
# batch.head()

In [7]:
path = '/Users/alexsherstinsky/Downloads/GESupport04202020/JohnCostanzo/Verify/Titanic.csv'

In [8]:
batch_kwargs = {'path': path, 'datasource': datasource_name}

In [9]:
batch = context.get_batch(batch_kwargs, expectation_suite_name)
batch.head()

2020-04-21T17:03:20-0700 - INFO - 	8 expectation(s) included in expectation_suite.


Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6
0,,Name,PClass,Age,Sex,Survived,SexCode
1,1.0,"Allen, Miss Elisabeth Walton",1st,29,female,1,1
2,2.0,"Allison, Miss Helen Loraine",1st,2,female,0,1
3,3.0,"Allison, Mr Hudson Joshua Creighton",1st,30,male,0,0
4,4.0,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25,female,0,1


In [12]:
import os
import sys
import io

import time
import datetime

from pyspark import SQLContext

from pyspark.context import SparkContext
from pyspark.sql import SparkSession

In [13]:
from pyspark.sql import functions as F

In [14]:
sys.version_info

sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)

In [15]:
os.environ.get('PYSPARK_PYTHON')

In [16]:
spark_session = SparkSession.builder.appName("pytest-pyspark-local-notebook-manage_expectations"). \
    master("local[2]"). \
    config("spark.executor.memory", "6g"). \
    config("spark.driver.memory", "6g"). \
    config("spark.ui.showConsoleProgress", "false"). \
    config("spark.sql.shuffle.partitions", "2"). \
    config("spark.default.parallelism", "4"). \
    enableHiveSupport(). \
    getOrCreate()
sc = spark_session.sparkContext

In [17]:
spark = SQLContext(sc)

In [21]:
df = spark.read.csv(path, header=True)

In [22]:
df.show(20, False)

+---+------------------------------------------------+------+----+------+--------+-------+
|_c0|Name                                            |PClass|Age |Sex   |Survived|SexCode|
+---+------------------------------------------------+------+----+------+--------+-------+
|1  |Allen, Miss Elisabeth Walton                    |1st   |29  |female|1       |1      |
|2  |Allison, Miss Helen Loraine                     |1st   |2   |female|0       |1      |
|3  |Allison, Mr Hudson Joshua Creighton             |1st   |30  |male  |0       |0      |
|4  |Allison, Mrs Hudson JC (Bessie Waldo Daniels)   |1st   |25  |female|0       |1      |
|5  |Allison, Master Hudson Trevor                   |1st   |0.92|male  |1       |0      |
|6  |Anderson, Mr Harry                              |1st   |47  |male  |1       |0      |
|7  |Andrews, Miss Kornelia Theodosia                |1st   |63  |female|1       |1      |
|8  |Andrews, Mr Thomas, jr                          |1st   |39  |male  |0       |0      |

In [23]:
batch_kwargs = {'dataset': df, 'datasource': datasource_name}

In [24]:
batch = context.get_batch(batch_kwargs, expectation_suite_name)
batch.head()

2020-04-21T17:16:24-0700 - INFO - 	8 expectation(s) included in expectation_suite.


Unnamed: 0,_c0,Name,PClass,Age,Sex,Survived,SexCode
0,1,"Allen, Miss Elisabeth Walton",1st,29.0,female,1,1
1,2,"Allison, Miss Helen Loraine",1st,2.0,female,0,1
2,3,"Allison, Mr Hudson Joshua Creighton",1st,30.0,male,0,0
3,4,"Allison, Mrs Hudson JC (Bessie Waldo Daniels)",1st,25.0,female,0,1
4,5,"Allison, Master Hudson Trevor",1st,0.92,male,1,0


In [27]:
profile_results = context.profile_data_asset(
    datasource_name="leo_datasource",
    expectation_suite_name=expectation_suite_name,
    batch_kwargs=batch_kwargs,
)

if profile_results["success"]:
    print("Profiling completed")
    context.build_data_docs()
else:
    print("Profiling failed")


2020-04-21T17:21:08-0700 - INFO - Profiling 'leo_datasource' with 'BasicDatasetProfiler'
2020-04-21T17:21:08-0700 - INFO -             Preparing column 1 of 7: _c0
2020-04-21T17:21:08-0700 - INFO -             Preparing column 2 of 7: Name
2020-04-21T17:21:08-0700 - INFO -             Preparing column 3 of 7: PClass
2020-04-21T17:21:08-0700 - INFO -             Preparing column 4 of 7: Age
2020-04-21T17:21:09-0700 - INFO -             Preparing column 5 of 7: Sex
2020-04-21T17:21:09-0700 - INFO -             Preparing column 6 of 7: Survived
2020-04-21T17:21:09-0700 - INFO -             Preparing column 7 of 7: SexCode
2020-04-21T17:21:09-0700 - INFO - 	49 expectation(s) included in expectation_suite.
2020-04-21T17:21:11-0700 - INFO - 	Profiled 7 columns using 1313 rows from None (2.634 sec)
2020-04-21T17:21:11-0700 - INFO - 
Profiled the data asset, with 1313 total rows and 7 columns in 2.64 seconds.
Generated, evaluated, and stored 49 Expectations during profiling. Please review resu

In [None]:
# df = spark.read.parquet(hdfs_file)
# df.show(20, False)

# context = ge.DataContext()
# batch_kwargs = {
#     "datasource": "leo_datasource",
#     "dataset": df.limit(1000),
# }

# # profile the dataset and create or overwrite the given expectation suite name
# profile_results = context.profile_data_asset(
#     datasource_name="leo_datasource",
#     expectation_suite_name=category,
#     batch_kwargs=batch_kwargs,
# )

# if profile_results["success"]:
#     logger.info("Profiling completed")
#     context.build_data_docs()
# else:
#     logger.error("Profiling failed")


In [31]:
results = context.run_validation_operator(
    "action_list_operator", 
    assets_to_validate=[batch], 
)

if results["success"]:
    print("Validations completed")
    context.build_data_docs()
else:
    print("Validations failed")


2020-04-21T17:23:02-0700 - INFO - Setting run_id to: 20200422T002302.411112Z
2020-04-21T17:23:02-0700 - INFO - 	8 expectation(s) included in expectation_suite.
Validations completed


In [None]:
# <Alex>The code below was not used (only as an example).</Alex>

## 4. Validate the batch with Validation Operators

`Validation Operators` provide a convenient way to bundle the validation of
multiple expectation suites and the actions that should be taken after validation.

When deploying Great Expectations in a **real data pipeline, you will typically discover these needs**:

* validating a group of batches that are logically related
* validating a batch against several expectation suites such as using a tiered pattern like `warning` and `failure`
* doing something with the validation results (e.g., saving them for a later review, sending notifications in case of failures, etc.).

[Read more about Validation Operators in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#save-validation-results)

In [None]:
# This is an example of invoking a validation operator that is configured by default in the great_expectations.yml file

#Generate a run id, a timestamp, or a meaningful string that will help you refer to validation results. We recommend they be chronologically sortable.
# Let's make a simple sortable timestamp. Note this could come from your pipeline runner (e.g., Airflow run id).
run_id = datetime.utcnow().isoformat().replace(":", "") + "Z"

results = context.run_validation_operator(
    "action_list_operator", 
    assets_to_validate=[batch], 
    run_id=run_id)

## 5. View the Validation Results in Data Docs

Let's now build and look at your Data Docs. These will now include an **data quality report** built from the `ValidationResults` you just created that helps you communicate about your data with both machines and humans.

[Read more about Data Docs in the tutorial](https://docs.greatexpectations.io/en/latest/tutorials/validate_data.html?utm_source=notebook&utm_medium=validate_data#view-the-validation-results-in-data-docs)

In [None]:
context.open_data_docs()

## Congratulations! You ran Validations!

## Next steps:

### 1. Read about the typical workflow with Great Expectations:

[typical workflow](https://docs.greatexpectations.io/en/latest/getting_started/typical_workflow.html?utm_source=notebook&utm_medium=validate_data#view-the-validation-results-in-data-docs)

### 2. Explore the documentation & community

You are now among the elite data professionals who know how to build robust descriptions of your data and protections for pipelines and machine learning models. Join the [**Great Expectations Slack Channel**](https://greatexpectations.io/slack) to see how others are wielding these superpowers.