## Why Data Quality


- Data driven oraganizations/products
- Wrong desicion / Bad product
- Its all about trust

## Data Quality Effect


- Is the data correct ?
- If so to what time ?


![](dashboard.png)

- great to find violations as early as possible

- productivity and integrity are the same thing data stuff

![](identify_the_problem.png)

![](pull.png)


![](push.png)

#### Trying to maintain data systems that are untested, undocumented and unstable is nearly impossible


## Data Quality characteristics

- **Accuracy**: for whatever data described, it needs to be accurate.
- **Relevancy**: the data should meet the requirements for the intended use.
- **Completeness**: the data should not have missing values or miss data records.
- **Timeliness**: the data should be up to date.
- **Consistency**:the data should have the data format as expected and can be cross reference-able with the same results.


## Data Quality tools


- [Great Expectations](https://github.com/great-expectations/great_expectations)
- [Deequ](https://github.com/awslabs/deequ)
- [Tensorflow Data Validation](https://www.tensorflow.org/tfx/data_validation/get_started)


## Great Expectations Native

In [2]:
import pandas as pd
import great_expectations as ge
import great_expectations.jupyter_ux
import json

from datetime import datetime
from sklearn.model_selection import train_test_split

2020-10-21T09:11:58+0300 - INFO - Great Expectations logging enabled at 20 level by JupyterUX module.


In [3]:
df = ge.read_csv("data/titanic.csv")
train, test = train_test_split(df, test_size=0.3)

#### expectations are assertions about data

In [4]:
train.expect_column_values_to_be_between("Age", 0,80)
train.expect_column_values_to_be_in_set('Survived', [1, 0])
train.expect_column_mean_to_be_between("Age", 20,40)
train.expect_column_values_to_match_regex('Name', '[A-Z][a-z]+(?: \([A-Z][a-z]+\))?, ', mostly=.95)
train.expect_column_values_to_be_in_set("Sex", ["male", "female"])

results = train.validate() 
if train.validate()["SUCCESS"]:
    ...
results

2020-10-21T09:11:59+0300 - INFO - 	5 expectation(s) included in expectation_suite.


{
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 5,
    "successful_expectations": 5,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "success": true,
  "meta": {
    "great_expectations_version": "0.12.4",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2020-10-21T06:11:59.247162+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "5406c6de-1364-11eb-ac9b-acde48001122"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20201021T061159.247065Z"
  },
  "results": [
    {
      "success": true,
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "result": {
        "element_count": 39,
        "missing_count": 7,
        "missing_percent": 17.94871794871795,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "u

In [5]:
my_expectations = train.get_expectation_suite()
test.validate(expectation_suite=my_expectations)

2020-10-21T09:11:59+0300 - INFO - 	5 expectation(s) included in expectation_suite. result_format settings filtered.


{
  "evaluation_parameters": {},
  "statistics": {
    "evaluated_expectations": 5,
    "successful_expectations": 5,
    "unsuccessful_expectations": 0,
    "success_percent": 100.0
  },
  "success": true,
  "meta": {
    "great_expectations_version": "0.12.4",
    "expectation_suite_name": "default",
    "run_id": {
      "run_time": "2020-10-21T06:11:59.270239+00:00",
      "run_name": null
    },
    "batch_kwargs": {
      "ge_batch_id": "5406d8cc-1364-11eb-ac9b-acde48001122"
    },
    "batch_markers": {},
    "batch_parameters": {},
    "validation_time": "20201021T061159.270174Z"
  },
  "results": [
    {
      "success": true,
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_message": null,
        "exception_traceback": null
      },
      "result": {
        "element_count": 18,
        "missing_count": 1,
        "missing_percent": 5.555555555555555,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "u

## Great Expectations Validation in Your pipeline

In [10]:
# ! great_expectations init

In [15]:
context = ge.data_context.DataContext()
context.list_expectation_suite_names()

['titanic_validation']

In [16]:
expectation_suite_name = " "
batch_kwargs = {'path': "https://github.com/plotly/datasets/raw/master/titanic.csv",
                'datasource': "titanic"}
batch = context.get_batch(batch_kwargs, my_expectations)
batch.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [17]:
run_id = {"run_name": "First validation", "run_time": datetime.now()}
results = context.run_validation_operator("action_list_operator", 
                                          assets_to_validate=[batch], 
                                          run_id=run_id)
results

2020-10-21T11:51:14+0300 - INFO - 	5 expectation(s) included in expectation_suite.


{
  "validation_operator_config": {
    "class_name": "ActionListValidationOperator",
    "module_name": "great_expectations.validation_operators",
    "name": "action_list_operator",
    "kwargs": {
      "action_list": [
        {
          "name": "store_validation_result",
          "action": {
            "class_name": "StoreValidationResultAction"
          }
        },
        {
          "name": "store_evaluation_params",
          "action": {
            "class_name": "StoreEvaluationParametersAction"
          }
        },
        {
          "name": "update_data_docs",
          "action": {
            "class_name": "UpdateDataDocsAction"
          }
        }
      ],
      "result_format": {
        "result_format": "SUMMARY",
        "partial_unexpected_count": 20
      }
    }
  },
  "evaluation_parameters": null,
  "success": true,
  "run_results": {
    "ValidationResultIdentifier::default/First validation/20201021T115114.262382Z/dc6329347f045298907901acf6f3f906": {
  

## Test as data documentations









#### your docs are your tests and your  tests are your docs

In [19]:
context.build_data_docs()
context.open_data_docs()

## Addtional Resources

- [Great expectations 101](https://www.youtube.com/watch?v=uM9DB2ca8T8)
- [Great expectations 201](https://www.youtube.com/watch?v=LuLhS_oLhS8)
- [Great expectations 301](https://www.youtube.com/watch?v=pq5CBea12v4)