[![AWS SDK for pandas](_static/logo.png "AWS SDK for pandas")](https://github.com/aws/aws-sdk-pandas)

# 34 - Glue Data Quality

 AWS Glue Data Quality helps you evaluate and monitor the quality of your data.

## Create test data

First, let's start by creating test data, writing it to S3, and registering it in Glue Data Catalog.

In [None]:
import awswrangler as wr
import pandas as pd

glue_database = "aws_sdk_pandas"
glue_table = "glue_table"
path = "s3://..."

df = pd.DataFrame({"c0": [0, 1, 2], "c1": [0, 1, 2], "c2": [0, 0, 1]})
wr.s3.to_parquet(df, path, dataset=True, database=glue_database, table=glue_table)

## Start with recommended data quality rules

AWS Glue Data Quality computes statistics for your data, and is able to recommend a set of data quality rules so you can get started
quickly.

Running Glue Data Quality recommendation and evaluation tasks requires an IAM role. This role must have permission to access resources that various AWS Glue Data Quality processes require to run on your behalf. To find out more, check [Authorization](https://docs.aws.amazon.com/glue/latest/dg/data-quality-authorization.html).

In [3]:
ruleset_name = "ruleset"
iam_role_arn = "arn:aws:iam::..."

wr.data_quality.create_recommendation_ruleset(
    name=ruleset_name,
    database=glue_database,
    table=glue_table,
    iam_role_arn=iam_role_arn,
)

Unnamed: 0,rule_type,parameter,expression
0,RowCount,,between 1 and 6
1,IsComplete,c0,
2,Uniqueness,c0,> 0.95
3,ColumnValues,c0,<= 2
4,IsComplete,c1,
5,Uniqueness,c1,> 0.95
6,ColumnValues,c1,<= 2
7,IsComplete,c2,
8,ColumnValues,c2,<= 1


## Run a data quality task

In [4]:
wr.data_quality.evaluate_ruleset(
    name=ruleset_name,
    iam_role_arn=iam_role_arn,
)

Unnamed: 0,Name,Description,Result,ResultId
0,Rule_1,RowCount between 1 and 6,PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
1,Rule_2,"IsComplete ""c0""",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
2,Rule_3,"Uniqueness ""c0"" > 0.95",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
3,Rule_4,"ColumnValues ""c0"" <= 2",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
4,Rule_5,"IsComplete ""c1""",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
5,Rule_6,"Uniqueness ""c1"" > 0.95",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
6,Rule_7,"ColumnValues ""c1"" <= 2",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
7,Rule_8,"IsComplete ""c2""",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f
8,Rule_9,"ColumnValues ""c2"" <= 1",PASS,dqresult-a0113154a72e9abf6b6a9b4ee52c2d6618b0485f


## Create ruleset from Data Quality Definition Language definition

Data Quality Definition Language (DQDL) is a domain specific language that you use to define rules for AWS Glue Data Quality. For full syntax reference, see [DQDL](https://docs.aws.amazon.com/glue/latest/dg/dqdl.html).

In [6]:
dqdl_rules = (
    "Rules = ["
    "RowCount between 1 and 6,"
    'IsComplete "c0",'
    'Uniqueness "c0" > 0.95,'
    'ColumnValues "c0" <= 2,'
    'IsComplete "c1",'
    'Uniqueness "c1" > 0.95,'
    'ColumnValues "c1" <= 2,'
    'IsComplete "c2",'
    'ColumnValues "c2" <= 1'
    "]"
)

wr.data_quality.create_ruleset(
    name="ruleset2",
    database=glue_database,
    table=glue_table,
    dqdl_rules=dqdl_rules,
)

## Create ruleset from data frame

AWS SDK for pandas also allows you to create ruleset from a pandas data frame.

In [8]:
df_rules = pd.DataFrame({
    "rule_type": ["RowCount", "IsComplete", "Uniqueness"],
    "parameter": [None, "c0", "c0"],
    "expression": ["between 1 and 6", None, "> 0.95"],
})

wr.data_quality.create_ruleset(
    name="ruleset3",
    database=glue_database,
    table=glue_table,
    df_rules=df_rules,
)