# Data Quality Tutorial - Python 2

## Prerequisities
You should have worked through:
1. The Data Quality Core training module which is a foundation for all data quality training. This will explain the Data Quality Dimensions (like Uniqueness), and how each measure is calculated.
2. The Python 1 Tutorial (found in this directory). This will equip you to run all of the data quality functions against dataframes, interpret the output, and export the final report

### Prior coding experience required
* You should know what pandas dataframes are
* You should have some basic python knowledge

## Aims of Python 2
* To be able to write a data quality config file
    * this is one or more text files that define your data quality rules
* To be able to run this config file against your dataframe
* To write and use a regular expression config file as part of this workflow
    * the allows you to store all your regular expressions in one place, and makes the data quality config file more readable
* To understand how to obtain a config file directly from a data quality report
    * this assists a typical workflow where you might play around with several rules in a notebook, but then want a production-ready config file afterwards


In [None]:
from datetime import datetime

import pandas as pd

from gchq_data_quality import (
    DataQualityConfig,
    UniquenessRule,
    ValidityRegexRule,
)

Let's re-use the same dataframe we used in Python 1

In [None]:
df = pd.DataFrame(
    {
        "id": [1, 2, 3, 3, 5],  # 4 /5 unique
        "name": ["John", "Jane", "Dave", None, "Missing"],  # 1 null value
        "age": [30, 25, 102, 15, -5],  # a negative age
        "email": [
            "john@example.com",
            "jane@example.com",
            "dave@example",
            "test@test.com",
            "alice@example.com",
        ],  # invalid 3rd email
        "category": ["A", "B", "C", "D", "X"],
        "score": [
            10,
            20,
            30,
            40,
            -1,
        ],  # missing scores are defined as -1
        "date": [
            datetime(2023, 1, 1),
            datetime(2023, 2, 1),
            datetime(2023, 3, 1),
            datetime(2021, 1, 1),  # one date too old
            datetime(2023, 5, 1),
        ],
    }
)

df.head()

# Config Files
We define our data quality rules within one or more YAML files (YAML is a mark-up langauge, similar to JSON, that was chosen because it is simple and human-readable (more so than JSON)).

We will validate this file against the key-value pairs in our DataQualityConfig class.

The YAML file (and DataQualityConfig class) has:
1. Some overall settings for the measurement you are undertaking (all of which are optional)
    1. dataset_name
    2. dataset_id
    2. measurement_sample
    3. lifecycle_stage
    4. measurement_time (if you want to override the default of 'now')
2. A list of data quality rules, where each rule specififes:
    1. The function you are calling (like 'uniqueness' or 'validity_regex')
    2. The parameters of that function (like 'field' and 'regex_pattern')
    The names are the same as the rules 'validity_regex' = ValidityRegexRule

In [None]:
# We can find out the overall settings like this:
DataQualityConfig.model_fields

In [None]:
# within 'rules' the parameters will vary based on the function
UniquenessRule.model_fields

In [None]:
ValidityRegexRule.model_fields  # notice how this has an additional field of 'regex_pattern'

## Writing this as a YAML file
YAML files are written as key-value pairs (using a colon to separate the key from the value).
Lists items are indicated with a single hyphen - 

### Example YAML file with 2 rules

```yaml
dataset_name: My Source Data
measurement_sample: 10% of records
lifecycle_stage: null # or you can just omit the entry if it's optional and null
rules:
- field: id
  function: uniqueness
- field: name
  na_values: ''
  function: validity_regex
  regex_pattern: '[A-z0-9_]'
```

#### Writing lists
There are two ways to write lists, as individual elements with '-' or like you would in python

```yaml
- field: category
  function: accuracy
  valid_values: [A,B,C,D] # simplest way
  valid_values: ['A','B','C','D'] # equivalent - but the quotation marks for strings are not necessary
  valid_values:
  - A
  - B
  - C
  - D
  # most verbose way, but for long lists this might be preferable
```

#### Writing REGEX
To ensure characters aren't escaped incorrectly, always surround regex_pattern with single quotation marks (')

```yaml
- field: your_field
  function: validity_regex
  regex_pattern: '[A-Za-z]+' #standard regex
  regex_pattern: '\d{4}-\d{2}-\d{2}' # if you use single quotes, you don't need to 'double escape' for special characters like \d
  regex_pattern: "\\d{4}-\\d{2}-\\d{2}" # this is more confusing; equivalent pattern if you surround your regex with double quotation marks (") rather than single (')
  regex_pattern: 'don''t' # if you need to use the single quotation mark as part of your pattern, you type it twice. This pattern matches the word "don't" 

```
 
### Try it out yourself
Complete the EXERCISE_rules.yaml, so that we have:
- uniqueness rule running on the id field
- completeness rule running on the name field
- validity_regex rule running on the email field
- accuracy rule running on the category field (with valid values of A,B,C,D)



In [None]:
# this function will create a DataQualityConfig object from one or more yaml files
dc_solution = DataQualityConfig.from_yaml("resources/SOLUTION_rules.yaml")

In [None]:
# UNCOMMENT this line and run once you have tried the exercise
# dc_exercise = DataQualityConfig.from_yaml("resources/EXERCISE_rules.yaml")

### Check your answers
The below equality should give 'True', if it doesn't the cell below will print everything out side-by-side

In [None]:
# dc_solution == dc_exercise

### Loading YAML files
You might find it useful to separate your data quality rules into sections, such as
- 'biographical.yaml'
- 'customer_data.yaml'

or, as you create more rules, separate them into directories


The pattern for loading these is:
```python
config = DataQualityConfig.from_yaml(['biographical.yaml', 'customer_data.yaml'])
```

## Running the rules against your data
Once you've loaded a DataQualityConfig object, you can run that against a dataframe

In [None]:
from datetime import UTC

dc_solution.measurement_sample = "Our Test Data"
dc_solution.dataset_name = "Overwrite Dataset Name"
dc_solution.measurement_time = datetime.now(tz=UTC)
report = dc_solution.execute(df)

In [None]:
report.to_dataframe(measurement_time_format="%Y-%m-%d %H:%M")

### Retrieving your rules from the Data Quality Report
A useful workflow is to iterate on rules in a notebook to create a draft data quality report,
then, once you've settled on the rules you want to run, you want to pull out those rules into a YAML file so you can deploy into a production setting. This can be a faster or more convenient way of writing that yaml file from scratch.

You can then tweak / edit the YAML file to suit

Note: if you overwrite the default rule_description values into your YAML file, then this two-way conversion won't work for those rules (they will be skipped)

In [None]:
report = dc_solution.execute(
    df
)  # let's remove the overwritten values and get a fresh report
config_from_report = DataQualityConfig.from_report(report)
# we can recover the data quality config from a DataQualityReport
# but ONLY if the rule_description is left in the default state (which is a dictionary of all parameters)
dc_solution == config_from_report

### Saving your rules to a YAML file
To avoid writing out the YAML file by hand, you can write a DataQualityConfig to a yaml file.
Note that it will output all rules, but only output the first values for the measurement information (source_data, measurement_sample etc).

It will tend to output a more verbose YAML file than you need, for example if a value is 'null' it will output that to the YAML file (rather than ignoring it)

In [None]:
config_from_report.to_yaml("resources/yaml_from_report.yaml", overwrite=True)

## Managing your Regular Expressions
It can be easiser to manage your regular expressions in one place, especially if you use lots of them.

We provide a pattern for that:
```python
DataQualityConfig.from_yaml('my_config.yaml', regex_yaml_path='my_regex_patterns.yaml')
``` 

This assumes your regex file looks like this:
```yaml
EMAIL_REGEX: '[a-Z...regex here]'
PHONE_REGEX: '[0-9]+'
```

and it will swap out the values for EMAIL_REGEX etc in your config file, which can look like this:

```yaml
- field: email
  function: validity_regex
  regex_pattern: EMAIL_REGEX
```

We provide a simple regex_patterns.yaml file for you to try



In [None]:
config_without_swap = DataQualityConfig.from_yaml(
    "resources/SOLUTION_rules_with_regex.yaml"
)
print(
    f"The regex_pattern in the YAML file is this: {config_without_swap.rules[3].regex_pattern}"
)

In [None]:
config_with_regex_swap = DataQualityConfig.from_yaml(
    "resources/SOLUTION_rules_with_regex.yaml",
    regex_yaml_path="resources/regex_patterns.yaml",
)
print(
    f"EMAIL_REGEX gets swapped if we pass a regex YAML file to: {config_with_regex_swap.rules[3].regex_pattern}"
)

You can then run this swapped config against your data

In [None]:
config_with_regex_swap.execute(df).to_dataframe()

## Changing the number of output samples and invalid row numbers
These are controlled with global settings (default is 10)

```python
from gchq_data_quality.globals import SampleConfig
SampleConfig.RECORDS_FAILED_SAMPLE_SIZE = 25
```