### Considerations for the Demonstration

- The dataset we will work with is a synthetic sample of credit card transactions containing 1,000 records.
- The dataset includes the following fields: `transaction_id`, `customer_id`, `transaction_date`, `transaction_amount`, `account_type`, `transaction_type`, and `status`.
- It is necessary to implement a solution to detect data quality issues in the following areas:
  - **Accuracy**: Transaction amounts must not be negative. However, a small margin of negative values may be acceptable in specific cases.
  - **Completeness**: Transaction types must not contain empty values.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from tabulate import tabulate
import great_expectations as gx

df = pd.read_csv('../data/synthetic_data.csv')
print(df[['transaction_id', 'customer_id', 'transaction_type', 'transaction_amount']].head(9))

context = gx.get_context()

data_source = context.data_sources.add_pandas("pandas")
data_asset = data_source.add_dataframe_asset(name="pd dataframe asset")
batch_definition = data_asset.add_batch_definition_whole_dataframe("batch definition")
batch = batch_definition.get_batch(batch_parameters={"dataframe": df})

   transaction_id  customer_id transaction_type  transaction_amount
0               1         2685       withdrawal         2310.159223
1               2         1769       withdrawal         2069.799784
2               3         7949       withdrawal         6383.228325
3               4         3433       withdrawal         2576.269638
4               5         6311       withdrawal         6323.834705
5               6         6051          deposit          593.961595
6               7         7420         transfer         1426.326191
7               8         2184          deposit         2752.657137
8               9         5555       withdrawal         5974.498700


## Implementing Data Quality Checks with Great Expectations

We'll now define and execute expectations for our two focus dimensions:

### 1. Accuracy Expectations
- Transaction amounts should be positive

In [2]:

# Define expectations
expectations_results = []

# Accuracy Expectations
print("Executing Accuracy Expectations...")

# 1. Transaction Amount Validation
expectation = gx.expectations.ExpectColumnValuesToBeBetween(
    column="transaction_amount", 
    min_value=0,
    mostly=0.95
)

validation_result = batch.validate(expectation)

expectations_results.append({
    'Data Quality Issue': 'Accuracy',
    'Expectation': 'Positive Transaction Amounts',
    'Success': validation_result['success'],
    'Total records': validation_result['result']['element_count'],
    'Unexpected records': validation_result['result']['unexpected_count'],
    'Unexpected percentage': f"{validation_result['result']['unexpected_percent']:.2f}%",
    #'Partial List': validation_result['result']['partial_unexpected_list'],
})
df_results = pd.DataFrame(expectations_results)
from tabulate import tabulate

print(tabulate(df_results, headers='keys', tablefmt='github'))


Executing Accuracy Expectations...


Calculating Metrics:   0%|          | 0/10 [00:00<?, ?it/s]

|    | Data Quality Issue   | Expectation                  | Success   |   Total records |   Unexpected records | Unexpected percentage   |
|----|----------------------|------------------------------|-----------|-----------------|----------------------|-------------------------|
|  0 | Accuracy             | Positive Transaction Amounts | True      |            1000 |                   19 | 1.96%                   |


### 2. Completeness Expectations
- Identifying missing transaction_type values

In [7]:
# 1. Transaction Amount Validation
expectation = gx.expectations.ExpectColumnValuesToNotBeNull(
    column="transaction_type",
    #value_set=["deposit", "withdrawal", "transfer", "payment"]
)

print("Executing Completeness Expectations...")

validation_result = batch.validate(expectation)

expectations_results.append({
        'Data Quality Issue': 'Completeness',
        'Expectation': 'Expect transaction_type column values to not be null',
        'Success': validation_result['success'],
        'Total records': validation_result['result']['element_count'],
        'Unexpected records': validation_result['result']['unexpected_count'],
        'Unexpected percentage': f"{validation_result['result']['unexpected_percent']:.2f}%",
        #'Partial List': validation_result['result']['partial_unexpected_list'],
    })
df_results = pd.DataFrame(expectations_results)
from tabulate import tabulate
df_results = pd.DataFrame(expectations_results)
print(tabulate(df_results, headers='keys', tablefmt='github'))

Executing Completeness Expectations...


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

|    | Data Quality Issue   | Expectation                                          | Success   |   Total records |   Unexpected records | Unexpected percentage   |
|----|----------------------|------------------------------------------------------|-----------|-----------------|----------------------|-------------------------|
|  0 | Accuracy             | Positive Transaction Amounts                         | False     |            1000 |                   19 | 1.96%                   |
|  1 | Completeness         | Expect transaction_type column values to not be null | False     |            1000 |                   92 | 9.20%                   |
