# 1. Preface

---
```plain
This notebook is designed to leverage Great Expectations for data validation, ensuring that the dataset meets quality standards before analysis. By integrating Great Expectations, we can create data profiles, perform checks to verify data consistency, and generate comprehensive validation reports. This process helps maintain data integrity and prepares it for more reliable downstream tasks.
```
---

# 2. Import Libraries

In [1]:
# Create a data context
from great_expectations.data_context import FileDataContext

# 3. Preparing Data for Great Expectations

## 3.1. Instantiate Data Context

In this section, I will create a data context using `FileDataContext` to configure data validation settings according to the current project directory.

In [2]:
# Create a data context
context = FileDataContext.create(project_root_dir='./')

## 3.2. Connect to A Datasource

In this section, I will assign unique names to the Datasource and Data Asset. The Datasource refers to the CSV file containing sales data, while the Data Asset represents a specific subset of data within that datasource. Afterward, I will create a batch request to process the data.

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-data-sales'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'sales'
path_to_data = 'data/clean_data.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

## 3.3. Create an Expectation Suite

In this section, I will create an expectation suite named `expectation-sales-dataset`, which will be used to store data validation rules and a validator to validate the data using the previously created batch request.

In [4]:
# Creat an expectation suite
expectation_suite_name = 'expectation-sales-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,transaction_id,customer_id,age,gender,loyalty_member,product_type,sku,rating,order_status,payment_method,total_price,unit_price,quantity,purchase_date,shipping_type,add_ons_purchased,add_on_total,revenue
0,1,1000,53,Male,No,Smartphone,SKU1004,2,Cancelled,Credit Card,5538.33,791.19,7,2024-03-20,Standard,Add ons,40.21,5578.54
1,2,1000,53,Male,No,Tablet,SKU1002,3,Completed,Paypal,741.09,247.03,3,2024-04-20,Overnight,Add ons,26.09,767.18
2,3,1002,41,Male,No,Laptop,SKU1005,3,Completed,Credit Card,1855.84,463.96,4,2023-10-17,Express,No Add ons,0.0,1855.84
3,4,1002,41,Male,Yes,Smartphone,SKU1004,2,Completed,Cash,3164.76,791.19,4,2024-08-09,Overnight,Add ons,60.16,3224.92
4,5,1003,75,Male,Yes,Smartphone,SKU1001,5,Completed,Cash,41.5,20.75,2,2024-05-21,Express,Add ons,35.56,77.06


# 4. Expectation

##  4.1. To be unique

The `transaction_id` column must have unique values for each transaction to avoid data duplication. By using the expectation `expect_column_values_to_be_unique`, we can validate that every value in this column is indeed unique, preventing any data duplication.

In [5]:
validator.expect_column_values_to_be_unique('transaction_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 20000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `transaction_id` column meets the expectation that each data entry is unique, with no duplicate data present.

## 4.2. To be between 

The `age` column must have values within the range of 18 to 80, as this data relates to the sale of electronic goods typically targeted at consumers across various age groups, but with certain age restrictions to prevent unrealistic data or data from irrelevant age groups, such as children or elderly individuals who rarely purchase electronic items. By using the expectation `expect_column_values_to_be_between`, we can validate that each value in this column falls within the specified age criteria, thereby preventing invalid and irrelevant data.

In [6]:
validator.expect_column_values_to_be_between(column='age', min_value=18, max_value=80)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 20000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `age` column meets the expectation that all data values must fall within the range of 18 to 80.

## 4.3. To be in set

The `product_type` column must contain values from a specific set, namely 'Smartphone', 'Tablet', 'Laptop', 'Smartwatch', 'Headphones', as the store only sells these products. By using `expect_column_values_to_be_in_set`, we can ensure that the column contains only product types that match the categories being sold, thereby preventing misclassification or irrelevant data.

In [7]:
validator.expect_column_values_to_be_in_set(column='product_type', value_set=['Smartphone', 'Tablet', 'Laptop', 'Smartwatch', 'Headphones'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 20000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `product_type` column meets expectations, with all values being within a specified set.

## 4.4. To be in type list

The `total_price` column must have the data type 'float64' to ensure accurate price calculations and analysis. By using `expect_column_values_to_be_in_type_list`, we can ensure that the column has the appropriate data type, preventing errors in the calculation and further analysis processes.

In [10]:
validator.expect_column_values_to_be_in_type_list(column='total_price', type_list=['float64'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `total_price` column meets expectations, with the data type being 'float64'.

## 4.5. To match regex

The `purchase_date` column must have a consistent date format of YYYY-MM-DD to maintain data consistency and prevent incorrect interpretation or invalid date entries. By using `expect_column_values_to_match_regex`, we can ensure that each value in the column follows the correct date format, thereby avoiding data inconsistencies that could interfere with analysis and reporting.

In [11]:
validator.expect_column_values_to_match_regex(column="purchase_date", regex=r"\d{4}-\d{2}-\d{2}")

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 20000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `purchase_date` column meets expectations, with the date format being consistent as YYYY-MM-DD.

## 4.6. Unique value count to be between

The `payment_method` column must have a number of unique values $\leq$ 6, as the store only accepts a maximum of 6 different payment methods. By using `expect_column_unique_value_count_to_be_between`, we can validate that the number of payment types listed in the column does not exceed the specified limit, ensuring the data remains consistent with the store's policy and preventing invalid entries.

In [12]:
validator.expect_column_unique_value_count_to_be_between(column='payment_method', min_value=1, max_value=6)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 6
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `payment_method` column meets expectations, with the number of unique values being $\leq$ 6.

## 4.7. Table column count to equal

The dataset must have 18 columns to ensure consistency in data structure and alignment with the predefined format, allowing the data to be processed correctly in analysis or system integration, while preventing missing or irrelevant columns. By using `expect_table_column_count_to_equal`, we can ensure that the number of columns in the dataset matches the specified value.

In [13]:
validator.expect_table_column_count_to_equal(value=18)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 18
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The number of columns in the dataset meets expectations, with a total of 18 columns.

# 5. Saving Great Expectation

## 5.1. Save into Expectation Suite

In this section, I will store the validation results in the Expectation Suite using `discard_failed_expectations=False` to ensure that failed expectations are still recorded.

In [14]:
# Save into Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

## 5.2. Create a Checkpoint

In this section, I will create a checkpoint named `checkpoint_1` to store the validation status using the validator that has been created. Afterward, I will run this checkpoint to verify whether the data meets the defined expectations.

In [15]:
# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [16]:
# Run a checkpoint
checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/33 [00:00<?, ?it/s]

## 5.3. Create Data Docs

In this section, I will create data docs containing documentation of the data validation results. This process will generate a report that includes all the expectations applied, as well as the validation status for each tested data.

In [None]:
# Build data docs
context.build_data_docs()