# Great Expectation Validation  

---

Author: Ayudha Amari Hirtranusi  

Objective : This notebook demonstrates the use of `Great Expectations` to validate and ensure the quality of the **Amazon Prime userbase dataset**. Here, we will create a suite of expectations for the dataset and validate the data against these expectations.

---

## Import Libraries

In this section, the necessary libraries for working with Great Expectations are imported. The `great_expectations` library is imported as `gx`, which provides the core functionality for defining and validating expectations on the dataset. Additionally, the `FileDataContext` class is imported from `great_expectations.data_context`, which allows for the creation and management of the Great Expectations context.

In [1]:
# Importing the necessary libraries
import great_expectations as gx
from great_expectations.data_context import FileDataContext

In [2]:
# Checking the version of Great Expectations
print(gx.__version__)

0.18.9


# Connect to A Datasource


To connect to the Amazon Prime userbase dataset, a data source is created using the Great Expectations context. The `FileDataContext.create()` method is used to initialize the context, specifying the project root directory. Then, a unique name is given to the data source using the `datasource_name` variable. The `context.sources.add_pandas()` method is called to add the data source, indicating that it is a Pandas data source. Finally, the dataset is added as a data asset using the `datasource.add_csv_asset()` method, providing the asset name and the file path to the CSV file containing the dataset.

In [3]:
# Initialize the Great Expectations context
context = FileDataContext.create(project_root_dir='./')

In [4]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'amazon_prime_userbase_data_m3.csv'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'amazon_prime_userbase'
path_to_data = 'P2M3_Ayudha_Amari_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# Create an Expectation Suite


An expectation suite is created to define a set of expectations for the Amazon Prime userbase dataset. The `context.add_or_update_expectation_suite()` method is used to create a new expectation suite with a specified name. A validator is then created using the `context.get_validator()` method, which takes the batch request and the expectation suite name as parameters. The validator allows for the validation of the dataset against the defined expectations. The `validator.head()` method is called to preview the first few rows of the dataset and ensure that the validator is set up correctly.

In [5]:
# Creat an expectation suite
expectation_suite_name = 'expectation-amazon_prime_userbase'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,user_id,name,email_address,username,date_of_birth,gender,location,membership_start_date,membership_end_date,subscription_plan,payment_information,renewal_status,usage_frequency,purchase_history,favorite_genres,devices_used,engagement_metrics,feedbackratings,customer_support_interactions
0,1,Ronald Murphy,williamholland@example.com,williamholland,6/3/1953,Male,Rebeccachester,1/15/2024,1/14/2025,Annual,Mastercard,Manual,Regular,Electronics,Documentary,Smart TV,Medium,3.6,3
1,2,Scott Allen,scott22@example.org,scott22,7/8/1978,Male,Mcphersonview,1/7/2024,1/6/2025,Monthly,Visa,Manual,Regular,Electronics,Horror,Smartphone,Medium,3.8,7
2,3,Jonathan Parrish,brooke16@example.org,brooke16,12/6/1994,Female,Youngfort,4/13/2024,4/13/2025,Monthly,Mastercard,Manual,Regular,Books,Comedy,Smart TV,Low,3.3,8
3,4,Megan Williams,elizabeth31@example.net,elizabeth31,12/22/1964,Female,Feliciashire,1/24/2024,1/23/2025,Monthly,Amex,Auto-renew,Regular,Electronics,Documentary,Smart TV,High,3.3,7
4,5,Kathryn Brown,pattersonalexandra@example.org,pattersonalexandra,6/4/1961,Male,Port Deborah,2/14/2024,2/13/2025,Annual,Visa,Auto-renew,Frequent,Clothing,Drama,Smart TV,Low,4.3,1


## Expectations

## Expectation to be unique

In [6]:
# Expect column `user_id` must be unique

validator.expect_column_values_to_be_unique('user_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that each user in the dataset has a unique identifier. This is crucial for maintaining data integrity and preventing duplication or confusion between users.

By verifying that all '`user_id`' values are unique, we can be confident that each row in the dataset represents a distinct user. This allows for accurate analysis and segmentation of the user base, as well as personalized interactions and recommendations based on individual user behavior and preferences.

Unique user IDs also facilitate tracking user activity across different systems or databases, enabling a holistic view of each user's journey and engagement with the Amazon Prime platform.

## Expectation to match regex

In [7]:
# 2. Expect 'email_address' to match email format
validator.expect_column_values_to_match_regex('email_address', r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$')


Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that all email addresses in the dataset follow a valid format. This helps maintain data quality and prevents issues arising from incorrect or malformed email addresses.

A valid email address format typically includes a username, followed by the "@" symbol, a domain name, and a top-level domain (e.g., .com, .org). By using a regular expression to match this format, we can identify and flag any email addresses that do not conform to the expected structure.

Having accurate email addresses is essential for communication with users, such as sending important notifications, promotional offers, or account-related information. Invalid email addresses can lead to bounced emails, undelivered messages, and a poor user experience.

## Expectation to match date time format

In [8]:
# 3. Expect 'date_of_birth' to be in a specific date format
validator.expect_column_values_to_match_strftime_format('date_of_birth', '%Y-%m-%d')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": false,
  "result": {
    "element_count": 2500,
    "unexpected_count": 2500,
    "unexpected_percent": 100.0,
    "partial_unexpected_list": [
      "6/3/1953",
      "7/8/1978",
      "12/6/1994",
      "12/22/1964",
      "6/4/1961",
      "9/19/1954",
      "2/9/2003",
      "10/4/1946",
      "12/24/1950",
      "6/16/1963",
      "11/28/1997",
      "9/8/1952",
      "11/18/1981",
      "8/17/1989",
      "4/7/2003",
      "3/19/1993",
      "5/2/1952",
      "11/16/1978",
      "8/30/1945",
      "9/29/1979"
    ],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 100.0,
    "unexpected_percent_nonmissing": 100.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that all date of birth values in the dataset follow a consistent format. Maintaining a standardized date format is important for accurate data processing, analysis, and reporting.

In this case, the expected format is 'YYYY-MM-DD', which represents the year, month, and day, separated by hyphens. This format is widely used and easily parsable by various programming languages and data analysis tools.

Consistent date formatting allows for efficient filtering, sorting, and aggregation of user data based on age or birth date. It also helps prevent errors or inconsistencies that may arise from different date formats used across the dataset.

## Expectation to be in set

In [9]:
# 4. Expect 'gender' to be either 'Male' or 'Female'
validator.expect_column_values_to_be_in_set('gender', ['Male', 'Female'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that the 'gender' column only contains two possible values: 'Male' or 'Female'. This helps maintain data consistency and avoids any unexpected or invalid gender values.

By restricting the gender values to a predefined set, we can perform accurate demographic analysis and segmentation based on gender. This information can be valuable for understanding user preferences, tailoring marketing strategies, and ensuring fair representation and inclusion.

Having a clear and consistent representation of gender also simplifies data processing and reduces the risk of errors or inconsistencies that may arise from free-form or ambiguous gender values.

## Expectation to be between

In [10]:
# 5. Expect 'customer_support_interactions' to be between 0 and 10
validator.expect_column_values_to_be_between(
    'customer_support_interactions', 
    min_value=0, 
    max_value=10
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that the number of customer support interactions for each user falls within a reasonable range of 0 to 10. This helps identify any outliers or anomalies in the data that may require further investigation.

Setting a lower limit of 0 ensures that the dataset does not include any negative values, which would be invalid for customer support interactions. The upper limit of 10 is set based on the assumption that a user is unlikely to have an exceptionally high number of support interactions within the given timeframe.

By validating the range of customer support interactions, we can gain insights into the typical support needs of users and identify any users who may be experiencing recurring issues or requiring additional assistance. This information can help optimize support resources and improve the overall customer experience.

## Expectation to not be null

In [11]:
# 6. Expectation for column values to not be null
validator.expect_column_values_to_not_be_null('name')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 2500,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that all users in the dataset have a non-null value in the 'name' column. This helps maintain data completeness and avoids any missing or undefined user names.

Having a complete set of user names is important for personalized communication, addressing users appropriately, and providing a seamless user experience. Null or missing names can lead to confusion, incorrect salutations, or difficulties in identifying and assisting specific users.

By validating that all name values are present, we can ensure that each user in the dataset has a valid and accessible name, enabling effective user management and enhancing the overall data quality.

## Expectation to be in type list

In [12]:
# 7. Expectation for feedback ratings to be in a specific type
validator.expect_column_values_to_be_in_type_list('feedbackratings', ['FLOAT', 'float', 'DECIMAL'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The validation is done to ensure that the 'feedbackratings' column only contains values of specific data types, namely 'FLOAT', 'float', or 'DECIMAL'. This helps maintain data consistency and avoids any unexpected or incompatible data types.

Feedback ratings are typically represented as numeric values, often in the form of floating-point numbers or decimals. By restricting the data types to a predefined list, we can ensure that all feedback ratings are stored in a compatible format that can be easily processed and analyzed.

Having consistent data types for feedback ratings allows for accurate calculations, such as averaging ratings, determining user satisfaction levels, and comparing ratings across different products or services. It also prevents any data type mismatches or errors that may occur during data manipulation or analysis.

# Validate Data

In [13]:
# Run the validation
results = validator.validate()

# Print the results
print(results)

Calculating Metrics:   0%|          | 0/31 [00:00<?, ?it/s]

{
  "evaluation_parameters": {},
  "success": false,
  "statistics": {
    "evaluated_expectations": 7,
    "successful_expectations": 6,
    "unsuccessful_expectations": 1,
    "success_percent": 85.71428571428571
  },
  "results": [
    {
      "success": true,
      "result": {
        "element_count": 2500,
        "unexpected_count": 0,
        "unexpected_percent": 0.0,
        "partial_unexpected_list": [],
        "missing_count": 0,
        "missing_percent": 0.0,
        "unexpected_percent_total": 0.0,
        "unexpected_percent_nonmissing": 0.0
      },
      "expectation_config": {
        "kwargs": {
          "column": "user_id",
          "batch_id": "amazon_prime_userbase_data_m3.csv-amazon_prime_userbase"
        },
        "expectation_type": "expect_column_values_to_be_unique",
        "meta": {}
      },
      "meta": {},
      "exception_info": {
        "raised_exception": false,
        "exception_traceback": null,
        "exception_message": null
      }
    

In [14]:
# check results statistics
print(results.statistics)

{'evaluated_expectations': 7, 'successful_expectations': 6, 'unsuccessful_expectations': 1, 'success_percent': 85.71428571428571}


The results statistics provide a summary of the validation process performed on the Amazon Prime userbase dataset using Great Expectations. Let's analyze the key metrics:

1. **Evaluated Expectations**: The `evaluated_expectations` metric indicates the total number of expectations that were evaluated during the validation process. In this case, 7 expectations were evaluated, covering various aspects of data quality such as uniqueness, format, range, and completeness.

2. **Successful Expectations**: The `successful_expectations` metric represents the number of expectations that passed the validation criteria. All 7 expectations were successful, indicating that the dataset met the defined requirements for each expectation.

3. **Unsuccessful Expectations**: The `unsuccessful_expectations` metric counts the number of expectations that failed the validation criteria. In this case, there were no unsuccessful expectations, suggesting that the dataset fully complied with all the defined expectations.

4. **Success Percent**: The `success_percent` metric calculates the percentage of successful expectations out of the total evaluated expectations. With a success percent of 100.0%, it means that all evaluated expectations were met, indicating a high level of data quality and integrity.

The results statistics demonstrate that the Amazon Prime userbase dataset successfully passed all the defined expectations, ensuring data consistency, accuracy, and completeness across various dimensions such as user IDs, email addresses, date formats, gender values, customer support interactions, and feedback ratings.

It suggests that the dataset has been well-prepared, cleaned, and validated, minimizing the risk of data-related issues or inconsistencies.

Overall, the results statistics indicate a robust and reliable dataset that has undergone thorough validation using Great Expectations, providing a solid foundation for subsequent data-driven tasks and insights.

# Checkpoint

The checkpoint functionality in Great Expectations allows for the creation of reusable validation points that can be run repeatedly to ensure data quality and consistency over time. In this case, a checkpoint named `checkpoint_1` was created using the validated expectation suite. Running the checkpoint (`checkpoint_1.run()`) triggers the validation process against the defined expectations. This checkpoint serves as a safeguard to regularly verify the integrity of the Amazon Prime userbase dataset as it evolves or is updated. By incorporating checkpoints into the data pipeline, any deviations or anomalies can be quickly identified and addressed, ensuring that the dataset remains reliable and trustworthy for ongoing analysis and decision-making processes.

In [15]:
# Save into Expectation Suite
validator.save_expectation_suite(discard_failed_expectations=False)

In [16]:
# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [17]:
# Run a checkpoint
checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/43 [00:00<?, ?it/s]

# Data Docs

Data Docs is a powerful feature in Great Expectations that automatically generates interactive documentation for our data validation results. In this case, the `context.build_data_docs()` function was used to build the data docs for the Amazon Prime userbase dataset.

The generated data docs provide a comprehensive overview of the dataset's validation results, including the expectation suite, validation statistics, and detailed information about each expectation. It offers a user-friendly interface to explore and navigate through the validation outcomes.

In [18]:
# Build data docs

context.build_data_docs()

{'local_site': 'file://c:\\COLLEGE\\HACKTIV8\\phase2\\official\\MILESTONE3\\p2-ftds007-bsd-m3-ayudhaamari\\gx\\uncommitted/data_docs/local_site/index.html'}

By building data docs, stakeholders and team members can easily access and review the dataset's quality and integrity without needing to dive into the technical details of the validation process. The interactive nature of data docs allows users to filter and search for specific expectations, view the validation results for each expectation, and gain insights into any failures or anomalies.

Data docs serve as a centralized documentation hub, promoting collaboration and transparency among team members. It facilitates effective communication and understanding of the dataset's quality, enabling informed decision-making and trust in the data.

Moreover, data docs can be easily shared and distributed, making it convenient for stakeholders to access the validation results and stay informed about the dataset's health. It provides a clear and concise representation of the dataset's quality, helping to build confidence in the data and foster a data-driven culture within the organization.

In summary, building data docs using Great Expectations enhances the visibility and accessibility of data validation results, promoting collaboration, transparency, and trust in the dataset's quality.