# Milestone 3 - Great Expectations
---
Name: Basyira Sabita  
Batch: HCK-012

In [1]:
# Install the library

# !pip install -q great-expectations

# **Instantiate the Data Context**

In [2]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

First, we need to make file data context to store the expectations.

# **Connect to a Datasource**

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-cust-shopping-trends'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'cust-shopping-trends'
path_to_data = './dags/P2M3_basyira_sabita_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

Then, we connect to a datasource and data asset for the data that need to be inspect.

# **Create an Expectation Suite**

In [4]:
# Create an expectation suite
expectation_suite_name = 'expectation-cust-shopping-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,customer_id,age,gender,item_purchased,category,purchase_amount_usd,location,size,color,season,review_rating,subscription_status,shipping_type,discount_applied,promo_code_used,previous_purchases,preferred_payment_method,frequency_of_purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


Creating the expectation suit to collect the rules that the data need to follow.

## **Expectation 1** - Column `customer_id` must be unique

In [5]:
# Expectation 1 - Column `customer_id` must be unique
validator.expect_column_values_to_be_unique('customer_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The data needs to have a unique identifier for each row in order to be identifiable and distinct. In this dataset, the customer_id column acts as the unique identifier for each row of data, and therefore its values must be unique. As seen above, the expectation for the column is a success, indicating that the values in the column are indeed unique.

## **Expectation 2** - Column `age` values need to be between 16 until 99

In [6]:
# Expectation 2 - Column `age` values need to be between 16 until 99
validator.expect_column_values_to_be_between(
    column='age', min_value=16, max_value=99
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In the dataset, there's an `age` column. It's expected that `age` column values cannot be negative or excessively large. To ensure the data is meaningful within the context of customer shopping trends, we set the maximum `age` to 99 and the minimum `age` to 16 (the legal age in the US). Upon inspection, the age column values fall within the expected range of 16 to 99, meeting the specified criteria.

## **Expectation 3** - Column `gender` needs to include both gender

In [7]:
# Expectation 3 - Column `gender` needs to include both gender (male and female)
validator.expect_column_values_to_be_in_set(
    column='gender', value_set=['Male', 'Female']
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `gender` column is expected to include both genders (male and female) as they represent the target market for the brand. Ensuring diversity within the dataset allows it to better represent the overall market. Upon inspection, we confirm that the `gender` column contains data for both males and females, meeting the specified expectation.

## **Expectation 4** - Column `purchase_amount_usd` needs to be numerical

In [8]:
# Expectation 4 - Column `purchase_amount_usd` needs to be consists of numerical values
validator.expect_column_values_to_be_in_type_list(
    column='purchase_amount_usd', type_list=['int', 'float']
)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `purchase_amount_usd` column values are expected to be numerical since they represent monetary amounts in USD. Given that purchase amounts are typically numerical values, it's essential for the column values to be of a numerical data type for accurate processing. Upon inspection, we confirm that all values in the column are numerical (int64), aligning with the expected data type.

## **Expectation 5** - Column `category` distinct values needs to equals a list

In [9]:
# Expectation 5 - Column 'category' distinct values needs to equals
# the following list: ['Clothing', 'Footwear', 'Outerwear', 'Accessories'] 
validator.expect_column_distinct_values_to_equal_set(
    column='category',
    value_set=['Clothing', 'Footwear', 'Outerwear', 'Accessories']
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": [
      "Accessories",
      "Clothing",
      "Footwear",
      "Outerwear"
    ],
    "details": {
      "value_counts": [
        {
          "value": "Accessories",
          "count": 1240
        },
        {
          "value": "Clothing",
          "count": 1737
        },
        {
          "value": "Footwear",
          "count": 599
        },
        {
          "value": "Outerwear",
          "count": 324
        }
      ]
    }
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The `category` column distinct values are expected to match the following list: Clothing, Footwear, Outerwear, and Accessories. This requirement ensures that the categories in the dataset align with the specific product offerings of the brand, limiting them to only these four types. Upon inspection, we confirm that the column only contains values from this list, indicating the data is aligning with the expectation.

## **Expectation 6** - Column `size` value length needs to be between 1 - 2


In [10]:
# Expectation 6 - Column `size` value length needs to be between 1 - 2 (S - XL only)
validator.expect_column_value_lengths_to_be_between(
    column='size',
    min_value=1,
    max_value=2
)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Given that the company offers item sizes ranging from S to XL only, we need to ensure that the values in the `size` column falls within this range. One way to verify this is by checking that the length of each value falls within the range of 1 to 2 characters. Upon inspection, we confirm that all values in the column meet this criterion, thus aligning with the expectation.

## **Expectation 7** - Column `location` unique value needs to be equal to 50 

In [11]:
# Expectation 7 - Column `location` unique value needs to be equal to 50 
# to ensure there's data for every states in the US
validator.expect_column_unique_value_count_to_be_between(
    column='location',
    min_value=50,
    max_value=50
)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 50
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

The unique values of the `location` column needs to be equal to 50 to ensure that there's every data for every states in the US. Upon inspection, the unique value counts for the `location` columns is exactly 50, thus aligning with expectation.

# Saving into the Expectations Suite

In [22]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

Finally, we'll save the rules to the expectation suite.