# 1. Introduction

Name: Ghassani Nurbaningtyas

Batch: SBY-003

After cleaning the data, we will validate the data using Great Expectation. In this notebook, I will validate the data with 7 Expectation and the result True.

Great Expectation helps us to ensure that oour data is accurate, consistent, and compliant with our business rules. How Great Exectations works? First we define the expectation about our data (to be unique, to be in type list, etc), Then Great Expectation validates our data against the expectation. It will checks whether the data meets the specified criteria and generates a report highlighting any issues. The report we can use to fix our data quality problems.

Benefit of using Great Expectations:
- Improved data quality
By identifying and fixing data quality issues, we can improve the accuracy and reliability of our data.
- Improved productivity
Great Expectations can automate the data validation process.
- Reduce cost
Data quality problems can lead to costly errors and delays. Great Expectations can help us avoid these costs by ensuring that our data is clean and accurate.

# 2. Import Libraries

In [19]:
# For Load Dataset:
import pandas as pd

# For Dataset Validation:
from great_expectations.data_context import FileDataContext

# 3. Data Load

Before we validate the dataset, let's check whether the clean dataset is really clean or not. first we load the clean data.

In [20]:
# Load the clean data
df = pd.read_csv('P2M3_Tyas_data_clean.csv')
df.head()

Unnamed: 0,customer_id,age,gender,item_purchased,category,purchase_amount_usd_,location,size,color,season,review_rating,subscription_status,shipping_type,discount_applied,promo_code_used,previous_purchases,payment_method,frequency_of_purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


Let's check data type each column.

In [21]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   customer_id             3900 non-null   int64  
 1   age                     3900 non-null   int64  
 2   gender                  3900 non-null   object 
 3   item_purchased          3900 non-null   object 
 4   category                3900 non-null   object 
 5   purchase_amount_usd_    3900 non-null   int64  
 6   location                3900 non-null   object 
 7   size                    3900 non-null   object 
 8   color                   3900 non-null   object 
 9   season                  3900 non-null   object 
 10  review_rating           3900 non-null   float64
 11  subscription_status     3900 non-null   object 
 12  shipping_type           3900 non-null   object 
 13  discount_applied        3900 non-null   object 
 14  promo_code_used         3900 non-null   

Data type each column is appropriate. Then let's check data duplicates.

In [22]:
# Check data duplicate
df.duplicated().sum()

0

There is no data duplicate in clean dataset. Then let's check missing value.

In [23]:
# Check Missing Value
df.isna().sum()

customer_id               0
age                       0
gender                    0
item_purchased            0
category                  0
purchase_amount_usd_      0
location                  0
size                      0
color                     0
season                    0
review_rating             0
subscription_status       0
shipping_type             0
discount_applied          0
promo_code_used           0
previous_purchases        0
payment_method            0
frequency_of_purchases    0
dtype: int64

There is no missing value in clean dataset. Because the data type is correct, there is no duplicate data and no missing values, the clean dataset is completely clean so it is ready for validation.

# 4. Initiate Data Context

In [24]:
# Create gc folder
context = FileDataContext.create(project_root_dir='./')

# 5. Connect to A Datasource

In [25]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'milestone_clean_data'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'data_clean'
path_to_data = 'P2M3_Tyas_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# 6. Create an Expectation Suite

In [26]:
# Creat an expectation suite
expectation_suite_name = 'expectation-clean-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,customer_id,age,gender,item_purchased,category,purchase_amount_usd_,location,size,color,season,review_rating,subscription_status,shipping_type,discount_applied,promo_code_used,previous_purchases,payment_method,frequency_of_purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


# 7. Expectation

## 7.1 Expectation to be unique

For Expectation 1, we will check the customer_id column must be unique value, so there is no duplicated in customer_id

In [27]:
# Expectation 1 : Column `customer_id` must be unique
validator.expect_column_values_to_be_unique('customer_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `to be unique` on column `customer_id` has True result. The test confirms that all values in the **customer_id column are unique**, so there are no duplicates.

## 7.2 Expectation to be beetwen min value and max value

For Expectation 2 we will check review_rating column must have value between 0 and 5

In [28]:
# Expectation 2 : Column `review_rating` must between 0 and 5

validator.expect_column_values_to_be_between(
    column='review_rating', min_value=0, max_value=5
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `to be beetwen min value and max value` on column `review_rating` has True result. The test confirms that all values in the **review_rating column are on range 0-5**.

## 7.3 Expectation to be in set

For Expectation 3, we will check the subscription_status column must contains Yes or No.

In [29]:
# Expectation 3 : Column 'subscription_status' must contain one of the following 2 things :
# Yes
# No

validator.expect_column_values_to_be_in_set('subscription_status',['Yes','No'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 3900,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `to be in set` on column `subscription_status` has True result. The test confirms that all values in the **subscription_status column are unique just has 2 values Yes or No**, so there is no value beyond yes or no.

## 7.4 Expectation to be type list

For Expectation 4, we will check age column must in form of integer64.

In [30]:
# Expectation 4 : Column 'age' column must in form of integer64

validator.expect_column_values_to_be_in_type_list('age', ['int64'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `to be type of list` on column `age` has True result. The test confirms that all values in the **Age column in form of integer 64**.

## 7.5 Expectation to be exist

For Expectation 5, Column 'gender' must be exist in the dataset

In [31]:
# Expectation 5 : Column 'gender' must be exist in the dataset
validator.expect_column_to_exist('gender')

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `to be exist` on column `gender` has True result. The test confirms that all values in the **gender column is exist**, so there is column gender in the dataset.

## 7.6 Expectation median values to be between

For Expectation 6, Column 'purchase_amount_usd_' must have median values between minimum value and maximum value.

In [32]:
# Expectation 6 : Column 'purchase_amount_usd_' must have median values between minimum value and maximum value.

validator.expect_column_median_to_be_between('purchase_amount_usd_',min_value=10, max_value=70)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 60.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `median to be between` on column `purchase_amount_usd_` has True result. The test confirms that in the **purchase_amount_usd_ column have median values between minimum value and maximum value.**, so median value not exceeds its range.

## 7.7 Expectation minimum value to be between

For Expectation 7, Column 'previous_purchases' must have minimum values between minimum value and maximum value.

In [33]:
# Expectation 7 : Column 'previous_purchases' must have minimum values between minimum value and maximum value.

validator.expect_column_min_to_be_between('previous_purchases', min_value=0, max_value=50)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 1
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

As you can see, the expectation `min to be between` on column `previous_purchases` has True result. The test confirms that min value in the **previous_purchases column between minimum and maximum value**, so the minimum value not exceed the range.

# 8. Save Into Expectation Suite

In [34]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=True)

# 9. Checkpoint

In [35]:
# Create a checkpoint

checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/26 [00:00<?, ?it/s]

# 10. Data Docs

In [36]:
# Build data docs

context.build_data_docs()

{'local_site': 'file://c:\\Users\\DELL\\data\\Tugas\\p2-ftds003-sby-m3-ghssni\\gx\\uncommitted/data_docs/local_site/index.html'}

# 11. Conclusion

Validation test using great expectation show all true report.
Data that has been validated using Great Expectations can be continued to be processed further for data analysis and creating data visualizations.
