# **MILESTONE 3 - GREAT EXPECTATION**

## **Nama :** Dewa Dwi Al-matin
## **Batch :** FTDS HCK-013

___

# IMPORT LIBRARIES

In [1]:
from great_expectations.data_context import FileDataContext

___

# SETUP

In [2]:
context = FileDataContext.create(project_root_dir='./')

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'milestone'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'table_m3'
path_to_data = 'P2M3_dewa_almatin_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

In [4]:
# Creat an expectation suite
expectation_suite_name = 'expectation-trip-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,row_id,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,postal_code,region,product_id,category,sub_category,product_name,sales
0,1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,2,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,42420.0,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,3,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,90036.0,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O Donnel,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,5,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O Donnel,Consumer,United States,Fort Lauderdale,Florida,33311.0,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold N Roll Cart System,22.368


___

# EXPECTATIONS

## Unique 'row_id'
This code snippet is used to ensure that each value in the 'row_id' column is unique. We use it in the project to maintain data integrity and consistency, as having non-unique values in this column could lead to errors or inconsistencies in downstream processes that rely on 'row_id' as a unique identifier.

In [5]:
validator.expect_column_values_to_be_unique('row_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 9789,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

'row_id' is proven unique

## 'postal_code' range
This code snippet is using the expect_column_values_to_be_between method from the Great Expectations library to validate whether all values in the 'postal_code' column fall within a specified range. In this case, the range is defined by the minimum value of 0 and the maximum value of 99999.

In [6]:
validator.expect_column_values_to_be_between(
    column='postal_code', min_value=0, max_value=99999
)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 9789,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

# Set 'category'

This code snippet is utilizing the expect_column_values_to_be_in_set method from the Great Expectations library to validate whether all values in the 'category' column belong to a predefined set of values. In this case, the expected set of values includes 'Furniture', 'Office Supplies', and 'Technology'.

In [7]:
validator.expect_column_values_to_be_in_set('category', ['Furniture', 'Office Supplies', 'Technology'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 9789,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

# 'sales' data type

This code snippet is utilizing the expect_column_values_to_be_in_type_list method from the Great Expectations library to validate whether all values in the 'sales' column belong to a specified data type. In this case, the expected data type is 'float'.

In [8]:
validator.expect_column_values_to_be_in_type_list('sales', ['float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "result": {
    "observed_value": "float64"
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

# 'order_id' not null
This code snippet is utilizing the expect_column_values_to_not_be_null method from the Great Expectations library to validate whether there are any null (missing) values in the 'order_id' column.

In [9]:
validator.expect_column_values_to_not_be_null('order_id')

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 9789,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": []
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

# 'order_date' format
This code snippet is utilizing the expect_column_values_to_match_strftime_format method from the Great Expectations library to validate whether all values in the 'order_date' column conform to a specified strftime format. In this case, the expected format is '%Y-%m-%d', which represents the year-month-day format.

In [10]:
validator.expect_column_values_to_match_strftime_format('order_date', '%Y-%m-%d')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "result": {
    "element_count": 9789,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

# 1 value for 'country'

This code snippet is using the expect_column_unique_value_count_to_be_between method from the Great Expectations library to validate whether the number of unique values in the 'country' column falls within a specified range. In this case, the range is defined by a minimum value of 1 and a maximum value of 1, indicating that we expect exactly one unique value in the column (no other country than America)

In [11]:
validator.expect_column_unique_value_count_to_be_between(column='country', min_value=1, max_value=1)

Calculating Metrics:   0%|          | 0/4 [00:00<?, ?it/s]

{
  "result": {
    "observed_value": 1
  },
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  },
  "meta": {},
  "success": true
}

In [12]:
validator.save_expectation_suite(discard_failed_expectations=False)

In [13]:
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [14]:
checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/37 [00:00<?, ?it/s]

In [15]:
context.build_data_docs()

{'local_site': 'file:///Users/dewaalmatin/Documents/Hacktiv8/Graded Challenge/p2-ftds013-hck-m3-dewaalmatin/gx/uncommitted/data_docs/local_site/index.html'}

___

# CONCLUSION

In conclusion, based on the validations performed using the Great Expectations library, every aspect of the data meets our expectations and is deemed correct.

- The 'row_id' column contains unique values, ensuring data integrity and uniqueness.
- The 'postal_code' column values fall within the expected range of 0 to 99999.
- The 'category' column contains values exclusively from the set ['Furniture', 'Office Supplies', 'Technology'] as expected.
- The 'sales' column contains only float values as anticipated.
- The 'order_id' column does not contain any null (missing) values, ensuring completeness.
- The 'order_date' column values adhere to the specified '%Y-%m-%d' format, maintaining consistency in date representation.
- Lastly, the 'country' column contains exactly one unique value, aligning with our expectations.

Overall, these validations confirm that the data meets our quality standards and can be confidently used for analysis or downstream processes.