# `Data Validation`

# 1. Introduction

Name: Celine Clarissa

Original Dataset: [Kaggle](https://www.kaggle.com/datasets/athu1105/book-genre-prediction/data)

---

## Identifying the Problem

#### `Background`

A company providing products to customers must know the demographics and characteristics of their customers in order to minimize
customer churn.

#### `Problem Statement and Objectives (SMART Framework)`
 
As a data analyst at a company, skills of understanding the market and extracting business insights from data are needed. By analyzing
customer churn data, it is possible to find out about user demographic, as well as their browsing behavior. After gaining information
from data, it is targeted for the company to strategize plans correlating to business insights. These insights are aimed to be displayed
in the form of a dashboard after 5 working days.

---
---

## 2. Import 

In [1]:
!pip install -q great-expectations

---
---
## 3. Data Validation

In [2]:
# create data context
from great_expectations.data_context import FileDataContext
context = FileDataContext.create(project_root_dir='./')

In [3]:
# give datasource name
datasource_name = 'csv-milestone-celine'
datasource = context.sources.add_pandas(datasource_name)

# give data asset name
asset_name = 'celine'
path_to_data = 'P2M3_Celine_Clarissa_Data_Clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# build batch request
batch_request = asset.build_batch_request()

In [5]:
# create expectation suite
expectation_suite_name = 'expectation-trip-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# create validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,unnamed:_0,age,gender,security_no,region_category,membership_category,joining_date,joined_through_referral,referral_id,preferred_offer_types,...,average_time_spent,average_transaction_value,average_frequency_login_days,points_in_wallet,used_special_discount,offer_application_preference,past_complaint,complaint_status,feedback,churn_risk_score
0,0,18,F,XW0DQ7H,Village,Platinum Membership,2017-08-17,No,xxxxxxxx,Gift Vouchers/Coupons,...,300.63,53005.25,17.0,781.75,Yes,Yes,No,Not Applicable,Products always in Stock,0
1,1,32,F,5K0N3X1,City,Premium Membership,2017-08-28,?,CID21329,Gift Vouchers/Coupons,...,306.34,12838.38,10.0,686.919871,Yes,No,Yes,Solved,Quality Customer Care,0
2,2,44,F,1F2TCL3,Town,No Membership,2016-11-11,Yes,CID12313,Gift Vouchers/Coupons,...,516.16,21027.0,22.0,500.69,No,Yes,Yes,Solved in Follow-up,Poor Website,1
3,3,37,M,VJGJ33N,City,No Membership,2016-10-29,Yes,CID3793,Gift Vouchers/Coupons,...,53.27,25239.56,6.0,567.66,No,Yes,Yes,Unsolved,Poor Website,1
4,4,31,F,SVZXCWB,City,No Membership,2017-09-12,No,xxxxxxxx,Credit/Debit Card Offers,...,113.13,24483.66,16.0,663.06,No,Yes,Yes,Solved,Poor Website,1


In [6]:
# expectation 1: column `security_no` must be unique

validator.expect_column_values_to_be_unique('security_no')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 36704,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [9]:
# expectation 2: The average of `age` must in range 10-70

validator.expect_column_values_to_be_between('age', 10, 70)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 36704,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [7]:
# expectation 3: column `region_category` must contain one of the following: 'City', 'Town', 'Village'

validator.expect_column_values_to_be_in_set('region_category', ['City', 'Town', 'Village'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 36704,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 5379,
    "missing_percent": 14.655078465562339,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [10]:
# expectation 4: column `points_in_wallet` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('points_in_wallet', ['integer', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [11]:
# expectation 5: column 'churn_risk_score' must be present

validator.expect_column_to_exist('churn_risk_score')

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

{
  "success": true,
  "result": {},
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [12]:
# expectation 6: number of columns must be equal to 24

validator.expect_table_column_count_to_equal(24)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 24
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [13]:
# expectation 7: number of rows must be in range 0-36,992

validator.expect_table_row_count_to_be_between(0, 36992)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": 36704
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}