================================================================================

Program ini berisikan proses Data validation menggunakan library Pythons Great Expectations.

Notebook ini dijalankan menggunakan google colab.

Dataset yang digunakan merupakan data yang telah di cleaning menggunakan apache airflow.

================================================================================

# A. Install Great Expectation Package

Pertama saya akan meng install versi Great Expectations dan Numpy yang sesuai agar proses coding berjalan lancar

In [1]:
# Install the library

!pip install -q "great-expectations==0.18.19" "numpy==1.24.3"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m29.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.3/17.3 MB[0m [31m41.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m813.6/813.6 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m50.9/50.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.7/113.7 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m739.1/739.1 kB[0m [31m22.6 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dep

In [1]:
import numpy as np

np.__version__

'1.24.3'

versi numpy berhasil di ubah, lanjut kita buat data contextnya

# B. Instantiate Data Context

Saya akan membuat data context di google colab untuk menyimpan hasil validasi

In [2]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')


# C. Connect to A Datasource

Selanjutnya, saya melakukan koneksi ke datasouce. Dimana data yang digunakan saya upload sendiri dari repo lokal ke google colab. Data ini adalah hasil cleaning dari apache airflow.

In [3]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'milestone3_csv'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'p2m3_clean_data'
path_to_data = './P2M3_ade_indra_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

# D. Create an Expectation Suite

Mari kita buat aturan-aturan nya dengan membuat validator.

In [4]:
# Creat an expectation suite
expectation_suite_name = 'expectation-employee-survey-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,empid,gender,age,maritalstatus,joblevel,experience,dept,emptype,wlb,workenv,...,sleephours,commutemode,commutedistance,numcompanies,teamsize,numreports,edulevel,haveot,traininghoursperyear,jobsatisfaction
0,6,Male,32,Married,Mid,7,IT,FullTime,1,1,...,7.6,Car,20,3,12,0,Bachelor,True,33.5,5
1,11,Female,34,Married,Mid,12,Finance,FullTime,1,1,...,7.9,Car,15,4,11,0,Bachelor,False,36.0,5
2,33,Female,23,Single,Intern/Fresher,1,Marketing,FullTime,2,4,...,6.5,Motorbike,17,0,30,0,Bachelor,True,10.5,5
3,20,Female,29,Married,Junior,6,IT,Contract,2,2,...,7.5,Public Transport,13,2,9,0,Bachelor,True,23.0,5
4,28,Other,23,Single,Junior,1,Sales,PartTime,3,1,...,4.9,Car,20,0,7,0,Bachelor,False,20.5,5


In [None]:
validator.columns

# E. Expectations

Saya akan melakukan validasi dengan menggunakan 7 Expectations dari web [GX](https://greatexpectations.io/expectations/) yaitu :

- to be unique
- to be between min_value and max_value
- to be in set
- to be in type list
- expect_column_pair_values_a_to_be_greater_than_b
- expect_table_row_count_to_be_between
- expect_table_column_count_to_equal

---

## Mari kita lakukan validasi!

In [6]:
# Expectation 1 : Column `empid` must be unique

validator.expect_column_values_to_be_unique('empid')




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "empid",
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "element_count": 3025,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [7]:
# Expectation 2 : Column `age` must be within range 20-60

validator.expect_column_values_to_be_between(
    column='age', min_value=20, max_value=60
)




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "age",
      "min_value": 20,
      "max_value": 60,
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "element_count": 3025,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [8]:
# Expectation 3 : Column `jobsatisfaction` must contain scale 1-5

validator.expect_column_values_to_be_in_set('jobsatisfaction', [1, 2, 3, 4, 5])




Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "jobsatisfaction",
      "value_set": [
        1,
        2,
        3,
        4,
        5
      ],
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "element_count": 3025,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [9]:
# Expectation 4 : Column `physicalactivityhours` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('physicalactivityhours', ['integer', 'float'])




Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_type_list",
    "kwargs": {
      "column": "physicalactivityhours",
      "type_list": [
        "integer",
        "float"
      ],
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "float64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [11]:
# Expectation 5 : Column 'age' must be larger than column 'experience'

validator.expect_column_pair_values_a_to_be_greater_than_b('age', 'experience')




Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_pair_values_a_to_be_greater_than_b",
    "kwargs": {
      "column_A": "age",
      "column_B": "experience",
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "element_count": 3025,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [15]:
# Expectation 6 : Table row must be within range 3000-5000

validator.expect_table_row_count_to_be_between(min_value=3000, max_value=5000)

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_table_row_count_to_be_between",
    "kwargs": {
      "min_value": 3000,
      "max_value": 5000,
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 3025
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [18]:
# Expectation 7 : Column table must be equal 23

validator.expect_table_column_count_to_equal(23)




Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_table_column_count_to_equal",
    "kwargs": {
      "value": 23,
      "batch_id": "milestone3_csv-p2m3_clean_data"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 23
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

Terakhir saya simpan hasil expectation tadi menjadi satu kesatuan dan membuat dokumentasi data.

In [None]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

# Build data docs
context.build_data_docs()