# Milestone 3 - Great Expectations

Nama: Fauzan Rahmat Farghani

Batch: HCK-028

Notebook ini digunakan untuk melakukan validasi data dari dataset bersih menggunakan great expectation. Dataset berisikan informasi karyawan yang akan divalidasi menggunakan 7 ekspektasi termasuk to_be_unique, to_be_between, to_be_in_set, to_be_in_type_list dan 3 ekspektasi lainnya.

# Import Libraries

libraries yang akan digunakan:

In [37]:
from great_expectations.data_context import FileDataContext
import pandas as pd

Akses dan modifikasi kolom:

In [38]:
# Baca data
df = pd.read_csv('P2M3_fauzanfarghani_data_clean.csv')

# Tambahkan kolom employee_id
df['employee_id'] = df['employeenumber'].astype(str) + '_' + df['department']

# Simpan kembali ke file (opsional, jika ingin update file)
df.to_csv('P2M3_fauzanfarghani_data_clean.csv', index=False)

# Instantiate Data Context

In [39]:
context = FileDataContext.create(project_root_dir='./')

Konteks data akan disimpan di directory yang sedang digunakan yakni file P2-M3/fauzanfarghani.

# Connect to a Data Source

In [None]:
# Give a name to a Datasource.
datasource_name = 'P2M3_fauzanfarghani'
try:
    datasource = context.get_datasource(datasource_name)
except Exception:
    datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'fauzanfarghani_data_clean'
try:
    asset = datasource.get_asset(asset_name)
except Exception:
    path_to_data = 'P2M3_fauzanfarghani_data_clean.csv'
    asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

Inisialisasi Data Source berguna untuk menyediakan API untuk akses data dari sumber data. Inisialisasi Data Asset berguna untuk mendefinisikan kumpulan Data Source.

# Expectation Suite Creation

In [41]:
# Creat an expectation suite
expectation_suite_name = 'expectation_fauzanfarghani_dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,age,attrition,businesstravel,dailyrate,department,distancefromhome,education,educationfield,employeenumber,environmentsatisfaction,...,relationshipsatisfaction,stockoptionlevel,totalworkingyears,trainingtimeslastyear,worklifebalance,yearsatcompany,yearsincurrentrole,yearssincelastpromotion,yearswithcurrmanager,employee_id
0,41,Yes,Travel_Rarely,1102,Sales,1,College,Life Sciences,1,Medium,...,Low,0,8,0,Bad,6,4,0,5,1_Sales
1,49,No,Travel_Frequently,279,Research & Development,8,Below College,Life Sciences,2,High,...,Very High,1,10,3,Better,10,7,1,7,2_Research & Development
2,37,Yes,Travel_Rarely,1373,Research & Development,2,College,Other,4,Very High,...,Medium,0,7,3,Better,0,0,0,0,4_Research & Development
3,33,No,Travel_Frequently,1392,Research & Development,3,Master,Life Sciences,5,Very High,...,High,0,8,3,Better,8,7,3,0,5_Research & Development
4,27,No,Travel_Rarely,591,Research & Development,2,Below College,Medical,7,Low,...,Very High,1,6,3,Better,2,2,2,2,7_Research & Development


Inisialisasi expectation suite berguna untuk menggabungkan proses validasi data secara multipel dan akan dirangkum menjadi sebuah deskripsi data. Atribut yang digunakan untuk berinteraksi dengan data adalah dengan menggunakan validator yang telah diinisialisasi di atas.

# Expectations

### 1. employee_id must be unique

In [42]:
validator.expect_column_values_to_be_unique(column='employee_id')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 2. age between 18 and 65

In [43]:
validator.expect_column_values_to_be_between(column='age', min_value=18, max_value=65)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 3. department must be in a specific set

In [44]:
validator.expect_column_values_to_be_in_set(column='department', value_set=['Sales', 'Research & Development', 'Human Resources'])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 4. monthlyincome must be in int64/float64

In [45]:
validator.expect_column_values_to_be_in_type_list(column='monthlyincome', type_list=['int', 'float'])

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 5. jobrole must be alphanumeric & spaces (match regex pattern)

In [46]:
validator.expect_column_values_to_match_regex(column='jobrole', regex='^[A-Za-z\s]+$')

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 6. percentsalaryhike must be between 11 and 25

In [47]:
validator.expect_column_values_to_be_between(column='percentsalaryhike', min_value=11, max_value=25)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

### 7. worklifebalance must have specific length based on criteria

In [48]:
# criteria: 'Bad', 'Good', 'Better', 'Best'
validator.expect_column_value_lengths_to_be_between(column='worklifebalance', min_value=3, max_value=6)

Calculating Metrics:   0%|          | 0/9 [00:00<?, ?it/s]

{
  "success": true,
  "result": {
    "element_count": 1470,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [49]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)

discard_failed_expectations di set menjadi false agar Great Expectation dapat menyimpan hasil ekspektasi maupun benar atau salah.

# Checkpoint

In [None]:
# Create a checkpoint
checkpoint_1 = context.add_or_update_checkpoint(
    name = 'checkpoint_1',
    validator = validator,
)

In [51]:
# Run a checkpoint

checkpoint_result = checkpoint_1.run()

Calculating Metrics:   0%|          | 0/46 [00:00<?, ?it/s]

Checkpoint digunakan untuk menjalankan seluruh validasi secara terstruktur dan menyimpan hasilnya.

# Data Docs

In [52]:
# Build data docs

context.build_data_docs()

{'local_site': 'file:///Users/fauzanfarghani/Hacktiv8/Phase 2/M3/p2-ftds028-hck-m3-fauzanfarghani/P2-M3/fauzanfarghani/gx/uncommitted/data_docs/local_site/index.html'}

Dokumentasi hasil validasi bisa diakses melalui browser.

# Conclusion
Berdasarkan ketujuh validasi yang telah dilakukan, hasilnya return true semua. kolom `employee_id` baru membuat keunikan data dari kombinasi `employeenumber` dan `department` yang dapat digunakan untuk analisis. Ketujuh ekspektasi secara keseluruhan melibatkan uniqueness, range checks, set membership, type checking, regex patterns, value range untuk salary hikes, and panjang value.