---
# **Great Expectation**
---
Program ini dibuat untuk memastikan bahwa data yang digunakan dalam berbagai proyek memenuhi standar yang ditetapkan, sehingga mengurangi risiko kesalahan dan menjaga keandalan hasil analisis.

In [1]:
# ignore warnings
import warnings
warnings.filterwarnings('ignore')

In [2]:
# Install the library

!pip install -q "great-expectations==0.18.19"

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.4/5.4 MB[0m [31m56.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.7/113.7 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m526.7/526.7 kB[0m [31m26.5 MB/s[0m eta [36m0:00:00[0m
[?25h

# **Create directory**

In [3]:
# Create a data context

from great_expectations.data_context import FileDataContext

context = FileDataContext.create(project_root_dir='./')

# **Connect Data Source**

In [4]:
# Give a name to a Datasource. This name must be unique between Datasources.
datasource_name = 'csv-data-m3'
datasource = context.sources.add_pandas(datasource_name)

# Give a name to a data asset
asset_name = 'milestone3'
path_to_data = '/content/P2M3_danisa_data_clean.csv'
asset = datasource.add_csv_asset(asset_name, filepath_or_buffer=path_to_data)

# Build batch request
batch_request = asset.build_batch_request()

  and should_run_async(code)



# **Create Expectation Suite**

In [5]:
# Creat an expectation suite
expectation_suite_name = 'expectation-m3-dataset'
context.add_or_update_expectation_suite(expectation_suite_name)

# Create a validator using above expectation suite
validator = context.get_validator(
    batch_request = batch_request,
    expectation_suite_name = expectation_suite_name
)

# Check the validator
validator.head()

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Unnamed: 0,id,date,day,month,year,customer_age,age_group,customer_gender,country,state,product_category,sub_category,product,order_quantity,unit_cost,unit_price,profit,cost,revenue
0,106740,2014-01-05,1,May,2014,35,Adults (35-64),M,Germany,Nordrhein-Westfalen,Accessories,Tires and Tubes,LL Road Tire,16,8,21,164,128,292
1,30758,2014-06-27,27,June,2014,50,Adults (35-64),M,United States,Oregon,Accessories,Helmets,"Sport-100 Helmet, Blue",25,13,35,428,325,753
2,40669,2016-05-29,29,May,2016,47,Adults (35-64),F,United States,California,Accessories,Helmets,"Sport-100 Helmet, Blue",12,13,35,256,156,412
3,53764,2014-02-22,22,February,2014,24,Youth (<25),M,Australia,Queensland,Bikes,Mountain Bikes,"Mountain-200 Silver, 46",1,1266,2320,683,1266,1949
4,36840,2015-12-12,12,December,2015,47,Adults (35-64),M,Canada,British Columbia,Accessories,Helmets,"Sport-100 Helmet, Blue",19,13,35,411,247,658


-----
# **Expectation**
-----

# **Expectation 1**

Pada ekspektasi ini, dilakukan validasi terhadap data set apakah data tersebut dikatakan sebagai unique. Pada data set yang digunakan, kolom ID adalah kolom yang seharusnya unique. Maka kita akan melakukan validasi terhadap kolom tersebut ```to_be_unique```

In [6]:
# Expectation 1 : Column `id` must be unique

validator.expect_column_values_to_be_unique('id')

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "id",
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Expectation 2**
Pada data validasi yang kedua, disini ekspektasi yang ingin didapat adalah data ini adalah data yang diambil hanya dari tahun 2011 hingga 2016. Jadi kita akan memvalidasi data set berdasarkan ```minimal value 2011``` dan ```maximal valuenya 2016```.

In [7]:
# Expectation 2 : Column `year` must be less than 2016

validator.expect_column_values_to_be_between(
    column='year',
    min_value=2011,
    max_value=2016,
)

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "year",
      "min_value": 2011,
      "max_value": 2016,
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Expectation 3**
Pada validasi ini dilakukan untuk memastikan bahwa tidak ada data lain selain yang diterakan pada code.

In [8]:
# Expectation 3 : Column `customer_gender` must contain one of the following :
# M = Male
# F = Female

validator.expect_column_values_to_be_in_set('customer_gender', ['M', 'F'])

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "customer_gender",
      "value_set": [
        "M",
        "F"
      ],
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Expectation 4**

Pada validasi ini akan dilakukan pengecekan apakah data numerik sudah sesuai dengan tipe data yang seharusnya

In [9]:
# Expectation 4 : Column `revenue` must in form of integer or float

validator.expect_column_values_to_be_in_type_list('revenue', ['int', 'float'])

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_type_list",
    "kwargs": {
      "column": "revenue",
      "type_list": [
        "int",
        "float"
      ],
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "int64"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Expectation 5**
Pada validasi 5 ini, menerapkan ekspektasi bahwa nilai pada kolom yang dipilih harus unik dalam setiap rekaman (record) atau baris. Dengan kata lain, kita memeriksa bahwa tidak ada dua kolom tertentu dalam satu baris yang memiliki nilai yang sama. **Revenue** adalah total pendapatan yang dihasilkan dari penjualan, sedangkan **Profit** adalah pendapatan bersih setelah dikurangi semua biaya. Secara logika, kedua nilai ini biasanya berbeda karena profit adalah hasil akhir setelah pengeluaran dipotong dari revenue. Jadi diharapkan tidak terdapat salah imputasi untuk nilai yang seharusnya.

In [10]:
validator.expect_select_column_values_to_be_unique_within_record(
    column_list=['revenue', 'profit'],
)

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_select_column_values_to_be_unique_within_record",
    "kwargs": {
      "column_list": [
        "revenue",
        "profit"
      ],
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "element_count": 10000,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Expectation 6**
Pada validasi ke 5, diharapkan untuk kolom ```country``` tidak ada imputasi yang terbalik, dimana yang seharusnya ```negara``` namun diisi dengan ```state```.

In [11]:
# Expectation 6 : Column `country` must be not any false unique imputation as a `state`
validator.expect_column_values_to_not_be_in_set(
    column="country",
    value_set=['Nordrhein-Westfalen', 'Oregon', 'California', 'Queensland',
       'British Columbia', 'Bayern', 'England', 'Washington',
       'New South Wales', 'Saarland', 'Seine (Paris)', 'Victoria',
       'Seine Saint Denis', 'Tasmania', 'South Australia', 'Yveline',
       'Hamburg', 'Essonne', 'Nord', 'Hessen', "Val d'Oise",
       'Seine et Marne', 'Moselle', 'Loiret', 'Hauts de Seine', 'Ontario',
       'Wyoming', 'Alberta', 'South Carolina', 'Val de Marne', 'Somme',
       'Garonne (Haute)', 'Ohio', 'Montana', 'Pas de Calais',
       'Loir et Cher', 'Minnesota', 'New York', 'Brandenburg', 'Florida',
       'Charente-Maritime', 'Texas', 'Georgia', 'Illinois', 'Arizona',
       'Massachusetts', 'Kentucky']
)

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_in_set",
    "kwargs": {
      "column": "country",
      "value_set": [
        "Nordrhein-Westfalen",
        "Oregon",
        "California",
        "Queensland",
        "British Columbia",
        "Bayern",
        "England",
        "Washington",
        "New South Wales",
        "Saarland",
        "Seine (Paris)",
        "Victoria",
        "Seine Saint Denis",
        "Tasmania",
        "South Australia",
        "Yveline",
        "Hamburg",
        "Essonne",
        "Nord",
        "Hessen",
        "Val d'Oise",
        "Seine et Marne",
        "Moselle",
        "Loiret",
        "Hauts de Seine",
        "Ontario",
        "Wyoming",
        "Alberta",
        "South Carolina",
        "Val de Marne",
        "Somme",
        "Garonne (Haute)",
        "Ohio",
        "Montana",
        "Pas de Calais",
        "Loir et Cher",
        "Minnesota",
        "New York"

# **Expectation 7**
Pada validasi kali ini kita mendeteksi karena data yang kita gunakan sebanyak ``10.000`` row, maka dilakukan validasi apakah data sama dengan yang seharusnya atau tidak.

In [12]:
# Expectation 7 : Data set that we use is 10.000, check the row is equal to 10.000 or not
validator.expect_table_row_count_to_equal(
    value=10000
)

  and should_run_async(code)



Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_table_row_count_to_equal",
    "kwargs": {
      "value": 10000,
      "batch_id": "csv-data-m3-milestone3"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 10000
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

# **Save Expectation**

In [13]:
# Save into Expectation Suite

validator.save_expectation_suite(discard_failed_expectations=False)