# Milestone 3

Nama  : M. Arindra Jehan
Batch : HCK-015

Program ini dibuat untuk mengecek expectations pada dataset. Adapun dataset yang dipakai adalah dataset mengenai sales real estate pada tahun 2001-2020.

## Libraries

In [8]:
import pandas as pd
import great_expectations as gx
import numpy as np

## Load Data

In [2]:
# Load Dataset
df = pd.read_csv('P2M3_Ari_data_clean.csv')

# To Great Expectation
ge_df = gx.from_pandas(df)

## Expectations

#### Expectation 1 : **to be unique**

In [10]:
# create a unique ID column
df["unique_id"] = np.random.randint(low=1, high=10000000, size=len(df))

In [12]:
ge_df = gx.from_pandas(df)

In [13]:
expectation_1 = ge_df.expect_column_values_to_be_unique('unique_id')
print(expectation_1)

{
  "success": false,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_unique",
    "kwargs": {
      "column": "unique_id",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 1045454,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 103671,
    "unexpected_percent": 9.916361695493059,
    "unexpected_percent_total": 9.916361695493059,
    "unexpected_percent_nonmissing": 9.916361695493059,
    "partial_unexpected_list": [
      1019945,
      3733316,
      1871202,
      2696219,
      2880573,
      8295322,
      6043673,
      8670682,
      1839713,
      6267599,
      8854697,
      2453184,
      8882556,
      5847385,
      2867995,
      9546191,
      9948075,
      875810,
      6259657,
      323517
    ]
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that the unique id columns have no duplicates

#### Expectation 2 : **to be between min_value and max_value**

In [55]:
# Cell 2: Expectation to be between min_value and max_value
expectation_2 = ge_df.expect_column_values_to_be_between('sales_ratio', min_value=0, max_value=1300000)
print(expectation_2)

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_between",
    "kwargs": {
      "column": "sales_ratio",
      "min_value": 0,
      "max_value": 1300000,
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 1045454,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that the value of sales ratio column are between 0 and 1300000

#### Expectation 3 : **to be in set**


In [54]:
# Cell 3: Expectation to be in set
properties = ['Residential', 'Commercial', 'Vacant Land', 'Public Utility', 'Apartments', 'Single Family', 'Industrial', 'Condo', 'Two Family', 'Three Family', 'Four Family']
expectation_3 = ge_df.expect_column_values_to_be_in_set('property_type', properties)
print(expectation_3)

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_be_in_set",
    "kwargs": {
      "column": "property_type",
      "value_set": [
        "Residential",
        "Commercial",
        "Vacant Land",
        "Public Utility",
        "Apartments",
        "Single Family",
        "Industrial",
        "Condo",
        "Two Family",
        "Three Family",
        "Four Family"
      ],
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 1045454,
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that the list of properties are indeed is a set of values in column property types

#### Expectation 4 : **to be in type list**


In [25]:
# Cell 4: Expectation to be in type list
expectation_4 = ge_df.expect_column_values_to_be_of_type('town', 'object')
print(expectation_4)

{
  "success": true,
  "expectation_config": {
    "expectation_type": "_expect_column_values_to_be_of_type__aggregate",
    "kwargs": {
      "column": "town",
      "type_": "object",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "observed_value": "object_"
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that the column town only have a string datatypes

#### Expectation 5 : **not to be null**


In [35]:
# Cell 5: Expectation to not be null
expectation_5 = ge_df.expect_column_values_to_not_be_null('date_recorded')
print(expectation_5)

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_column_values_to_not_be_null",
    "kwargs": {
      "column": "date_recorded",
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "element_count": 1045454,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "partial_unexpected_list": []
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that there are no null values in the date_recorded column

#### Expectation 6 : **table row to be between x and y**


In [36]:
# Cell 6: Expectation of table rows to be between x and y
expectation_6 = ge_df.expect_table_row_count_to_be_between(900000, 1200000)
print(expectation_6)

{
  "success": true,
  "expectation_config": {
    "expectation_type": "expect_table_row_count_to_be_between",
    "kwargs": {
      "min_value": 900000,
      "max_value": 1200000,
      "result_format": "BASIC"
    },
    "meta": {}
  },
  "result": {
    "observed_value": 1045454
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}


**description** :

success means that the table rows count are between 900k and 1.2m, this shows that the data has many entries

#### Expectation 7 : **proportion of unique values to be between min_value and max_value**


In [53]:
# Cell 7: Expectation to be in set
expectation_7 = ge_df.expect_column_proportion_of_unique_values_to_be_between('list_year', min_value=0.0, max_value=1.0)
expectation_7

{
  "success": true,
  "result": {
    "observed_value": 2.0086967001895828e-05,
    "element_count": 1045454,
    "missing_count": null,
    "missing_percent": null
  },
  "meta": {},
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

**description** :

success means that the proportion of unique values are between 0 and 1, this ensure that a certain proportion in the year columns are unique