# Machine Learning Operations Project
The aim of the project is to simulate the realworld process of deploying machine learning models, using the concepts that we have discussed during the classes. This notebook is focuses on the implementation of ``Great Expectations`` with a connected user interface (UI).

## Metadata

**UTC**:
Timestamp UTC seconds

**Temperature[C]**:
Air Temperature

**Humidity[%]**:
Air Humidity

**TVOC[ppb]**:
Total Volatile Organic Compounds; measured in parts per billion

**eCO2[ppm]**:
CO2 equivalent concentration; calculated from different values like TVOC

**Raw H2**:
Raw molecular hydrogen; not compensated (Bias, temperature, etc.)

**Raw Ethanol**:
Raw ethanol gas

**Pressure[hPa]**:
Air Pressure

**PM1.0**:
Particulate matter size < 1.0 µm (PM1.0). 1.0 µm < 2.5 µm (PM2.5)

**PM2.5**:
Particulate matter size < 1.0 µm (PM1.0). 1.0 µm < 2.5 µm (PM2.5)

**NC0.5**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**NC1.0**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**NC2.5**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**CNT**:
Sample counter

**Fire Alarm (Target)**:
Ground truth is "1" if a fire is there

## Imports

In [None]:
# Basic imports
import os
import pandas as pd
import numpy as np
import seaborn as sns
from typing import Tuple
import matplotlib.pyplot as plt
# import great_expectations.jupyter_ux

In [None]:
# dataframe = pd.read_csv('Data/smoke_detection.csv', index_col=[0])

In [None]:
# def split_dataframe(dataframe: pd.DataFrame, output_filenames: list) -> None:
#     # Order dataframe by month
#     dataframe = dataframe.sort_values(by='UTC').reset_index(drop=True)

#     # Split dataframe into 3 equal-sized datasets
#     size = len(dataframe)
#     third = size // 3

#     # Split the dataframe
#     dataframe_1 = dataframe.iloc[:third]
#     dataframe_2 = dataframe.iloc[third:third*2]
#     dataframe_3 = dataframe.iloc[third*2:]

#     # Save each split dataframe as a CSV file
#     for df, filename in zip([dataframe_1, dataframe_2, dataframe_3], output_filenames):
#         df.to_csv(filename, index=False)

# # Apply function
# split_dataframe(dataframe, ['df_one.csv', 'df_two.csv', 'df_three.csv'])

# <span style="color: #FE6C1B;">Great Expectations</span>


## Terminologies

*Context*: A context in is the main object that manages the overall configuration and execution of the data expectations. It serves as a container for storing and organizing expectations, data sources, and validation results. The context allows to define, execute, and manage our data expectations.

*Validator*: A validator is responsible for evaluating expectations on a given batch of data. Validators are used to validate data against a set of predefined expectations. They help to assess data quality, perform data validation, and monitor data pipelines.

*Suite*: An Expectation Suite is a collection of expectations that define the desired properties and characteristics of our data. It serves as a set of rules against which your data can be validated. The suite contains a set of expectations that can be applied to one or more batches of data. 

*Batch*: A batch represents a subset of data that we want to evaluate against our expectations. It can be a collection of rows, a partitioned dataset, a file, a table, or any other logical grouping of data. Batches are used as inputs to validation processes and contain the data you want to validate.

*Checkpoint*: A Checkpoint is a way to operationalize data validation using Expectation Suites. It allows you to define a pipeline-like flow for performing data validation on batches of data. It helps automate the validation process by defining the steps to be executed on data batches and tracking the results.

## Installation

- Open Anaconda Prompt Terminal
- After satisfying ``!pip install great_expectations``, run ``great_expectations init`` and confirm with 'Y'
- Verify the installation by running the version ``!great_expectations --version``
- Our version used: version 0.16.13


More information: https://docs.greatexpectations.io/docs/tutorials/quickstart/

In [None]:
# After setting up great_expectations init
for i in os.listdir('/Users/jlutt/great_expectations'):
    print(i)

.gitignore
checkpoints
expectations
great_expectations.yml
plugins
profilers
uncommitted


## Getting started

In [None]:
# Set up
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint

# Create data context
context = gx.get_context(
    context_root_dir='/Users/jlutt/great_expectations'
)

# Connect to data
validator = context.sources.pandas_default.read_csv('../data/01_raw/smoke_detection_1.csv')

# Extract column names
column_names = [f"{column_name}" for column_name in validator.columns()]
print(f"Columns: {', '.join(column_names)}.")
print(validator.head(n_rows=5, fetch_all=False))

# Create expectation suite
expectation_suite_name = "smoke_detection_suite"
suite = context.create_expectation_suite(expectation_suite_name=expectation_suite_name,overwrite_existing=True)

Calculating Metrics:   0%|          | 0/2 [00:00<?, ?it/s]

Columns: Unnamed: 0, UTC, Temperature[C], Humidity[%], TVOC[ppb], eCO2[ppm], Raw H2, Raw Ethanol, Pressure[hPa], PM1.0, PM2.5, NC0.5, NC1.0, NC2.5, CNT, Fire Alarm.


Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

   Unnamed: 0         UTC  Temperature[C]  Humidity[%]  TVOC[ppb]  eCO2[ppm]  \
0           0  1654712187           27.45        43.27         48        488   
1           1  1654712188           27.41        43.54         32        457   
2           2  1654712189           27.36        43.76         34        455   
3           3  1654712190           27.32        43.84         29        454   
4           4  1654712191           27.27        43.98         28        456   

   Raw H2  Raw Ethanol  Pressure[hPa]  PM1.0  PM2.5  NC0.5  NC1.0  NC2.5  CNT  \
0   12844        20723        937.586   2.04   2.12  14.05  2.191  0.049    0   
1   12857        20743        937.589   2.16   2.24  14.83  2.313  0.052    1   
2   12857        20747        937.604   2.19   2.28  15.07  2.350  0.053    2   
3   12858        20752        937.610   2.24   2.32  15.39  2.400  0.054    3   
4   12860        20751        937.601   2.26   2.35  15.58  2.429  0.055    4   

   Fire Alarm  
0           0  


## Expectations
Now we use that data source for profiling, validation and documentation. More information regarding expectations, can be found here: https://legacy.docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html

In [None]:
# Use the Validator to create and run an Expectation
# Assert column count
validator.expect_table_column_count_to_equal(15)

Calculating Metrics:   0%|          | 0/3 [00:00<?, ?it/s]

{
  "meta": {},
  "result": {
    "observed_value": 16
  },
  "success": false,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [None]:
# Assert data types
validator.expect_column_values_to_be_of_type("UTC", "int64")
validator.expect_column_values_to_be_of_type("Temperature[C]", "float")
validator.expect_column_values_to_be_of_type("Humidity[%]", "float")
validator.expect_column_values_to_be_of_type("TVOC[ppb]", "int64")
validator.expect_column_values_to_be_of_type("eCO2[ppm]", "int64")
validator.expect_column_values_to_be_of_type("Raw H2", "int64")
validator.expect_column_values_to_be_of_type("Raw Ethanol", "int64")
validator.expect_column_values_to_be_of_type("Pressure[hPa]", "float")
validator.expect_column_values_to_be_of_type("PM1.0", "float")
validator.expect_column_values_to_be_of_type("PM2.5", "float")
validator.expect_column_values_to_be_of_type("NC0.5", "float")
validator.expect_column_values_to_be_of_type("NC1.0", "float")
validator.expect_column_values_to_be_of_type("NC2.5", "float")
validator.expect_column_values_to_be_of_type("CNT", "int64")
validator.expect_column_values_to_be_of_type("Fire Alarm", "int64")

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/1 [00:00<?, ?it/s]

{
  "meta": {},
  "result": {
    "observed_value": "int64"
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [None]:
 # Assert no missing values
for column_name in column_names:
    validator.expect_column_values_to_not_be_null(column_name)

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/6 [00:00<?, ?it/s]

In [None]:
# Assert relationships between columns
validator.expect_column_pair_values_A_to_be_greater_than_B("PM2.5", "PM1.0").success = False

Calculating Metrics:   0%|          | 0/7 [00:00<?, ?it/s]

In [32]:
# Assert target variable
validator.expect_column_values_to_be_in_set("Fire Alarm", [0, 1])

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "meta": {},
  "result": {
    "element_count": 31315,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

In [33]:
# Assert value ranges to detect outliers
validator.expect_column_values_to_be_between("Temperature[C]", min_value=-23, max_value=60)
validator.expect_column_values_to_be_between("Humidity[%]", min_value=10, max_value=76)
validator.expect_column_values_to_be_between("TVOC[ppb]", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("eCO2[ppm]", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("Pressure[hPa]", min_value=930, max_value=941)
validator.expect_column_values_to_be_between("PM1.0", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("PM2.5", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("NC0.5", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("NC1.0", min_value=0, max_value=None)
validator.expect_column_values_to_be_between("NC2.5", min_value=0, max_value=None)

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

Calculating Metrics:   0%|          | 0/8 [00:00<?, ?it/s]

{
  "meta": {},
  "result": {
    "element_count": 31315,
    "unexpected_count": 0,
    "unexpected_percent": 0.0,
    "partial_unexpected_list": [],
    "missing_count": 0,
    "missing_percent": 0.0,
    "unexpected_percent_total": 0.0,
    "unexpected_percent_nonmissing": 0.0
  },
  "success": true,
  "exception_info": {
    "raised_exception": false,
    "exception_traceback": null,
    "exception_message": null
  }
}

## Great Expectations UI

In [34]:
# Review and save our expectation suite
print(validator.get_expectation_suite(discard_failed_expectations=False))
validator.save_expectation_suite('data/09_data_unit_tests/expectations.json')

# Create checkpoint
checkpoint = SimpleCheckpoint(
    "smoke_detection_checkpoint",
    context,
    validator=validator,
)

# Run checkpoint to validate data 
checkpoint_result = checkpoint.run()

# View results
context.build_data_docs()
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

{
  "meta": {
    "great_expectations_version": "0.16.16"
  },
  "data_asset_type": null,
  "ge_cloud_id": null,
  "expectations": [
    {
      "meta": {},
      "kwargs": {
        "value": 15
      },
      "expectation_type": "expect_table_column_count_to_equal"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "UTC",
        "type_": "int64"
      },
      "expectation_type": "expect_column_values_to_be_of_type"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "Temperature[C]",
        "type_": "float"
      },
      "expectation_type": "expect_column_values_to_be_of_type"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "Humidity[%]",
        "type_": "float"
      },
      "expectation_type": "expect_column_values_to_be_of_type"
    },
    {
      "meta": {},
      "kwargs": {
        "column": "TVOC[ppb]",
        "type_": "int64"
      },
      "expectation_type": "expect_column_values_to_be_of_type"
    },
    {
      "meta

FileNotFoundError: [Errno 2] No such file or directory: 'data/09_data_unit_tests/expectations.json'

## References
- https://medium.com/@mostsignificant/python-data-validation-made-easy-with-the-great-expectations-package-8d1be266fd3f
- https://towardsdatascience.com/great-expectations-automated-testing-for-data-science-and-engineering-teams-1e7c78f1d2d5
- https://towardsdatascience.com/a-great-python-library-great-expectations-6ac6d6fe822e
- https://github.com/datarootsio/tutorial-great-expectations/blob/main/tutorial_great_expectations.ipynb