# Machine Learning Operations Project
The aim of the project is to simulate the realworld process of deploying machine learning models, using the concepts that we have discussed during the classes. This notebook is focuses on the implementation of ``Great Expectations`` with a connected user interface (UI).

## Metadata

**UTC**:
Timestamp UTC seconds

**Temperature[C]**:
Air Temperature

**Humidity[%]**:
Air Humidity

**TVOC[ppb]**:
Total Volatile Organic Compounds; measured in parts per billion

**eCO2[ppm]**:
CO2 equivalent concentration; calculated from different values like TVOC

**Raw H2**:
Raw molecular hydrogen; not compensated (Bias, temperature, etc.)

**Raw Ethanol**:
Raw ethanol gas

**Pressure[hPa]**:
Air Pressure

**PM1.0**:
Particulate matter size < 1.0 µm (PM1.0). 1.0 µm < 2.5 µm (PM2.5)

**PM2.5**:
Particulate matter size < 1.0 µm (PM1.0). 1.0 µm < 2.5 µm (PM2.5)

**NC0.5**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**NC1.0**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**NC2.5**:
Number concentration of particulate matter. This differs from PM because NC gives the actual number of particles in the air. The raw NC is also classified by the particle size: < 0.5 µm (NC0.5); 0.5 µm < 1.0 µm (NC1.0); 1.0 µm < 2.5 µm (NC2.5);

**CNT**:
Sample counter

**Fire Alarm (Target)**:
Ground truth is "1" if a fire is there

## Imports

In [None]:
# Basic imports
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import great_expectations.jupyter_ux

In [None]:
for i in os.listdir('/Users/jlutt/great_expectations'):
    print(i)

# <span style="color: #FE6C1B;">Great Expectations</span>


## Installation

- Open Anaconda Prompt Terminal
- After satisfying ``!pip install great_expectations``, run ``great_expectations init`` and confirm with 'Y'
- Verify the installation by running the version ``!great_expectations --version``
- Our version used: version 0.16.13


More information: https://docs.greatexpectations.io/docs/tutorials/quickstart/

## Terminologies

*Context*: A context in is the main object that manages the overall configuration and execution of the data expectations. It serves as a container for storing and organizing expectations, data sources, and validation results. The context allows to define, execute, and manage our data expectations.

*Validator*: A validator is responsible for evaluating expectations on a given batch of data. Validators are used to validate data against a set of predefined expectations. They help to assess data quality, perform data validation, and monitor data pipelines.

*Suite*: An Expectation Suite is a collection of expectations that define the desired properties and characteristics of our data. It serves as a set of rules against which your data can be validated. The suite contains a set of expectations that can be applied to one or more batches of data. 

*Batch*: A batch represents a subset of data that we want to evaluate against our expectations. It can be a collection of rows, a partitioned dataset, a file, a table, or any other logical grouping of data. Batches are used as inputs to validation processes and contain the data you want to validate.

*Checkpoint*: A Checkpoint is a way to operationalize data validation using Expectation Suites. It allows you to define a pipeline-like flow for performing data validation on batches of data. It helps automate the validation process by defining the steps to be executed on data batches and tracking the results.

## Getting started

In [None]:
# Set up
import great_expectations as gx
from great_expectations.checkpoint import SimpleCheckpoint

# Create data context
context = gx.get_context(
    context_root_dir='/Users/jlutt/great_expectations'
)

# Connect to data
validator = context.sources.pandas_default.read_csv("Data/smoke_detection.csv")

# Extract column names
column_names = [f"{column_name}" for column_name in validator.columns()]
print(f"Columns: {', '.join(column_names)}.")
print(validator.head(n_rows=5, fetch_all=False))

# Create expectation suite
expectation_suite_name = "smoke_detection_suite"
suite = context.create_expectation_suite(expectation_suite_name=expectation_suite_name,overwrite_existing=True)

## Expectations
Now we use that data source for profiling, validation and documentation. More information regarding expectations, can be found here: https://legacy.docs.greatexpectations.io/en/latest/reference/glossary_of_expectations.html

In [None]:
# Use the Validator to create and run an Expectation
validator.expect_table_columns_to_match_ordered_list(column_names)

In [None]:
validator.expect_column_values_to_be_unique("UTC")

In [None]:
validator.expect_column_to_exist("Temperature[C]")

In [None]:
validator.expect_column_to_exist("Humidity[%]")

In [None]:
validator.expect_column_to_exist("TVOC[ppb]")

In [None]:
validator.expect_column_to_exist("eCO2[ppm]")

In [None]:
validator.expect_column_to_exist("Raw H2")

In [None]:
validator.expect_column_to_exist("Raw Ethanol")

In [None]:
validator.expect_column_to_exist("PM1.0")

In [None]:
validator.expect_column_to_exist("PM2.5")

In [None]:
validator.expect_column_to_exist("NC0.5")

In [None]:
validator.expect_column_to_exist("NC1.0")

In [None]:
validator.expect_column_to_exist("NC2.5")

In [None]:
validator.expect_column_values_to_be_unique("CNT")

In [None]:
validator.expect_column_values_to_be_in_set("Fire Alarm",[0,1])

In [None]:
# Review and save our expectation suite
print(validator.get_expectation_suite(discard_failed_expectations=False))
validator.save_expectation_suite(discard_failed_expectations=False)

# # Create checkpoint
# checkpoint = SimpleCheckpoint(
#     "smoke_detection_checkpoint",
#     context,
#     validator=validator,
# )

# Run checkpoint to validate data 
checkpoint_result = checkpoint.run()

# View results
context.build_data_docs()
validation_result_identifier = checkpoint_result.list_validation_result_identifiers()[0]
context.open_data_docs(resource_identifier=validation_result_identifier)

## References
- https://medium.com/@mostsignificant/python-data-validation-made-easy-with-the-great-expectations-package-8d1be266fd3f
- https://towardsdatascience.com/great-expectations-automated-testing-for-data-science-and-engineering-teams-1e7c78f1d2d5
- https://towardsdatascience.com/a-great-python-library-great-expectations-6ac6d6fe822e
- https://github.com/datarootsio/tutorial-great-expectations/blob/main/tutorial_great_expectations.ipynb