<h1>Part 2 - Data Management</h1>

<font size="3">
The goal of this section is to learn more about data management tools, in particular Great Expectation. This package allows you to perform data quality assessment & alerting on your data projects.

We will continue using the TLC trip record data, deep dive specifically into these datasets to catch data quality issues & encode our own set of rules & triggers.
</font>

## 1 - Getting familiar with the data

### 1-1. Downloading our data

To start with, let's download our data: we will use a larger dataset containing several details about January 2022 "for hire vehicles" trips in NYC (Uber, Lyft...). This dataset has been lightly modified for the purpose of our exercise. Let's download it & save it under our data folder. 

In [None]:
import gdown

gdown.download(
    "https://drive.google.com/uc?id=1xQ8heQzUkKehOUPYvrHIqQ_pDJNCH9tT",
    "../data/taxi-trips-2022-01.parquet",
    quiet=False,
)
gdown.download(
    "https://drive.google.com/uc?id=11kOFkDJIXSW2Hu0o2o-PWBhTJi0msYfH",
    "../data/taxi-trips-2022-02.parquet",
    quiet=False,
)

### 1-2. Loading our data

Our data is composed of several columns, the most interesting ones being:
- `hvfhs_license_num`: this is the Taxi & License Commission license number of the company operating the trip. Possible values are HV0002 (Juno), HV0003 (Uber), HV0004 (Via), HV0005 (Lyft).
- `request_datetime`, `on_scene_datetime`, `pickup_datetime`, `dropoff_datetime`: logs datetime for ride request, when driver arrived, picked-up & dropped off passenger(s).
- `PULocationID`, `DOLocationID`: where the trip began & ended. Those are `int` values.
- `trip_miles`, `trip_time`: miles for passenger trip & total time in seconds of trip
- `base_passenger_fare`: base fare excluding toll (`tolls`), tips (`tips`), taxes (`sales_tax`) and fees (`airport_fee`, `congestion_surcharge`, `bcf`). 
- `driver_pay`: total driver pay (exclusing tools, tips, commission, taxes...)
- `shared_match_flag`: did the passenger share the vehicle with another passenger who booked separately? (Y/N)

Let's load it and print the first rows.

In [None]:
import pandas as pd

data = pd.read_parquet("../data/taxi-trips-2022-01.parquet")
data.head()

### 1-3. Exploring our data

Let's assume our goal will be to create an application able to predict the fare of a trip, from the pick-up & dropoff locations. We will eventually be using the following columns:
- `base_passenger_fare`: our target variable
- `hvfhs_license_num`: fare might depend on operating company
- `request_datetime`, `on_scene_datetime`, `pickup_datetime`, `dropoff_datetime`: fare might depend on congestion & time of pickup
- `PULocationID`, `DOLocationID`: fare will depend on pick up and drop off location
- `trip_miles`, `trip_time`: these fields might be useful to normalize training data

Let's first explore quality of these key fields. What can you see? Is data quality sufficient?

**Conclusion:** several data quality issues might have a detrimental impact on our model. Which one?
- ?

We could simple discard these errors, but if we were to continuously train a fare forecasting algorithms, it might be biased or simply break under these data quality issue as soon as a new batch of training data arrives. Let's then put in place some control layers !

## 2 - Installing Great Expectations

Great expectation allows us to:
- define data quality rules in a language agnostic format (as config files)
- run these data quality checks & rules on various types of data sources
- trigger actions & alerting whenever a rule breaks
- generate data quality reports easily from our set of rules

Your environment should already contain great expectation as a python library. Otherwise you can simply follow the following commands to install it: https://docs.greatexpectations.io/docs/guides/setup/installation/local

## 3 - Getting to know Great Expectations

### 3-1. Connecting to our data

As we will see, Great expectations, works with a lot of configuration files (`.yml`, `.json`). This enables us to stay language & datasource agnostic, and to have our rules & checks documented as config and not hard coded.

The main entrypoint & best practice to manage 'rules' is to have a folder `great_expectations` where we will store all our config. 

Before starting implementing checks & triggers, we first need to connect to a dataset, and explain to Great Expectation how to connect to it. This can usually be best done in the following main file: `great_expectations/great_expectations.yml`. 

In [None]:
import yaml
from pprint import pprint

with open("../great_expectations/great_expectations.yml", "r") as stream:
    try:
        ge_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pprint(ge_config["datasources"])

We have already made part of our task: and told great expectation where to find our dataset, and how to read it (using Pandas & the parquet read function).

### 3-2. Writing a first expectation
Data quality rules (or "expectations") can also be written in config files and are stored in the `great_expectations/expectations/` folder.
We have already written one expecting the base fare not to be negative.

In [None]:
import json

with open("../great_expectations/expectations/taxi-trips-expectations.json") as f:
    expectation = json.load(f)

pprint(expectation["expectations"])

### 3-3. Checking our data
Now that we can connect to our data... and have defined a set of data quality rules, how do we apply these rules to our datasources? As you would expect, great expectations also uses configuration files to run data checks, as found in the `great_expectations/checkpoints/` folder. Where we bin a datasource (and particularly a data asset) to a suite of expectations.

In [None]:
with open("../great_expectations/checkpoints/taxi-trips-checkpoint.yml", "r") as stream:
    try:
        chkp_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pprint(chkp_config)

Before running our checkpoint, let's introduce the `great_expectations.data_context`: this object scans your repository and stores all datasources, checkpoints & expectations you have defined. You can then handle them from your code.

In [None]:
import great_expectations as ge
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs
import datetime

context = ge.data_context.DataContext()
print(context.list_expectation_suite_names())
print([datasource["name"] for datasource in context.list_datasources()])
print(context.list_checkpoints())

You can now run a checkpoint

In [None]:
context.run_checkpoint(checkpoint_name="taxi-trips-checkpoint")

Note that GE allows you to export your results in a simple html format

In [None]:
context.open_data_docs()

## 4 - More expectations & more data!

### 4-1. More expectations
Now use what you have learnt to great 2 or 3 more expectations for your data. You can look for ideas there: https://greatexpectations.io/expectations/

In [None]:
import json

with open(
    "../great_expectations/expectations/taxi-trips-expectations-solution.json"
) as f:
    expectation = json.load(f)

pprint(expectation["expectations"])

### 4-2. More data assets
Before running our checkpoint, change the `great_expectations.yml` file so that it also catches data from february (`data/taxi-trips-2022-02.parquet`)

In [None]:
with open("../great_expectations/great_expectations-solution.yml", "r") as stream:
    try:
        ge_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pprint(ge_config["datasources"])

### 4-3. Running our new expectations
Update your checkpoint file & run the expectations you have just created. 

## 5 - Wrapping up

In this short tutorial, you have seen how to configure a simple great expectations project & run a few data quality rules. The main takeaway is that GE allows you to create expectations & run them entirely with configuration, abstracting the connection to data sources behind.

Other exercices you could work on:
- Connecting to a distant datasource (s3, BigQuery...)
- Writing your own expectation (not available in the gallery)
- Using great expectations actions to avoid deploying if data quality is not as expected