### 1-2. Loading our data

Our data is composed of several columns, the most interesting ones being:
- `hvfhs_license_num`: this is the Taxi & License Commission license number of the company operating the trip. Possible values are HV0002 (Juno), HV0003 (Uber), HV0004 (Via), HV0005 (Lyft).
- `request_datetime`, `on_scene_datetime`, `pickup_datetime`, `dropoff_datetime`: logs datetime for ride request, when driver arrived, picked-up & dropped off passenger(s).
- `PULocationID`, `DOLocationID`: where the trip began & ended. Those are `int` values.
- `trip_miles`, `trip_time`: miles for passenger trip & total time in seconds of trip
- `base_passenger_fare`: base fare excluding toll (`tolls`), tips (`tips`), taxes (`sales_tax`) and fees (`airport_fee`, `congestion_surcharge`, `bcf`). 
- `driver_pay`: total driver pay (exclusing tools, tips, commission, taxes...)
- `shared_match_flag`: did the passenger share the vehicle with another passenger who booked separately? (Y/N)

Let's load it and print the first rows.

In [7]:
import pandas as pd

data = pd.read_csv("../../data/train_and_test2.csv")
data.head()

Unnamed: 0,Passengerid,Age,Fare,Sex,sibsp,zero,zero.1,zero.2,zero.3,zero.4,...,zero.12,zero.13,zero.14,Pclass,zero.15,zero.16,Embarked,zero.17,zero.18,2urvived
0,1,22.0,7.25,0,1,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0
1,2,38.0,71.2833,1,1,0,0,0,0,0,...,0,0,0,1,0,0,0.0,0,0,1
2,3,26.0,7.925,1,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,1
3,4,35.0,53.1,1,1,0,0,0,0,0,...,0,0,0,1,0,0,2.0,0,0,1
4,5,35.0,8.05,0,0,0,0,0,0,0,...,0,0,0,3,0,0,2.0,0,0,0


### 1-3. Exploring our data

Let's assume our goal will be to create an application able to predict the fare of a trip, from the pick-up & dropoff locations. We will eventually be using the following columns:
- `base_passenger_fare`: our target variable
- `hvfhs_license_num`: fare might depend on operating company (HV0003 or HV0005)
- `request_datetime`, `on_scene_datetime`, `pickup_datetime`, `dropoff_datetime`: fare might depend on congestion & time of pickup
- `PULocationID`, `DOLocationID`: fare will depend on pick up and drop off location
- `trip_miles`, `trip_time`: these fields might be useful to normalize training data

Let's first explore quality of these key fields. What can you see? Is data quality sufficient?

In [8]:
# Distribution of base passenger fare
data["Age"].describe()

count    1309.000000
mean       29.503186
std        12.905241
min         0.170000
25%        22.000000
50%        28.000000
75%        35.000000
max        80.000000
Name: Age, dtype: float64

In [9]:
# Values of operating company
data["Fare"].value_counts()

8.0500     60
13.0000    59
7.7500     55
26.0000    50
7.8958     49
           ..
7.7417      1
8.1583      1
8.4583      1
7.8000      1
7.7208      1
Name: Fare, Length: 281, dtype: int64

In [10]:
# Values of trip miles
data["2urvived"].describe()

count    1309.000000
mean        0.261268
std         0.439494
min         0.000000
25%         0.000000
50%         0.000000
75%         1.000000
max         1.000000
Name: 2urvived, dtype: float64

**Conclusion:** several data quality issues might have a detrimental impact on our model.
- Our target is sometimes negative... which should not happen.
- A few trips are associated to an unknown company...
- A few trips have abnormously high miles recorded at the mileage counter

We could simple discard these errors, but if we were to continuously train a fare forecasting algorithms, it might be biased or simply break under these data quality issue as soon as a new batch of training data arrives. Let's then put in place some control layers !

## 2 - Installing Great Expectations

Great expectation allows us to:
- define data quality rules in a language agnostic format (as config files)
- run these data quality checks & rules on various types of data sources
- trigger actions & alerting whenever a rule breaks
- generate data quality reports easily from our set of rules

Your environment should already contain great expectation as a python library. Otherwise you can simply follow the following commands to install it: https://docs.greatexpectations.io/docs/guides/setup/installation/local

In [11]:
%pip list | grep great-expectations

[0mgreat-expectations            0.15.50
Note: you may need to restart the kernel to use updated packages.


## 3 - Getting to know Great Expectations

### 3-1. Connecting to our data

As we will see, Great expectations, works with a lot of configuration files (`.yml`, `.json`). This enables us to stay language & datasource agnostic, and to have our rules & checks documented as config and not hard coded.

The main entrypoint & best practice to manage 'rules' is to have a folder `gx` where we will store all our config. 

Before starting implementing checks & triggers, we first need to connect to a dataset, and explain to Great Expectation how to connect to it. This can usually be best done in the following main file: `gx/great_expectations.yml`. 

In [29]:
import yaml
from pprint import pprint

with open("great_expectations/great-expectations.yml", "r") as stream:
    try:
        ge_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pprint(ge_config["datasources"], indent=0)

{'titanic_trips': {'class_name': 'Datasource',
                 'data_connectors': {'parquet_data_connector': {'assets': {'titanic': {'pattern': 'train_and_test2.csv'}},
                                                              'base_directory': '../../../data_validation',
                                                              'batch_spec_passthrough': {'reader_method': 'read_csv',
                                                                                        'reader_options': {}},
                                                              'class_name': 'ConfiguredAssetFilesystemDataConnector',
                                                              'module_name': 'great_expectations.datasource.data_connector'}},
                 'execution_engine': {'class_name': 'PandasExecutionEngine',
                                     'module_name': 'great_expectations.execution_engine'},
                 'module_name': 'great_expectations.datasource'}}


We have already made part of our task: and told great expectation where to find our dataset, and how to read it (using Pandas & the parquet read function).

### 3-2. Writing a first expectation
Data quality rules (or "expectations") can also be written in config files and are stored in the `gx/expectations/` folder.
We have already written one expecting the base fare not to be negative.

In [30]:
import json

# Open JSON file
with open("great_expectations/expectations/titanic-expectations.json", "r") as f:
    data = json.load(f)

# Pretty print JSON data
print(json.dumps(data, indent=4))

{
    "data_asset_type": null,
    "expectation_suite_name": "titanic-expectations",
    "expectations": [
        {
            "expectation_type": "expect_column_min_to_be_between",
            "kwargs": {
                "column": "Age",
                "min_value": 1
            },
            "meta": {
                "notes": {
                    "format": "markdown",
                    "content": "Target variable should not be negative as drivers should be paid a positive amount."
                }
            }
        }
    ],
    "ge_cloud_id": null,
    "meta": {
        "great_expectations_version": "0.15.46"
    }
}


You will have to define your own expectations afterwards, feel free to [explore the doc](https://docs.greatexpectations.io/docs/guides/expectations/how_to_create_and_edit_expectations_based_on_domain_knowledge_without_inspecting_data_directly) to understand the JSON definition of expectations.

### 3-3. Checking our data
Now that we can connect to our data... and have defined a set of data quality rules, how do we apply these rules to our datasources? As you would expect, great expectations also uses configuration files to run data checks, as found in the `great_expectations/checkpoints/` folder. Where we bin a datasource (and particularly a data asset) to a suite of expectations.

In [31]:
with open("great_expectations/checkpoint/titanic-checkpoint.yml", "r") as stream:
    try:
        chkp_config = yaml.safe_load(stream)
    except yaml.YAMLError as exc:
        print(exc)

pprint(chkp_config)

{'class_name': 'SimpleCheckpoint',
 'config_version': 1.0,
 'name': 'titanic-checkpoint',
 'run_name_template': '%Y%m%d-%H%M%S-my-run-name-template',
 'validations': [{'batch_request': {'data_asset_name': 'titanic',
                                    'data_connector_name': 'parquet_data_connector',
                                    'datasource_name': 'titanic_trips'},
                  'expectation_suite_name': 'titanic-expectations'}]}


Again, take some time to [follow the documentation](https://docs.greatexpectations.io/docs/guides/validation/checkpoints/how_to_create_a_new_checkpoint/) to understand the content of this file.

Before running our checkpoint, let's introduce the `great_expectations.data_context`: this object scans your repository and stores all datasources, checkpoints & expectations you have defined. You can then handle them from your code.

In [33]:
import great_expectations as gx
import great_expectations.jupyter_ux
from great_expectations.datasource.types import BatchKwargs

import datetime

#context = gx.get_context()
context = gx.data_context.DataContext()
print(context.list_expectation_suite_names())
print([datasource["name"] for datasource in context.list_datasources()])
print(context.list_checkpoints())

2023-03-19T11:57:33+0000 - INFO - FileDataContext loading zep config
2023-03-19T11:57:33+0000 - INFO - GxConfig.parse_yaml() failed with errors - [{'loc': ('xdatasources',), 'msg': 'field required', 'type': 'value_error.missing'}]
2023-03-19T11:57:33+0000 - INFO - GxConfig.parse_yaml() returning empty `xdatasources`
2023-03-19T11:57:33+0000 - INFO - Loading 'datasources' ->
{}
2023-03-19T11:57:33+0000 - INFO - Loaded 'datasources' ->
{}
['titanic-expectations']
['titanic_trips']
['titanic-checkpoint']


In [34]:
context.list_checkpoints()

['titanic-checkpoint']

You can now run a checkpoint

In [36]:
context.run_checkpoint(checkpoint_name="titanic-checkpoint")

InvalidBatchRequestError: Validator could not be created because BatchRequest returned an empty batch_list.
                Please check your parameters and try again.

Note that GE allows you to export your results in a simple html format

In [None]:
context.open_data_docs()

## 4 - More expectations & more data!

### 4-1. More expectations
Now use what you have learnt to great 2 or 3 more expectations for your data. You can look for ideas there: https://greatexpectations.io/expectations/

TODO : Create your expectations in the `great_expectations/expectations/taxi-trips-expectations.json` file and once it's done run the code below to make sure they works.

In [None]:
import json

with open("great_expectations/expectations/taxi-trips-expectations.json") as f:
    expectation = json.load(f)

pprint(expectation["expectations"])

### 4-2. Running our new expectations
Update your checkpoint file & run the expectations you have just created. 

In [None]:
context.run_checkpoint(checkpoint_name="taxi-trips-checkpoint")
context.open_data_docs()

## 5 - Wrapping up

In this short tutorial, you have seen how to configure a simple great expectations project & run a few data quality rules. The main takeaway is that GE allows you to create expectations & run them entirely with configuration, abstracting the connection to data sources behind.

Other exercices you could work on:
- Connecting to a distant datasource (s3, BigQuery...)
- Writing your own expectation (not available in the gallery)
- Using great expectations actions to avoid deploying if data quality is not as expected