# EM-TEST Examples

This Jupyter notebook demonstrates how to apply and customize data schema validations on the EM-DAT (Emergency Events Database) xlsx file data using `emtest` together with [`pandas`](https://pandas.pydata.org/) and [`pandera`](https://pandera.readthedocs.io/en/stable/) Python packages. The latter is an open-source framework for precision data testing.

In particular, we will:
1. Load a fake EM-DAT dataset with potential error entries included for the purpose of showcasing data validation techniques.
2. Apply a predefined schema to validate the dataframe, which covers key features in the dataset.
3. Use lazy validation to accumulate all schema errors and present them in a concise report.
4. Modify the existing data schema, for example by elevating warnings to errors to tighten the accuracy of our data.
5. Demonstrate how to apply specific schemas targeting specific disaster types, such as earthquakes.
6. Customize checks for specific columns in the dataset, reflecting the diversity and complexity of disaster data.

By the end of the notebook, you will have an understanding of integrating pandas with emtest and pandera for validating and reporting potential issues in EM-DAT. 

## Validate the EM-DAT Data

We load the EM-DAT data into and [`pandas.DataFrame`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) and use a [`pandera.DataFrameSchema`](https://pandera.readthedocs.io/en/stable/dataframe_schemas.html) to validate the dataframe. A validation schema is implemented in `emtest`, hence we just need to import it. 

### Load a fake EM-DAT data

We load a fake EM-DAT dataset that contains 56 errors and 9 warnings and parse it into a `pandas.DataFrame` object. We refer to the "DisNo." column to be used as the index and specify which are the date fields that need to be parsed. 

In [1]:
%xmode Minimal
import pandas as pd
import pandera as pa

emdat = pd.read_excel(
    '../data/fake_emdat_test.xlsx',
    index_col='DisNo.',
    parse_dates=['Entry Date', 'Last Update']
)

Exception reporting mode: Minimal


### Validate with `emdat_schema`

To validate a loaded EM-DAT dataframe, we import the `emtest`predefined schema, named `emdat_schema` and use the `validate` method to pass the dataframe we want to validate with the scheme. 

In [2]:
from emtest import emdat_schema
emdat_schema.validate(emdat)



SchemaError: Column 'Historic' failed element-wise validator number 0: Invalid Historic value failure cases: wrong_historic

With this approach, we notice two warnings, and a `SchemaError`.

The two warnings detect that some countries and ISO code are not in the current reference used by EM-DAT. The warning identifies in the dataframe historical country names and codes that are no longer use today, e.g., the Soviet Union or Yugoslavia. Having historical countries that are not in the current reference should be notified the users but should not raise errors as it is somewhat expected. 

The error message is:
```
SchemaError: Column 'Historic' failed element-wise validator number 0: isin(['Yes', 'No']) failure cases: wrong_historic
```
The validation error is raised for as soon as one of the assumptions specified in the schema is falsified, here for the first column 'Historic'. This behavior is fine if one aims to be certain that the EM-DAT dataset passes all the test, while not caring specifically for a detailed list of all errors. User interested in seeing all errors raised during the validate call can rely on 
[Lazy validation](https://pandera.readthedocs.io/en/stable/lazy_validation.html).


### Lazy Validation & Error Reports

Lazy validation can be activated by setting the `lazy` keyword argument to `True` while calling the `validate` method. We can then catch all the `failure_cases` associated with `SchemaErrors`.

In [3]:
try:
    emdat_schema.validate(emdat, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(f"{len(exc.failure_cases)} failure cases:")
    print(exc.failure_cases)
    # use exc.failure_cases.to_csv() method to export



2665 failure cases:
       schema_context        column                       check  check_number  \
40              Index        DisNo.      Invalid DisNo. Pattern             0   
1332  DataFrameSchema      Location           Invalid magnitude             9   
1780  DataFrameSchema    Start Year           Invalid magnitude             9   
1772  DataFrameSchema    Start Year           Invalid magnitude             9   
1773  DataFrameSchema    Start Year           Invalid magnitude             9   
...               ...           ...                         ...           ...   
26             Column  No. Homeless  Invalid No. Homeless value             0   
25             Column  No. Affected  Invalid No. Affected value             0   
24             Column   No. Injured   Invalid No. Injured value             0   
23             Column  Total Deaths  Invalid Total Deaths value             0   
0              Column      Historic      Invalid Historic value             0   

       

For a less verbose alternative, one can use the `get_validation_report` function in `emtest` to catch the failure cases.

In [4]:
from emtest.utils import get_validation_report
get_validation_report(emdat, emdat_schema)



Unnamed: 0,schema_context,column,check,check_number,failure_case,index
40,Index,DisNo.,Invalid DisNo. Pattern,0,wrong_disno,0
1652,DataFrameSchema,Magnitude,Invalid magnitude,9,0.0,2007-0092-LTU
369,DataFrameSchema,Magnitude,Invalid coldwave magnitude,6,25.0,1986-0015-CHN
407,DataFrameSchema,Magnitude,Invalid earthquake magnitude,7,2.0,2022-0231-BIH
408,DataFrameSchema,Magnitude,Invalid earthquake magnitude,7,17.1,1976-0071-IDN
120,DataFrameSchema,Start Day,Missing start month value,1,14.0,2015-0465-ITA
154,DataFrameSchema,End Day,Missing end month value,2,22.0,2013-0306-CHN
178,DataFrameSchema,Start Year,Start date inconsistency at the year resolution,3,1975,1974-0088-USA
73,DataFrameSchema,Latitude,Missing latitude or longitude coordinates,0,60.0,2022-0698-MEX
74,DataFrameSchema,Longitude,Missing latitude or longitude coordinates,0,60.0,2009-0033-AFG


Overall, we captured the 56 errors in our test dataset. But we may want to check the warning more closely the warnings and add them to the report. This feature is not directly implemented in `pandera` but `emtest` provides to alternative ways to do it. 

## Add Warnings to the Report

The basic `emtest` implementation strategy to add warning to the error reports is to redefine warnings in the scheme into errors, so they naturally get included. 
This can be done with the `set_warnings_to_errors` function. 

In [5]:
from emtest.utils import set_warnings_to_errors
updated_schema = set_warnings_to_errors(emdat_schema)
try:
    updated_schema.validate(emdat, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(f"{len(exc.failure_cases)} failure cases:")
    print(exc.failure_cases)

2674 failure cases:
       schema_context      column                    check  check_number  \
49              Index      DisNo.   Invalid DisNo. Pattern             0   
1337  DataFrameSchema    Location        Invalid magnitude             9   
1787  DataFrameSchema  Start Year        Invalid magnitude             9   
1779  DataFrameSchema  Start Year        Invalid magnitude             9   
1780  DataFrameSchema  Start Year        Invalid magnitude             9   
...               ...         ...                      ...           ...   
31             Column     End Day       isin(range(1, 32))             0   
30             Column   End Month  Invalid End Month value             0   
29             Column    End Year   Invalid End Year value             0   
28             Column   Start Day  Invalid Start Day value             0   
0              Column    Historic   Invalid Historic value             0   

                      failure_case          index  
49             

Alternatively, one may pass the `add_warnings` keyword argument set to `True` within the `get_validation_report` function. This allows display warnings, now errors, associated with the "ISO" or "Country" columns. 

In [6]:
get_validation_report(emdat, emdat_schema, add_warnings=True)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
49,Index,DisNo.,Invalid DisNo. Pattern,0,wrong_disno,0
1661,DataFrameSchema,Magnitude,Invalid magnitude,9,0.0,2007-0092-LTU
378,DataFrameSchema,Magnitude,Invalid coldwave magnitude,6,25.0,1986-0015-CHN
416,DataFrameSchema,Magnitude,Invalid earthquake magnitude,7,2.0,2022-0231-BIH
417,DataFrameSchema,Magnitude,Invalid earthquake magnitude,7,17.1,1976-0071-IDN
...,...,...,...,...,...,...
31,Column,End Day,"isin(range(1, 32))",0,12.5,1990-0630-IND
30,Column,End Month,Invalid End Month value,0,-2.0,1998-0261-AUS
29,Column,End Year,Invalid End Year value,0,2100,1981-0106-LKA
28,Column,Start Day,Invalid Start Day value,0,32.0,2002-0492-PNG


## Customize Columns' Checks

The last section of this tutorial shows how type-specific schemas, or other derivative from the main `emdat_schema` can be implemented without reimplementing an entire new schema. For that purpose, the `update_column_checks` function makes it possible to redefine the checks of one single columns of a schema. We just need to pass the `schema`, the column name `col_name`, and the new list of Checks in `new_check`. 

The following example update the scheme to recreate a specific check for cold waves, validating that cold wave magnitude should be below 10°C.  

In [7]:
from emtest.utils import update_column_checks
coldwave_schema = update_column_checks(
    schema=emdat_schema,
    col_name='Magnitude',
    new_checks=[
        pa.Check.less_than(
            max_value=10, # Celsius degrees
            description="Test whether value less than or equal to 10",
            name="check_coldwave_magnitude",
            error="Invalid coldwave magnitude 2"
        )
    ]
)

In our fake dataset, we have a cold wave event registered with 25°C that is now identified. 

In [8]:
get_validation_report(emdat[emdat['Disaster Subtype'] ==  'Cold wave'], coldwave_schema)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
15,DataFrameSchema,Magnitude,Invalid coldwave magnitude,6,25.0,1986-0015-CHN
0,Column,Magnitude,Invalid coldwave magnitude 2,0,25.0,1986-0015-CHN
