# EM-TEST Examples

This Jupyter notebook demonstrates how to apply and customize data schema validations on the EM-DAT (Emergency Events Database) xlsx file data using `emtest` together with [`pandas`]() and [`pandera`]() Python packages.

In particular, we will:
1. Load a fake EM-DAT dataset with potential error entries included for the purpose of showcasing data validation techniques.
2. Apply a predefined schema to validate the dataframe, which covers key features in the dataset.
3. Use lazy validation to accumulate all schema errors and present them in a concise report.
Modify the existing data schema, for example by elevating warnings to errors to tighten the accuracy of our data.
4. Demonstrate how to apply specific schemas targeting specific disaster types, such as earthquakes.
5. Customize checks for specific columns in the dataset, reflecting the diversity and complexity of disaster data.

By the end of the notebook, you will have an understanding of integrating pandas with emtest and pandera for validating and looking for potential issues in EM-DAT. 

## Validate the EM-DAT Data

We load the EM-DAT data into and [`pandas.DataFrame`]() and use a pandera 

### Load a fake EM-DAT data

We load a fake EM-DAT dataset that contains 42 errors and 6 warnings and parse it into a `pandas.DataFrame` object. We refer to the "DisNo." column to be used as the index and specify which are the date fields that need to be parsed. 

In [1]:
import pandas as pd
import pandera as pa

emdat = pd.read_excel(
    '../data/fake_emdat_test.xlsx',
    index_col='DisNo.',
    parse_dates=['Entry Date', 'Last Update']
)

### Validate with `emdat_schema`

To validate the 

In [2]:
from emtest import emdat_schema
emdat_schema.validate(emdat)



SchemaError: Column 'Historic' failed element-wise validator number 0: isin(['Yes', 'No']) failure cases: wrong_historic

### Lazy Validation & Error Reports

In [3]:
try:
    emdat_schema.validate(emdat, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(f"{len(exc.failure_cases)} failure cases:")
    print(exc.failure_cases)
    # use exc.failure_cases.to_csv() method to export

42 failure cases:
   schema_context                                     column  \
41          Index                                     DisNo.   
40          Index                                     DisNo.   
39          Index                                     DisNo.   
30         Column  Reconstruction Costs, Adjusted ('000 US$)   
23         Column                                    End Day   
24         Column                               Total Deaths   
25         Column                                No. Injured   
26         Column                               No. Affected   
27         Column                               No. Homeless   
28         Column                             Total Affected   
29         Column            Reconstruction Costs ('000 US$)   
32         Column        Insured Damage, Adjusted ('000 US$)   
31         Column                  Insured Damage ('000 US$)   
1          Column                         Classification Key   
33         Column     



In [6]:
from emtest.utils import get_validation_report
get_validation_report(emdat, emdat_schema)



Unnamed: 0,schema_context,column,check,check_number,failure_case,index
41,Index,DisNo.,Invalid DisNo. Pattern,0.0,wrong_disno,0
40,Index,DisNo.,field_uniqueness,,1927-0012-DZA,11
39,Index,DisNo.,field_uniqueness,,1927-0012-DZA,2
30,Column,"Reconstruction Costs, Adjusted ('000 US$)",greater_than(0.0),0.0,-10.0,2009-0122-CHN
23,Column,End Day,"isin(range(1, 32))",0.0,12.5,1990-0630-IND
24,Column,Total Deaths,greater_than(0.0),0.0,-10.0,2013-0259-CHN
25,Column,No. Injured,greater_than(0.0),0.0,-10.0,2004-0294-FJI
26,Column,No. Affected,greater_than(0.0),0.0,-10.0,1983-0440-PAK
27,Column,No. Homeless,greater_than(0.0),0.0,-10.0,2011-0368-IRN
28,Column,Total Affected,greater_than(0.0),0.0,-10.0,2013-0565-CHN


## Add Warnings to the Report

In [9]:
from emtest.utils import set_warnings_to_errors
updated_schema = set_warnings_to_errors(emdat_schema)
try:
    updated_schema.validate(emdat, lazy=True)
except pa.errors.SchemaErrors as exc:
    print(f"{len(exc.failure_cases)} failure cases:")
    print(exc.failure_cases)

48 failure cases:
   schema_context                                     column  \
47          Index                                     DisNo.   
46          Index                                     DisNo.   
45          Index                                     DisNo.   
35         Column            Reconstruction Costs ('000 US$)   
26         Column                                  Start Day   
27         Column                                   End Year   
28         Column                                  End Month   
29         Column                                    End Day   
30         Column                               Total Deaths   
31         Column                                No. Injured   
32         Column                               No. Affected   
33         Column                               No. Homeless   
34         Column                             Total Affected   
36         Column  Reconstruction Costs, Adjusted ('000 US$)   
1          Column     

In [10]:
get_validation_report(emdat, emdat_schema, add_warnings=True)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
47,Index,DisNo.,Invalid DisNo. Pattern,0.0,wrong_disno,0
46,Index,DisNo.,field_uniqueness,,1927-0012-DZA,11
45,Index,DisNo.,field_uniqueness,,1927-0012-DZA,2
35,Column,Reconstruction Costs ('000 US$),greater_than(0.0),0.0,-10.0,1990-0508-MEX
26,Column,Start Day,"isin(range(1, 32))",0.0,32.0,2002-0492-PNG
27,Column,End Year,"in_range(1900, 2024)",0.0,2100,1981-0106-LKA
28,Column,End Month,"isin(range(1, 13))",0.0,-2.0,1998-0261-AUS
29,Column,End Day,"isin(range(1, 32))",0.0,12.5,1990-0630-IND
30,Column,Total Deaths,greater_than(0.0),0.0,-10.0,2013-0259-CHN
31,Column,No. Injured,greater_than(0.0),0.0,-10.0,2004-0294-FJI


## Type-specific Schemas

In [13]:
from emtest import earthquake_schema

emdat_eq = emdat[emdat['Disaster Type'] == 'Earthquake']

get_validation_report(emdat_eq, earthquake_schema)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
3,Index,DisNo.,Invalid DisNo. Pattern,0,wrong_disno,0
0,Column,Magnitude,Invalid earthquake magnitude,0,5.7,1978-0057-ITA
1,Column,Magnitude,Invalid earthquake magnitude,0,5.6,2022-0231-BIH
2,Column,Insured Damage ('000 US$),greater_than(0.0),0,-10.0,1978-0057-ITA


## Customize Columns' Checks

In [12]:
from emtest.utils import update_column_checks
coldwave_schema = update_column_checks(
    schema=emdat_schema,
    col_name='Magnitude',
    new_checks=[
        pa.Check.less_than(
            max_value=0,
            description="Test whether value is between 6 and 10",
            name="check_coldwave_magnitude",
            error="Invalid coldwave magnitude"
        )
    ]
)

In [14]:
emdat_cw = emdat[emdat['Disaster Subtype'] == 'Cold wave']
get_validation_report(emdat_cw, coldwave_schema)

Unnamed: 0,schema_context,column,check,check_number,failure_case,index
0,Column,Magnitude,Invalid coldwave magnitude,0,10.0,1986-0015-CHN
