# Input data Excel validation. 
This notebook contains the steps and main logic to validate the input data from the user's excel.  
The input Excel is generated with custom pre-validations to ensure that it complies with the expected data, and is also subject to a geocoding validation. The validation shown here is focused on checking that the fields in the Excel are filled correctly (in terms of required fields and expected values), prior to data ingestion.

It is structured in two steps:  
1) By-column check: required fills, correct formats, etc...  
2) By-row check: correct/incorrect combinations of location information

In [None]:
!pip install pandera --user

In [1]:
import pandas as pd
import numpy as np
import pandera as pa
from pandera.typing import Series
import re

## Read data. 
- Read sheet from Google sheets  
- Get correct field names, all in lowercase, replace spacing and symbols
- Remove rows with usage notes

In [2]:
sheet_url = "https://docs.google.com/spreadsheets/d/16sQlhPXGaFpDPi_QWDsUCTZVJQMl9C8z6_KFJoBUR1Y/edit#gid=0"
url = sheet_url.replace('/edit#gid=', '/export?format=csv&gid=')
df = pd.read_csv(url)

df.columns = df.iloc[2].str.lower().str.strip().str.replace(' ', '_')
df = df.rename(columns = lambda x: re.sub('_\(º[en]\)', '', x))
df = df.drop([0, 1, 2])

df.head()

2,material,business_unit,tier_1_supplier,producer,location_type,country,address,latitude,longitude,2010_tons,...,2015_tons,2016_tons,2017_tons,2018_tons,2019_tons,2020_tons,....1,data_source,comments,....2
3,10.05 Maize (corn),Accessories,Cargill,Moll,Unknown,Lebanon,,,,2400,...,2522,2547,2572,2598,2624,2650,,,aoeuijj,
4,10.05 Maize (corn),Accessories,,Moll,Unknown,Malaysia,,,,1300,...,1366,1380,1394,1408,1422,1436,,,,
5,"09.01 Coffee, whether or not roasted or decaff...",Accessories,,Moll,Unknown,United States of America,,,,1000,...,1050,1061,1072,1083,1094,1105,,,,
6,"08.03 Bananas, including plantains; fresh or d...",Accessories,,Moll,Unknown,Japan,,,,730,...,767,775,783,791,799,807,,,,
7,40 Rubber and articles thereof,Accessories,,Moll,Unknown,India,,,,490,...,515,520,525,530,535,540,,,,


## Validate fields

Create schema to validate the main fields:  
- material, bussiness_unit, location_type, country and tonnage are **required** and must be filled  
- tonnage is an integer (or coerce into it) greater than 0 
- latitude and longitude as float and within especific ranges (lat: -90 to 90; long: -180 to 180)

In [8]:
class data_validation(pa.SchemaModel):
    material: Series[str] = pa.Field(str_matches= "[A-Za-z]*", allow_duplicates=True, nullable=False)
    business_unit: Series[str] = pa.Field(str_matches= "[A-Za-z]*", allow_duplicates=True, nullable=False)
    location_type: Series[str] = pa.Field(str_matches= "[A-Za-z]*", allow_duplicates=True, nullable=False)
    country: Series[str] = pa.Field(str_matches= "[A-Za-z]*", allow_duplicates=True, nullable=False)
    tons: Series[int] = pa.Field(alias ='(.*_tons)', nullable=False, allow_duplicates=True, regex=True, coerce=True, in_range={"min_value": 0, "max_value": np.iinfo(np.int32).max})
    latitude: Series[float] = pa.Field(nullable=True, allow_duplicates=True, coerce=True, in_range={"min_value": -90, "max_value": 90})
    longitude: Series[float] = pa.Field(nullable=True, allow_duplicates=True, coerce=True, in_range={"min_value": -180, "max_value": 180})

        
        
    

In [9]:
data_validation.validate(df).head()

2,material,business_unit,tier_1_supplier,producer,location_type,country,address,latitude,longitude,2010_tons,...,2015_tons,2016_tons,2017_tons,2018_tons,2019_tons,2020_tons,....1,data_source,comments,....2
3,10.05 Maize (corn),Accessories,Cargill,Moll,Unknown,Lebanon,,,,2400,...,2522,2547,2572,2598,2624,2650,,,aoeuijj,
4,10.05 Maize (corn),Accessories,,Moll,Unknown,Malaysia,,,,1300,...,1366,1380,1394,1408,1422,1436,,,,
5,"09.01 Coffee, whether or not roasted or decaff...",Accessories,,Moll,Unknown,United States of America,,,,1000,...,1050,1061,1072,1083,1094,1105,,,,
6,"08.03 Bananas, including plantains; fresh or d...",Accessories,,Moll,Unknown,Japan,,,,730,...,767,775,783,791,799,807,,,,
7,40 Rubber and articles thereof,Accessories,,Moll,Unknown,India,,,,490,...,515,520,525,530,535,540,,,,


## Locations validation. 

Check if location data is in the correct format, according to the following logic:  
(note: Country info and coordinates are validated in the previous step)  

- if location_type is Unknown, it should not contain info on address and coordinates (only country)
- if location_type is Country of production or Origin country, also no info on address and coordinates
- if location type is Point of production, Aggregation point or Origin suplier facility, it MUST contain address or coordinates

The result is a log of the outcomes (correct or type of error) for each entry.

In [14]:
def location_validation(df):
    for l in range(len(df)):
        if 'country' in df.iloc[l]['location_type'].lower():
            if not pd.isna(df.iloc[l]['address']) or not pd.isna(df.iloc[l]['latitude']) or not pd.isna(df.iloc[l]['longitude']):
                print(f'Location entry {l+1}: WARNING location type can be updated')
            else:
                e=0 
            
        if 'unknown' in df.iloc[l]['location_type'].lower():
            if not pd.isna(df.iloc[l]['address']) or not pd.isna(df.iloc[l]['latitude']) or not pd.isna(df.iloc[l]['longitude']):
                print(f'Location entry {l+1}: WARNING location type can be updated')
            else:
                e=0            
    
        if 'point' in df.iloc[l]['location_type'].lower():
            if pd.isna(df.iloc[l]['address']):
                if pd.isna(df.iloc[l]['latitude']) or pd.isna(df.iloc[l]['longitude']):
                    print(f'LOCATION ERROR ON ENTRY {l+1}: address or latitude/longitude REQUIRED')
                else:
                    e=0
            else:
                e=0           
        if 'facility' in df.iloc[l]['location_type'].lower():
            if pd.isna(df.iloc[l]['address']):
                if pd.isna(df.iloc[l]['latitude']) or pd.isna(df.iloc[l]['longitude']):
                    print(f'LOCATION ERROR ON ENTRY {l+1}: address or latitude/longitude REQUIRED')  
                else:
                    e=0
            else:
                e=0   
                
        if e == 0:
            print(f'Location entry {l+1}: OK') 

In [15]:
location_validation(df)

Location entry 1: OK
Location entry 2: OK
Location entry 3: OK
Location entry 4: OK
Location entry 5: OK
Location entry 6: OK
Location entry 7: OK
Location entry 8: OK
Location entry 9: OK
Location entry 10: OK
Location entry 11: OK
Location entry 12: OK
Location entry 13: OK
Location entry 14: OK
Location entry 15: OK
Location entry 16: OK
Location entry 17: OK
Location entry 18: OK
Location entry 19: OK
Location entry 20: OK
Location entry 21: OK
Location entry 22: OK
Location entry 23: OK
Location entry 24: OK
Location entry 25: OK
Location entry 26: OK
Location entry 27: OK
Location entry 28: OK
Location entry 29: OK
Location entry 30: OK
Location entry 31: OK
Location entry 32: OK
Location entry 33: OK
Location entry 34: OK
Location entry 35: OK
Location entry 36: OK
Location entry 37: OK
Location entry 38: OK
Location entry 39: OK
Location entry 40: OK
Location entry 41: OK
Location entry 42: OK
Location entry 43: OK
Location entry 44: OK
Location entry 45: OK
Location entry 46: 

## Change some location data to check error detection. 

In [12]:
df_invalid = df.copy()

In [16]:
df_invalid.iloc[1]['address'] = 'Fake street'
df_invalid.iloc[14]['address'] = np.nan
df_invalid.iloc[20]['latitude'] = np.nan
#df_invalid.head(21)

location_validation(df_invalid)

Location entry 1: OK
Location entry 2: OK
Location entry 3: OK
Location entry 4: OK
Location entry 5: OK
Location entry 6: OK
Location entry 7: OK
Location entry 8: OK
Location entry 9: OK
Location entry 10: OK
Location entry 11: OK
Location entry 12: OK
Location entry 13: OK
Location entry 14: OK
LOCATION ERROR ON ENTRY 15: address or latitude/longitude REQUIRED
Location entry 15: OK
Location entry 16: OK
Location entry 17: OK
Location entry 18: OK
Location entry 19: OK
Location entry 20: OK
LOCATION ERROR ON ENTRY 21: address or latitude/longitude REQUIRED
Location entry 21: OK
Location entry 22: OK
Location entry 23: OK
Location entry 24: OK
Location entry 25: OK
Location entry 26: OK
Location entry 27: OK
Location entry 28: OK
Location entry 29: OK
Location entry 30: OK
Location entry 31: OK
Location entry 32: OK
Location entry 33: OK
Location entry 34: OK
Location entry 35: OK
Location entry 36: OK
Location entry 37: OK
Location entry 38: OK
Location entry 39: OK
Location entry 40