# Converting Institution data to AEZs and Data cleaning

The validation data is orignally labelled by institution, this notebook will clip insitutionally labelled data to agro-ecological regions of Africa. After spatially clipping the data, the data is cleaned: duplicates are removed, observations that were cloudy or returned an 'NA' values from the datacube query, or were mislabelled are removed.

**Input data** : `<INSTITUTION_NAME>_wofs_ls_valid.csv>`

**Output_data** : `<AEZ>_wofs_ls_validation_points.csv`

Last modified: 13/02/2023

In [1]:
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon



In [2]:
path = '../02_Validation_results/WOfS_Assessment/wofs_ls/Institutions/'

AGRYHMET = pd.read_csv(path+'AGRHYMET_wofs_ls_valid.csv')
RCMRD = pd.read_csv(path+'RCMRD_wofs_ls_valid.csv')
OSS = pd.read_csv(path+'/OSS_wofs_ls_valid.csv')
AFRIGIST = pd.read_csv(path+'AFRIGIST_wofs_ls_valid.csv')

In [3]:
# concatenate the institution data into a big table and check total length
combined = pd.concat([AGRYHMET, RCMRD, OSS, AFRIGIST]).reset_index(drop=True).drop('Unnamed: 0', axis=1)
print('Total Number of points:', len(combined))

Total Number of points: 40131


Now that they are all combined into one table, we need to turn it into geopandas to be able to compare it with the shape files...

In [4]:
# make a geopandas object
geo_combined = gpd.GeoDataFrame(combined, geometry=gpd.points_from_xy(combined.LON, combined.LAT), crs='EPSG:4326')


## Clip points to AEZ's

In [5]:
path = '../02_Validation_data/AEZ_shapefiles/'
east = gpd.read_file(path+'Eastern.shp')
west = gpd.read_file(path+'Western.shp')
north = gpd.read_file(path+'Northern.shp')
south = gpd.read_file(path+'Southern.shp')
sahel = gpd.read_file(path+'Sahel.shp')
central = gpd.read_file(path+'Central.shp')
io = gpd.read_file(path+'Indian_ocean.shp')

shapes = [east,west,north,south,sahel,central,io]
aezs= ['Eastern', 'Western', 'Northern', 'Southern', 'Sahel', 'Central', 'Indian_ocean']

### Loop through AEZs and clip points to region, clean data

In [6]:
i = 0
total_before=[]
total_after=[]
total_sites_before=[]
total_sites_after=[]
for s, a in zip(shapes, aezs):
    
    #clip data to AEZ boundary
    geo_combined = geo_combined.rename(columns={"WATERFLAG": "ACTUAL"})
    gdf = gpd.overlay(geo_combined, s, how='intersection')
    
    #tally number of samples
    raw_n = len(gdf)
    print(a,'Before cleaning:', raw_n)
    total_before.append(raw_n)
    #tally number of unique locations
    total_sites_before.append(len((gdf.PLOT_ID).unique()))
        
    # setting the class_wet column to be prediction
    gdf["PREDICTION"] = gdf["CLASS_WET"].apply(lambda x: "1" if x >= 1 else "0")
    
    # Remove the duplicated plot IDs which means those that are labeled for similar month as 0, 1, 2  or 3.
    # find all duplicated entries
    dup = gdf[gdf.duplicated(["PLOT_ID", "MONTH"], keep=False)]
    while len(dup)>0:
        # keep actual duplicated entries (if duplicated entries have the same actual values, then they are still valid points to use)
        dup = dup.drop_duplicates(["PLOT_ID", "MONTH", "ACTUAL"])
        # drop duplicated rows
        gdf = gdf.drop(dup.index)
        dup = gdf[gdf.duplicated(["PLOT_ID", "MONTH"], keep=False)]
    
    # check dups agains
    if len(gdf[gdf.duplicated(["PLOT_ID", "MONTH"], keep=False)])>0:
        print("dups?")
        break
    
    #tally number of duplicates removed
    dup = raw_n - len(gdf)
    print('   No. of duplicates:' , dup)
    
    # Filter out those rows that are labeled more than 1 or there is no clear WOfS/SCL observations
    not_clear = gdf[(gdf["ACTUAL"] > 1) | (gdf["CLEAR_OBS"] == 0.0)].index
    gdf = gdf.drop(not_clear)
    
    nans = gdf[gdf["CLEAR_OBS"].isna()].index
    gdf = gdf.drop(nans)
    
    #tally number of cloudy obs
    print('   No. of cloudy obs:' , len(not_clear))
    print('   No. of NaNs:' , len(nans))
    
    #tally samples left after cleaning
    print('  ',a,'After cleaning:', len(gdf))
    print('\n')
    total_after.append(len(gdf))
    
    #tally number of unique validation locations
    print('   No. of unique locations:' , len((gdf.PLOT_ID).unique()))
    print('\n')
    total_sites_after.append(len((gdf.PLOT_ID).unique()))

    # save out to file for the accuracy assesments
    gdf.to_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/'+a+'_wofs_ls_validation_points.csv')

Eastern Before cleaning: 6093
   No. of duplicates: 243
   No. of cloudy obs: 3248
   No. of NaNs: 28
   Eastern After cleaning: 2574


   No. of unique locations: 461


Western Before cleaning: 8038
   No. of duplicates: 4003
   No. of cloudy obs: 2894
   No. of NaNs: 28
   Western After cleaning: 1113


   No. of unique locations: 402


Northern Before cleaning: 3597
   No. of duplicates: 4
   No. of cloudy obs: 2409
   No. of NaNs: 4
   Northern After cleaning: 1180


   No. of unique locations: 229


Southern Before cleaning: 7216
   No. of duplicates: 2368
   No. of cloudy obs: 2653
   No. of NaNs: 113
   Southern After cleaning: 2082


   No. of unique locations: 410


Sahel Before cleaning: 3577
   No. of duplicates: 18
   No. of cloudy obs: 2314
   No. of NaNs: 9
   Sahel After cleaning: 1236


   No. of unique locations: 255


Central Before cleaning: 7937
   No. of duplicates: 3732
   No. of cloudy obs: 3068
   No. of NaNs: 41
   Central After cleaning: 1096


   No. of uniqu

### Sum of data points before and after cleaning

In [7]:
print('Total samples before cleaning: ', sum(total_before))
print('Total samples after cleaning: ', sum(total_after))
print('Total samples sites before cleaning: ', sum(total_sites_before))
print('Total samples sites after cleaning: ', sum(total_sites_after))

Total samples before cleaning:  40131
Total samples after cleaning:  11363
Total samples sites before cleaning:  2900
Total samples sites after cleaning:  2377
