# Converting Institution data to AEZs and Data cleaning

The validation data is orignally labelled by institution, this notebook will clip insitutionally labelled data to agro-ecological regions of Africa. After spatially clipping the data, the data is cleaned: duplicates are removed, observations that were cloudy or returned an 'NA' values from the datacube query, or were mislabelled are removed.

**Input data** : `<INSTITUTION_NAME>_wofs_ls_valid.csv>`

**Output_data** : `<AEZ>_wofs_ls_validation_points.csv`

Last modified: 03/02/2022

In [1]:
import numpy as np
import xarray as xr
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon

In [2]:
# read in the institution files, preferably the ones that have had columns dropped already in the processing step
AGRYHMET = pd.read_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/Institutions/AGRYHMET_wofs_ls_valid.csv')
RCMRD = pd.read_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/Institutions/RCMRD_wofs_ls_valid.csv')
OSS = pd.read_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/Institutions/OSS_wofs_ls_valid.csv')
AFRIGIST = pd.read_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/Institutions/AFRIGIST_wofs_ls_valid.csv')

In [3]:
# concatenate the institution data into a big table and check total length
combined = pd.concat([AGRYHMET, RCMRD, OSS, AFRIGIST]).reset_index(drop=True).drop('Unnamed: 0', axis=1)
print('Total Number of points:', len(combined))

Total Number of points: 40131


Now that they are all combined into one table, we need to turn it into geopandas to be able to compare it with the shape files...

In [5]:
# make a geopandas object
geo_combined = gpd.GeoDataFrame(combined, geometry=gpd.points_from_xy(combined.LON, combined.LAT), crs='EPSG:4326')


## Clip points to AEZ's

In [6]:
east = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Eastern.shp')
west = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Western.shp')
north = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Northern.shp')
south = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Southern.shp')
sahel = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Sahel.shp')
central = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Central.shp')
io = gpd.read_file('../02_Validation_data/AEZ_shapefiles/Indian_ocean.shp')

shapes = [east,west,north,south,sahel,central,io]
aezs= ['Eastern', 'Western', 'Northern', 'Southern', 'Sahel', 'Central', 'Indian_ocean']

### Loop through AEZs and clip points to region

In [7]:
i = 0
total_before=[]
total_after=[]
for s, a in zip(shapes, aezs):
    
    #clip data to AEZ boundary
    geo_combined = geo_combined.rename(columns={"WATERFLAG": "ACTUAL"})
    gdf = gpd.overlay(geo_combined, s, how='intersection')
    
    #tally number of samples
    raw_n = len(gdf)
    print(a,'Before cleaning:', raw_n)
    total_before.append(raw_n)
    
    # setting the class_wet column to be prediction
    gdf["PREDICTION"] = gdf["CLASS_WET"].apply(lambda x: "1" if x >= 1 else "0")
    
    # Remove the duplicated plot IDs which means those that are labeled for similar month as 0, 1, 2  or 3.
    gdf = gdf.drop_duplicates(["LAT", "LON", "MONTH"], keep=False)
    
    #tally number of duplicates removed
    dup = raw_n - len(gdf)
    print('   n duplicates:' , dup)
    
    # Filter out those rows that are labeled more than 1 or there is no clear WOfS/SCL observations
    indexNames = gdf[
        (gdf["ACTUAL"] > 1) | (gdf["CLEAR_OBS"] == 0.0) | (gdf["CLEAR_OBS"].isna())
    ].index
    gdf = gdf.drop(indexNames)
    
    #tally number of cloudy obs
    not_clear = (raw_n-dup) - len(gdf)
    print('   n cloudy obs:' , not_clear)
    
    #tally samples left after cleaning
    print('  ',a,'After cleaning:', len(gdf))
    total_after.append(len(gdf))
    
    # save out to file for the accuracy assesments
    gdf.to_csv('../02_Validation_results/WOfS_Assessment/wofs_ls/'+a+'_wofs_ls_validation_points.csv')

Eastern Before cleaning: 6093
   n duplicates: 243
   n cloudy obs: 3181
   Eastern After cleaning: 2669
Western Before cleaning: 8038
   n duplicates: 4021
   n cloudy obs: 2962
   Western After cleaning: 1055
Northern Before cleaning: 3597
   n duplicates: 4
   n cloudy obs: 2468
   Northern After cleaning: 1125
Southern Before cleaning: 7216
   n duplicates: 2377
   n cloudy obs: 2684
   Southern After cleaning: 2155
Sahel Before cleaning: 3577
   n duplicates: 18
   n cloudy obs: 2335
   Sahel After cleaning: 1224
Central Before cleaning: 7937
   n duplicates: 6600
   n cloudy obs: 731
   Central After cleaning: 606
Indian_ocean Before cleaning: 3673
   n duplicates: 158
   n cloudy obs: 1232
   Indian_ocean After cleaning: 2283


### Sum of data points before and after cleaning

In [9]:
print('Total samples before cleaning: ', sum(total_before))
print('Total samples after cleaning: ', sum(total_after))


Total samples before cleaning:  40131
Total samples after cleaning:  11117
