# Training data (Label data) preparation

This Notebook is used in preparation for the script 'croptype_mapping_using_LightGMB.ipynb'. \
Here, we load and prepare an already classified training dataset of South Africa Crop Type Competition from Radient MLHub. \
Source : https://mlhub.earth/data/ref_south_africa_crops_competition_v1 \
It was been downloaded beforehand using the provided script: https://github.com/radiantearth/mlhub-tutorials/blob/main/notebooks/South%20Africa%20Crop%20Types%20Competition/south_africa_crop_type_competition_load_asset_paths.ipynb \
This training dataset will be used in the main script to train a LightGBM Classification Model.

## Load packages

In [1]:
import json
import os
from shapely.geometry import Polygon, box
from glob import glob
import rioxarray as rio
from rioxarray.merge import merge_arrays
import matplotlib.pyplot as plt

## Extract training data 

We use this predefined area as bounding box, as this is part of the overlapping area between the training data and the available Sentinel-2 data in the SALDi cube. \
Only certain amount of training data will be used (roughly 50 - 60 data). 

In [2]:
# find all the directories containing bbox of training data. 
train_label_subdirectories = glob("/home/datacube/work/data/yumyumyumi/project/ref_south_africa_crops_competition_v1_train_labels/*/", recursive = True)
train_label_subdirectories.remove('/home/datacube/work/data/yumyumyumi/project/ref_south_africa_crops_competition_v1_train_labels/_common/')

# set the bbox and add a buffer to use shapely contains function and get about 50 training data tiles from the whole dataset. 
my_bbox_shapely = box(18.938369750976562,  -33.625197399207, 19.03003692626953, -33.52880293198198).buffer(0.08)

training_file_list_inside_bbox = list()

for sub in train_label_subdirectories:
    json_file = json.load(open(sub + 'stac.json'))
    json_file_bbox = tuple(json_file['bbox'])
    minx, miny, maxx, maxy = json_file_bbox
    train_bbox_shapely = box(minx, miny, maxx, maxy)
    if my_bbox_shapely.contains(train_bbox_shapely): 
        training_file_list_inside_bbox.append(sub)

print('Number of training data tiles:', len(training_file_list_inside_bbox))

Number of training data tiles: 52


In [3]:
# open the training tifs and save them in a list
training_tifs = []
for i in range(0,len(training_file_list_inside_bbox)):
    data = rio.open_rasterio(training_file_list_inside_bbox[i] + 'labels.tif')
    training_tifs.append(data)


In [4]:
# save it as a global variable to use it in the other Notebook
%store training_tifs

Stored 'training_tifs' (list)
