# Feature Extraction
* **Products used:** 
[dem_cop_30](https://explorer.digitalearth.africa/products/s2_l2a), [s2_l2a](https://explorer.digitalearth.africa/products/dem_cop_90), [dem_srtm](https://explorer.digitalearth.africa/products/dem_srtm), [dem_srtm_deriv](https://explorer.digitalearth.africa/products/dem_srtm_deriv)

## Background:

Training data extraction plays a crucial role in training machine learning models. The process involves extracting relevant feature layers from a geospatial dataset based on predefined geometries or regions of interest. This enables the creation of accurate and reliable classification models for various applications such as land cover mapping, crop monitoring, and environmental analysis.

To facilitate this task, the open-data-cube provides a powerful function called "collect_training_data." This function is part of the deafrica_tools.classification script and is specifically designed to extract training data from the open-data-cube using geometries defined within a GeoJSON file. The GeoJSON file contains the spatial boundaries or polygons that delineate the regions of interest for which training data needs to be extracted.

## Description:

This notebook focuses on the extraction of training data (feature layers) from the open-data-cube using geometries defined within a GeoJSON file. It follows a step-by-step approach to guide users in utilizing the "collect_training_data" function effectively. The goal is to enable users to extract the appropriate training data for their specific use case.

The main steps in this notebook are as follows:

1. **Previewing the Training Data:** The notebook starts by plotting the polygons from the training data on a basemap. This visualization provides users with a visual representation of the regions of interest for which training data will be extracted.

2. **Defining the Feature Layer Function:** Next, a feature layer function is defined. This function specifies the set of feature layers to be extracted from the open-data-cube. These layers are carefully selected based on their relevance to the classification task at hand.

3. **Extracting Training Data:** The "collect_training_data" function is then employed to extract the training data from the datacube. It utilizes the predefined geometries from the GeoJSON file and retrieves the corresponding feature layers. This step ensures that the extracted data aligns precisely with the defined regions of interest.

4. **Exporting Training Data:** Finally, the extracted training data is exported and saved to disk. This facilitates its subsequent use in other scripts or machine learning workflows for training classification models.

By following the steps outlined in this notebook, users can leverage the "collect_training_data" function to efficiently extract training data from the open-data-cube. 

## Getting started
To run this analysis, run all the cells in the notebook, starting with the "Load packages" cell.

In [1]:
!pip install richdem


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Load packages

In [2]:
%matplotlib inline

import os 
import math
import datacube
import warnings
import rioxarray
import rasterio
import numpy as np
import pandas as pd
import xarray as xr
import xarray as xr
import richdem as rd
import geopandas as gpd
import matplotlib.pyplot as plt
# from odc.algo import xr_geomedian
from odc.io.cgroups import get_cpu_quota
from datacube.testutils.io import rio_slurp_xarray


from deafrica_tools.datahandling import load_ard
from deafrica_tools.plotting import map_shapefile
from deafrica_tools.bandindices import calculate_indices
from classification import collect_training_data


import io
# import contextlib
from externaldrive import list_gdrive, read_tif_from_gdrive

## Analysis parameters
 * path: The path to the input vector file from which we will extract training data. A default geojson is provided.
 * field: This is the name of column in your shapefile attribute table that contains the class labels. The class labels must be integers

In [3]:
# Specify a prefix to identify the area of interest in the saved outputs
# By assigning the desired prefix, you can easily identify the outputs associated with the specific area of interest.
prefix = 'Test'

field = 'class_id'
path = f'data/{prefix}_training_samples.geojson'

print(path)

# Load input data shapefile
training_points= gpd.read_file(path) 
training_points.head()

data/SANBI_Test_training_samples.geojson


Unnamed: 0,class_id,original_class,class_name,geometry
0,0,,Non-wetland,POINT (2915136.140 -3458929.894)
1,0,,Non-wetland,POINT (2804676.213 -3684577.413)
2,0,,Non-wetland,POINT (2944255.625 -3666822.009)
3,0,,Non-wetland,POINT (2883949.495 -3284985.865)
4,4,CVB,Channelled valley-bottom,POINT (2726360.915 -3805228.972)


In [4]:
# Set a flag to convert to polygons:
use_polygons = True

if use_polygons:
    # Convert from lat,lon to EPSG:6933 (projection in metres)
    training_points = training_points.to_crs("EPSG:6933")

    # Buffer geometry to get a square - only if trying to sample multiple pixels
    buffer_radius_m = 10
    training_points.geometry = training_points.geometry.buffer(buffer_radius_m, cap_style=3)

#### Plot on interactive map 

In [5]:
points = training_points
training_points.explore(
    tiles = "https://mt1.google.com/vt/lyrs=s&x={x}&y={y}&z={z}", 
    attr ='Imagery @2022 Landsat/Copernicus, Map data @2022 Google',
    popup=True,
    cmap='viridis',
    style_kwds=dict(radius= 5, color= 'red', fillOpacity= 0.8, fillColor= 'red', weight= 3),
    )

## Defining Query

The function `collect_training_data` takes our geojson containing class labels and extracts training data (features) from the datacube over the locations specified by the input geometries. The function will also pre-process our training data by stacking the arrays into a useful format and removing any `NaN` or `inf` values.

The below variables can be set within the `collect_training_data` function:

* `field`: The name of column in your geojson file attribute table that contains the class labels, which corresponds to the `class_attr` that we defined earlier.
* `zonal_stats`: An optional string giving the names of zonal statistics to calculate across each geometry (polygon or point). Default is None (all pixel values are returned). Supported values are 'mean', 'median', 'max', and 'min'.
* `dc_query`: A datacube query dictionary for the Open Data Cube query such as `measurements` (the bands to load from the satellite), the `resolution` (the cell size), and the `output_crs` (the output projection). 
* `feature_func`:  A function for generating feature layers that is applied to the data within the bounds of the input geometry. This function will take the 'dc_query' as the only argument.
* `return_coords`: If True, then the training data will contain two extra columns ‘x_coord’ and ‘y_coord’ corresponding to the x,y coordinate of each sample.

> Note: `collect_training_data` also has a number of additional parameters for handling ODC I/O read failures, where polygons that return an excessive number of null values can be resubmitted to the multiprocessing queue.  Check out the [docs](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks/blob/83116e80ebb4f8744e3de74e7a713aadd0a7577a/Tools/deafrica_tools/classification.py#L565) to learn more.

We will define the first three parameters and describe the `feature_func` seperately in a moment.

In [6]:
#set up our inputs to collect_training_data
zonal_stats = 'mean'

# Set up the inputs for the ODC query
time = ('2018')

resolution = (-10,10)

output_crs='epsg:6933'

Note that we've selected nine spectral bands with spatial resolution no lower than 20 m here for demonstration. However, it is advised that you test and select the bands based on your own classification task. Using the variables above, we can generate a datacube query object from the parameters above:

In [7]:
query = {
    'time': time,
    'output_crs': output_crs,
    'resolution': resolution,
}

### Defining feature function

## Defining feature function

To create the desired feature layers, we pass instructions to `collect_training_data` through the `feature_func` parameter. The `feature_func` must accept a `dc_query` dictionary, and return a single `xarray.Dataset` or `xarray.DataArray` containing 2D coordinates (i.e x, y - no time dimension). e.g.

          def feature_function(query):
              dc = datacube.Datacube(app='feature_layers')
              ds = dc.load(**query)
              ds = ds.mean('time')
              return ds

Below, we will define a more complicated feature layer function than the brief example shown above. Firstly We will calculate the Normalised Difference Water Index (NDWI), which is commonly used to distinguish between Water and non-water land cover classes. We use the `calculate_indices`function to automatically calculate NDVI for all specified bands. 

In addition, we'll use temporal signatures to help distinguish wetland classes. To reduce data size while keeping seasonal changes, we are implementing biannual temporal aggregation, i.e. geomedian (sometimes referred to as the 'geometric median') for each pixel location.

### Read terrain indices from a Google Drive Folder. 
Make sure you have followed the instructions to set up the connection with [Google Drive API using a service account](https://docs.digitalearthafrica.org/en/latest/platform_tools/googledrive_access.html). This code should only be used when the terrain attribute data is located in a Google Drive to save on Sandbox disk space.

In [8]:
# # Capture the list of TIFF files from Google Drive
# tif_files = list_gdrive()

# # Initialize an empty dataset for merging
# terrain_stacked = xr.Dataset()

# # Ensure tif_files contains the expected structure
# if isinstance(tif_files, list):
#     # Filter the TIFF files from the list
#     tif_files = [file for file in tif_files if file['name'].endswith('.tif')]

#     # Display the TIFF files with their IDs
#     if tif_files:
#         print("Available TIFF files:")
#         for tif in tif_files:
#             print(f"{tif['name']} (ID: {tif['id']})")

#         # Lists to store the data arrays and their extents
#         data_arrays = []
#         extents = []
#         titles = []

#         # Read and merge all the TIFF files into the dataset
#         for tif in tif_files:
#             selected_file_id = tif['id']  # Select the current TIFF file
            
#             # Read the selected TIFF file from Google Drive
#             data_array, transform = read_tif_from_gdrive(selected_file_id)

#             # Check if the data was read successfully
#             if data_array is not None:
#                 print(f"Data for {tif['name']} read successfully!")

#                 # Convert to a dataset with the filename as the variable name
#                 tif_dataset = data_array.to_dataset(name=tif['name'].replace('.tif', ''))

#                 # Merge with the existing stacked dataset
#                 terrain_stacked = xr.merge([terrain_stacked, tif_dataset], compat='override')

#                 # Store the data array and its extent
#                 data_arrays.append(data_array)
#                 x_min, x_max = transform[2], transform[2] + transform[0] * data_array.shape[1]
#                 y_min, y_max = transform[5] + transform[4] * data_array.shape[0], transform[5]
#                 extents.append((x_min, x_max, y_min, y_max))
#                 titles.append(tif['name'])  # Store the title for the plot

#             else:
#                 print(f"Failed to read data for {tif['name']}.")

#         # Plot all TIFF data in subplots after reading them all
#         # Calculate the number of rows needed
#         num_files = len(data_arrays)
#         num_columns = 4
#         num_rows = math.ceil(num_files / num_columns)
        
#         # Create subplots with the desired number of rows and columns
#         fig, axes = plt.subplots(nrows=num_rows, ncols=num_columns, figsize=(15, 5 * num_rows))

#         # Flatten the axes array for easier indexing
#         axes = axes.flatten()
        
#         # Loop through the files and plot them
#         for i in range(num_files):
#             im = axes[i].imshow(data_arrays[i], cmap='gray', extent=extents[i])
#             axes[i].set_title(titles[i])
#             axes[i].set_xlabel('X Coordinate')
#             axes[i].set_ylabel('Y Coordinate')
        
#             # Add a color bar to each subplot
#             cbar = fig.colorbar(im, ax=axes[i], orientation='vertical', fraction=0.046, pad=0.04)
#             cbar.set_label('Pixel Value')
        
#         # Hide axes for unused subplots if any
#         for i in range(num_files, len(axes)):
#             axes[i].axis('off')
        
#         # Adjust layout to prevent overlap
#         plt.tight_layout()
#         plt.show()

#         # Print a summary of the final merged dataset
#         print(f"Final stacked dataset contains {len(terrain_stacked.data_vars)} variables.")
#     else:
#         print("No TIFF files found.")
# else:
#     print("Failed to retrieve files from Google Drive.")

In [9]:
def feature_layers(query, terrain_stacked=None): # Make sure to change None to terrain_stacked if using terrain indices from a google drive folder
    # connect to the datacube
    dc = datacube.Datacube(app='feature_layers')
    
    # load s2 annual geomedian
    ds = dc.load(
        product='gm_s2_annual',
        measurements=['blue', 'green', 'red', 'nir_1', 'nir_2', 'swir_1', 'swir_2', 'emad', 'smad', 'bcmad'],
        **query)
    
    # calculate some band indices
    ds = calculate_indices(ds, index=['NDVI', 'MNDWI', 'TCW'], drop=False, satellite_mission='s2')
    
    # Specify the variables you want to keep
    variables_to_keep = ['NDVI', 'MNDWI', 'TCW', 'emad', 'smad', 'bcmad']
    
    # Drop variables that are not in the keep list
    ds = ds.drop_vars([var for var in ds.data_vars if var not in variables_to_keep])
    
    # Add a prefix "Annual" to the band names
    new_band_names = ['Annual_' + band_name for band_name in ds.data_vars]
    ds = ds.rename({old_band_name: new_band_name for old_band_name, new_band_name in zip(ds.data_vars, new_band_names)})

    # Stack multi-temporal measurements and rename them
    n_time = ds.sizes['time']
    list_measurements = list(ds.keys())
    list_stack_measures = []
    for j in range(len(list_measurements)):
        for k in range(n_time):
            variable_name = list_measurements[j] + '_' + str(k)
            measure_single = ds[list_measurements[j]].isel(time=k).rename(variable_name)
            list_stack_measures.append(measure_single)
    ds_stacked = xr.merge(list_stack_measures, compat='override')

    # Load the Sentinel-1 data    
    ds_s1 = dc.load(product=["s1_rtc"], measurements=['vv', 'vh'], group_by="solar_day", **query)

    # Add a prefix "Sent1_" to the variables in ds_s1
    ds_s1 = ds_s1.rename({old_var: 'sentinel-1_' + old_var for old_var in ds_s1.data_vars})

    # Median values are used to scale the measurements so they have a similar range for visualization
    median_s1 = ds_s1[['sentinel-1_vv', 'sentinel-1_vh']].median()

    # Add ALOS L-Band Annual mosaic
    ds_alos = dc.load(product='alos_palsar_mosaic', measurements=['hh', 'hv'], **query)
    
    # Add a prefix "alos_palsar" to the variables in ds_alos
    ds_alos = ds_alos.rename({old_var: 'alos_palsar_' + old_var for old_var in ds_alos.data_vars})  
    median_alos = ds_alos[['alos_palsar_hh', 'alos_palsar_hv']].median()

    # Add WOfS Annual summary
    wofs_annual = dc.load(product='wofs_ls_summary_annual', like=ds.geobox, time=query['time'])
    wofs_annual_frequency = wofs_annual.frequency
    wofs_annual_frequency.name = 'WOfS'

    # Choose one of the following methods to load terrain attributes:

    # Option 1: Use this code if the terrain attribute files are located in a folder within the Sandbox.
    # Uncomment the code below and comment out the Google Drive option if using this method.
    # loop through the terrain attribite files and add them to the dataset
    
    folder = os.path.join("data/terrain_attributes/", prefix)
    for filename in os.listdir(folder):
            if filename.endswith('.tif'):
                filepath = os.path.join(folder, filename)
                tif = rio_slurp_xarray(filepath, gbox=ds.geobox)
                tif = tif.to_dataset(name=filename.replace('.tif', ''))
                ds_stacked = xr.merge([ds_stacked, tif], compat='override')
                
    # Option 2: Use this code if the terrain attribute files are in a Google Drive folder
    # Uncomment the code below and comment out the Sandbox folder option above if using this method.
    # Make sure to change None to terrain_stacked in the function argument at the top if using terrain indices from a google drive folder

    # bbox = ds.geobox.extent.boundingbox
    # if terrain_stacked is not None:  
    #     terrain_stacked.attrs['crs'] = ds.geobox.crs
    #     terrain_stacked = terrain_stacked.sel(x=slice(bbox.left, bbox.right), y=slice(bbox.top, bbox.bottom))
    #     terrain_stacked = terrain_stacked.rio.reproject_match(ds)
    #     ds_stacked = xr.merge([ds_stacked, terrain_stacked], compat='override', combine_attrs='override')

    # Merge all the datasets into a single dataset
    ds_stacked = xr.merge([ds_stacked, median_s1, median_alos, wofs_annual_frequency], compat='override', combine_attrs='override')
    
    return ds_stacked


Now let's run the `collect_training_data` function. This may take minutes to hours depending on your number of training points, number of measurements/bands set for the query and the calculation work in the feature function. Since we've used 10 measurements (9 spectral bands and 1 NDWI index) with 6 temporal geomedian for each band, it can be very time-consuming to finish the training features extraction. Therefore, here we are passing in `gdf=training_points[0:10]` to only run the code over the first 10 geometries as demonstration. Nevertheless, the extracted full training data file is provided in the 'Results/' folder, which will be used for next module of the workflow.

> **Note**:  With supervised classification, its common to have many, many labelled geometries in the training data. `collect_training_data` can parallelize across the geometries in order to speed up the extracting of training data. Setting `ncpus>1` will automatically trigger the parallelization. However, its best to set `ncpus=1` to begin with to assist with debugging before triggering the parallelization.

In [10]:
##### detect the number of CPUs
ncpus=round(get_cpu_quota())
print('ncpus = '+str(ncpus))

# collect training data
column_names, model_input = collect_training_data(
    gdf=training_points,
    dc_query=query,
    ncpus=ncpus,
    field=field, # integer class label
    zonal_stats="mean",
    feature_func=feature_layers,
    return_coords=True)

ncpus = 7
Taking zonal statistic: mean
Collecting training data in parallel mode


  0%|          | 0/3997 [00:00<?, ?it/s]

Error opening source dataset: s3://deafrica-sentinel-1/s1_rtc/S29E030/2018/02/10/011358/s1_rtc_011358_S29E030_2018_02_10_VV.tif
Error opening source dataset: s3://deafrica-sentinel-1/s1_rtc/S27E030/2018/03/21/024487/s1_rtc_024487_S27E030_2018_03_21_VH.tif


Percentage of possible fails after run 1 = 0.0 %
Removed 453 rows wth NaNs &/or Infs
Output shape:  (3542, 25)


In [11]:
print(column_names)

['class_id', 'Annual_emad_0', 'Annual_smad_0', 'Annual_bcmad_0', 'Annual_NDVI_0', 'Annual_MNDWI_0', 'Annual_TCW_0', 'Slope_90m', 'Curvature_450m', 'TWI', 'Slope_450m', 'Elevation', 'DTW', 'TPI_90m', 'MrRTF', 'MrVBF', 'TPI_450m', 'Curvature_90m', 'sentinel-1_vv', 'sentinel-1_vh', 'alos_palsar_hh', 'alos_palsar_hv', 'WOfS', 'x_coord', 'y_coord']


In [12]:
pd_training_features=pd.DataFrame(data=model_input,columns=column_names)
pd_training_features

Unnamed: 0,class_id,Annual_emad_0,Annual_smad_0,Annual_bcmad_0,Annual_NDVI_0,Annual_MNDWI_0,Annual_TCW_0,Slope_90m,Curvature_450m,TWI,...,MrVBF,TPI_450m,Curvature_90m,sentinel-1_vv,sentinel-1_vh,alos_palsar_hh,alos_palsar_hv,WOfS,x_coord,y_coord
0,0.0,1195.421387,0.010374,0.084440,0.328569,-0.570528,-0.231002,19.826857,-0.293095,3.314414,...,2.0,-0.305664,-0.039978,0.047480,0.011059,2698.0,1143.0,0.000000,2915135.0,-3458925.0
1,0.0,1655.406128,0.016796,0.152221,0.286133,-0.511735,-0.209967,56.770924,-0.136111,3.134199,...,2.0,-0.469727,-0.099939,0.080999,0.011475,3074.0,1414.0,0.000000,2883945.0,-3284985.0
2,0.0,1000.161560,0.006482,0.070875,0.478075,-0.466149,-0.196494,76.067909,-0.339499,-0.775587,...,0.0,-1.701538,0.080005,0.063203,0.012917,3535.0,1643.0,0.000000,2888895.0,-3622025.0
3,4.0,763.524170,0.008826,0.071589,0.483557,-0.632315,-0.193676,73.798500,0.542211,0.776343,...,1.0,-2.634949,0.090027,0.052554,0.014792,2904.5,1359.5,0.000000,2726360.0,-3805225.0
4,0.0,1023.866943,0.006194,0.063918,0.372740,-0.535694,-0.239757,66.580383,0.514251,0.518476,...,1.0,1.791870,0.040039,0.058811,0.013001,3142.0,1197.0,0.000000,2804675.0,-3684575.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3537,0.0,1051.719238,0.010760,0.090678,0.412452,-0.571128,-0.217700,65.644821,-0.058259,0.727977,...,2.0,-2.095093,-0.000073,0.049470,0.009775,3650.0,1124.0,0.000000,2884925.0,-3338475.0
3538,0.0,785.821106,0.002905,0.059831,0.279992,-0.568576,-0.263844,64.332672,-0.052280,-0.077230,...,1.0,-0.637329,0.019983,0.022200,0.002797,1192.5,405.5,0.000000,2736615.0,-3362150.0
3539,0.0,512.061523,0.002213,0.033848,0.435172,-0.544383,-0.225588,62.499538,0.369242,-0.756742,...,0.0,4.367310,0.280005,0.063852,0.009558,1478.0,714.0,0.000000,2841585.0,-3703585.0
3540,2.0,693.201599,0.002756,0.063478,0.413032,-0.580784,-0.207667,16.216364,-0.006287,1.629385,...,4.0,0.013672,-0.120020,0.051136,0.007798,2092.0,561.0,0.065591,2754125.0,-3689465.0


### Export training features

In [13]:
# convert the data to geopandas dataframe
pd_training_features=pd.DataFrame(data=model_input,columns=column_names)
#set the name and location of the output file
# output_file = "results/training_features.txt"
output_file = f"results/{prefix}_training_features.txt"
#Export files to disk
pd_training_features.to_csv(output_file, header=True, index=None, sep=' ')

In [14]:
pd_training_features

Unnamed: 0,class_id,Annual_emad_0,Annual_smad_0,Annual_bcmad_0,Annual_NDVI_0,Annual_MNDWI_0,Annual_TCW_0,Slope_90m,Curvature_450m,TWI,...,MrVBF,TPI_450m,Curvature_90m,sentinel-1_vv,sentinel-1_vh,alos_palsar_hh,alos_palsar_hv,WOfS,x_coord,y_coord
0,0.0,1195.421387,0.010374,0.084440,0.328569,-0.570528,-0.231002,19.826857,-0.293095,3.314414,...,2.0,-0.305664,-0.039978,0.047480,0.011059,2698.0,1143.0,0.000000,2915135.0,-3458925.0
1,0.0,1655.406128,0.016796,0.152221,0.286133,-0.511735,-0.209967,56.770924,-0.136111,3.134199,...,2.0,-0.469727,-0.099939,0.080999,0.011475,3074.0,1414.0,0.000000,2883945.0,-3284985.0
2,0.0,1000.161560,0.006482,0.070875,0.478075,-0.466149,-0.196494,76.067909,-0.339499,-0.775587,...,0.0,-1.701538,0.080005,0.063203,0.012917,3535.0,1643.0,0.000000,2888895.0,-3622025.0
3,4.0,763.524170,0.008826,0.071589,0.483557,-0.632315,-0.193676,73.798500,0.542211,0.776343,...,1.0,-2.634949,0.090027,0.052554,0.014792,2904.5,1359.5,0.000000,2726360.0,-3805225.0
4,0.0,1023.866943,0.006194,0.063918,0.372740,-0.535694,-0.239757,66.580383,0.514251,0.518476,...,1.0,1.791870,0.040039,0.058811,0.013001,3142.0,1197.0,0.000000,2804675.0,-3684575.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3537,0.0,1051.719238,0.010760,0.090678,0.412452,-0.571128,-0.217700,65.644821,-0.058259,0.727977,...,2.0,-2.095093,-0.000073,0.049470,0.009775,3650.0,1124.0,0.000000,2884925.0,-3338475.0
3538,0.0,785.821106,0.002905,0.059831,0.279992,-0.568576,-0.263844,64.332672,-0.052280,-0.077230,...,1.0,-0.637329,0.019983,0.022200,0.002797,1192.5,405.5,0.000000,2736615.0,-3362150.0
3539,0.0,512.061523,0.002213,0.033848,0.435172,-0.544383,-0.225588,62.499538,0.369242,-0.756742,...,0.0,4.367310,0.280005,0.063852,0.009558,1478.0,714.0,0.000000,2841585.0,-3703585.0
3540,2.0,693.201599,0.002756,0.063478,0.413032,-0.580784,-0.207667,16.216364,-0.006287,1.629385,...,4.0,0.013672,-0.120020,0.051136,0.007798,2092.0,561.0,0.065591,2754125.0,-3689465.0


In [15]:
# create geopandas dataframe
gpd_training_features=gpd.GeoDataFrame(pd_training_features, 
geometry=gpd.points_from_xy(model_input[:,-2], model_input[:,-1],crs=output_crs))

#####  Add a column for binary (wetland/non-wetland) classification

In [16]:
# Check if unique values in 'class_id' are only 0 and 1
unique_values = gpd_training_features['class_id'].unique()
if len(unique_values) == 2 and set(unique_values) == {0, 1}:
    # Replace 'class_id' with 'class_id_binary'
    gpd_training_features.rename(columns={'class_id': 'class_id_binary'}, inplace=True)
else:
    # Create 'class_id_binary' column based on condition
    gpd_training_features['class_id_binary'] = gpd_training_features['class_id'].apply(lambda x: 1 if x != 0 else 0)
    gpd_training_features.rename(columns={'class_id': 'class_id_type'}, inplace=True)

# Insert the new column at the second position
gpd_training_features.insert(0, 'class_id_binary', gpd_training_features.pop('class_id_binary'))
print(gpd_training_features.columns)

Index(['class_id_binary', 'class_id_type', 'Annual_emad_0', 'Annual_smad_0',
       'Annual_bcmad_0', 'Annual_NDVI_0', 'Annual_MNDWI_0', 'Annual_TCW_0',
       'Slope_90m', 'Curvature_450m', 'TWI', 'Slope_450m', 'Elevation', 'DTW',
       'TPI_90m', 'MrRTF', 'MrVBF', 'TPI_450m', 'Curvature_90m',
       'sentinel-1_vv', 'sentinel-1_vh', 'alos_palsar_hh', 'alos_palsar_hv',
       'WOfS', 'x_coord', 'y_coord', 'geometry'],
      dtype='object')


In [17]:
# # Replace non-zero values in the 'class_id' column with 1
# gpd_training_features['class_id_binary'] = gpd_training_features['class_id_type'].apply(lambda x: 1 if x != 0 else 0)
# # Insert the new column at the second position
# gpd_training_features.insert(1, 'class_id_binary', gpd_training_features.pop('class_id_binary'))
# gpd_training_features

In [18]:
# save as geojson file
# gpd_training_features.to_file('results/training_features.geojson', driver="GeoJSON")
geojson_file = f"results/{prefix}_training_features.geojson"
gpd_training_features.to_file(geojson_file, driver="GeoJSON")

***

## Additional information

**License:** The code in this notebook is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0). 
Digital Earth Africa data is licensed under the [Creative Commons by Attribution 4.0](https://creativecommons.org/licenses/by/4.0/) license.

**Contact:** If you need assistance, please post a question on the [Open Data Cube Slack channel](http://slack.opendatacube.org/) or on the [GIS Stack Exchange](https://gis.stackexchange.com/questions/ask?tags=open-data-cube) using the `open-data-cube` tag (you can view previously asked questions [here](https://gis.stackexchange.com/questions/tagged/open-data-cube)).
If you would like to report an issue with this notebook, you can file one on [Github](https://github.com/digitalearthafrica/deafrica-sandbox-notebooks).

**Compatible datacube version:** 

In [19]:
from datetime import datetime
datetime.today().strftime('%Y-%m-%d')

'2024-12-21'