<a href="https://colab.research.google.com/github/ck1972/University-GeoAI/blob/main/1c_Preparing_LandCover_Training_Datasets_GEE_Bulawayo_2024_Tutorial1_01.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Preparing Training Data for Land Cover Mapping
## Introduction
Preparing training samples is a crucial step in any geospatial machine learning pipeline, requiring particular attention due to the unique characteristics of spatial data. Unlike traditional datasets, geospatial information combines both spatial and temporal dimensions, often integrating multiple data sources with varying resolutions, projections, and formats. The challenges of preparing such data for machine learning applications are manifold, ranging from handling missing values in satellite imagery to aligning datasets from different coordinate systems.

## Requirements
To run this script, the user must have an Earth Engine account. In addition, the user must authenticate the Earth Engine Python API. See the instructions [here](https://developers.google.com/earth-engine/guides/auth).


This script will use the [geemap](https://geemap.org) Python package to display the maps. Geemap enables users to interactively explore and visualize Earth Engine datasets within a Jupyter-based environment with minimal coding. To learn more about geemap, check out https://geemap.org.

Following are the steps to model AGBD.

# Initialize and Authenticate Earth Engine
To get started with Google Earth Engine (GEE), you need to initialize and authenticate the Earth Engine API. Follow these steps.


First, import the Earth Engine API by importing the ee module into your Python environment. This module allows you to interact with the Earth Engine platform.


In [None]:
# Import the API
import ee

# Import the geemap library
import geemap

Next, initialize the Earth Engine API. You must initialize the API to use Earth Engine functionalities. This involves authenticating your session and initializing the library. When you run the ee.Initialize() command for the first time, you might be prompted to authenticate your session. This will open a web browser window where you need to log in with your Google account and grant Earth Engine access.

In [None]:
# Trigger the authentication flow.
ee.Authenticate()

# Initialize the library.
ee.Initialize(project='ee-xxx-test') # Change to your EE project

## Define the boundary
First, we define an area of interest by creating a boundary around Bulawayo using a FeatureCollection from the FAO GAUL dataset. Next, we filter the dataset to select features where the administrative name matches "Bulawayo" and extracts the corresponding geometry. Next, a map is created and centered on this boundary at a zoom level of 12 using geemap.

In [None]:
# Define the area of interest
boundary = ee.FeatureCollection("FAO/GAUL_SIMPLIFIED_500m/2015/level2") \
    .filter(ee.Filter.eq('ADM1_NAME', 'Bulawayo')) \
    .geometry()

# Create a map
map1 = geemap.Map()
map1.centerObject(boundary, 12)

# Define styling for the boundary
boundary_style = {
    'color': 'blue',  # Outline color
    'fillColor': '00000000'  # Fully transparent fill
}

# Add the boundary layer to the map
map1.addLayer(ee.Image().paint(boundary, 1, 2), boundary_style, "Bulawayo")

# Display the map
map1

Map(center=[-20.140139749882138, 28.548130645057014], controls=(WidgetControl(options=['position', 'transparen…

## Prepare land cover samples
### Extract land cover training samples from the ESA land cover dataset
Next, we load the ESA WorldCover 2020 dataset and then clip it to a specified boundary. We define visualization parameters that indicate the 'Map' band should be used for display. For preparing training samples, we set up a stratified sampling scheme by defining specific land cover classes (such as tree cover, shrubland, grassland, cropland, bare/sparse vegetation, and permanent water bodies) alongside a predetermined number of points to sample for each class. The stratified sampling is then performed on the clipped landcover image using these class values and point counts, along with a defined spatial resolution (scale) and a random seed for reproducibility. Finally, we print the total number of collected training samples. Note that we exclude the built-up class.

In [None]:
# Load ESA WorldCover 2020 data and clip to boundary
esa_landcover = ee.ImageCollection('ESA/WorldCover/v100').first()
esa_landcover_clipped = esa_landcover.clip(boundary)

# Prepare visualization parameters
visualization = {
    'bands': ['Map'],
}

### Prepare training samples
# Define classes and samples per class
class_values = [10,20,30,40,60,80] # ESA classes: 10-Tree cover,20-Shrubland,30-Grassland,40-Cropland, 60- Bare/ sparse vegetation, 80-Permanent water bodies
class_points = [1200,1200,1600,1600,1600,600]

# Stratified sampling
other_classes = esa_landcover_clipped.stratifiedSample(
    classBand='Map',
    classValues=class_values,
    classPoints=class_points,
    numPoints=0,
    region=boundary,
    scale=10,
    geometries=True,
    seed=42
)

print('Collected Training Samples:', other_classes.size().getInfo())

Collected Training Samples: 7800


### Extract building footprint from the Global Google-Microsoft open buildings dataset
Next, we load a Global Google-Microsoft open buildings dataset and then filter it to only include buildings that lie within a predefined boundary. We then define a function to extract the centroid (center point) of each building's footprint and applies this function to create a new feature collection consisting solely of these centroids. To ensure manageability and randomness, a random column is added to each feature, and the collection is sorted based on these random values before being limited to 1600 points. Finally, each feature is tagged with a property named 'Map' set to 50 to match the ESA land cover 'Map' propert. We also print the total count of these training samples is printed.

In [None]:
# Load the combined building footprints dataset
buildings = ee.FeatureCollection('projects/sat-io/open-datasets/VIDA_COMBINED/ZWE')

# Filter buildings within the boundary
buildings_filtered = buildings.filterBounds(boundary)

# Function to extract center points of building footprints
def extract_centroids(feature):
    return ee.Feature(feature.centroid())

# Extract centroids of buildings
built_up_points = buildings_filtered.map(extract_centroids)

# Add a random column and limit the collection to 1600 points
built_up_points = built_up_points.randomColumn('random').sort('random').limit(1600)

# Set the 'Map' property to 50 for built-up class
built_up_points = built_up_points.map(lambda feature: feature.set('Map', 50))

print('Total Training Samples:', built_up_points.size().getInfo())

Total Training Samples: 1600


### Combine land cover training samples
Next, we merge two collections of points—one representing built-up areas derived from the building footprint and the other representing other land cover classes derived form the ESA land cover—into a single dataset called 'training_data'. We then refine this dataset to include only the 'Map' property, which indicates the land cover class for each point. The total number of training samples is printed by computing the size of this merged dataset. To better understand the distribution of classes within the training data, we compute a histogram of the 'Map' values, which aggregates the count of points for each class. Finally, we print out this histogram, providing a dictionary that maps each class value to its corresponding count of training points.

In [None]:
# Combine 'built_up_points' and 'other_classes' points
training_data = built_up_points.merge(other_classes)

training_data = training_data.select(['Map'])

print('Total Training Samples:', training_data.size().getInfo())

# Compute a histogram of the number of training points per class
class_histogram = training_data.aggregate_histogram('Map')

# Print the histogram (a dictionary mapping class values to counts)
print('Number of training data points per class:', class_histogram.getInfo())

Total Training Samples: 9400
Number of training data points per class: {'10': 1200, '20': 1200, '30': 1600, '40': 1600, '50': 1600, '60': 1600, '80': 600}


### Display the training points
Next, we initialize an interactive map centered on a predefined boundary at a zoom level of 12 using the geemap library. We will then adds a land cover layer to the map by displaying the ESA WorldCover 2020 image that has been clipped to the boundary, using specific visualization parameters to render the data. Additionally, the training sample points, which represent various land cover classes, are overlaid on the map with a default styling.

In [None]:
# Initialize our map.
map2 = geemap.Map()
map2.centerObject(boundary, 12)

# Add the NDVI and RESI layers to the map.
map2.addLayer(esa_landcover_clipped, visualization, 'ESA Landcover 2020')

# Add training samples to the map
map2.addLayer(training_data, {}, 'Training Samples')

# Display the map with layer control.
map2.addLayerControl()
map2

Map(center=[-20.140139749882138, 28.548130645057014], controls=(WidgetControl(options=['position', 'transparen…

### Export the training points
We export the 'training_data' feature collection as an Earth Engine asset. We use the ee.batch.Export.table.toAsset function, specifying the collection to be exported, a description for the task, and the destination asset ID where the data will be stored. After configuring the export, the task is started with task.start(). Finally, the code prints a message displaying the task's ID, confirming that the export process has begun. We can also check the progress in Google Earth Engine.

In [None]:
# Export first
task = ee.batch.Export.table.toAsset(
    collection=training_data,
    description='Bul_training_data_export',
    assetId='projects/ee-kamusoko-test/assets/Bul_training_data1'
)
task.start()

# Print task status
print(f"Export task started: {task.id}")

Export task started: O3EDMJ3GFOD47WR54WNKA5YU
