## Geospatial Data Processing and Vegetation Index Calculation

### Description

This Jupyter Notebook aims to construct a pandas DataFrame containing labeled geospatial data enriched with reflectance values and vegetation indices derived from the label source data. The notebook leverages the SpatioTemporal Asset Catalog (STAC) API to access geospatial datasets, and various Python libraries like pystac_client, geopandas, and rasterio to manipulate the data.

### Sections

* **Setting up the Environment**: The notebook starts by importing the required Python libraries and specifying the STAC API endpoint.

* **Data Retrieval from STAC API**: The notebook connects to the STAC API and retrieves labeled geospatial datasets related to the "ai-extensions-svv-dataset-labels" collection.

* **Utility Functions**: This section defines utility functions for calculating normalized differences (NDVI, NDWI1, NDWI2) and reading geojson data from an AWS S3 bucket.

* **Data Sampling and Processing**: Each labeled geospatial item is processed to sample reflectance values from different spectral bands, such as "coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", and "swir22". The notebook then calculates vegetation indices like NDVI, NDWI1, and NDWI2 based on the sampled data.

* **Creating Enriched DataFrame**: The processed data is organized into a pandas DataFrame, which combines the labeled geospatial data with reflectance values and vegetation indices.

* **Data Description and Saving**: The notebook provides a descriptive summary of the enriched DataFrame, including statistics and characteristics of the data. The DataFrame is then saved as a pickle file for further analysis and visualization.

By following the steps in this notebook, users can obtain a comprehensive DataFrame that contains labeled geospatial data with additional valuable information, such as reflectance values and vegetation indices. This DataFrame can serve as a foundation for various geospatial analysis tasks, including land cover classification, vegetation health monitoring, and environmental assessment. Users can adapt the notebook to work with different datasets and explore other geospatial analytics techniques based on their specific requirements.



In [None]:
from pystac_client import Client
from utils import (
    UserSettings,
    get_asset_by_common_name,
    convert_coordinates,
)
from pystac import read_file
from urllib.parse import urlparse
import geopandas as gpd
import pandas as pd
import boto3, botocore
import io
import os
from pystac.stac_io import DefaultStacIO, StacIO
import rasterio
from pystac import Item
from loguru import logger

In [None]:
stac_endpoint = "https://stac-api-dev.terradue.com/"

headers = []

cat = Client.open(stac_endpoint, headers=headers, ignore_conformance=True)
cat

In [None]:
collections = ["ai-extensions-svv-dataset-labels"]

query = cat.search(collections=collections)

In [None]:
[item.get_assets()["labels"] for item in query.item_collection()]

In [None]:
settings = UserSettings("usersettings.json")

settings.set_s3_environment(
    query.item_collection()[0].get_assets()["labels"].get_absolute_href()
)

print(os.environ["AWS_ACCESS_KEY_ID"])

In [None]:
StacIO.set_default(DefaultStacIO)

In [None]:
label_item = query.item_collection()[0]
label_item

In [None]:
label_item.get_assets()

In [None]:
# create normalized difference function
def nd(a, b):
    return (a - b) / (a + b)


def read_geojson(
    label_item: Item, user_settings="usersettings.json", asset_key="labels"
):
    settings = UserSettings(user_settings)

    settings.set_s3_environment(label_item.get_assets()[asset_key].get_absolute_href())
    session = botocore.session.Session()

    s3_client = session.create_client(
        service_name="s3",
        region_name=os.environ.get("AWS_REGION"),
        use_ssl=True,
        endpoint_url=os.environ.get("AWS_S3_ENDPOINT"),
        aws_access_key_id=os.environ.get("AWS_ACCESS_KEY_ID"),
        aws_secret_access_key=os.environ.get("AWS_SECRET_ACCESS_KEY"),
    )

    parsed = urlparse(label_item.get_assets()[asset_key].get_absolute_href())

    bucket = parsed.netloc
    key = parsed.path[1:]

    obj = s3_client.get_object(Bucket=bucket, Key=key)

    return gpd.read_file(io.BytesIO(obj["Body"].read()))


def sample_data(label_item, source_item=None, common_bands=["red", "nir"]):
    gdf = read_geojson(label_item)

    if source_item is None:
        source_item = read_file(
            [link.target for link in label_item.get_links() if link.rel in ["source"]][
                0
            ]
        )

    dataset = {}
    for common_band in common_bands:
        logger.info(f"Reading {common_band} band")
        dataset[common_band] = rasterio.open(
            get_asset_by_common_name(source_item, common_band).get_absolute_href()
        )

    def convert_row(row, target_crs):
        "EPSG:4326"
        longitude = row.geometry.x
        latitude = row.geometry.y

        src_crs = "EPSG:4326"

        row["utm_x"], row["utm_y"] = convert_coordinates(
            src_crs, target_crs, longitude, latitude
        )

        return pd.Series(row)

    crs_info = dataset[common_bands[0]].crs
    target_crs = f"EPSG:{crs_info.to_epsg()}"
    gdf = gdf.apply(convert_row, target_crs=target_crs, axis=1)

    points_utm = [(x, y) for x, y in zip(gdf["utm_x"], gdf["utm_y"])]

    for common_band in common_bands:
        logger.info(f"Sampling {common_band} band")
        gdf[common_band] = [
            val[0] / 10000 for val in dataset[common_band].sample(points_utm, 1)
        ]

    if "red" in common_bands and "nir" in common_bands:
        gdf["ndvi"] = nd(gdf["nir"], gdf["red"])
    if "green" in common_bands and "nir" in common_bands:
        gdf["ndwi1"] = nd(gdf["green"], gdf["nir"])
    if "nir" in common_bands and "swir16" in common_bands:
        gdf["ndwi2"] = nd(gdf["nir"], gdf["swir16"])

    return gdf

In [None]:
tmp_gdfs = []

for label_item in query.item_collection():

    sampled_data = sample_data(label_item=label_item, common_bands=["coastal", "red", "green", "blue", "nir", "nir08", "nir09", "swir16", "swir22"])
    
    tmp_gdfs.append(sampled_data)

gdf = pd.concat(tmp_gdfs)


In [None]:
gdf.describe()

In [None]:
gdf.to_pickle('sprint-0-STAC-labels-to-dataframe.pkl')