# Agricultural Fields Data - Timeseries Endpoint

#### Data was requested with the following specifications:

* Years: 2008-2024
* Months: 4, 5, 6, 7, 8, 9
* Type: Remote Sensing:
    * **Dataset**: Landsat 5/7/8/9 SR – 30 **Variable**: NDVI, MSAVI, NDWI 
    * **Dataset**: Open ET – 30m – Monthly **Variable**: ETa: eeMETRIC, geeSEBAL, DISALEXI
    * **Dataset**: Sentinel 2 SR – 10m – 5day **Variable**: NDVI, MSAVI, NDWI, NDRE, BSI
 
Variable names:
L-NDVI, L-MSAVI, L-NDWI,
eeMETRIC, geeSEBAL, DISALEXI
S-NDVI, S-MSAVI, S-NDWI, S-NDRE, S-BSI

*L is appended in front of the variables generated with Landsat and an S in front of those generated with Sentinel. Sentinel will not have data prior to 2017 or so.*

The area reducer of median and temporal reducers (by month) of mean and max are requested.

#### Notes
The [timeseries endpoint](https://docs.climateengine.org/docs/build/html/timeseries.html#rst-timeseries-native-coordinates) is used in this script.

## Setup

In [None]:
import os
import logging
from dotenv import load_dotenv

# AUTHENTICATION
load_dotenv()
CLIMATE_ENGINE_API_KEY = os.environ.get('CLIMATE_ENGINE_API_KEY')

HEADERS = {
    'Accept': 'application/json',
    'Authorization': CLIMATE_ENGINE_API_KEY
}

# LOGGING
logging.basicConfig(
    level=logging.DEBUG, # INFO for useful info, DEBUG for uglier, verbose info
    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
    force=True
)

logger = logging.getLogger("climateengine.scraper.timeseries")

## Metadata
### Datasets
Available datasets from ClimateEngine API: https://docs.climateengine.org/docs/build/html/datasets.html

API Parameters:
* Landsat 5/7/8/9 SR - 30m: LANDSAT_SR
* OpenET - 30m - Monthly: OPENET_CONUS
* Sentinel 2 SR - 10m - 5day: SENTINEL2_SR

In [None]:
DATASETS = ['LANDSAT_SR', 'OPENET_CONUS', 'SENTINEL2_SR']

### Variables
The API allows us to see the available variables for a given dataset: https://api.climateengine.org/docs#/metadata/metadata_dataset_variables_metadata_dataset_variables_get

In [None]:
from scrape_utils import synchronous_fetch_with_retry

# All requested variables for the dataset of corresponding index
# Most important NDWI variable is NDWI_NIR_SWIR_GAO - vegatation moisture 
VARIABLES = [ 
    ['NDVI', 'MSAVI', 'NDWI_NIR_SWIR_Gao'],
    ['et_eemetric', 'et_geesebal', 'et_disalexi'], 
    ['NDVI', 'MSAVI', 'NDWI_NIR_SWIR_Gao', 'NDRE', 'BSI']]

for i, dataset in enumerate(DATASETS):
    res = synchronous_fetch_with_retry(f'https://api.climateengine.org/metadata/dataset_variables?dataset={dataset}', headers=HEADERS)

    api_variables = set(res.get('Data').get('variables'))
    missing = set(VARIABLES[i]).difference(api_variables)

    if missing:
        logger.info(f'{dataset}: ✗ API missing requested variables {missing}')
    else:
        logger.info(f'{dataset}: ✓ all variables available')


### Dates
The API allows us to see the minimum and maximum dates for a given dataset: https://api.climateengine.org/docs#/metadata/metadata_dataset_dates_metadata_dataset_dates_get

In [None]:
# Requested year range for the dataset of corresponding index indicated by a start and end inclusive
YEARS = [
    [2008, 2024],
    [2008, 2024],
    [2017, 2024]]

# Explicit enumeration of desired months
MONTHS = [
    [4, 5, 6, 7, 8, 9],
    [4, 5, 6, 7, 8, 9],
    [4, 5, 6, 7, 8, 9]]
    
for i, dataset in enumerate(DATASETS):
    res = synchronous_fetch_with_retry(f'https://api.climateengine.org/metadata/dataset_dates?dataset={dataset}', headers=HEADERS)
    
    data = res.get('Data')
    date_min = int(data['min'][:4])  # Extract year from date string
    date_max = int(data['max'][:4])
    req_min, req_max = YEARS[i]
    available = '✓' if req_min >= date_min and req_max <= date_max else '✗'
    logger.info(f'{dataset}: {available} (available: {date_min}-{date_max}, requested: {req_min}-{req_max})') 
    

## Data Collection

### Prepare Agricultural Fields

In [None]:
import pandas as pd

AG_FIELDS_URL = 'https://wc.bearhive.duckdns.org/weppcloud/runs/copacetic-note/ag-fields/browse/ag_fields/CSB_2008_2024_Hangman_with_Crop_and_Performance.geojson?raw=true'

fields_data = synchronous_fetch_with_retry(AG_FIELDS_URL)

# Extract field data
fields = []
for feature in fields_data['features']:
    properties = feature['properties']
    field_id = properties.get('field_ID')
    geometry = feature['geometry']
    
    fields.append({
        'field_id': field_id,
        'geometry': geometry,
    })

fields_df = pd.DataFrame(fields)
logger.info(fields_df.info())

### Fetch Data

In [None]:
import aiohttp
import asyncio
import json
import calendar
from tqdm.asyncio import tqdm
from scrape_utils import asynchronous_fetch_with_retry

semaphore = asyncio.Semaphore(10)

async def fetch_data(dataset: str, dataset_index: int, session: aiohttp.ClientSession):
    tasks = [
        asynchronous_fetch_with_retry(
            session=session,
            url='https://api.climateengine.org/timeseries/native/coordinates',
            semaphore=semaphore,
            headers=HEADERS,
            params={
                'dataset': dataset,
                'variable': ','.join(VARIABLES[dataset_index]),
                'start_date': f'{YEARS[dataset_index][0]}-{MONTHS[dataset_index][0]:02d}-01',
                'end_date': f'{YEARS[dataset_index][1]}-{MONTHS[dataset_index][-1]:02d}-{calendar.monthrange(YEARS[dataset_index][1], MONTHS[dataset_index][-1])[1]}',
                'area_reducer': 'median',
                'coordinates': json.dumps(row['geometry']['coordinates'])
            })
        for _, row in fields_df.head(3).iterrows()
    ]

    return await tqdm.gather(*tasks)

def convert_results_to_pandas(results) -> pd.DataFrame:
    rows = []

    # Order of asyncio results is preserved which allows us to determine field ID
    # https://docs.python.org/3/library/asyncio-task.html#running-tasks-concurrently
    for i, result in enumerate(results):
        field_id = fields_df.iloc[i]['field_id']
        
        # Each result['Data'] is a list with one item containing the timeseries
        if 'Data' in result and len(result['Data']) > 0:
            timeseries_data = result['Data'][0]['Data']
            
            # Each item in timeseries_data is a dict with Date and variable values
            for data_point in timeseries_data:
                row = {
                    'field_id': field_id,
                    'date': data_point.get('Date'),
                    **{k: v for k, v in data_point.items() if k != 'Date'}  # All variables
                }
                rows.append(row)
    return pd.DataFrame(rows)
    
# Main Loop
all_dataset_results = {}
async with aiohttp.ClientSession(raise_for_status=True, timeout=aiohttp.ClientTimeout(total=None)) as session:
    for i, dataset in enumerate(DATASETS):

        # Fetch dataset data
        logger.info(f'{dataset}: starting...')
        results = await fetch_data(dataset=dataset, dataset_index=i, session=session)
        logger.info(f'{dataset}: fetched {len(results)} results')

        # Write dataset data to dataframe
        result_df = convert_results_to_pandas(results)

        # Filter out unwanted months (years should already be capped to the specified range)
        all_dataset_results[dataset] = result_df[pd.to_datetime(result_df['date']).dt.month.isin(MONTHS[i])]

### Save Results to File

In [None]:
# Save each dataset's DataFrame to a separate parquet file
import os

output_dir = 'data/output'
os.makedirs(output_dir, exist_ok=True)

for dataset_name, df in all_dataset_results.items():
    output_file = f'{output_dir}/{dataset_name.lower()}_timeseries.parquet'
    df.to_parquet(output_file, engine='pyarrow', compression='snappy', index=False)
    logger.info(f'{dataset_name}: saved {len(df)} rows to {output_file}')
    
logger.info(f'All datasets saved to {output_dir}/')

# Save a combined file with all datasets
combined_df = pd.concat([
    df.assign(dataset=dataset_name) 
    for dataset_name, df in all_dataset_results.items()
], ignore_index=True)

combined_file = f'{output_dir}/all_datasets_combined.parquet'
combined_df.to_parquet(combined_file, engine='pyarrow', compression='snappy', index=False)
logger.info(f'Combined: saved {len(combined_df)} rows to {combined_file}')

## Processing
In the final output, only the monthly mean and max ET values of each variable are needed. The data response from the timeseries API holds every single image from the time period, which is too granular.


### Load Data

In [None]:
import pandas as pd
import os

output_dir = 'data/output'
os.makedirs(output_dir, exist_ok=True)

landsat_sr_raw = pd.read_parquet('data/output/landsat_sr_timeseries.parquet')
openet_conus_raw = pd.read_parquet('data/output/openet_conus_timeseries.parquet')
sentinel2_sr_raw = pd.read_parquet('data/output/sentinel2_sr_timeseries.parquet')
raw_dfs = {
    'landsat_sr': landsat_sr_raw, 
    'openet_conus': openet_conus_raw, 
    'sentinel2_sr': sentinel2_sr_raw
}

### Process Data

In [None]:
processed_dfs = {}

# Processing
for dataset_name, raw_df in raw_dfs.items():
    # Avoid sentinel value contamination
    df = raw_df.replace(-9999, pd.NA)
    
    df['date'] = pd.to_datetime(df['date'])
    df['year'] = df['date'].dt.year
    df['month'] = df['date'].dt.month

    keys = ['field_id', 'date', 'year', 'month']
    value_cols = [col for col in df.columns if col not in keys]

    df[value_cols] = (
        df[value_cols]
        .apply(pd.to_numeric, errors='coerce')
    )
    
    # Group by field_id and year_month, calculate mean and max
    agg_df = df.groupby(['field_id', 'year', 'month'])[value_cols].agg(['mean', 'max']).reset_index()

    # Flatten MultiIndex (cleaner column names)
    agg_df.columns = [
        f"{col}_{stat}" if stat else col
        for col, stat in agg_df.columns
    ]

    # Round values to 4 decimals
    stat_cols = [c for c in agg_df.columns if c not in ['field_id', 'year', 'month']]
    agg_df[stat_cols] = agg_df[stat_cols].round(4)

    # 
    
    # Add finished data to final dictionary
    processed_dfs[dataset_name] = agg_df

# Write finished processed dataframes to files
for dataset_name, processed_df in processed_dfs.items():

    output_file = f'{output_dir}/{dataset_name}_timeseries_agg.parquet'
    processed_df.to_parquet(output_file)
