# Agricultural Fields Data Request

Data was requested with the following specifications:

* Years: 2008-2024
* Months: 4, 5, 6, 7, 8, 9
* Type: Remote Sensing:
    * **Dataset**: Landsat 5/7/8/9 SR – 30 **Variable**: NDVI, MSAVI, NDWI 
    * **Dataset**: Open ET – 30m – Monthly **Variable**: ETa: eeMETRIC, geeSEBAL, DISALEXI
    * **Dataset**: Sentinel 2 SR – 10m – 5day **Variable**: NDVI, MSAVI, NDWI, NDRE, BSI
 
Variable names:
L-NDVI, L-MSAVI, L-NDWI,
eeMETRIC, geeSEBAL, DISALEXI
S-NDVI, S-MSAVI, S-NDWI, S-NDRE, S-BSI

*L is appended in front of the variables generated with Landsat and an S in front of those generated with Sentinel. Sentinel will not have data prior to 2017 or so.*

## Setup

In [None]:
import os
import logging
from dotenv import load_dotenv

# AUTHENTICATION
load_dotenv()
CLIMATE_ENGINE_API_KEY = os.environ.get('CLIMATE_ENGINE_API_KEY')

HEADERS = {
    'Accept': 'application/json',
    'Authorization': CLIMATE_ENGINE_API_KEY
}

# LOGGING
logging.basicConfig(
    level=logging.DEBUG, # INFO for useful info, DEBUG for uglier, verbose info
    format="%(asctime)s | %(levelname)-8s | %(name)s | %(message)s",
    force=True
)

logger = logging.getLogger("climateengine.scraper")

## Metadata
### Datasets
Available datasets from ClimateEngine API: https://docs.climateengine.org/docs/build/html/datasets.html

API Parameters:
* Landsat 5/7/8/9 SR - 30m: LANDSAT_SR
* OpenET - 30m - Monthly: OPENET_CONUS
* Sentinel 2 SR - 10m - 5day: SENTINEL2_SR

In [None]:
DATASETS = ['LANDSAT_SR', 'OPENET_CONUS', 'SENTINEL2_SR']

### Variables
The API allows us to see the available variables for a given dataset: https://api.climateengine.org/docs#/metadata/metadata_dataset_variables_metadata_dataset_variables_get

In [None]:
from scrape_utils import synchronous_fetch_with_retry

# All requested variables for the dataset of corresponding index
VARIABLES = [ 
    ['NDVI', 'MSAVI', 'NDWI_NIR_SWIR_Gao', 'NDWI_Green_NIR_McFeeters', 'NDWI_Green_SWIR_Xu', 'NDWI_Green_SWIR_Hall', 'NDWI_SWIR_Green_Allen'],
    ['et_eemetric', 'et_geesebal', 'et_disalexi'], 
    ['NDVI', 'MSAVI', 'NDWI_NIR_SWIR_Gao', 'NDWI_Green_NIR_McFeeters', 'NDWI_Green_SWIR_Xu', 'NDWI_Green_SWIR_Hall', 'NDWI_SWIR_Green_Allen']] # Temporarily remove 'NDWI_NIR_SWIR2'

for i, dataset in enumerate(DATASETS):
    res = synchronous_fetch_with_retry(f'https://api.climateengine.org/metadata/dataset_variables?dataset={dataset}', headers=HEADERS)

    api_variables = set(res.get('Data').get('variables'))
    missing = set(VARIABLES[i]).difference(api_variables)

    if missing:
        logger.info(f'{dataset}: ✗ API missing requested variables {missing}')
    else:
        logger.info(f'{dataset}: ✓ all variables available')


### Dates
The API allows us to see the minimum and maximum dates for a given dataset: https://api.climateengine.org/docs#/metadata/metadata_dataset_dates_metadata_dataset_dates_get

In [None]:
# Requested year range for the dataset of corresponding index
DATES = [
    [2008, 2024],
    [2008, 2024],
    [2017, 2024]]

MONTHS = [
    [4, 5, 6, 7, 8, 9],
    [4, 5, 6, 7, 8, 9],
    [4, 5, 6, 7, 8, 9]]
    
for i, dataset in enumerate(DATASETS):
    res = synchronous_fetch_with_retry(f'https://api.climateengine.org/metadata/dataset_dates?dataset={dataset}', headers=HEADERS)
    
    data = res.get('Data')
    date_min = int(data['min'][:4])  # Extract year from date string
    date_max = int(data['max'][:4])
    req_min, req_max = DATES[i]
    available = '✓' if req_min >= date_min and req_max <= date_max else '✗'
    logger.info(f'{dataset}: {available} (available: {date_min}-{date_max}, requested: {req_min}-{req_max})') 
    

## Data Collection

### Prepare Agricultural Fields

In [None]:
import pandas as pd

AG_FIELDS_URL = 'https://wc.bearhive.duckdns.org/weppcloud/runs/copacetic-note/ag-fields/browse/ag_fields/CSB_2008_2024_Hangman_with_Crop_and_Performance.geojson?raw=true'

fields_data = synchronous_fetch_with_retry(AG_FIELDS_URL)

# Extract field data
fields = []
for feature in fields_data['features']:
    properties = feature['properties']
    field_id = properties.get('field_ID')
    geometry = feature['geometry']
    
    fields.append({
        'field_id': field_id,
        'geometry': geometry,
    })

fields_df = pd.DataFrame(fields)
logger.info(fields_df.info())

### Fetch Data

In [None]:
import aiohttp
import asyncio
import json
import calendar
from tqdm.asyncio import tqdm
from scrape_utils import asynchronous_fetch_with_retry

semaphore = asyncio.Semaphore(10)

async def fetch_data(dataset: str, dataset_index: int, session: aiohttp.ClientSession):
    tasks = [
        asynchronous_fetch_with_retry(
            session=session,
            url='https://api.climateengine.org/timeseries/native/coordinates',
            semaphore=semaphore,
            headers=HEADERS,
            params={
                'dataset': dataset,
                'variable': ','.join(VARIABLES[dataset_index]),
                'start_date': f'{DATES[dataset_index][0]}-{MONTHS[dataset_index][0]:02d}-01',
                'end_date': f'{DATES[dataset_index][1]}-{MONTHS[dataset_index][-1]:02d}-{calendar.monthrange(DATES[dataset_index][1], MONTHS[dataset_index][-1])[1]}',
                'area_reducer': 'mean',
                'coordinates': json.dumps(row['geometry']['coordinates'])
            })
        for _, row in fields_df.head(10).iterrows()
    ]

    return await tqdm.gather(*tasks)

async with aiohttp.ClientSession(raise_for_status=True, timeout=aiohttp.ClientTimeout(total=600000)) as session:
    for i, dataset in enumerate(DATASETS):
        logger.info(f'{dataset}: starting...')
        results = await fetch_data(dataset=dataset, dataset_index=i, session=session)
        logger.info(f'{dataset}: fetched {len(results)} results')