# Download Missing MYD021KM Data for Existing EMIT-Aqua Coincident Pairs

**Purpose:**  
This notebook scans existing coincident data directories, reads AIRS filenames to determine acquisition times, and downloads the corresponding MYD021KM (MODIS Calibrated Radiance) data that was missing from the original download.

**Requirements:**
+ A NASA [Earthdata Login](https://urs.earthdata.nasa.gov/) account is required
+ Configured `.netrc` file with NASA Earthdata credentials
+ Existing coincident data directory structure from original download

---

## Import Required Packages

In [1]:
import os
import re
import subprocess
import requests
import datetime as dt
from pathlib import Path
from collections import defaultdict

## Configuration

Set the base directory where your coincident data was downloaded. The notebook will scan subdirectories to find existing AIRS files and use them to determine when to search for MODIS data.

In [2]:
# Base directory containing your coincident pair subdirectories
base_data_dir = '/Users/andrewbuggee/Documents/MATLAB/Matlab-Research/Hyperspectral_Cloud_Retrievals/Batch_Scripts/Paper-2/coincident_EMIT_Aqua_data/'

# CMR API base URL
cmrurl = 'https://cmr.earthdata.nasa.gov/search/'

# MYD021KM product information
modis_product = {
    'MYD021KM': {
        'doi': '10.5067/MODIS/MYD021KM.061',
        'concept_id': None,  # Will be fetched
        'name': 'Aqua/MODIS Level-1B Calibrated Radiances 1km',
        'description': 'MODIS/Aqua Calibrated Radiances 5-Min L1B Swath 1km'
    }
}


# Combine products
all_products = {**modis_product}

# Verify base directory exists
if not os.path.exists(base_data_dir):
    raise ValueError(f"Base data directory not found: {base_data_dir}")

print(f"Base data directory: {base_data_dir}")
print(f"Directory exists: {os.path.exists(base_data_dir)}")

Base data directory: /Users/andrewbuggee/Documents/MATLAB/Matlab-Research/Hyperspectral_Cloud_Retrievals/Batch_Scripts/Paper-2/coincident_EMIT_Aqua_data/
Directory exists: True


## Get MODIS Product Concept IDs

In [3]:
# Fetch concept IDs for MODIS products
for product_key, product_info in all_products.items():
    doi = product_info['doi']
    doisearch = cmrurl + 'collections.json?doi=' + doi
    
    try:
        response = requests.get(doisearch)
        response.raise_for_status()
        concept_id = response.json()['feed']['entry'][0]['id']
        all_products[product_key]['concept_id'] = concept_id
        print(f"{product_key}: {concept_id}")
        print(f"  {product_info['name']}")
    except Exception as e:
        print(f"Error fetching concept ID for {product_key}: {e}")
        print(f"  DOI: {doi}")

MYD021KM: C1379758607-LAADS
  Aqua/MODIS Level-1B Calibrated Radiances 1km


## Parse Existing Data Directories

Scan the coincident data directories and extract timing information from AIRS filenames. We'll use AIRS timing to search for coincident MODIS data since both are on the Aqua satellite.

In [4]:
def parse_airs_filename(filename):
    """
    Extract date and time from AIRS filenames.
    
    Example: AIRS.2024.05.16.193.L2.RetStd.v7.0.7.0.G24137155634.hdf
    Format: AIRS.YYYY.MM.DD.HHH where HHH is granule number (0-239)
    
    Returns:
        dict: {'year': int, 'month': int, 'day': int, 'hour': int, 'minute': int}
              or None if parsing fails
    """
    # AIRS format: AIRS.YYYY.MM.DD.HHH (where HHH is granule number, ~6 min each)
    airs_match = re.search(r'AIRS\.(\d{4})\.(\d{2})\.(\d{2})\.(\d{3})', filename)
    if airs_match:
        year = int(airs_match.group(1))
        month = int(airs_match.group(2))
        day = int(airs_match.group(3))
        granule = int(airs_match.group(4))
        
        # Convert granule number to approximate UTC time
        # AIRS has 240 granules per day (6 minute granules)
        minutes_since_midnight = granule * 6
        hour = minutes_since_midnight // 60
        minute = minutes_since_midnight % 60
        
        return {
            'year': year,
            'month': month,
            'day': day,
            'hour': hour,
            'minute': minute,
            'granule': granule
        }
    
    return None


def create_temporal_search_string(time_info, window_minutes=5):
    """
    Create CMR temporal search string with a time window around the observation.
    
    Since AIRS and MODIS are both on Aqua, they observe nearly simultaneously.
    We use a smaller window (±10 minutes) compared to AMSR-E.
    
    Args:
        time_info: dict with year, month, day, hour, minute
        window_minutes: search window in minutes (default ±10 minutes)
    
    Returns:
        str: CMR temporal search string
    """
    obs_time = dt.datetime(
        time_info['year'],
        time_info['month'],
        time_info['day'],
        time_info['hour'],
        time_info['minute']
    )
    
    start_time = obs_time - dt.timedelta(minutes=window_minutes)
    end_time = obs_time + dt.timedelta(minutes=window_minutes)
    
    dt_format = '%Y-%m-%dT%H:%M:%SZ'
    return start_time.strftime(dt_format) + ',' + end_time.strftime(dt_format)


def create_spatial_search_bbox(emit_files):
    """
    Extract bounding box from EMIT filenames if possible.
    This can help narrow down MODIS granule searches.
    
    For now, returns None - spatial filtering can be added if needed.
    """
    # TODO: Could parse EMIT metadata or filenames for spatial bounds
    return None


# Scan directories for existing data
print("Scanning data directories...\n")
print("=" * 70)

pair_info = {}  # Dictionary to store info for each pair

# Get all subdirectories in base_data_dir
subdirs = [d for d in Path(base_data_dir).iterdir() if d.is_dir()]

for subdir in sorted(subdirs):
    pair_name = subdir.name
    
    # Check if directory has AIRS files but no MYD021KM files
    files = list(subdir.glob('*'))
    airs_files = [f for f in files if 'AIRS' in f.name and f.suffix in ['.hdf', '.nc']]
    myd021km_files = [f for f in files if 'MYD021KM' in f.name and f.suffix in ['.hdf', '.nc']]
    myd03_files = [f for f in files if 'MYD03' in f.name and f.suffix in ['.hdf', '.nc']]
    emit_files = [f for f in files if 'EMIT' in f.name]
    
    # Skip if no AIRS files
    if not airs_files:
        continue
    
    # Check if MYD021KM already exists
    if myd021km_files:
        print(f"✓ {pair_name}: MYD021KM data already exists ({len(myd021km_files)} files) - SKIPPING")
        continue
    
    # Parse timing from AIRS file
    time_info = None
    source_file = None
    
    for airs_file in airs_files:
        time_info = parse_airs_filename(airs_file.name)
        if time_info:
            source_file = airs_file.name
            break
    
    if time_info:
        pair_info[pair_name] = {
            'directory': subdir,
            'time_info': time_info,
            'source_file': source_file,
            'temporal_str': create_temporal_search_string(time_info),
            'airs_count': len(airs_files),
            'emit_count': len(emit_files),
            'has_myd03': len(myd03_files) > 0
        }
        
        print(f"✗ {pair_name}: Missing MYD021KM data")
        print(f"    Time: {time_info['year']:04d}-{time_info['month']:02d}-{time_info['day']:02d} "
              f"{time_info['hour']:02d}:{time_info['minute']:02d} UTC (from AIRS granule {time_info['granule']})")
        print(f"    Files: {len(airs_files)} AIRS, {len(emit_files)} EMIT, "
              f"MYD03: {'Yes' if myd03_files else 'No'}")
    else:
        print(f"⚠ {pair_name}: Could not parse AIRS timing - SKIPPING")

print("=" * 70)
print(f"\nFound {len(pair_info)} pair(s) missing MYD021KM data\n")

Scanning data directories...

✓ 2023_9_16_T191106_2: MYD021KM data already exists (1 files) - SKIPPING
✓ 2023_9_16_T191106_3: MYD021KM data already exists (3 files) - SKIPPING
✓ 2023_9_16_T191118_1: MYD021KM data already exists (1 files) - SKIPPING
✓ 2023_9_16_T191118_2: MYD021KM data already exists (3 files) - SKIPPING
✗ 2023_9_16_T191130_1: Missing MYD021KM data
    Time: 2023-09-16 19:06 UTC (from AIRS granule 191)
    Files: 1 AIRS, 3 EMIT, MYD03: Yes
✓ 2023_9_16_T191130_2: MYD021KM data already exists (2 files) - SKIPPING
✓ 2023_9_16_T191142: MYD021KM data already exists (3 files) - SKIPPING
✓ 2024-09-12-T1955: MYD021KM data already exists (3 files) - SKIPPING
✓ 2024-09-12-T2000: MYD021KM data already exists (3 files) - SKIPPING
✓ 2024_05_17-T1835: MYD021KM data already exists (1 files) - SKIPPING
✓ 2024_11_14_T193337: MYD021KM data already exists (3 files) - SKIPPING
✓ 2024_1_12_T185446: MYD021KM data already exists (3 files) - SKIPPING
✓ 2024_1_12_T185458: MYD021KM data already 

## Search and Download MYD021KM Data

For each pair missing MYD021KM data, search CMR for coincident MODIS granules and download them.

In [5]:
def search_cmr_modis(concept_id, temporal_str, page_size=2000):
    """
    Search CMR for MODIS granules matching temporal criteria.
    
    Returns:
        list: URLs of matching granules
    """
    granule_search_url = cmrurl + 'granules'
    
    search_params = {
        'concept_id': concept_id,
        'temporal': temporal_str,
        'page_size': page_size,
    }
    
    headers = {'Accept': 'application/json'}
    
    try:
        response = requests.get(granule_search_url, params=search_params, headers=headers)
        response.raise_for_status()
        granules = response.json()['feed']['entry']
        
        # Extract data file URLs (exclude metadata and auxiliary files)
        urls = []
        for g in granules:
            file_urls = [
                x['href'] for x in g.get('links', [])
                if 'https' in x['href']
                and any(ext in x['href'] for ext in ['.hdf', '.nc', '.h5', '.he5'])
                and '.dmrpp' not in x['href']
                and not any(x['href'].endswith(f'.{digit}') for digit in '0123456789')
                and not x['href'].endswith(('.xml', '.qa', '.ph', '.html'))
            ]
            urls.extend(file_urls)
        
        return urls
    
    except Exception as e:
        print(f"    Error searching CMR: {e}")
        return []


# Search for and download MODIS data for each pair
print("=" * 70)
print("SEARCHING FOR MYD021KM DATA")
print("=" * 70)
print()

download_summary = defaultdict(list)

for pair_name, info in pair_info.items():
    print(f"Pair: {pair_name}")
    print(f"  Time: {info['temporal_str']}")
    
    pair_urls = []
    
    # Search MYD021KM (always needed)
    product_key = 'MYD021KM'
    product_data = all_products[product_key]
    
    if not product_data['concept_id']:
        print(f"  - {product_key}: No concept ID available - SKIPPING")
    else:
        urls = search_cmr_modis(product_data['concept_id'], info['temporal_str'])
        
        if urls:
            print(f"  - {product_key}: Found {len(urls)} file(s)")
            pair_urls.extend(urls)
        else:
            print(f"  - {product_key}: No files found")
    
    # Search MYD03 (geolocation) only if it's not already present
    if not info['has_myd03']:
        product_key = 'MYD03'
        product_data = all_products[product_key]
        
        if not product_data['concept_id']:
            print(f"  - {product_key}: No concept ID available - SKIPPING")
        else:
            urls = search_cmr_modis(product_data['concept_id'], info['temporal_str'])
            
            if urls:
                print(f"  - {product_key}: Found {len(urls)} file(s)")
                pair_urls.extend(urls)
            else:
                print(f"  - {product_key}: No files found")
    else:
        print(f"  - MYD03: Already present - SKIPPING")
    
    if pair_urls:
        download_summary[pair_name] = {
            'urls': pair_urls,
            'directory': info['directory'],
            'count': len(pair_urls)
        }
        print(f"  Total files to download: {len(pair_urls)}\n")
    else:
        print(f"  ⚠ No MODIS data found for this time period\n")

print("=" * 70)
print(f"Total pairs with MODIS data found: {len(download_summary)}")
print(f"Total files to download: {sum(v['count'] for v in download_summary.values())}")
print("=" * 70)
print()

SEARCHING FOR MYD021KM DATA

Pair: 2023_9_16_T191130_1
  Time: 2023-09-16T19:01:00Z,2023-09-16T19:11:00Z
  - MYD021KM: Found 3 file(s)
  - MYD03: Already present - SKIPPING
  Total files to download: 3

Total pairs with MODIS data found: 1
Total files to download: 3



## Download MODIS Files

Download the identified MYD021KM (and MYD03 if needed) files to their respective pair directories.

In [6]:
print("=" * 70)
print("DOWNLOADING MODIS FILES")
print("=" * 70)
print()

total_downloaded = 0
total_failed = 0

for pair_name, download_info in download_summary.items():
    pair_dir = download_info['directory']
    urls = download_info['urls']
    
    print(f"Downloading to: {pair_name}/")
    print(f"  Files: {len(urls)}")
    
    # Create URL file for wget
    url_file = pair_dir / 'modis_urls_to_download.txt'
    with open(url_file, 'w') as f:
        for url in urls:
            f.write(url + '\n')
    
    # Download using wget
    try:
        result = subprocess.run(
            ['wget', '-P', str(pair_dir), '-i', str(url_file)],
            capture_output=True,
            text=True
        )
        
        # Count successful downloads
        if result.stderr:
            saved_count = result.stderr.count('saved')
            total_downloaded += saved_count
            print(f"  ✓ Downloaded {saved_count} file(s)")
        
        # Clean up URL file if successful
        if result.returncode == 0:
            url_file.unlink()
        else:
            total_failed += len(urls)
            print(f"  ⚠ Download completed with warnings (return code: {result.returncode})")
            print(f"    URL file saved: {url_file.name}")
    
    except Exception as e:
        total_failed += len(urls)
        print(f"  ✗ Error downloading: {e}")
        print(f"    URLs saved to: {url_file.name}")
    
    print()

print("=" * 70)
print("DOWNLOAD COMPLETE")
print("=" * 70)
print(f"Successfully downloaded: {total_downloaded} files")
if total_failed > 0:
    print(f"Failed/warnings: {total_failed} files")
print(f"Data location: {base_data_dir}")
print("=" * 70)

DOWNLOADING MODIS FILES

Downloading to: 2023_9_16_T191130_1/
  Files: 3
  ✓ Downloaded 3 file(s)

DOWNLOAD COMPLETE
Successfully downloaded: 3 files
Data location: /Users/andrewbuggee/Documents/MATLAB/Matlab-Research/Hyperspectral_Cloud_Retrievals/Batch_Scripts/Paper-2/coincident_EMIT_Aqua_data/


## Summary

The notebook has:
1. Scanned your existing coincident data directories
2. Identified pairs missing MYD021KM (MODIS calibrated radiance) data
3. Extracted timing information from AIRS filenames
4. Searched NASA CMR for matching MODIS granules (±5 minute window)
5. Downloaded MYD021KM files (and MYD03 geolocation if needed) to the appropriate directories

**About MYD021KM:**
- MODIS Level-1B calibrated radiances at 1km resolution
- Contains reflectance and emissive bands for atmospheric/surface studies
- Often used with MYD03 (geolocation) for precise georeferencing

**Note:** If you encounter download issues, check:
- Your `.netrc` file has correct NASA Earthdata credentials
- File permissions: `chmod 600 ~/.netrc`
- Any `*_urls_to_download.txt` files left in directories indicate partial downloads that can be retried manually