# NASA UMM-G to GeoCroissant Conversion - RAW Implementation

1. **Data Acquisition**: Fetching satellite imagery metadata from NASA Earthdata
2. **Data Processing**: Cloud coverage filtering and granule selection
3. **Format Conversion**: Converting UMM-G to GeoCroissant RAW converter


## Step 1: Install Required Dependencies

Before starting the conversion process, we need to install the necessary Python packages:

- **`python-cmr`**: NASA Common Metadata Repository client
- **`earthaccess`**: Simplified access to NASA Earth science data
- **`mlcroissant`**: Machine Learning Croissant format support
- **`gdal`**: Geospatial Data Abstraction Library for raster processing

In [1]:
!pip install python-cmr
!pip install earthaccess
!pip install mlcroissant
!pip install gdal



## Step 2: NASA Earthdata Authentication & Data Query

This section demonstrates how to:
1. **Authenticate** with NASA Earthdata Login system
2. **Define geographic bounds** for Huntsville, Alabama area
3. **Query satellite data** with cloud coverage filters
4. **Retrieve granule metadata** for further processing

In [2]:
import earthaccess
import json

# Step 1: Authenticate with NASA Earthdata
earthaccess.login()

# Step 2: Define Huntsville, Alabama coordinates
# Huntsville coordinates: approximately 34.7304° N, 86.5861° W
huntsville_lat = 34.7304
huntsville_lon = -86.5861

# Create a bounding box around Huntsville (about 0.5 degrees in each direction)
bbox = [huntsville_lon - 0.5, huntsville_lat - 0.5, huntsville_lon + 0.5, huntsville_lat + 0.5]

# Step 3: Build the search query for Huntsville with cloud coverage filter
query = (earthaccess.DataGranules()
         .concept_id("C2021957295-LPCLOUD")
         .bounding_box(*bbox)  # Huntsville area
         .cloud_hosted(True)
         .cloud_cover(0, 20))  # Filter for 0-20% cloud coverage

# Step 4: Get total number of matching granules with low cloud cover
print(f"Total matching granules over Huntsville, AL with 0-20% cloud coverage: {query.hits()}")

# Step 5: Fetch just the first few granules
granules = list(query.get(10))  # Get first 10 granules
print(f"Fetched {len(granules)} granules")

# Step 6: Print the first sample
if granules:
    sample = granules[0]
    print("\nFirst granule metadata for Huntsville, AL:")
    #print(json.dumps(sample, indent=2))
    
    # Extract cloud coverage info if available
    if hasattr(sample, 'get') and 'CloudCover' in str(sample):
        print(f"\nCloud coverage information found in metadata")
    else:
        # Look for cloud coverage in the metadata structure
        metadata_str = json.dumps(sample, indent=2)
        if 'cloud' in metadata_str.lower():
            print(f"\nCloud-related information found in metadata")
else:
    print("No granules found over Huntsville, AL with 0-20% cloud coverage")

Enter your Earthdata Login username:  harshinde
Enter your Earthdata password:  ········


Total matching granules over Huntsville, AL with 0-20% cloud coverage: 1494
Fetched 10 granules

First granule metadata for Huntsville, AL:

Cloud-related information found in metadata


## Step 3: Cloud Coverage Analysis

This section implements cloud coverage extraction and analysis:

### Cloud Coverage Extraction Function
- Parses the `AdditionalAttributes` array in UMM-G metadata
- Searches for `CLOUD_COVERAGE` attribute
- Returns percentage value for quality assessment

### Quality Assessment
- Displays cloud coverage for each retrieved granule
- Helps identify the best quality data for conversion
- Enables selection of optimal granules for processing

In [3]:
# Extract cloud coverage from the metadata
def get_cloud_coverage(granule):
    additional_attrs = granule.get('umm', {}).get('AdditionalAttributes', [])
    for attr in additional_attrs:
        if attr.get('Name') == 'CLOUD_COVERAGE':
            return float(attr.get('Values', ['0'])[0])
    return None

# Check the cloud coverage of your sample
cloud_cover = get_cloud_coverage(sample)
print(f"Cloud coverage: {cloud_cover}%")

# Check cloud coverage for all fetched granules
for i, granule in enumerate(granules):
    cloud_cover = get_cloud_coverage(granule)
    granule_id = granule.get('umm', {}).get('GranuleUR', 'Unknown')
    print(f"Granule {i+1}: {granule_id} - Cloud coverage: {cloud_cover}%")

Cloud coverage: 3.0%
Granule 1: HLS.S30.T16SED.2016011T162642.v2.0 - Cloud coverage: 3.0%
Granule 2: HLS.S30.T16SEE.2016011T162642.v2.0 - Cloud coverage: 2.0%
Granule 3: HLS.S30.T16SEC.2016011T162642.v2.0 - Cloud coverage: 13.0%
Granule 4: HLS.S30.T16SEC.2016041T162412.v2.0 - Cloud coverage: 13.0%
Granule 5: HLS.S30.T16SDE.2016114T163322.v2.0 - Cloud coverage: 18.0%
Granule 6: HLS.S30.T16SEC.2016114T163322.v2.0 - Cloud coverage: 8.0%
Granule 7: HLS.S30.T16SED.2016114T163322.v2.0 - Cloud coverage: 13.0%
Granule 8: HLS.S30.T16SDC.2016114T163322.v2.0 - Cloud coverage: 3.0%
Granule 9: HLS.S30.T16SDD.2016114T163322.v2.0 - Cloud coverage: 3.0%
Granule 10: HLS.S30.T16SDD.2016134T163332.v2.0 - Cloud coverage: 0.0%


## Step 4: Granule Selection and Data Persistence

### Optimal Granule Selection
- Selects granule with 0% cloud coverage for best quality
- Saves the complete UMM-G metadata to JSON file
- Preserves all original metadata fields for comprehensive conversion

### Output File: `nasa_ummg_h.json`
This file contains the complete NASA UMM-G metadata structure including:
- Granule identification and temporal information
- Spatial extent and geometric data
- Platform and instrument details
- Additional attributes and quality metrics
- Data access URLs and distribution information

In [4]:
# Get Granule 4 (0% cloud coverage)
granule_4 = granules[9]  # Index 3 for the 4th granule

# Step 2: Save it to a JSON file
with open("nasa_ummg_h.json", "w") as f:
    json.dump(granule_4, f, indent=2)

print("Saved Granule 4 (HLS.S30.T55JCN.2015332T001732.v2.0 - 0% cloud coverage) to nasa_ummg_h.json")

# Optional: Also print some key info about the saved granule
granule_id = granule_4.get('umm', {}).get('GranuleUR', 'Unknown')
cloud_cover = get_cloud_coverage(granule_4)
print(f"Granule ID: {granule_id}")
print(f"Cloud Coverage: {cloud_cover}%")

Saved Granule 4 (HLS.S30.T55JCN.2015332T001732.v2.0 - 0% cloud coverage) to nasa_ummg_h.json
Granule ID: HLS.S30.T16SDD.2016134T163332.v2.0
Cloud Coverage: 0.0%


## Step 5: RAW Converter Implementation

### NASA UMMG To GeoCroissant Converter
This is the core RAW converter that provides comprehensive field mapping between NASA UMM-G and GeoCroissant formats.


In [5]:
import json
from typing import Dict, List, Any, Optional

class NASAUMMGToGeoCroissantConverterCorrected:
    def __init__(self):
        self.namespaces = {
            "@context": {
                "sc": "http://schema.org/",
                "cr": "http://mlcommons.org/croissant/",
                "geocr": "http://mlcommons.org/croissant/geocr/",
                "dct": "http://purl.org/dc/terms/"
            }
        }

    def convert_polygon_to_wkt(self, points: List[Dict[str, float]]) -> str:
        if not points:
            return ""
        coords = [f"{p['Longitude']} {p['Latitude']}" for p in points]
        if coords and coords[0] != coords[-1]:
            coords.append(coords[0])
        return f"POLYGON(({', '.join(coords)}))"

    def calculate_bounding_box(self, points: List[Dict[str, float]]) -> Dict[str, float]:
        if not points:
            return {}
        lons = [p.get('Longitude', 0) for p in points]
        lats = [p.get('Latitude', 0) for p in points]
        return {
            "west": min(lons),
            "south": min(lats),
            "east": max(lons),
            "north": max(lats)
        }

    def find_additional_attribute(self, attributes: List[Dict], name: str) -> Optional[str]:
        for attr in attributes:
            if attr.get('Name') == name:
                values = attr.get('Values', [])
                return values[0] if values else None
        return None

    def extract_data_urls(self, related_urls: List[Dict]) -> List[str]:
        return [url_info.get('URL', '') for url_info in related_urls if url_info.get('Type') == 'GET DATA']

    def convert_to_geocroissant(self, ummg_data: Dict[str, Any]) -> Dict[str, Any]:
        meta = ummg_data.get('meta', {})
        umm = ummg_data.get('umm', {})
        geocroissant = {"@type": "geocr:SatelliteImagery"}

        if meta.get('concept-id'):
            geocroissant["@id"] = meta.get('concept-id')
        if umm.get('GranuleUR'):
            geocroissant["cr:name"] = umm.get('GranuleUR')
        if umm.get('CollectionReference', {}).get('EntryTitle'):
            geocroissant["cr:description"] = umm.get('CollectionReference', {}).get('EntryTitle')
        if meta.get('revision-date'):
            geocroissant["dct:temporal"] = meta.get('revision-date')

        spatial_extent = umm.get('SpatialExtent', {})
        if spatial_extent:
            horizontal_domain = spatial_extent.get('HorizontalSpatialDomain', {})
            geometry = horizontal_domain.get('Geometry', {})
            polygons = geometry.get('GPolygons', [])
            if polygons:
                points = polygons[0].get('Boundary', {}).get('Points', [])
                if points:
                    geocroissant["geocr:Geometry"] = self.convert_polygon_to_wkt(points)
                    bbox = self.calculate_bounding_box(points)
                    if bbox:
                        geocroissant["geocr:BoundingBox"] = bbox

        additional_attrs = umm.get('AdditionalAttributes', [])
        spatial_resolution = self.find_additional_attribute(additional_attrs, 'SPATIAL_RESOLUTION')
        if spatial_resolution:
            geocroissant["geocr:Resolution"] = float(spatial_resolution)

        temporal_extent = umm.get('TemporalExtent', {})
        if temporal_extent:
            range_datetime = temporal_extent.get('RangeDateTime', {})
            if range_datetime:
                geocroissant["geocr:temporalExtent"] = {
                    "geocr:start": range_datetime.get('BeginningDateTime'),
                    "geocr:end": range_datetime.get('EndingDateTime')
                }

        related_urls = umm.get('RelatedUrls', [])
        data_urls = self.extract_data_urls(related_urls)
        if data_urls:
            geocroissant["cr:distribution"] = [
                {
                    "@type": "cr:Distribution",
                    "sc:contentUrl": url,
                    "sc:encodingFormat": "image/tiff"
                } for url in data_urls
            ]

        platforms = umm.get('Platforms', [])
        if platforms:
            platform = platforms[0]
            if platform.get('ShortName'):
                geocroissant["geocr:observatory"] = platform.get('ShortName')
            instruments = platform.get('Instruments', [])
            if instruments and instruments[0].get('ShortName'):
                geocroissant["geocr:instrument"] = instruments[0].get('ShortName')

        geocroissant["geocr:measurementType"] = "multispectral_imagery"
        geocroissant["geocr:fileType"] = "GeoTIFF"
        geocroissant["geocr:numericalData"] = {
            "dataType": "surface_reflectance",
            "format": "raster"
        }

        allowed_fields = {
            "@id", "@type", "cr:name", "cr:description", "dct:temporal",
            "geocr:BoundingBox", "geocr:Geometry", "geocr:Resolution", "geocr:temporalExtent",
            "geocr:measurementType", "geocr:numericalData", "geocr:fileType",
            "geocr:instrument", "geocr:observatory", "cr:distribution",
            "sc:contentUrl", "sc:encodingFormat", "sc:format", "sc:temporal",
            "cr:Field", "cr:RecordSet"
        }
        filtered_geocroissant = {k: v for k, v in geocroissant.items() if k in allowed_fields}
        return filtered_geocroissant

    def convert_to_recordset(self, ummg_data: Dict[str, Any]) -> Dict[str, Any]:
        record = self.convert_to_geocroissant(ummg_data)
        recordset = {
            "@context": self.namespaces["@context"],
            "@type": "cr:RecordSet",
            "cr:record": [record]
        }
        return recordset

## Step 6: Execute Conversion Process

### Conversion Workflow:
1. **Load UMM-G data** from the saved JSON file
2. **Initialize converter** with proper namespace configuration
3. **Execute conversion** using the RAW converter implementation
4. **Generate RecordSet** wrapper for Croissant compliance
5. **Save output** to `geocroissant_output_corrected.json`

In [6]:
# Load the NASA UMM-G JSON
with open('nasa_ummg_h.json', 'r') as f:
    ummg_data = json.load(f)

# Convert to GeoCroissant and wrap in cr:RecordSet
converter = NASAUMMGToGeoCroissantConverterCorrected()
recordset_data = converter.convert_to_recordset(ummg_data)

# Save the output
with open('geocroissant_output_corrected.json', 'w') as f:
    json.dump(recordset_data, f, indent=2)

print("Conversion completed.")
print(f"GeoCroissant fields used: {list(recordset_data['cr:record'][0].keys())}")
print(f"Number of records: {len(recordset_data['cr:record'])}")
print(f"Output type: {recordset_data['@type']}")
print(f"Namespaces: {list(recordset_data['@context'].keys())}")

Conversion completed.
GeoCroissant fields used: ['@type', '@id', 'cr:name', 'cr:description', 'dct:temporal', 'geocr:Geometry', 'geocr:BoundingBox', 'geocr:Resolution', 'geocr:temporalExtent', 'cr:distribution', 'geocr:observatory', 'geocr:instrument', 'geocr:measurementType', 'geocr:fileType', 'geocr:numericalData']
Number of records: 1
Output type: cr:RecordSet
Namespaces: ['sc', 'cr', 'geocr', 'dct']


## Step 7: Field Mapping Analysis

### Comprehensive Field Mapping Assessment
This analysis provides a detailed comparison between the original UMM-G structure and the converted GeoCroissant output.

### Understanding Unmapped Fields
Some UMM-G fields remain unmapped due to:
- **Schema differences**: No equivalent representation in GeoCroissant
- **Technical constraints**: Fields requiring complex transformation
- **Namespace limitations**: Fields outside current GeoCroissant specification

In [7]:
def get_all_keys(obj, prefix=''):
    keys = set()
    if isinstance(obj, dict):
        for k, v in obj.items():
            full_key = f'{prefix}.{k}' if prefix else k
            keys.add(full_key)
            keys.update(get_all_keys(v, full_key))
    elif isinstance(obj, list):
        for i, item in enumerate(obj):
            keys.update(get_all_keys(item, prefix))
    return keys

ummg_keys = get_all_keys(ummg_data.get('umm', {}))
geocroissant_keys = get_all_keys(recordset_data['cr:record'][0])
allowed_geocroissant_fields = {
    "@id", "@type", "cr:name", "cr:description", "dct:temporal",
    "geocr:BoundingBox", "geocr:Geometry", "geocr:Resolution", "geocr:temporalExtent",
    "geocr:measurementType", "geocr:numericalData", "geocr:fileType",
    "geocr:instrument", "geocr:observatory", "cr:distribution",
    "sc:contentUrl", "sc:encodingFormat", "sc:format", "sc:temporal",
    "cr:Field", "cr:RecordSet"
}
mapped = set()
for gk in geocroissant_keys:
    for uk in ummg_keys:
        if gk.split(':')[-1].lower() in uk.lower() or uk.lower() in gk.split(':')[-1].lower():
            mapped.add(uk)
unmapped = sorted(ummg_keys - mapped)
print("Unmapped UMM-G fields:")
for field in unmapped:
    print(f"  {field}")

Unmapped UMM-G fields:
  AdditionalAttributes
  AdditionalAttributes.Values
  CollectionReference
  CollectionReference.EntryTitle
  DataGranule
  DataGranule.DayNightFlag
  DataGranule.Identifiers
  DataGranule.Identifiers.Identifier
  DataGranule.Identifiers.IdentifierType
  DataGranule.ProductionDateTime
  GranuleUR
  MetadataSpecification
  MetadataSpecification.URL
  MetadataSpecification.Version
  Platforms
  ProviderDates
  ProviderDates.Date
  ProviderDates.Type
  RelatedUrls
  RelatedUrls.Type
  RelatedUrls.URL
  SpatialExtent
  SpatialExtent.HorizontalSpatialDomain


In [8]:
import json

# Load and pretty-print the JSON
with open("geocroissant_output_h.json", "r") as f:
    data = json.load(f)

# Pretty print with indentation
print(json.dumps(data, indent=2))

{
  "@context": {
    "sc": "http://schema.org/",
    "cr": "http://mlcommons.org/croissant/",
    "geocr": "http://mlcommons.org/croissant/geocr/",
    "dct": "http://purl.org/dc/terms/"
  },
  "@type": "cr:RecordSet",
  "cr:record": [
    {
      "@context": {
        "sc": "http://schema.org/",
        "cr": "http://mlcommons.org/croissant/",
        "geocr": "http://mlcommons.org/croissant/geocr/",
        "dct": "http://purl.org/dc/terms/"
      },
      "@type": "geocr:SatelliteImagery",
      "@id": "G2700719831-LPCLOUD",
      "cr:name": "HLS.S30.T16SDD.2016134T163332.v2.0",
      "cr:description": "HLS Sentinel-2 Multi-spectral Instrument Surface Reflectance Daily Global 30m v2.0",
      "dct:temporal": "2023-05-31T04:12:47.704Z",
      "geocr:Geometry": "POLYGON((-88.0861147 34.24811019, -86.8939975 34.25287576, -86.89272491 35.24303017, -88.09948036 35.23808354, -88.0861147 34.24811019))",
      "geocr:BoundingBox": {
        "west": -88.09948036,
        "south": 34.2481101