# Processing and Correcting NEON Hyperspectral Flight Lines for Scalable Spectral Data Analysis

Welcome to this vignette! This guide provides a detailed walkthrough for processing NEON (National Ecological Observatory Network) flight line data, taking you from raw downloads to actionable outputs. The workflow includes converting raw NEON flight lines into ENVI-compatible formats, applying essential data corrections (such as topographic and BRDF adjustments), and extracting hyperspectral data to build comprehensive tables for numerical and statistical analysis.

This workflow has been carefully designed to address the challenges of processing large datasets, ensuring efficient memory usage and high data integrity, even within the constraints of a machine with 250 GB of RAM. By the end of this guide, you will have the tools and understanding to transform raw hyperspectral data into corrected, high-quality datasets ready for advanced ecological and environmental research.

---

## Table of Contents
1. [Introduction](#1-introduction)
2. [Prerequisites](#2-prerequisites)
3. [Environment Setup](#3-environment-setup)
4. [Understanding NEON Flight Lines](#4-understanding-neon-flight-lines)
5. [Finding NEON Flight Codes](#5-finding-neon-flight-codes)
6. [Running the `jefe` Function](#6-running-the-jefe-function)
7. [Handling Large Data Processing](#7-handling-large-data-processing)
8. [Extracting Data and Building Tables](#8-extracting-data-and-building-tables)
9. [Workarounds for RAM Limitations](#9-workarounds-for-ram-limitations)
10. [Conclusion](#10-conclusion)
11. [References](#11-references)

---

## 1. Introduction

Hyperspectral data collected through NEON (National Ecological Observatory Network) flight lines provides a high-resolution spectral view of the Earth's surface, capturing detailed information about vegetation, soils, water, and other environmental components. However, the raw NEON data comes in specialized formats that require processing and correction before they can be used for meaningful analysis. 

This vignette provides a detailed, step-by-step guide to download, convert, and process NEON flight line data, ensuring efficient memory usage and high data integrity throughout the workflow. Along the way, we will create specific file types needed for both corrections and analysis, bridging the gap between raw hyperspectral data and actionable insights.

---

### **Why Extract Hyperspectral Signals?**
Hyperspectral data is invaluable for ecological and environmental research as it provides detailed spectral signatures across hundreds of bands. By extracting these signals and applying corrections, researchers can:
- **Translate Patterns Across Scales:** Connect fine-scale field measurements to broader regional or global observations.
- **Quantify Environmental Changes:** Monitor vegetation health, water quality, or land cover changes over time.
- **Improve Decision-Making Tools:** Build robust models for ecological resilience, biodiversity, and conservation planning.

---

### **Applications of This Workflow**

This workflow is particularly suited for:
- **Scaling Insights Across Spatial Domains:** Translating fine-scale hyperspectral data to broader landscapes ensures consistency and comparability across scales.
- **Monitoring Environmental Changes:** Creating corrected and high-fidelity datasets to track vegetation health, water quality, or land cover over time.
- **Enabling Cross-Sensor Calibration:** Harmonizing hyperspectral data across platforms by applying consistent corrections and resampling techniques.

---

## 2. Prerequisites

Before you begin, ensure you have the following:

- **Hardware Requirements:**
  - A machine with at least **250 GB RAM** to handle large datasets efficiently.

- **Software Requirements:**
  - Access to a pre-configured Python environment with necessary libraries installed, including:
    - `geopandas`, `rasterio`, `pandas`, `numpy`, `hytools`, `scikit-learn`, `matplotlib`, `requests`, `h5py`, `ray`.

- **Data Requirements:**
  - NEON flight line data.
  - Corresponding flight codes to identify and process the relevant flight lines.

- **Additional Tools:**
  - A Jupyter Notebook interface to follow this vignette step by step.

---

## 3. Python Setup

To follow this vignette, you'll need a Python environment configured with the necessary dependencies. If you haven't set up your environment yet, follow the steps below to install the required tools and libraries. This guide assumes you are working in a Jupyter Notebook.

### Required Libraries
Ensure the following libraries are available in your environment:

In [31]:
import hytools as ht
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import random
import nbconvert
import time

In [32]:
pip install spectral

Note: you may need to restart the kernel to use updated packages.


In [34]:
### Loading Earth Lab Spectral Tools

# 1. Enable autoreload in your Jupyter Notebook:

%load_ext autoreload
%autoreload 2

# 2. Import the custom tools module:

import spectral_unmixing_tools_original as el_spectral

# 3. Verify that the tools loaded correctly by printing the module's directory:

print(dir(el_spectral))

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
['ENVIProcessor', 'GradientBoostingRegressor', '__builtins__', '__cached__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'apply_topo_and_brdf_corrections', 'boosted_quantile_plot', 'boosted_quantile_plot_by_sensor', 'box', 'clean_data_and_write_to_csv', 'concatenate_sensors', 'control_function', 'download_neon_file', 'download_neon_flight_lines', 'extract_overlapping_layers_to_2d_dataframe', 'find_raster_files', 'fit_models_with_different_alpha', 'flight_lines_to_envi', 'generate_config_json', 'generate_correction_configs', 'generate_correction_configs_for_directory', 'get_spectral_data_and_wavelengths', 'glob', 'go_forth_and_multiply', 'gpd', 'h5py', 'ht', 'jefe', 'json', 'load_and_combine_rasters', 'load_spectra', 'mask', 'np', 'open_image', 'os', 'pd', 'plot_each_sensor_with_highlight', 'plot_spectral_data', 'plot_with_highlighted_sensors', 'plt', 'prepare_spectral_data', 'pr


## 4. Understanding and Finding NEON Flight Lines

NEON flight lines are aerial survey paths designed to collect high-resolution spectral data across various ecological sites. These datasets are vital for studying vegetation, soil, water bodies, and other environmental parameters, forming the foundation for many ecological and environmental analyses.

### **How to Find Flight Codes**
To process NEON flight lines with the `jefe` function, you’ll need the flight codes corresponding to your desired data. Follow these steps to find them:
1. **Access the NEON Data Portal:** Visit the [NEON Data Portal](https://data.neonscience.org/) to browse available datasets.
2. **Navigate to Flight Line Data:** Locate the section for flight line spectral data at your site of interest.
3. **Identify Relevant Flight Codes:** Each flight line has associated metadata, including its unique flight code. Record the codes for the lines you wish to process.

### **Important Considerations**
1. **Data Availability:**
   - NEON’s Airborne Observation Platform (AOP) data is generally available 60 days after the last collection day at a site.
   - Data collection schedules may shift due to weather or logistical factors. For the latest updates, consult the [NEON Flight Schedules and Coverage page](https://www.neonscience.org/data-collection/flight-schedules-coverage).

2. **Data Quality Updates:**
   - NEON regularly updates its data products to address quality concerns or implement new processing methods.
   - Stay informed about updates or changes that could affect your datasets by checking the [AOP Data Availability Notification](https://www.neonscience.org/impact/observatory-blog/aop-data-availability-notification-release-2024).

---

## 5. Running the `jefe` Function

The `jefe` function orchestrates the entire workflow, including converting flight lines into appropriate file formats, applying corrections, and extracting pixel data to build tables.

### Parameters for `jefe`

To effectively utilize the `jefe` function for processing NEON flight line data, it's crucial to understand and accurately specify its parameters. Below is a detailed guide on each parameter, including how to obtain the necessary information.

#### **`base_folder` (str)**
- **Description:** The directory where output files will be stored.
- **How to Specify:** Choose or create a directory path on your local system where you want the processed data to be saved.

#### **`site_code` (str)**
- **Description:** The NEON site code representing the specific field site.
- **How to Find:**
  - NEON assigns unique four-letter codes to each field site (e.g., "NIWO" for Niwot Ridge).
  - You can find these codes on the [NEON Field Sites page](https://www.neonscience.org/field-sites/explore).

#### **`product_code` (str)**
- **Description:** The NEON data product code identifying the specific data product.
- **How to Find:**
  - NEON data products have unique identifiers (e.g., "DP1.30003.001" for discrete return LiDAR point cloud data).
  - Browse the [NEON Data Products Catalog](https://data.neonscience.org/data-products/explore) to locate the product code relevant to your research.

#### **`year_month` (str)**
- **Description:** The year and month of data collection in `'YYYY-MM'` format.
- **How to Determine:**
  - Data collection periods vary by site and product. Consult the [NEON Data Availability page](https://data.neonscience.org/visualizations/data-availability) to check when data was collected for your site and product of interest.
  - **Important Note:** Data availability is subject to change due to factors like weather conditions and program planning adjustments.

#### **`flight_lines` (list)**
- **Description:** A list of flight line codes to process.
- **How to Find:**
  - Flight line codes correspond to specific aerial survey paths.
  - Access the [NEON Data Portal](https://data.neonscience.org/) and navigate to the desired data product and site.
  - Flight line codes are typically listed in the metadata associated with each dataset.

---




### Example Usage

In [35]:
# el jefe takes 3-5 hours to run and it creates a lot of files. You should have 200+GB of RAM and Storage available.
base_folder = "Next_try"
site_code = 'NIWO'
product_code = 'DP1.30006.001'
year_month = '2020-08'
flight_lines = [
    'D13_NIWO_DP1_20200807_170802'
]
# BRDF correction is failing with only one flight line provided but works when a list is longer than one. 

# Error out after correction when it should be moving to translation. That's the function I've been trying to work and have been trying to isolate. 
# Run the jefe function with the provided example parameters
el_spectral.jefe(base_folder, site_code, product_code, year_month, flight_lines)

Processing flight line: D13_NIWO_DP1_20200807_170802
Data retrieved successfully for 2020-08!
Downloading NEON_D13_NIWO_DP1_20200807_170802_reflectance.h5 from https://storage.googleapis.com/neon-aop-products/2020/FullSite/D13/2020_NIWO_4/L1/Spectrometer/ReflectanceH5/2020080714/NEON_D13_NIWO_DP1_20200807_170802_reflectance.h5
Download completed for NEON_D13_NIWO_DP1_20200807_170802_reflectance.h5
Download completed.

Processing: ./NEON_D13_NIWO_DP1_20200807_170802_reflectance.h5
Error executing command: /opt/conda/envs/macrosystems/bin/python neon2envi2_generic.py --images './NEON_D13_NIWO_DP1_20200807_170802_reflectance.h5' --output_dir 'Next_try' -anc
Standard Output: Here we GO!

Error Output: 2024-12-18 20:58:58,945	ERROR services.py:1329 -- Failed to start the dashboard , return code 1
2024-12-18 20:58:58,946	ERROR services.py:1354 -- Error should be written to 'dashboard.log' or 'dashboard.err'. We are printing the last 20 lines for you. See 'https://docs.ray.io/en/master/ray-ob

In [3]:
import os
import glob
import numpy as np
import pandas as pd
import json
import rasterio
from spectral import open_image

# ----- Supporting Classes and Functions -----

class ENVIProcessor:
    def __init__(self, file_path):
        self.file_path = file_path
        self.data = None  # This will hold the raster data array
        self.file_type = "envi"

    def load_data(self):
        """Loads the raster data from the file_path into self.data"""
        with rasterio.open(self.file_path) as src:
            self.data = src.read()  # Read all bands

    def get_chunk_from_extent(self, corrections=[], resample=False):
        self.load_data()  # Ensure data is loaded
        return self.data


def find_raster_files(directory):
    """
    Searches for raster files in the given directory, capturing both original and corrected ENVI files,
    plus resampled ones, while excluding .hdr, .json, .csv, and any files containing '_mask' or '_ancillary'.
    We'll look for filenames containing '_reflectance' (original) or '_envi' (corrected/resampled).
    """
    pattern = "*"
    full_pattern = os.path.join(directory, pattern)
    all_files = glob.glob(full_pattern)

    filtered_files = [
        file for file in all_files
        if (
            ('_reflectance' in os.path.basename(file) or '_envi' in os.path.basename(file)) and
            '_mask' not in os.path.basename(file) and
            '_ancillary' not in os.path.basename(file) and
            not file.endswith('.hdr') and
            not file.endswith('.json') and
            not file.endswith('.csv')
        )
    ]

    found_files_set = set(filtered_files)
    found_files = list(found_files_set)
    found_files.sort()

    return found_files


def load_and_combine_rasters(raster_paths):
    """
    Loads and combines raster data from a list of file paths.
    Assumes each raster has shape (bands, rows, cols) and that
    all rasters can be concatenated along the band dimension.
    """
    chunks = []
    for path in raster_paths:
        processor = ENVIProcessor(path)
        chunk = processor.get_chunk_from_extent(corrections=['some_correction'], resample=False)
        chunks.append(chunk)
    combined_array = np.concatenate(chunks, axis=0)  # Combine along the first axis (bands)
    return combined_array


def process_and_flatten_array(array, json_dir='Resampling', original_bands=426, corrected_bands=426,
                              original_wavelengths=None, corrected_wavelengths=None, folder_name=None,
                              map_info=None):
    """
    Processes a 3D numpy array to a DataFrame, adds metadata columns, 
    renames columns dynamically based on JSON configuration, and adds Pixel_id.
    Uses provided wavelength lists to name original and corrected bands, and includes geocoordinates.

    Parameters:
    - array: A 3D numpy array of shape (bands, rows, cols).
    - json_dir: Directory containing the landsat_band_parameters.json file.
    - original_bands: Number of original bands expected.
    - corrected_bands: Number of corrected bands expected.
    - original_wavelengths: List of wavelengths for the original bands (floats).
    - corrected_wavelengths: List of wavelengths for the corrected bands (floats).
    - folder_name: Name of the subdirectory (flight line identifier).
    - map_info: The map info array from the metadata for georeferencing.

    Returns:
    - A pandas DataFrame with additional metadata columns and renamed band columns.
    """
    if len(array.shape) != 3:
        raise ValueError("Input array must be 3-dimensional. Expected (bands, rows, cols).")

    json_file = os.path.join(json_dir, 'landsat_band_parameters.json')
    if not os.path.isfile(json_file):
        raise FileNotFoundError(f"JSON file not found: {json_file}")

    with open(json_file, 'r') as f:
        config = json.load(f)

    bands, rows, cols = array.shape
    print(f"[DEBUG] array shape: bands={bands}, rows={rows}, cols={cols}")

    reshaped_array = array.reshape(bands, -1).T  # (pixels, bands)
    pixel_indices = np.indices((rows, cols)).reshape(2, -1).T  # (pixels, 2)
    df = pd.DataFrame(reshaped_array, columns=[f'Band_{i+1}' for i in range(bands)])

    # Extract map info for georeferencing:
    # Format: [projection, x_pixel_start, y_pixel_start, map_x, map_y, x_res, y_res, ...]
    # Typically:
    #   x_pixel_start, y_pixel_start = 1,1 for upper-left pixel
    #   map_x, map_y = coordinates of that upper-left pixel
    #   x_res, y_res = pixel sizes (y_res should be positive but we assume north-down in ENVI)
    if map_info is not None and len(map_info) >= 7:
        projection = map_info[0]
        x_pixel_start = float(map_info[1])
        y_pixel_start = float(map_info[2])
        map_x = float(map_info[3])
        map_y = float(map_info[4])
        x_res = float(map_info[5])
        y_res = float(map_info[6])
    else:
        # Fallback if map_info is not provided
        projection = 'Unknown'
        x_pixel_start, y_pixel_start = 1.0, 1.0
        map_x, map_y = 0.0, 0.0
        x_res, y_res = 1.0, 1.0

    # Compute Easting, Northing
    # Pixel_row and Pixel_col are zero-based. 
    # According to ENVI conventions:
    # Easting = map_x + (pixel_col - (x_pixel_start - 1)) * x_res
    # Northing = map_y - (pixel_row - (y_pixel_start - 1)) * y_res
    pixel_row = pixel_indices[:, 0]
    pixel_col = pixel_indices[:, 1]
    Easting = map_x + (pixel_col - (x_pixel_start - 1)) * x_res
    Northing = map_y - (pixel_row - (y_pixel_start - 1)) * y_res

    # Insert Pixel info and coordinates
    df.insert(0, 'Pixel_Col', pixel_col)
    df.insert(0, 'Pixel_Row', pixel_row)
    df.insert(0, 'Pixel_id', np.arange(len(df)))
    df.insert(3, 'Easting', Easting)
    df.insert(4, 'Northing', Northing)

    # Check we have enough bands
    if bands < (original_bands + corrected_bands):
        raise ValueError(
            f"Not enough bands. Expected at least {original_bands + corrected_bands} (original+corrected), but got {bands}."
        )

    # Determine Corrected and Resampled flags
    remaining_bands = bands - (original_bands + corrected_bands)
    corrected_flag = "Yes" if corrected_bands > 0 else "No"
    resampled_flag = "Yes" if remaining_bands > 0 else "No"

    # Metadata columns: Subdirectory, Data_Source, Sensor_Type, Corrected, Resampled
    # Insert these at the very front
    df.insert(0, 'Resampled', resampled_flag)
    df.insert(0, 'Corrected', corrected_flag)
    df.insert(0, 'Sensor_Type', 'Hyperspectral')
    df.insert(0, 'Data_Source', 'Flight line')
    df.insert(0, 'Subdirectory', folder_name if folder_name else 'Unknown')

    # Rename bands with wavelengths
    band_names = []
    # Original bands
    if original_wavelengths is not None and len(original_wavelengths) >= original_bands:
        for i in range(original_bands):
            wl = original_wavelengths[i]
            band_names.append(f"Original_band_{i+1}_wl_{wl}nm")
    else:
        for i in range(1, original_bands + 1):
            band_names.append(f"Original_band_{i}")

    # Corrected bands
    if corrected_wavelengths is not None and len(corrected_wavelengths) >= corrected_bands:
        for i in range(corrected_bands):
            wl = corrected_wavelengths[i]
            band_names.append(f"Corrected_band_{i+1}_wl_{wl}nm")
    elif original_wavelengths is not None and len(original_wavelengths) >= corrected_bands:
        for i in range(corrected_bands):
            wl = original_wavelengths[i]
            band_names.append(f"Corrected_band_{i+1}_wl_{wl}nm")
    else:
        for i in range(1, corrected_bands + 1):
            band_names.append(f"Corrected_band_{i}")

    print(f"[DEBUG] remaining_bands for resampled sensors: {remaining_bands}")

    sensor_bands_assigned = 0
    for sensor, details in config.items():
        wavelengths = details.get('wavelengths', [])
        for i, wl in enumerate(wavelengths, start=1):
            if sensor_bands_assigned < remaining_bands:
                band_names.append(f"{sensor}_band_{i}_wl_{wl}nm")
                sensor_bands_assigned += 1
            else:
                break
        if sensor_bands_assigned >= remaining_bands:
            break

    if sensor_bands_assigned < remaining_bands:
        extra = remaining_bands - sensor_bands_assigned
        print(f"[DEBUG] {extra} leftover bands have no matching sensors/wavelengths in JSON. Naming them generically.")
        for i in range(1, extra + 1):
            band_names.append(f"Unassigned_band_{i}")

    # Now we have Pixel_id, Pixel_Row, Pixel_Col, Easting, Northing, and multiple metadata columns.
    # Determine how many leading metadata columns we have before bands:
    # Currently: Subdirectory, Data_Source, Sensor_Type, Corrected, Resampled, Pixel_id, Pixel_Row, Pixel_Col, Easting, Northing
    # That's 10 columns before bands start.
    metadata_count = 10

    new_columns = list(df.columns[:metadata_count]) + band_names
    if len(new_columns) != df.shape[1]:
        raise ValueError(
            f"Band naming mismatch: {len(new_columns)} columns assigned vs {df.shape[1]} in df. Check indexing."
        )

    df.columns = new_columns

    print(f"[DEBUG] Final DataFrame shape: {df.shape}")
    print("[DEBUG] Columns assigned successfully.")

    return df


def clean_data_and_write_to_csv(df, output_csv_path, chunk_size=100000):
    """
    Cleans a large DataFrame by processing it in chunks and then writes it to a CSV file.
    """
    total_rows = df.shape[0]
    num_chunks = (total_rows // chunk_size) + (1 if total_rows % chunk_size else 0)

    print(f"Cleaning data and writing to CSV in {num_chunks} chunk(s).")

    first_chunk = True
    for i, start_row in enumerate(range(0, total_rows, chunk_size)):
        chunk = df.iloc[start_row:start_row + chunk_size].copy()
        non_pixel_cols = [col for col in chunk.columns if not col.startswith('Pixel') and 
                          col not in ['Subdirectory','Data_Source','Sensor_Type','Corrected','Resampled',
                                      'Easting','Northing']]

        # Replace -9999 values with NaN
        chunk[non_pixel_cols] = chunk[non_pixel_cols].apply(
            lambda x: np.where(np.isclose(x, -9999, atol=1), np.nan, x)
        )

        # Drop rows with all NaNs in non-pixel columns (spectral data)
        chunk.dropna(subset=non_pixel_cols, how='all', inplace=True)

        mode = 'w' if first_chunk else 'a'
        header = True if first_chunk else False
        chunk.to_csv(output_csv_path, mode=mode, header=header, index=False)

        print(f"Chunk {i+1}/{num_chunks} processed and written.")
        first_chunk = False

    print(f"Data cleaning complete. Output written to: {output_csv_path}")


def control_function(directory):
    """
    Orchestrates the finding, loading, processing of raster files found in a specified directory,
    cleans the processed data, and saves it to a CSV file in the same directory.
    """
    raster_paths = find_raster_files(directory)

    if not raster_paths:
        print(f"No matching raster files found in {directory}.")
        return

    # Assume original file name (without _envi etc.) is the directory name
    base_name = os.path.basename(os.path.normpath(directory))
    hdr_file = os.path.join(os.path.dirname(directory), base_name + '.hdr')
    if not os.path.isfile(hdr_file):
        hdr_file = os.path.join(directory, base_name + '.hdr')

    original_wavelengths = None
    map_info = None
    if os.path.isfile(hdr_file):
        img = open_image(hdr_file)
        original_wavelengths = img.metadata.get('wavelength', [])
        # Convert to float if they are strings
        original_wavelengths = [float(w) for w in original_wavelengths]
        map_info = img.metadata.get('map info', None)
    else:
        print(f"No HDR file found at {hdr_file}. Will use generic band names and no geocoords.")

    corrected_wavelengths = original_wavelengths

    # Load and combine raster data
    combined_array = load_and_combine_rasters(raster_paths)  
    print(f"Combined array shape for directory {directory}: {combined_array.shape}")

    # Attempt to process and flatten the array into a DataFrame
    try:
        df_processed = process_and_flatten_array(
            combined_array,
            json_dir='Resampling',
            original_bands=426,
            corrected_bands=426,
            original_wavelengths=original_wavelengths,
            corrected_wavelengths=corrected_wavelengths,
            folder_name=base_name,
            map_info=map_info
        )  
        print(f"DataFrame shape after flattening for directory {directory}: {df_processed.shape}")
    except ValueError as e:
        print(f"ValueError encountered during processing of {directory}: {e}")
        print("Check the number of bands vs. the expected column names in process_and_flatten_array().")
        return
    except Exception as e:
        print(f"An unexpected error occurred while processing {directory}: {e}")
        return

    # Extract the folder name from the directory path
    folder_name = os.path.basename(os.path.normpath(directory))
    output_csv_name = f"{folder_name}_spectral_data_all_sensors.csv"
    output_csv_path = os.path.join(directory, output_csv_name)

    # Always overwrite if CSV exists
    if os.path.exists(output_csv_path):
        print(f"CSV {output_csv_path} already exists and will be overwritten.")

    # Clean data and write to CSV
    clean_data_and_write_to_csv(df_processed, output_csv_path)  
    print(f"Processed and cleaned data saved to {output_csv_path}")


def process_all_subdirectories(parent_directory):
    """
    Searches for all subdirectories within the given parent directory, excluding non-directory files,
    and applies raster file processing to each subdirectory found.
    """
    for item in os.listdir(parent_directory):
        full_path = os.path.join(parent_directory, item)
        if os.path.isdir(full_path):
            try:
                control_function(full_path)
                print(f"Finished processing for directory: {full_path}")
            except Exception as e:
                print(f"Error processing directory '{full_path}': {e}")
        else:
            print(f"Skipping non-directory item: {full_path}")


In [4]:
base_folder = "New_Test"
process_all_subdirectories(base_folder)

Combined array shape for directory New_Test/NEON_D13_NIWO_DP1_20200807_170802_reflectance: (897, 10487, 994)
[DEBUG] array shape: bands=897, rows=10487, cols=994
[DEBUG] remaining_bands for resampled sensors: 45
[DEBUG] Final DataFrame shape: (10424078, 907)
[DEBUG] Columns assigned successfully.
DataFrame shape after flattening for directory New_Test/NEON_D13_NIWO_DP1_20200807_170802_reflectance: (10424078, 907)
CSV New_Test/NEON_D13_NIWO_DP1_20200807_170802_reflectance/NEON_D13_NIWO_DP1_20200807_170802_reflectance_spectral_data_all_sensors.csv already exists and will be overwritten.
Cleaning data and writing to CSV in 105 chunk(s).
Chunk 1/105 processed and written.
Chunk 2/105 processed and written.
Chunk 3/105 processed and written.
Chunk 4/105 processed and written.
Chunk 5/105 processed and written.
Chunk 6/105 processed and written.
Chunk 7/105 processed and written.
Chunk 8/105 processed and written.
Chunk 9/105 processed and written.
Chunk 10/105 processed and written.
Chunk 1

KeyboardInterrupt: 

In [17]:
import pandas as pd

# Define the file path
csv_file = "New_Test/NEON_D13_NIWO_DP1_20200807_170802_reflectance/NEON_D13_NIWO_DP1_20200807_170802_reflectance_spectral_data_all_sensors.csv"

# Load the CSV file
try:
    data = pd.read_csv(csv_file)
    
    # Preview the 12th column
    if data.shape[1] >= 12:  # Ensure there are at least 12 columns
        twelfth_column = data.iloc[:, 11]  # Column indices are zero-based
        twelfth_column_cleaned = twelfth_column.dropna()  # Remove NaN values
        print("12th Column Data (NaN values removed):")
        print(twelfth_column_cleaned)
    else:
        print("The file does not have 12 columns.")
except FileNotFoundError:
    print(f"The file at {csv_file} was not found.")
except Exception as e:
    print(f"An error occurred: {e}")


12th Column Data (NaN values removed):
651          0.0
1639        17.0
1640         0.0
1641         0.0
1642         0.0
           ...  
4117842    124.0
4117843     79.0
4117844     98.0
4117845    227.0
4117846      0.0
Name: Original_band_2_wl_386.674988nm, Length: 1972503, dtype: float64


### What Happens When `jefe` Runs

When you run the `jefe` function, a sequence of operations is executed, and multiple outputs are generated. Here's a detailed breakdown:

1. **Downloading Raw Data:**
   - The original NEON flight line folder is downloaded to the specified output directory.
   - The raw folder contains the reflectance data and associated metadata files.

2. **Conversion to Multiple Formats:**
   - The downloaded folder is processed to generate additional formats required for analysis.
   - These files are named systematically to represent the processing step or correction applied. For example:
     - **`_envi`:** Reflectance data in ENVI format.
     - **`_envi_mask`:** Mask files indicating areas to include or exclude during analysis.
     - **`.hdr`:** Header files describing the structure of the associated data.
     - **`.json`:** Configuration files for corrections and processing steps.

3. **Application of Corrections:**
   - Topographic corrections (TOPO) and bidirectional reflectance distribution function (BRDF) corrections are applied to ensure data accuracy.
   - Outputs include:
     - **`_brdf_coeffs__envi.json`:** Coefficients for BRDF corrections.
     - **`_topo_coeffs__envi.json`:** Coefficients for topographic corrections.

4. **Data Extraction and Processing:**
   - Spectral data is extracted pixel by pixel and saved in tabular formats for further analysis.
   - These extractions are saved incrementally to avoid memory overuse.

---

### Example Outputs from a Single Flight Line

After running the `jefe` function, the output directory contains processed files at the top level and a folder for the original raw data. Here’s what you can expect for a single trial run:

---

#### **Main Output Directory:**
- **Processed Files:** Includes ENVI-format files, masks, headers, and configuration files. These represent the final processed outputs ready for analysis.
- **Raw Folder:** A subdirectory containing the original reflectance data downloaded from NEON.

| File Name                                             | Description                                         |
|-------------------------------------------------------|-----------------------------------------------------|
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance__envi` | Reflectance data converted to ENVI format.         |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance__mask` | Mask file for the reflectance data.                |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance.hdr`   | Header file describing the ENVI data structure.    |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance__brdf_coeffs__envi.json` | BRDF correction coefficients. |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance__topo_coeffs__envi.json` | TOPO correction coefficients. |

---

#### **Raw Folder (Inside the Output Directory):**
- **Original Files:** Contains the raw reflectance data downloaded directly from NEON before any processing steps.

| File Name                                             | Description                                         |
|-------------------------------------------------------|-----------------------------------------------------|
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance`       | Original reflectance data from NEON.               |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance_ancillary` | Ancillary metadata for corrections.               |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance_config__envi.json` | Configuration for ENVI data processing. |
| `NEON_D13_NIWO_DP1_20200807_170802_reflectance_config__anc.json`  | Configuration for ancillary corrections. |

---

This structure ensures that:
1. The **processed files** are readily available in the main directory for analysis.
2. The **raw data** is preserved in its original form for reference or reprocessing if needed.

By organizing outputs this way, you can easily navigate between raw and processed data while maintaining a clear workflow history.


### Process Overview

1. **Data Conversion:**
   - Converts NEON reflectance data to formats compatible with ENVI tools and downstream analyses.

2. **Data Corrections:**
   - Applies topographic and BRDF corrections to improve data quality.

3. **Outputs Generated:**
   - Reflectance data in corrected formats.
   - Mask files for regions of interest.
   - Configuration files describing the processing steps.
   - Coefficients for TOPO and BRDF corrections.

By the end of this process, you will have a comprehensive set of files ready for analysis, including corrected reflectance data, metadata, and configurations.

---

## 6. Handling Large Data Processing<a name="handling-large-data-processing"></a>

Processing NEON flight lines involves managing large amounts of spectral data. This workflow incorporates strategies to optimize memory usage and prevent bottlenecks.

### Key Strategies

1. **Chunk Processing:** Processes data in smaller chunks to avoid memory overload.
2. **Direct Disk Writing:** Saves intermediate and final results directly to storage.
3. **Optimized Data Structures:** Uses efficient formats like NumPy arrays and Pandas DataFrames.
4. **Parallel Processing:** Utilizes libraries like `ray` for distributed processing.

---


## 9. Conclusion<a name="conclusion"></a>

This vignette provided a comprehensive, step-by-step guide to processing NEON flight line data, highlighting key techniques and strategies for handling large, complex datasets. The workflow included downloading NEON flight lines, converting them into suitable file formats, applying critical corrections, and extracting hyperspectral data from pixels before writing the results to CSV files for further numerical analysis.

By completing this process, you gain the ability to transform raw NEON airborne data into actionable datasets, enabling robust ecological and environmental research. This workflow is designed to balance efficiency, accuracy, and scalability, ensuring that even massive datasets can be processed on machines with limited resources.

### **Key Takeaways**
1. **Efficient Data Handling:** 
   - From downloading raw flight line data to saving corrected and processed outputs, this workflow demonstrates how to manage large-scale operations effectively.
   - Chunk processing and direct-to-disk writing ensure that memory constraints are respected while maintaining high data fidelity.

2. **Robust Data Corrections:** 
   - The inclusion of topographic and BRDF corrections ensures that the processed data is accurate and reliable for downstream analysis, accounting for variability in reflectance and terrain.

3. **Hyperspectral Data for Analysis:** 
   - The extraction of hyperspectral data from individual pixels provides a valuable resource for detailed numerical and statistical studies, enabling deeper insights into ecological and environmental processes.

4. **Scalability and Reproducibility:** 
   - This workflow is scalable to handle additional flight lines, datasets, and sites, making it a versatile tool for researchers working across diverse geographies and ecological systems.
   - By following standardized steps and leveraging robust tools, you can ensure that your processing is reproducible and aligned with scientific best practices.

---

## 10. References<a name="references"></a>

- **NEON Data Portal:** [https://data.neonscience.org/](https://data.neonscience.org/)
- **GeoPandas Documentation:** [https://geopandas.org/](https://geopandas.org/)
- **Rasterio Documentation:** [https://rasterio.readthedocs.io/](https://rasterio.readthedocs.io/)
- **NumPy Documentation:** [https://numpy.org/doc/](https://numpy.org/doc/)
- **HyTools Documentation:** [https://hytools.readthedocs.io/](https://hytools.readthedocs.io/)
- **Ray Documentation:** [https://docs.ray.io/en/latest/](https://docs.ray.io/en/latest/)
- **NEON Field Sites Page:** [https://www.neonscience.org/field-sites/explore](https://www.neonscience.org/field-sites/explore)
- **NEON Data Products Catalog:** [https://data.neonscience.org/data-products/explore](https://data.neonscience.org/data-products/explore)
- **NEON Data Availability Page:** [https://data.neonscience.org/visualizations/data-availability](https://data.neonscience.org/visualizations/data-availability)
- **NEON Flight Schedules and Coverage:** [https://www.neonscience.org/data-collection/flight-schedules-coverage](https://www.neonscience.org/data-collection/flight-schedules-coverage)
- **AOP Data Availability Notification:** [https://www.neonscience.org/impact/observatory-blog/aop-data-availability-notification-release-2024](https://www.neonscience.org/impact/observatory-blog/aop-data-availability-notification-release-2024)

---