# Advanced Microwave Scanning Radiometer(AMSR)
## Introduction:

The Advanced Microwave Scanning Radiometer (AMSR) is a satellite-based instrument designed to observe Earth's surface and atmosphere using microwave frequencies. AMSR produces various data products related to weather and climate, including measurements of snow water equivalent (SWE), soil moisture, sea ice concentration, rainfall rates, and ocean surface wind speeds.The AMSR data, collected by satellite-based instruments like the Advanced Microwave Scanning Radiometer (AMSR), is utilized within the snowcast_wormhole workflow to enhance snowfall prediction accuracy.

## AMSR Data download
In the first step of the data download, we gather the links to daily snow data files from AMSR using the script present in **amsr_swe_data_download** file. We do this by using Python's datetime module to create web links for each day within a range of years. This ensures that we have access to all the necessary data for analysis. By carefully creating these links, we make sure that we can easily get the data we need for predicting snowfall accurately.The **generate_links** function creates download links for AMSR daily snow data files spanning a specified range of years. It initializes variables for the base URL, date format, and time delta. Using a while loop, it iterates through each date within the given years, formats the date string, and constructs the download link by appending the date to the base URL. Finally, it returns a list of download links covering the entire specified time period. This function streamlines the process of generating download links for accessing AMSR snow data.  
After gathering the links to AMSR daily snow data files, the next step is to download these files for further analysis.  


In [None]:
from datetime import datetime, timedelta
import os
import subprocess

def generate_links(start_year, end_year):
    '''
    Generate a list of download links for AMSR daily snow data files.

    Args:
        start_year (int): The starting year.
        end_year (int): The ending year (inclusive).

    Returns:
        list: A list of download links for AMSR daily snow data files.
    '''
    base_url = "https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/"
    date_format = "%Y.%m.%d"
    delta = timedelta(days=1)

    start_date = datetime(start_year, 1, 1)
    end_date = datetime(end_year + 1, 1, 1)

    links = []
    current_date = start_date

    while current_date < end_date:
        date_str = current_date.strftime(date_format)
        link = base_url + date_str + "/AMSR_U2_L3_DailySnow_B02_" + date_str + ".he5"
        links.append(link)
        current_date += delta

    return links

if __name__ == "__main__":
    start_year = 2019
    end_year = 2022

    links = generate_links(start_year, end_year)
    save_location = "C:/Users/srila/gridmet_test_run/amsr"
    with open("C:/Users/srila/gridmet_test_run/amsr/download_links.txt", "w") as txt_file:
      for l in links:
        txt_file.write(" ".join(l) + "\n")

    #if not os.path.exists(save_location):
    #    os.makedirs(save_location)

    #for link in links:
    #    filename = link.split("/")[-1]
    #    save_path = os.path.join(save_location, filename)
    #    curl_cmd = f"curl -b ~/.urs_cookies -c ~/.urs_cookies -L -n -o {save_path} {link}"
    #    subprocess.run(curl_cmd, shell=True, check=True)
        # print(f"Downloaded: {filename}")


The **perform_download.sh** script utilizes the wget command-line tool to automate the downloading process. The download_links.txt file contains URLs pointing to the AMSR data files. The script reads each URL from this file, one by one, and downloads the corresponding data file using wget. Before downloading, it sets up some common options for wget, such as authentication credentials, cookie handling, and specifying the output directory. By executing this shell script, we efficiently fetch the required data files and store them in the designated output directory, ensuring that we have the necessary data ready for analysis and modeling tasks in the subsequent steps of our workflow. 

In [None]:
#!/bin/bash

# Specify the file containing the download links
input_file="C:/Users/srila/gridmet_test_run/amsr/download_links.txt"

# Specify the base wget command with common options
base_wget_command = "wget --http-user=<your_username> --http-password=<your_password> --load-cookies C:/Users/srila/gridmet_test_run/amsr/mycookies.txt --save-cookies mycookies.txt --keep-session-cookies --no-check-certificate -"

# Specify the output directory for downloaded files
output_directory="C:/Users/srila/gridmet_test_run/amsr"

# Ensure the output directory exists
mkdir -p "$output_directory"

# Loop through each line (URL) in the input file and download it using wget
while IFS= read -r url; do
    echo "Downloading: $url"
    $base_wget_command -P "$output_directory" "$url"
done < "$input_file"

## Extracting Features from AMSR data:
Once the AMSR data files are downloaded, the next step involves extracting relevant features from these files. The script named **amsr_features**, accomplishes this task by processing each AMSR data file and extracting snow water equivalent (SWE) values for specific grid cells corresponding to SNOTEL weather stations. Lets breakdown each step involved in feature extraction. 
###  copy_he5_files(source_dir, destination_dir)
The function **copy_he5_files(source_dir, destination_dir)** is used to copy files with the extension ".he5" from a specified source directory to a destination directory. It is used to transfer files containing AMSR data from one location to another, for organization or preprocessing purposes. 


In [None]:
def copy_he5_files(source_dir, destination_dir):
    '''
    Copy .he5 files from the source directory to the destination directory.

    Args:
        source_dir (str): The source directory containing .he5 files to copy.
        destination_dir (str): The destination directory where .he5 files will be copied.

    Returns:
        None
    '''
    # Get a list of all subdirectories and files in the source directory
    for root, dirs, files in os.walk(source_dir):
        for file in files:
            if file.endswith('.he5'):
                # Get the absolute path of the source file
                source_file_path = os.path.join(root, file)
                # Copy the file to the destination directory
                shutil.copy(source_file_path, destination_dir)

### find_closest_index
The function **find_closest_index** in the file is used to find the closest grid cell in an AMSR dataset to the latitude and longitude coordinates of each SNOTEL station, facilitating the mapping of AMSR data to specific geographic locations. 


In [None]:
def find_closest_index(target_latitude, target_longitude, lat_grid, lon_grid):
    '''
    Find the index of the grid cell with the closest coordinates to the target latitude and longitude.

    Args:
        target_latitude (float): The target latitude.
        target_longitude (float): The target longitude.
        lat_grid (numpy.ndarray): An array of latitude values.
        lon_grid (numpy.ndarray): An array of longitude values.

    Returns:
        Tuple[int, int, float, float]: A tuple containing the row index, column index, closest latitude, and closest longitude.
    '''
    # Compute the absolute differences between target and grid coordinates
    lat_diff = np.abs(lat_grid - target_latitude)
    lon_diff = np.abs(lon_grid - target_longitude)

    # Find the indices corresponding to the minimum differences
    lat_idx, lon_idx = np.unravel_index(np.argmin(lat_diff + lon_diff), lat_grid.shape)

    return lat_idx, lon_idx, lat_grid[lat_idx, lon_idx], lon_grid[lat_idx, lon_idx]

### create_snotel_station_to_amsr_mapper
The **create_snotel_station_to_amsr_mapper** function reads a list of SNOTEL (Snow Telemetry) station locations from a CSV file and maps each station to the corresponding grid cell in an AMSR (Advanced Microwave Scanning Radiometer) dataset. The function calculates the closest grid cell in the AMSR dataset to each SNOTEL station using the **find_closest_index** function. It then creates a new CSV file containing the mapped coordinates of each SNOTEL station along with the corresponding coordinates of the closest grid cell in the AMSR dataset. If the CSV file already exists, the function reads it directly instead of recalculating the mappings. Additionally, the function downloads the required AMSR dataset if it is not already available locally. 


In [None]:
def create_snotel_station_to_amsr_mapper(
  new_base_station_list_file, 
  target_csv_path
):
    station_data = pd.read_csv(new_base_station_list_file)
    
    
    date = "2022-10-01"
    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    
    # Check if the CSV already exists
    
    if os.path.exists(target_csv_path):
        print(f"File {target_csv_path} already exists, skipping..")
        df = pd.read_csv(target_csv_path)
        return df
    
    target_amsr_hdf_path = f"{work_dir}/amsr_testing/testing_amsr_{date}.he5"
    if os.path.exists(target_amsr_hdf_path):
        print(f"File {target_amsr_hdf_path} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ~/.urs_cookies -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        subprocess.run(cmd, shell=True)
    
    df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon', 
                               'amsr_lat_idx', 'amsr_lon_idx',
                               'station_lat', 'station_lon'])
    # Read the HDF
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    # Convert the AMSR grid into our gridMET 1km grid
    for idx, row in station_data.iterrows():
        target_lat = row['latitude']
        target_lon = row['longitude']
        
        # compare the performance and find the fastest way to search nearest point
        closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
        df.loc[len(df.index)] = [closest_lat, 
                                 closest_lon,
                                 closest_lat_idx,
                                 closest_lon_idx,
                                 target_lat,
                                 target_lon]
    
    # Save the new converted AMSR to CSV file
    df.to_csv(target_csv_path, index=False)
  
    print('AMSR mapper csv is created.')
    return df

### extract_amsr_values_save_to_csv
The **extract_amsr_values_save_to_csv** function processes AMSR (Advanced Microwave Scanning Radiometer) data files and extracts relevant information to create a CSV file. It first prepares a mapping between SNOTEL (Snow Telemetry) station coordinates and the corresponding grid cells in the AMSR dataset using the create_snotel_station_to_amsr_mapper function. Then, it iterates through the AMSR data files, extracts the snow water equivalent (SWE) values for each SNOTEL station's mapped grid cell, and saves the results along with the corresponding dates and station coordinates to a CSV file. The function handles parallel processing of multiple data files using Dask, a Python library for parallel computing.  


In [None]:
def extract_amsr_values_save_to_csv(amsr_data_dir, output_csv_file, new_base_station_list_file, start_date, end_date):
    if os.path.exists(output_csv_file):
        os.remove(output_csv_file)
    
    target_csv_path = f'{work_dir}/training_snotel_station_to_amsr_mapper.csv'
    mapper_df = create_snotel_station_to_amsr_mapper(new_base_station_list_file, 
                                         target_csv_path)
        
    # station_data = pd.read_csv(new_base_station_list_file)

    start_date = datetime.strptime(start_date, "%Y-%m-%d")
    end_date = datetime.strptime(end_date, "%Y-%m-%d")

    # Create a Dask DataFrame
    dask_station_data = dd.from_pandas(mapper_df, npartitions=1)

    # Function to process each file
    def process_file(filename):
        file_path = os.path.join(amsr_data_dir, filename)
        print(file_path)
        
        file = h5py.File(file_path, 'r')
        hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']

        date_str = filename.split('_')[-1].split('.')[0]
        date = datetime.strptime(date_str, '%Y%m%d')

        if not (start_date <= date <= end_date):
            print(f"{date} is not in the training period, skipping..")
            return None

        new_date_str = date.strftime("%Y-%m-%d")
        swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
        flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
        # Create an empty Pandas DataFrame with the desired columns
        result_df = pd.DataFrame(columns=['date', 'lat', 'lon', 'AMSR_SWE'])

        # Sample loop to add rows to the Pandas DataFrame using dask.delayed
        @delayed
        def process_row(row, swe, new_date_str):
          closest_lat_idx = int(row['amsr_lat_idx'])
          closest_lon_idx = int(row['amsr_lon_idx'])
          closest_swe = swe[closest_lat_idx, closest_lon_idx]
          
          return pd.DataFrame([[
            new_date_str, 
            row['station_lat'],
            row['station_lon'],
            closest_swe]], 
            columns=result_df.columns
          )


        # List of delayed computations
        delayed_results = [process_row(row, swe, new_date_str) for _, row in mapper_df.iterrows()]

        # Compute the delayed results and concatenate them into a Pandas DataFrame
        result_df = dask.compute(*delayed_results)
        result_df = pd.concat(result_df, ignore_index=True)

        # Print the final Pandas DataFrame
        #print(result_df)
          
        return result_df

    # Get the list of files
    files = [f for f in os.listdir(amsr_data_dir) if f.endswith('.he5')]

    # Create a Dask Bag from the files
    dask_bag = db.from_sequence(files, npartitions=2)

    # Process files in parallel
    processed_data = dask_bag.map(process_file).filter(lambda x: x is not None).compute()

    # Concatenate the processed data
    combined_df = pd.concat(processed_data, ignore_index=True)

    # Save the combined DataFrame to a CSV file
    combined_df.to_csv(output_csv_file, index=False)

    print(f"Merged data saved to {output_csv_file}")


Overall, the   automates the process of extracting relevant AMSR data features and integrating them with SNOTEL station data, streamlining the workflow for further analysis and modeling tasks related to snowfall prediction.

## Automating Real-time Retrieval and Processing of AMSR Snow Data
The script named amsr_testing_realtime serves the purpose of automating the retrieval and processing of AMSR snow data in near real-time. It facilitates the download of AMSR snow data for a specified date, converts it into a format compatible with Digital Elevation Model (DEM), and subsequently saves it as a CSV file. The script's functionalities encompass downloading AMSR data from an online repository, handling HDF5 file formats using the h5py library, converting AMSR grid data to a format suitable for analysis, and addressing missing values in the dataset through interpolation techniques. To execute these tasks, the script utilizes various libraries including pandas for efficient data manipulation, scipy for spatial operations, and subprocess for executing shell commands. Additionally, it offers flexibility in customizing the target date for data retrieval and processing to suit specific requirements. 

### prepare_amsr_grid_mapper:
This function **prepare_amsr_grid_mapper** is responsible for creating a mapping between the AMSR (Advanced Microwave Scanning Radiometer) grid and the GridMET (Grid Point Surface Meteorological Data) grid, facilitating the conversion of AMSR snow data into a format compatible with GridMET. It first reads the AMSR data from a specified HDF5 file, extracting latitude and longitude information. Then, it retrieves the coordinates of GridMET grid points representing stations in the Western US. By finding the nearest AMSR grid points to each GridMET station, it establishes a mapping, storing the relevant information such as latitude, longitude, and grid indices in a CSV file. This mapping is crucial for subsequent steps in the script, enabling efficient retrieval and processing of AMSR snow data for specific locations.


In [None]:
def prepare_amsr_grid_mapper():
    df = pd.DataFrame(columns=['amsr_lat', 'amsr_lon', 
                               'amsr_lat_idx', 'amsr_lon_idx',
                               'gridmet_lat', 'gridmet_lon'])
    date = test_start_date
    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    
    # Check if the CSV already exists
    target_csv_path = f'{work_dir}/amsr_to_gridmet_mapper.csv'
    if os.path.exists(target_csv_path):
        print(f"File {target_csv_path} already exists, skipping..")
        return
    
    target_amsr_hdf_path = f"{work_dir}/amsr_testing/testing_amsr_{date}.he5"
    if os.path.exists(target_amsr_hdf_path):
        print(f"File {target_amsr_hdf_path} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ~/.urs_cookies -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        subprocess.run(cmd, shell=True)
    
    # Read the HDF
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    # Convert the AMSR grid into our gridMET 1km grid
    western_us_df = pd.read_csv(western_us_coords)
    for idx, row in western_us_df.iterrows():
        target_lat = row['Latitude']
        target_lon = row['Longitude']
        
        # compare the performance and find the fastest way to search nearest point
        closest_lat_idx, closest_lon_idx, closest_lat, closest_lon = find_closest_index(target_lat, target_lon, lat, lon)
        df.loc[len(df.index)] = [closest_lat, 
                                 closest_lon,
                                 closest_lat_idx,
                                 closest_lon_idx,
                                 target_lat, 
                                 target_lon]
    
    # Save the new converted AMSR to CSV file
    df.to_csv(target_csv_path, index=False)
  
    print('AMSR mapper csv is created.')

### download_amsr_and_convert_grid
The download_amsr_and_convert_grid function is a core component of the "amsr_testing_realtime" script, responsible for downloading AMSR (Advanced Microwave Scanning Radiometer) snow data for a specific date and converting it into a format compatible with DEM (Digital Elevation Model) grids.
Here's a breakdown of its functionality:

**Download AMSR Data**: This function first constructs the URL for the AMSR data corresponding to the specified date and attempts to download it using the curl command. It ensures that the necessary cookies are available for authentication.

**Read HDF Data**: Once the data is downloaded, the function reads the HDF5 file using the h5py library, extracting the latitude, longitude, snow water equivalent (SWE), and flag information from the file.

**Data Conversion**: It then converts the AMSR grid into a format compatible with the DEM grid by finding the corresponding grid points in the DEM grid. This involves identifying the nearest DEM grid points for each AMSR grid point.

**Custom Calculation**: For each DEM grid point, the function performs a custom calculation to determine the SWE and flag values based on the nearest AMSR grid points.

**Save to CSV**: Finally, the function saves the converted data, including the latitude, longitude, SWE, and flag information, into a CSV file. This file can be further processed or analyzed in subsequent steps of the script.

In [None]:
def download_amsr_and_convert_grid(target_date = test_start_date):
    """
    Download AMSR snow data, convert it to DEM format, and save as a CSV file.
    """
    
    
    
    # the mapper
    target_mapper_csv_path = f'{work_dir}/amsr_to_gridmet_mapper.csv'
    mapper_df = pd.read_csv(target_mapper_csv_path)
    #print(mapper_df.head())
    
    df = pd.DataFrame(columns=['date', 'lat', 
                               'lon', 'AMSR_SWE', 
                               'AMSR_Flag'])
    date = target_date
    date = date.replace("-", ".")
    he5_date = date.replace(".", "")
    
    # Check if the CSV already exists
    target_csv_path = f'{work_dir}/testing_ready_amsr_{date}.csv'
    if os.path.exists(target_csv_path):
        print(f"File {target_csv_path} already exists, skipping..")
        return target_csv_path
    
    target_amsr_hdf_path = f"{work_dir}/amsr_testing/testing_amsr_{date}.he5"
    if os.path.exists(target_amsr_hdf_path) and is_binary(target_amsr_hdf_path):
        print(f"File {target_amsr_hdf_path} already exists, skip downloading..")
    else:
        cmd = f"curl --output {target_amsr_hdf_path} -b ~/.urs_cookies -c ~/.urs_cookies -L -n -O https://n5eil01u.ecs.nsidc.org/AMSA/AU_DySno.001/{date}/AMSR_U2_L3_DailySnow_B02_{he5_date}.he5"
        print(f'Running command: {cmd}')
        result = subprocess.run(cmd, shell=True, capture_output=True, text=True)
        # Check the exit code
        if result.returncode != 0:
            print(f"Command failed with exit code {result.returncode}.")
            if os.path.exists(target_amsr_hdf_path):
              os.remove(target_amsr_hdf_path)
              print(f"Wrong {target_amsr_hdf_path} removed successfully.")
            raise Exception(f"Failed to download {target_amsr_hdf_path} - {result.stderr}")
    
    # Read the HDF
    print(f"Reading {target_amsr_hdf_path}")
    file = h5py.File(target_amsr_hdf_path, 'r')
    hem_group = file['HDFEOS/GRIDS/Northern Hemisphere']
    lat = hem_group['lat'][:]
    lon = hem_group['lon'][:]
    
    # Replace NaN values with 0
    lat = np.nan_to_num(lat, nan=0.0)
    lon = np.nan_to_num(lon, nan=0.0)
    
    swe = hem_group['Data Fields/SWE_NorthernDaily'][:]
    flag = hem_group['Data Fields/Flags_NorthernDaily'][:]
    date = datetime.strptime(date, '%Y.%m.%d')
    
    # Convert the AMSR grid into our DEM 1km grid
    
    def get_swe(row):
        # Perform your custom calculation here
        closest_lat_idx = int(row['amsr_lat_idx'])
        closest_lon_idx = int(row['amsr_lon_idx'])
        closest_swe = swe[closest_lat_idx, closest_lon_idx]
        return closest_swe
    
    def get_swe_flag(row):
        # Perform your custom calculation here
        closest_lat_idx = int(row['amsr_lat_idx'])
        closest_lon_idx = int(row['amsr_lon_idx'])
        closest_flag = flag[closest_lat_idx, closest_lon_idx]
        return closest_flag
    
    # Use the apply function to apply the custom function to each row
    mapper_df['AMSR_SWE'] = mapper_df.apply(get_swe, axis=1)
    mapper_df['AMSR_Flag'] = mapper_df.apply(get_swe_flag, axis=1)
    mapper_df['date'] = date
    mapper_df.rename(columns={'dem_lat': 'lat'}, inplace=True)
    mapper_df.rename(columns={'dem_lon': 'lon'}, inplace=True)
    mapper_df = mapper_df.drop(columns=['amsr_lat',
                                        'amsr_lon',
                                        'amsr_lat_idx',
                                        'amsr_lon_idx'])
    
    print("result df: ", mapper_df.head())
    # Save the new converted AMSR to CSV file
    print(f"saving the new AMSR SWE to csv: {target_csv_path}")
    mapper_df.to_csv(target_csv_path, index=False)
    
    print('Completed AMSR testing data collection.')
    return target_csv_path

def add_cumulative_column(df, column_name):
    df[f'cumulative_{column_name}'] = df[column_name].sum()
    return df

### get_cumulative_amsr_data
The **get_cumulative_amsr_data** function in the "amsr_testing_realtime" script serves as the backbone for collecting, processing, and aggregating AMSR (Advanced Microwave Scanning Radiometer) snow data. Beginning from October 1st of the previous year up to a specified target date, this function iterates through each date, downloading corresponding AMSR data and converting it to CSV format. It meticulously handles missing values, employing polynomial interpolation techniques while enforcing constraints on snow water equivalent (SWE) values. As it processes data for each date, it aggregates the results into a cohesive DataFrame, incorporating latitude, longitude, SWE, flags, and cumulative SWE values. The function ensures data integrity and consistency, producing a comprehensive CSV dataset containing cumulative AMSR snow data for the specified target date, essential for subsequent analysis or visualization tasks.

In [None]:
   
def get_cumulative_amsr_data(target_date = test_start_date, force=False):
    
    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    print(selected_date)
    if selected_date.month < 10:
      past_october_1 = datetime(selected_date.year - 1, 10, 1)
    else:
      past_october_1 = datetime(selected_date.year, 10, 1)

    # Traverse and print every day from past October 1 to the specific date
    current_date = past_october_1
    target_csv_path = f'{work_dir}/testing_ready_amsr_{target_date}_cumulative.csv'

    columns_to_be_cumulated = ["AMSR_SWE"]
    
    gap_filled_csv = f"{target_csv_path}_gap_filled.csv"
    if os.path.exists(gap_filled_csv) and not force:
      print(f"{gap_filled_csv} already exists, skipping..")
      df = pd.read_csv(gap_filled_csv)
      print(df["AMSR_SWE"].describe())
    else:
      date_keyed_objects = {}
      data_dict = {}
      new_df = None
      while current_date <= selected_date:
        print(current_date.strftime('%Y-%m-%d'))
        current_date_str = current_date.strftime('%Y-%m-%d')

        data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str)
        current_df = pd.read_csv(data_dict[current_date_str])
        current_df.drop(columns=["date"], inplace=True)

        if current_date != selected_date:
          current_df.rename(columns={
            "AMSR_SWE": f"AMSR_SWE_{current_date_str}",
            "AMSR_Flag": f"AMSR_Flag_{current_date_str}",
          }, inplace=True)
        #print(current_df.head())

        if new_df is None:
          new_df = current_df
        else:
          new_df = pd.merge(new_df, current_df, on=['gridmet_lat', 'gridmet_lon'])
          #new_df = new_df.append(current_df, ignore_index=True)

        current_date += timedelta(days=1)

      print("new_df.columns = ", new_df.columns)
      print("new_df.head = ", new_df.head())
      df = new_df

      #df.sort_values(by=['gridmet_lat', 'gridmet_lon', 'date'], inplace=True)
      print("All current head: ", df.head())
      print("the new_df.shape: ", df.shape)

      print("Start to fill in the missing values")
      #grouped = df.groupby(['gridmet_lat', 'gridmet_lon'])
      filled_data = pd.DataFrame()

      # Apply the function to each group
      for column_name in columns_to_be_cumulated:
        start_time = time.time()
        #filled_data = df.apply(lambda row: interpolate_missing_and_add_cumulative_inplace(row, column_name), axis=1)
        #alike_columns = filled_data.filter(like=column_name)
        #filled_data[f'cumulative_{column_name}'] = alike_columns.sum(axis=1)
        print("filled_data.columns = ", filled_data.columns)
        filtered_columns = df.filter(like=column_name)
        print(filtered_columns.columns)
        filtered_columns = filtered_columns.mask(filtered_columns > 240)
        filtered_columns.interpolate(axis=1, method='linear', inplace=True)
        filtered_columns.fillna(0, inplace=True)
        
        sum_column = filtered_columns.sum(axis=1)
        # Define a specific name for the new column
        df[f'cumulative_{column_name}'] = sum_column
        df[filtered_columns.columns] = filtered_columns
        
        if filtered_columns.isnull().any().any():
          print("filtered_columns :", filtered_columns)
          raise ValueError("Single group: shouldn't have null values here")
        
        
        

        # Concatenate the original DataFrame with the Series containing the sum
        #df = pd.concat([df, sum_column.rename(new_column_name)], axis=1)
#         cumulative_column = filled_data.filter(like=column_name).sum(axis=1)
#         filled_data[f'cumulative_{column_name}'] = cumulative_column
        #filled_data = pd.concat([filled_data, cumulative_column], axis=1)
        print("filled_data.columns: ", filled_data.columns)
        end_time = time.time()
        # Calculate the elapsed time
        elapsed_time = end_time - start_time
        print(f"calculate column {column_name} elapsed time: {elapsed_time} seconds")

#       if any(filled_data['AMSR_SWE'] > 240):
#         raise ValueError("Error: shouldn't have AMSR_SWE > 240 at this point")
      filled_data = df
      filled_data["date"] = target_date
      print("Finished correctly ", filled_data.head())
      filled_data.to_csv(gap_filled_csv, index=False)
      print(f"New filled values csv is saved to {gap_filled_csv}")
      df = filled_data
    
    result = df
    print("result.head = ", result.head())
    # fill in the rest NA as 0
    if result.isnull().any().any():
      print("result :", result)
      raise ValueError("Single group: shouldn't have null values here")
    
    # only retain the rows of the target date
    print(result['date'].unique())
    print(result.shape)
    print(result[["AMSR_SWE", "AMSR_Flag"]].describe())
    result.to_csv(target_csv_path, index=False)
    print(f"New data is saved to {target_csv_path}")

### interpolate_missing_and_add_cumulative_inplace
The **interpolate_missing_and_add_cumulative_inplace** function within the "amsr_testing_realtime" script serves a crucial role in filling missing values in the AMSR snow data and computing cumulative values. It operates on a row-by-row basis, iterating through each row of the DataFrame containing AMSR data. For a specified column (e.g., snow water equivalent or SWE), it performs polynomial interpolation to estimate missing values, ensuring a continuous and smooth representation of the data. Additionally, it enforces constraints on the interpolated values, ensuring that they fall within reasonable bounds (e.g., 0 to 240 for SWE). After interpolation, the function calculates the cumulative sum of the interpolated values for each row, facilitating the aggregation of cumulative AMSR snow data. By incorporating these operations in place, the function optimizes memory usage and computational efficiency, making it suitable for processing large datasets efficiently within the real-time AMSR data collection pipeline.

In [None]:
def interpolate_missing_and_add_cumulative_inplace(row, column_name, degree=1):
  """
  Interpolate missing values in a Pandas Series using polynomial interpolation
  and add a cumulative column.

  Parameters:
    - row (pd.Series): The input row containing the data to be interpolated.
    - column_name (str): The name of the column to be interpolated.
    - degree (int, optional): The degree of the polynomial fit. Default is 1 (linear).

  Returns:
    - pd.Series: The row with interpolated values and a cumulative column.

  Raises:
    - ValueError: If there are unexpected null values after interpolation.

  Note:
    - For 'SWE' column, values above 240 are treated as gaps and set to 240.
    - For 'fsca' column, values above 100 are treated as gaps and set to 100.

  Examples:
    ```python
    # Example usage:
    interpolated_row = interpolate_missing_and_add_cumulative_inplace(my_row, 'fsca', degree=2)
    ```

  """
  
  # Extract X series (column names)
  x_all_key = row.index
  
  x_subset_key = x_all_key[x_all_key.str.startswith(column_name)]
  are_all_values_between_0_and_240 = row[x_subset_key].between(1, 239).all()
  if are_all_values_between_0_and_240:
    print("row[x_subset_key] = ", row[x_subset_key])
    print("row[x_subset_key].sum() = ", row[x_subset_key].sum())
  # create the cumulative column after interpolation
  row[f"cumulative_{column_name}"] = row[x_subset_key].sum()
  return row
    
    
def get_cumulative_amsr_data(target_date = test_start_date, force=False):
    
    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    print(selected_date)
    if selected_date.month < 10:
      past_october_1 = datetime(selected_date.year - 1, 10, 1)
    else:
      past_october_1 = datetime(selected_date.year, 10, 1)

    # Traverse and print every day from past October 1 to the specific date
    current_date = past_october_1
    target_csv_path = f'{work_dir}/testing_ready_amsr_{target_date}_cumulative.csv'

    columns_to_be_cumulated = ["AMSR_SWE"]
    
    gap_filled_csv = f"{target_csv_path}_gap_filled.csv"
    if os.path.exists(gap_filled_csv) and not force:
      print(f"{gap_filled_csv} already exists, skipping..")
      df = pd.read_csv(gap_filled_csv)
      print(df["AMSR_SWE"].describe())
    else:
      date_keyed_objects = {}
      data_dict = {}
      new_df = None
      while current_date <= selected_date:
        print(current_date.strftime('%Y-%m-%d'))
        current_date_str = current_date.strftime('%Y-%m-%d')

        data_dict[current_date_str] = download_amsr_and_convert_grid(current_date_str)
        current_df = pd.read_csv(data_dict[current_date_str])
        current_df.drop(columns=["date"], inplace=True)

        if current_date != selected_date:
          current_df.rename(columns={
            "AMSR_SWE": f"AMSR_SWE_{current_date_str}",
            "AMSR_Flag": f"AMSR_Flag_{current_date_str}",
          }, inplace=True)
        #print(current_df.head())

        if new_df is None:
          new_df = current_df
        else:
          new_df = pd.merge(new_df, current_df, on=['gridmet_lat', 'gridmet_lon'])
          #new_df = new_df.append(current_df, ignore_index=True)

        current_date += timedelta(days=1)

      print("new_df.columns = ", new_df.columns)
      print("new_df.head = ", new_df.head())
      df = new_df

      #df.sort_values(by=['gridmet_lat', 'gridmet_lon', 'date'], inplace=True)
      print("All current head: ", df.head())
      print("the new_df.shape: ", df.shape)

      print("Start to fill in the missing values")
      #grouped = df.groupby(['gridmet_lat', 'gridmet_lon'])
      filled_data = pd.DataFrame()

      # Apply the function to each group
      for column_name in columns_to_be_cumulated:
        start_time = time.time()
        #filled_data = df.apply(lambda row: interpolate_missing_and_add_cumulative_inplace(row, column_name), axis=1)
        #alike_columns = filled_data.filter(like=column_name)
        #filled_data[f'cumulative_{column_name}'] = alike_columns.sum(axis=1)
        print("filled_data.columns = ", filled_data.columns)
        filtered_columns = df.filter(like=column_name)
        print(filtered_columns.columns)
        filtered_columns = filtered_columns.mask(filtered_columns > 240)
        filtered_columns.interpolate(axis=1, method='linear', inplace=True)
        filtered_columns.fillna(0, inplace=True)
        
        sum_column = filtered_columns.sum(axis=1)
        # Define a specific name for the new column
        df[f'cumulative_{column_name}'] = sum_column
        df[filtered_columns.columns] = filtered_columns
        
        if filtered_columns.isnull().any().any():
          print("filtered_columns :", filtered_columns)
          raise ValueError("Single group: shouldn't have null values here")
        
        
        

        # Concatenate the original DataFrame with the Series containing the sum
        #df = pd.concat([df, sum_column.rename(new_column_name)], axis=1)
#         cumulative_column = filled_data.filter(like=column_name).sum(axis=1)
#         filled_data[f'cumulative_{column_name}'] = cumulative_column
        #filled_data = pd.concat([filled_data, cumulative_column], axis=1)
        print("filled_data.columns: ", filled_data.columns)
        end_time = time.time()
        # Calculate the elapsed time
        elapsed_time = end_time - start_time
        print(f"calculate column {column_name} elapsed time: {elapsed_time} seconds")

#       if any(filled_data['AMSR_SWE'] > 240):
#         raise ValueError("Error: shouldn't have AMSR_SWE > 240 at this point")
      filled_data = df
      filled_data["date"] = target_date
      print("Finished correctly ", filled_data.head())
      filled_data.to_csv(gap_filled_csv, index=False)
      print(f"New filled values csv is saved to {gap_filled_csv}")
      df = filled_data
    
    result = df
    print("result.head = ", result.head())
    # fill in the rest NA as 0
    if result.isnull().any().any():
      print("result :", result)
      raise ValueError("Single group: shouldn't have null values here")
    
    # only retain the rows of the target date
    print(result['date'].unique())
    print(result.shape)
    print(result[["AMSR_SWE", "AMSR_Flag"]].describe())
    result.to_csv(target_csv_path, index=False)
    print(f"New data is saved to {target_csv_path}")