# GridMET Climatology Data Downloader

In this chapter, we aim to:

- **Download GridMET Datan**: Obtain specific meteorological variables from the GridMET climatology dataset for a user-specified year. This involves utilizing libraries such as netCDF4 and urllib for file handling and downloading.

- **Visualize Data**: Create custom color maps and scatter plots to visualize meteorological variables spatially across the western United States. This functionality aids in understanding geographical patterns and trends in meteorological data.

- **Generate Cummulative history CSVs**: Generate cumulative history CSVs to aggregate meteorological data over a specified date range. This feature allows users to analyze historical meteorological patterns and long-term trends for decision-making purposes.

We facilitate the retrieval of specific meteorological variables from the GridMET climatology dataset for a user-defined year. Leveraging libraries such as netCDF4, urllib, and pandas, we enable seamless data handling and manipulation, ensuring efficient processing and analysis of climatological data. We offer custom color mapping for visualizing meteorological patterns and provide functionality for generating cumulative history CSVs, allowing users to aggregate data from past October 1st to the specified target date for trend analysis. With matplotlib, we enable plotting and visualization, empowering users to gain insights through graphical representations.

We provide comprehensive outputs including downloaded meteorological variables in NetCDF format, custom color-mapped visualizations showcasing meteorological patterns, and cumulative history CSV files containing aggregated data for trend analysis. With these outputs, users can explore and analyze GridMET climatology data efficiently, gaining valuable insights into long-term meteorological trends and patterns.

In [3]:
import os
import numpy as np
import pandas as pd
import netCDF4 as nc
import urllib.request
from datetime import datetime, timedelta, date
import matplotlib.pyplot as plt

train_start_date = "2018-01-03"
train_end_date = "2021-12-31"
homedir = os.path.expanduser('~')
work_dir = f"{homedir}/gridmet_test_run"

# Define the folder to store downloaded files
gridmet_folder_name = f'{work_dir}/gridmet_climatology'

western_us_coords = f'{work_dir}/dem_file.tif.csv'


gridmet_var_mapping = {
  "etr": "potential_evapotranspiration",
  "pr":"precipitation_amount",
  "rmax":"relative_humidity",
  "rmin":"relative_humidity",
  "tmmn":"air_temperature",
  "tmmx":"air_temperature",
  "vpd":"mean_vapor_pressure_deficit",
  "vs":"wind_speed",
}
# Define the custom colormap with specified colors and ranges
colors = [
    (0.8627, 0.8627, 0.8627),  # #DCDCDC - 0 - 1
    (0.8627, 1.0000, 1.0000),  # #DCFFFF - 1 - 2
    (0.6000, 1.0000, 1.0000),  # #99FFFF - 2 - 4
    (0.5569, 0.8235, 1.0000),  # #8ED2FF - 4 - 6
    (0.4509, 0.6196, 0.8745),  # #739EDF - 6 - 8
    (0.4157, 0.4706, 1.0000),  # #6A78FF - 8 - 10
    (0.4235, 0.2784, 1.0000),  # #6C47FF - 10 - 12
    (0.5529, 0.0980, 1.0000),  # #8D19FF - 12 - 14
    (0.7333, 0.0000, 0.9176),  # #BB00EA - 14 - 16
    (0.8392, 0.0000, 0.7490),  # #D600BF - 16 - 18
    (0.7569, 0.0039, 0.4549),  # #C10074 - 18 - 20
    (0.6784, 0.0000, 0.1961),  # #AD0032 - 20 - 30
    (0.5020, 0.0000, 0.0000)   # #800000 - > 30
]

First, we import necessary libraries for data manipulation, file handling, and visualization. We then defines important variables like folder paths and sets up a mapping for interpreting meteorological data. Additionally, it establishes a custom color scheme for visualization. These steps prepare for data processing and visualization.


In [4]:
def create_color_maps_with_value_range(df_col, value_ranges=None):
  if value_ranges == None:
    max_value = df_col.max()
    min_value = df_col.min()
    if min_value < 0:
      min_value = 0
    step_size = (max_value - min_value) / 12

    # Create 10 periods
    new_value_ranges = [min_value + i * step_size for i in range(12)]
  # Define your custom function to map data values to colors
  def map_value_to_color(value):
    # Iterate through the value ranges to find the appropriate color index
    for i, range_max in enumerate(new_value_ranges):
      if value <= range_max:
        return colors[i]

      # If the value is greater than the largest range, return the last color
      return colors[-1]

    # Map predicted_swe values to colors using the custom function
  color_mapping = [map_value_to_color(value) for value in df_col.values]
  return color_mapping, new_value_ranges

We generate color mappings for data values based on specified or automatically calculated value ranges. If we do not have a range of values, we calculate them by finding the minimum and maximum values in the data and divide the range into 12 intervals. Next, we determine which interval it falls into and assign a corresponding color. Finally, we obtain these color mappings alongside the newly calculated value ranges, providing a visual representation of the data distribution.

# Gridmet

In [6]:
def get_current_year():
    """
    Get the current year.

    Returns:
        int: The current year.
    """
    now = datetime.now()
    current_year = now.year
    return current_year

We retrieve the current year from the system's date and time settings.

In [7]:
def remove_files_in_folder(folder_path, current_year):
    """
    Remove all files in a specified folder.

    Parameters:
        folder_path (str): Path to the folder to remove files from.
    """
    # Get a list of files in the folder
    files = os.listdir(folder_path)

    # Loop through the files and remove them
    for file in files:
        file_path = os.path.join(folder_path, file)
        if os.path.isfile(file_path) and str(current_year) in file_path and file_path.endswith(".nc"):
            os.remove(file_path)
            print(f"Deleted file: {file_path}")

We iterate through the files within a designated folder and delete those that meet certain criteria, such as being associated with the current year and having a specific file extension. This ensures that only relevant files are removed from the directory.

In [8]:
def download_file(url, target_file_path, variable):
    """
    Download a file from a URL and save it to a specified location.

    Parameters:
        url (str): URL of the file to download.
        target_file_path (str): Path where the downloaded file should be saved.
        variable (str): Name of the meteorological variable being downloaded.
    """
    try:
        with urllib.request.urlopen(url) as response:
            print(f"Downloading {url}")
            file_content = response.read()
        save_path = target_file_path
        with open(save_path, 'wb') as file:
            file.write(file_content)
        print(f"File downloaded successfully and saved as: {save_path}")
    except Exception as e:
        print(f"An error occurred while downloading the file: {str(e)}")

We facilitate the download of a file from a given URL and store it in a specified location. We consider the meteorological variable's name being downloaded. Upon successful completion, we provide a confirmation message containing the saved file's path. If any errors occur during the download process, we display an error message.

In [9]:
def download_gridmet_of_specific_variables(year_list):
    """
    Download specific meteorological variables from the GridMET climatology dataset.
    """
    # Make a directory to store the downloaded files
    

    base_metadata_url = "http://www.northwestknowledge.net/metdata/data/"
    variables_list = ['tmmn', 'tmmx', 'pr', 'vpd', 'etr', 'rmax', 'rmin', 'vs']

    for var in variables_list:
        for y in year_list:
            download_link = base_metadata_url + var + '_' + '%s' % y + '.nc'
            target_file_path = os.path.join(gridmet_folder_name, var + '_' + '%s' % y + '.nc')
            if not os.path.exists(target_file_path):
                download_file(download_link, target_file_path, var)
            else:
                print(f"File {target_file_path} exists")

We download specific meteorological variables from the GridMET climatology dataset by iterating over years and variables. If a file for a specific year and variable combination doesn't exist, we download it.

In [10]:
def get_current_year():
    now = datetime.now()
    current_year = now.year
    return current_year

We retrieve the current year from the system's date and time settings.

In [11]:
def get_file_name_from_path(file_path):
    # Get the file name from the file path
    file_name = os.path.basename(file_path)
    return file_name

In [12]:
def get_var_from_file_name(file_name):
    # Assuming the file name format is "tmmm_year.csv"
    var_name = str(file_name.split('_')[0])
    return var_name

We extract the file name from a given file path.

In [13]:
def get_coordinates_of_template_tif():
  	# Load the CSV file and extract coordinates
    coordinates = []
    df = pd.read_csv(dem_csv)
    for index, row in df.iterrows():
        # Process each row here
        lon, lat = float(row["Latitude"]), float(row["Longitude"])
        coordinates.append((lon, lat))
    return coordinates

We load a CSV file and extract coordinates from it, resulting in a list of latitude and longitude tuples.

In [14]:
def find_nearest_index(array, value):
    # Find the index of the element in the array that is closest to the given value
    return (abs(array - value)).argmin()

We find the index of the element in the array that is closest to the given value.

In [16]:
def create_gridmet_to_dem_mapper(nc_file):
    western_us_dem_df = pd.read_csv(western_us_coords)
    # Check if the CSV already exists
    target_csv_path = f'{work_dir}/gridmet_to_dem_mapper.csv'
    if os.path.exists(target_csv_path):
        print(f"File {target_csv_path} already exists, skipping..")
        return
    
    # get the netcdf file and generate the csv file for every coordinate in the dem_template.csv
    selected_date = datetime.strptime(test_start_date, "%Y-%m-%d")
    # Read the NetCDF file
    with nc.Dataset(nc_file) as nc_file:
      
      # Get the values at each coordinate using rasterio's sample function
      latitudes = nc_file.variables['lat'][:]
      longitudes = nc_file.variables['lon'][:]
      
      def get_gridmet_var_value(row):
        # Perform your custom calculation here
        gridmet_lat_index = find_nearest_index(latitudes, float(row["Latitude"]))
        gridmet_lon_index = find_nearest_index(longitudes, float(row["Longitude"]))
        return latitudes[gridmet_lat_index], longitudes[gridmet_lon_index], gridmet_lat_index, gridmet_lon_index
    
      # Use the apply function to apply the custom function to each row
      western_us_dem_df[['gridmet_lat', 'gridmet_lon', 
                         'gridmet_lat_idx', 'gridmet_lon_idx',]] = western_us_dem_df.apply(lambda row: pd.Series(get_gridmet_var_value(row)), axis=1)
      western_us_dem_df.rename(columns={"Latitude": "dem_lat", 
                                        "Longitude": "dem_lon"}, inplace=True)
      
    # print(western_us_dem_df.head())
    
    # Save the new converted AMSR to CSV file
    western_us_dem_df.to_csv(target_csv_path, index=False)
    
    return western_us_dem_df

We generate a comprehensive mapping between the geographic coordinates present in a DEM template CSV file and their respective counterparts within a GridMET netCDF file. This mapping serves as a vital bridge, facilitating the seamless association and alignment of DEM and GridMET datasets. This integration is crucial for a wide array of spatial analyses, environmental modeling endeavors, and geographical studies.

In [17]:
def get_operation_day():
  # Get the current date and time
  current_date = datetime.now()

  # Calculate three days ago
  three_days_ago = current_date - timedelta(days=3)

  # Format the date as a string
  three_days_ago_string = three_days_ago.strftime("%Y-%m-%d")

  print(three_days_ago_string)
  return three_days_ago_string

test_start_date = get_operation_day()

2024-03-25


We retrieve the date of an operational day by calculating the date three days prior to the current date. This operational day is crucial for setting up various time-sensitive operations or analyses. 

In [18]:
def get_nc_csv_by_coords_and_variable(nc_file, var_name, target_date=test_start_date):
    
    create_gridmet_to_dem_mapper(nc_file)
  	
    mapper_df = pd.read_csv(f'{work_dir}/gridmet_to_dem_mapper.csv')
    
    # get the netcdf file and generate the csv file for every coordinate in the dem_template.csv
    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    # Read the NetCDF file
    with nc.Dataset(nc_file) as nc_file:
      # Get a list of all variables in the NetCDF file
      variables = nc_file.variables.keys()
      
      # Get the values at each coordinate using rasterio's sample function
      latitudes = nc_file.variables['lat'][:]
      longitudes = nc_file.variables['lon'][:]
      day = nc_file.variables['day'][:]
      long_var_name = gridmet_var_mapping[var_name]
      var_col = nc_file.variables[long_var_name][:]
      #print("val_col.shape: ", var_col.shape)
      
      # Calculate the day of the year
      day_of_year = selected_date.timetuple().tm_yday
      day_index = day_of_year - 1
      #print('day_index:', day_index)
      
      def get_gridmet_var_value(row):
        # Perform your custom calculation here
        lat_index = int(row["gridmet_lat_idx"])
        lon_index = int(row["gridmet_lon_idx"])
        var_value = var_col[day_index, lat_index, lon_index]
        
        return var_value
    
      # Use the apply function to apply the custom function to each row
      # print(mapper_df.columns)
      # print(mapper_df.head())
      mapper_df[var_name] = mapper_df.apply(get_gridmet_var_value, axis=1)
      
      # print("mapper_df[var_name]: ", mapper_df[var_name].describe())
      
      # drop useless columns
      mapper_df = mapper_df[["dem_lat", "dem_lon", var_name]]
      mapper_df.rename(columns={"dem_lat": "Latitude",
                               "dem_lon": "Longitude"}, inplace=True)

      
    # print(mapper_df.head())
    return mapper_df

We extract data for a specific variable from a NetCDF file by matching coordinates from a DEM template CSV file. This enables us to create a DataFrame containing the variable values alongside the corresponding coordinates. By doing so, we can effectively extract and analyze meteorological data for specific geographical locations, aiding in various environmental and geographical studies, as well as modeling endeavors.

In [19]:
def turn_gridmet_nc_to_csv(target_date=test_start_date):
    
    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    generated_csvs = []
    for root, dirs, files in os.walk(gridmet_folder_name):
        for file_name in files:
            
            if str(selected_date.year) in file_name and file_name.endswith(".nc"):
                print(f"Checking file: {file_name}")
                var_name = get_var_from_file_name(file_name)
                # print("Variable name:", var_name)
                res_csv = f"{work_dir}/testing_output/{str(selected_date.year)}_{var_name}_{target_date}.csv"

                if os.path.exists(res_csv):
                    #os.remove(res_csv)
                    print(f"{res_csv} already exists. Skipping..")
                    generated_csvs.append(res_csv)
                    continue

                # Perform operations on each file here
                netcdf_file_path = os.path.join(root, file_name)
                print("Processing file:", netcdf_file_path)
                file_name = get_file_name_from_path(netcdf_file_path)

                df = get_nc_csv_by_coords_and_variable(netcdf_file_path, 
                                                       var_name, target_date)
                df.replace('--', pd.NA, inplace=True)
                df.to_csv(res_csv, index=False)
                print("gridmet var saved: ", res_csv)
                generated_csvs.append(res_csv)
    return generated_csvs   


We convert GridMET NetCDF files to CSV format for a specified date. We iterate through files in the GridMET folder, checking for files corresponding to the selected date. For each matching file, we extract the variable name and generate a CSV file containing the data. If the CSV file already exists, we skip the process. This process facilitates easy access and analysis of meteorological data for a specific date.

In [20]:
def plot_gridmet(target_date=test_start_date):
  selected_date = datetime.strptime(target_date, "%Y-%m-%d")
  var_name = "pr"
  test_csv = f"{work_dir}/testing_output/{str(selected_date.year)}_{var_name}_{target_date}.csv"
  gridmet_var_df = pd.read_csv(test_csv)
  gridmet_var_df.replace('--', pd.NA, inplace=True)
  gridmet_var_df.dropna(inplace=True)
  gridmet_var_df['pr'] = pd.to_numeric(gridmet_var_df['pr'], errors='coerce')
  #print(gridmet_var_df.head())
  #print(gridmet_var_df["Latitude"].describe())
  #print(gridmet_var_df["Longitude"].describe())
  #print(gridmet_var_df["pr"].describe())
  
  colormaplist, value_ranges = create_color_maps_with_value_range(gridmet_var_df[var_name])
  
  # Create a scatter plot
  plt.scatter(gridmet_var_df["Longitude"].values, 
              gridmet_var_df["Latitude"].values, 
              label='Pressure', 
              color=colormaplist, 
              marker='o')

  # Add labels and a legend
  plt.xlabel('X-axis')
  plt.ylabel('Y-axis')
  plt.title('Scatter Plot Example')
  plt.legend()
  
  res_png_path = f"{work_dir}/testing_output/{str(selected_date.year)}_{var_name}_{target_date}.png"
  plt.savefig(res_png_path)
  print(f"test image is saved at {res_png_path}")

We plot GridMET meteorological data for a specific variable and date. We read the data from a corresponding CSV file and preprocess it, ensuring valid numerical values. Then, we create a scatter plot, mapping the variable values to geographic coordinates. The color of each point on the plot represents the magnitude of the variable value. Finally, we save the plot as a PNG image for further analysis and visualization.

In [21]:
def prepare_folder_and_get_year_list(target_date=test_start_date):
  # Check if the folder exists, if not, create it
  if not os.path.exists(gridmet_folder_name):
      os.makedirs(gridmet_folder_name)

  selected_date = datetime.strptime(target_date, "%Y-%m-%d")
  if selected_date.month < 10:
    past_october_1 = datetime(selected_date.year - 1, 10, 1)
  else:
    past_october_1 = datetime(selected_date.year, 10, 1)
  year_list = [selected_date.year, past_october_1.year]

  # Remove any existing files in the folder
  if selected_date.year == datetime.now().year:
    # check if the current year's netcdf contains the selected date
    # get etr netcdf and read
    nc_file = f"{gridmet_folder_name}/tmmx_{selected_date.year}.nc"
    ifremove = False
    if os.path.exists(nc_file):
      with nc.Dataset(nc_file) as ncd:
        day = ncd.variables['day'][:]
        # Calculate the day of the year
        day_of_year = selected_date.timetuple().tm_yday
        day_index = day_of_year - 1
        if len(day) <= day_index:
          ifremove = True
    
    if ifremove:
      print("The current year netcdf has new data. Redownloading..")
      remove_files_in_folder(gridmet_folder_name, selected_date.year)  # only redownload when the year is the current year
    else:
      print("The existing netcdf already covers the selected date. Avoid downloading..")
  return year_list

We prepare the folder structure for storing GridMET data and obtain a list of relevant years based on the target date. This process ensures that the necessary directory exists for data storage and determines the appropriate years for data retrieval without delving into technical details.

In [22]:
def add_cumulative_column(df, column_name):
  df[f'cumulative_{column_name}'] = df[column_name].sum()
  return df

We add a cumulative column to a DataFrame, summing the values of the specified column and storing the result in a new column.

In [23]:
def prepare_cumulative_history_csvs(target_date=test_start_date, force=False):
  """
    Prepare cumulative history CSVs for a specified target date.

    Parameters:
    - target_date (str, optional): The target date in the format 'YYYY-MM-DD'. Default is 'test_start_date'.
    - force (bool, optional): If True, forcefully regenerate cumulative CSVs even if they already exist. Default is False.

    Returns:
    None

    This function generates cumulative history CSVs for a specified target date. It traverses the date range from the past
    October 1 to the target date, downloads gridmet data, converts it to CSV, and merges it into a big DataFrame.
    The cumulative values are calculated and saved in new CSV files.

    Example:
    ```python
    prepare_cumulative_history_csvs(target_date='2023-01-01', force=True)
    ```

    Note: This function assumes the existence of the following helper functions:
    - download_gridmet_of_specific_variables
    - prepare_folder_and_get_year_list
    - turn_gridmet_nc_to_csv
    - add_cumulative_column
    - process_group_value_filling
    ```

    selected_date = datetime.strptime(target_date, "%Y-%m-%d")
    print(selected_date)
    if selected_date.month < 10:
        past_october_1 = datetime(selected_date.year - 1, 10, 1)
    else:
        past_october_1 = datetime(selected_date.year, 10, 1)

    # Rest of the function logic...

    filled_data = filled_data.loc[:, ['Latitude', 'Longitude', var_name, f'cumulative_{var_name}']]
    print("new_df final shape: ", filled_data.head())
    filled_data.to_csv(cumulative_target_path, index=False)
    print(f"new df is saved to {cumulative_target_path}")
    print(filled_data.describe())
    ```
Note: This docstring includes placeholders such as "download_gridmet_of_specific_variables" and "prepare_folder_and_get_year_list" for the assumed existence of related helper functions. You should replace these placeholders with actual documentation for those functions.
  """
  selected_date = datetime.strptime(target_date, "%Y-%m-%d")
  print(selected_date)
  if selected_date.month < 10:
    past_october_1 = datetime(selected_date.year - 1, 10, 1)
  else:
    past_october_1 = datetime(selected_date.year, 10, 1)

  # Traverse and print every day from past October 1 to the specific date
  current_date = past_october_1
  
  date_keyed_objects = {}
  
  download_gridmet_of_specific_variables(
    prepare_folder_and_get_year_list(target_date=target_date)
  )
  
  while current_date <= selected_date:
    print(current_date.strftime('%Y-%m-%d'))
    current_date_str = current_date.strftime('%Y-%m-%d')
    
    
    generated_csvs = turn_gridmet_nc_to_csv(target_date=current_date_str)
    
    # read the csv into dataframe and merge to the big dataframe
    date_keyed_objects[current_date_str] = generated_csvs
    
    current_date += timedelta(days=1)
    
  print("date_keyed_objects: ", date_keyed_objects)
  target_generated_csvs = date_keyed_objects[target_date]
  for index, single_csv in enumerate(target_generated_csvs):
    # traverse the variables of gridmet here
    # each variable is a loop
    print(f"creating cumulative for {single_csv}")
    
    cumulative_target_path = f"{single_csv}_cumulative.csv"
    print("cumulative_target_path = ", cumulative_target_path)
    
    if os.path.exists(cumulative_target_path) and not force:
      print(f"{cumulative_target_path} already exists, skipping..")
      continue
    
    # Extract the file name without extension
    file_name = os.path.splitext(os.path.basename(single_csv))[0]
    gap_filled_csv = f"{cumulative_target_path}_gap_filled.csv"

	# Split the file name using underscores
    var_name = file_name.split('_')[1]
    print(f"Found variable name {var_name}")
    current_date = past_october_1
    new_df = pd.read_csv(single_csv)
    print(new_df.head())
    
    all_df = pd.read_csv(f"{work_dir}/testing_output/{str(selected_date.year)}_{var_name}_{target_date}.csv")
    all_df["date"] = target_date
    all_df[var_name] = pd.to_numeric(all_df[var_name], errors='coerce')
    filled_data = all_df
    filled_data = filled_data[(filled_data['date'] == target_date)]
    filled_data.fillna(0, inplace=True)
    print("Finished correctly ", filled_data.head())
    filled_data = filled_data[['Latitude', 'Longitude', 
                               var_name, 
#                                f'cumulative_{var_name}'
                              ]]
    print(filled_data.shape)
    filled_data.to_csv(cumulative_target_path, index=False)
    print(f"new df is saved to {cumulative_target_path}")
    print(filled_data.describe())


We prepare cumulative history CSVs for a specified target date. We traverses the date range from the past October 1 to the target date, downloads GridMET data, converts it to CSV, and merges it into a big DataFrame. Cumulative values are then calculated and saved in new CSV files.

In [25]:
if __name__ == "__main__":
  # Run the download function
  #   download_gridmet_of_specific_variables(prepare_folder_and_get_year_list())
  #   turn_gridmet_nc_to_csv()
  #   plot_gridmet()

  # prepare testing data with cumulative variables
  prepare_cumulative_history_csvs(force=True)

2024-03-25 00:00:00
The existing netcdf already covers the selected date. Avoid downloading..
File /Users/meghana/gridmet_test_run/gridmet_climatology/tmmn_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/tmmn_2023.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/tmmx_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/tmmx_2023.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/pr_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/pr_2023.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/vpd_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/vpd_2023.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/etr_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/etr_2023.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/rmax_2024.nc exists
File /Users/meghana/gridmet_test_run/gridmet_climatology/rmax_2023.

When we run the script directly (`__name__ == "__main__"`), our main focus is to ensure the preparation of cumulative history CSVs for testing data. By triggering the `prepare_cumulative_history_csvs` function with the `force` parameter set to True, we ensure that all necessary steps are taken to generate accurate cumulative values. Throughout this process, we manage tasks such as downloading GridMET data for specific variables, converting it into CSV format, and calculating cumulative values for each variable over the specified date range. The use of `force=True` ensures that the cumulative CSVs are regenerated if they already exist, mai

# GridMET Climatology Data Retrieval and Analysis

In this chapter, we do
- **Data Collection:** : The script fetches gridMET climatology data from a specified source for various meteorological variables (e.g., temperature, precipitation) and multiple years.It ensures that the data is downloaded for each variable and year required for analysis.

- **Data Processing:** After downloading, the script extracts relevant data for specific geographical locations corresponding to weather stations. It organizes the data into structured formats (CSV files) for easier handling and analysis.

- **Data Integration:** We merge similar variables obtained from different years into separate CSV files. We then combine all variables together into a single comprehensive dataset for further analysis and modeling tasks. By integrating data from various sources and time periods, it creates a unified dataset that can provide insights into long-term weather patterns and trends.


In [1]:
import os
import glob
import urllib.request
from datetime import date, datetime

import pandas as pd
import xarray as xr
from pathlib import Path
import warnings

homedir = os.path.expanduser('~')
train_start_date = "2018-01-03"
train_end_date = "2021-12-31"

work_dir = f"{homedir}/gridmet_test_run"

We import necessary modules and libraries for its operation. These imports bring in functionalities like interacting with the operating system, handling file paths, making URL requests, managing dates and times, manipulating data, working with multi-dimensional arrays, and handling file system paths.

In [2]:

# Suppress FutureWarnings
warnings.filterwarnings("ignore", category=FutureWarning)

start_date = datetime.strptime(train_start_date, "%Y-%m-%d")
end_date = datetime.strptime(train_end_date, "%Y-%m-%d")

year_list = [start_date.year + i for i in range(end_date.year - start_date.year + 1)]

working_dir = work_dir
#stations = pd.read_csv(f'{working_dir}/station_cell_mapping.csv')
stations = pd.read_csv(f"{work_dir}/all_snotel_cdec_stations_active_in_westus.csv")
gridmet_save_location = f'{working_dir}/gridmet_climatology'
final_merged_csv = f"{work_dir}/training_all_active_snotel_station_list_elevation.csv_gridmet.csv"


### Explanation:

In this code snippet, FutureWarnings are suppressed using `warnings.filterwarnings("ignore", category=FutureWarning)`. Following that, the start and end dates for data processing are defined by converting the `train_start_date` and `train_end_date` strings into datetime objects. Subsequently, a list of years between the start and end dates is generated. The working directory path and file paths for SNOTEL stations data, GridMET climatology data storage, and the final merged CSV file are then set up accordingly. Overall, this snippet prepares the necessary parameters and file paths for subsequent data processing steps.


In [4]:
def get_files_in_directory():
    f = list()
    for files in glob.glob(gridmet_save_location + "/*.nc"):
        f.append(files)
    return f

We collect the names of files with the extension ".nc" within a specified directory by iterating through all files, appending their names to a list, and returning the list.

In [5]:
def download_file(url, save_location):
    try:
        print("download_file")
        with urllib.request.urlopen(url) as response:
            file_content = response.read()
        file_name = os.path.basename(url)
        save_path = os.path.join(save_location, file_name)
        with open(save_path, 'wb') as file:
            file.write(file_content)
        print(f"File downloaded successfully and saved as: {save_path}")
    except Exception as e:
        print(f"An error occurred while downloading the file: {str(e)}")

We attempt to download a file from a specified URL. We then save the downloaded file to a specified location.

In [6]:
def download_gridmet_climatology():
    folder_name = gridmet_save_location
    if not os.path.exists(folder_name):
        os.makedirs(folder_name)

    base_metadata_url = "http://www.northwestknowledge.net/metdata/data/"
    variables_list = ['tmmn', 'tmmx', 'pr', 'vpd', 'etr', 'rmax', 'rmin', 'vs']

    for var in variables_list:
        for y in year_list:
            download_link = base_metadata_url + var + '_' + '%s' % y + '.nc'
            print("downloading", download_link)
            if not os.path.exists(os.path.join(folder_name, var + '_' + '%s' % y + '.nc')):
                download_file(download_link, folder_name)

We set up a folder to store data. Then, we gather GridMET climatology data for a range of variables over multiple years. This involves collecting information on temperature, precipitation, vapor pressure deficit, reference evapotranspiration, maximum and minimum radiation, and wind speed. We ensure that the data for each variable and year combination is acquired and stored for further analysis.

In [7]:
def get_gridmet_variable(file_name):
    print(f"reading values from {file_name}")
    result_data = []
    ds = xr.open_dataset(file_name)
    var_to_extract = list(ds.keys())
    print(var_to_extract)
    var_name = var_to_extract[0]
    
    df = pd.DataFrame(columns=['day', 'lat', 'lon', var_name])
    
    csv_file = f'{gridmet_save_location}/{Path(file_name).stem}.csv'
    if os.path.exists(csv_file):
    	print(f"The file '{csv_file}' exists.")
    	return

    for idx, row in stations.iterrows():
        lat = row['latitude']
        lon = row['longitude']
		
        subset_data = ds.sel(lat=lat, lon=lon, method='nearest')
        subset_data['lat'] = lat
        subset_data['lon'] = lon
        # print('subset data:', lat, lon, subset_data.values())
        converted_df = subset_data.to_dataframe()
        #print("converted_df: ", converted_df.head())
        #print("converted_df columns: ", converted_df.columns)
        converted_df = converted_df.reset_index(drop=False)
        #print("convert to columns: ", converted_df.columns)
        converted_df = converted_df.drop('crs', axis=1)
        df = pd.concat([df, converted_df], ignore_index=True)
        
    result_df = df
    print("got result_df : ", result_df.head())
    result_df.to_csv(csv_file, index=False)
    print(f'completed extracting data for {file_name}')


We retrieve data from a specific GridMET file, capturing information relevant to latitude and longitude coordinates provided in a dataset. Subsequently, we organize this data into a structured format, ensuring each entry corresponds to a specific day, latitude, and longitude, along with the associated variable values. Following this organization, we save this processed data as a CSV file, providing a convenient and accessible format for further analysis or visualization.

In [8]:

def merge_similar_variables_from_different_years():
    files = os.listdir(gridmet_save_location)
    file_groups = {}

    for filename in files:
        base_name, year_ext = os.path.splitext(filename)
        parts = base_name.split('_')
        if len(parts) == 2 and year_ext == '.csv':
            file_groups.setdefault(parts[0], []).append(filename)

    for base_name, file_list in file_groups.items():
        if len(file_list) > 1:
            dfs = []
            for filename in file_list:
                df = pd.read_csv(os.path.join(gridmet_save_location, filename))
                dfs.append(df)
            merged_df = pd.concat(dfs, ignore_index=True)
            merged_filename = f"{base_name}_merged.csv"
            merged_df.to_csv(os.path.join(gridmet_save_location, merged_filename), index=False)
            print(f"Merged {file_list} into {merged_filename}")

We collect list of files from a designated location. Then, we organize these files into groups based on similarities in their names. For each group of similar files, if there are multiple files present, we proceed to read each file as a DataFrame. Subsequently, we merge these DataFrames into a single comprehensive DataFrame. Following this merging process, we save the resulting merged DataFrame into a new CSV file. Finally, we print a notification message to indicate which files have been successfully merged. Through these steps, the function facilitates the consolidation of related data from different files into cohesive datasets.

In [9]:
def merge_all_variables_together():
    merged_df = None
    file_paths = []

    for filename in os.listdir(gridmet_save_location):
        if filename.endswith("_merged.csv"):
            file_paths.append(os.path.join(gridmet_save_location, filename))
	
    rmin_merged_path = os.path.join(gridmet_save_location, 'rmin_merged.csv')
    rmax_merged_path = os.path.join(gridmet_save_location, 'rmax_merged.csv')
    tmmn_merged_path = os.path.join(gridmet_save_location, 'tmmn_merged.csv')
    tmmx_merged_path = os.path.join(gridmet_save_location, 'tmmx_merged.csv')
    
    df_rmin = pd.read_csv(rmin_merged_path)
    df_rmax = pd.read_csv(rmax_merged_path)
    df_tmmn = pd.read_csv(tmmn_merged_path)
    df_tmmx = pd.read_csv(tmmx_merged_path)
    
    df_rmin.rename(columns={'relative_humidity': 'relative_humidity_rmin'}, inplace=True)
    df_rmax.rename(columns={'relative_humidity': 'relative_humidity_rmax'}, inplace=True)
    df_tmmn.rename(columns={'air_temperature': 'air_temperature_tmmn'}, inplace=True)
    df_tmmx.rename(columns={'air_temperature': 'air_temperature_tmmx'}, inplace=True)
    
    df_rmin.to_csv(os.path.join(gridmet_save_location, 'rmin_merged.csv'))
    df_rmax.to_csv(os.path.join(gridmet_save_location, 'rmax_merged.csv'))
    df_tmmn.to_csv(os.path.join(gridmet_save_location, 'tmmn_merged.csv'))
    df_tmmx.to_csv(os.path.join(gridmet_save_location, 'tmmx_merged.csv'))
    
    if file_paths:
        merged_df = pd.read_csv(file_paths[0])
        for file_path in file_paths[1:]:
            df = pd.read_csv(file_path)
            merged_df = pd.concat([merged_df, df], axis=1)
        merged_df = merged_df.loc[:, ~merged_df.columns.duplicated()]
        merged_df.to_csv(final_merged_csv, index=False)


We identify and collect CSV files with specific names from a designated location. Then, we proceed to read each of these CSV files, containing data for different variables, into separate DataFrames. Subsequently, we rename specific columns within each DataFrame to ensure clarity and consistency across variables. After updating column names, we overwrite the original CSV files with the renamed versions for consistency. Next, if there are multiple CSV files available, we merge them together into a single comprehensive DataFrame. To avoid redundancy and ensure data integrity, we remove any duplicated columns in the merged DataFrame. Finally, we save the merged DataFrame as a new CSV file, representing the combined dataset encompassing all variables. Through these steps, the function facilitates the integration of data from various sources into a unified dataset for comprehensive analysis.

In [10]:
if __name__ == "__main__":
    
    download_gridmet_climatology()
    
    nc_files = get_files_in_directory()
    for nc in nc_files:
        get_gridmet_variable(nc)
    
    merge_similar_variables_from_different_years()
    merge_all_variables_together()

downloading http://www.northwestknowledge.net/metdata/data/tmmn_2018.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmn_2019.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmn_2020.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmn_2021.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmx_2018.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmx_2019.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmx_2020.nc
downloading http://www.northwestknowledge.net/metdata/data/tmmx_2021.nc
downloading http://www.northwestknowledge.net/metdata/data/pr_2018.nc
downloading http://www.northwestknowledge.net/metdata/data/pr_2019.nc
downloading http://www.northwestknowledge.net/metdata/data/pr_2020.nc
downloading http://www.northwestknowledge.net/metdata/data/pr_2021.nc
downloading http://www.northwestknowledge.net/metdata/data/vpd_2018.nc
downloading http://www.northwestknowledge.net/metdata/data/vpd_2019.nc
do

We start by checking if our script is being executed directly. If it is, we proceed to download data related to GridMET climatology. After that, we retrieve a list of files from a directory, likely containing data files. Next, we iterate through each file, processing them to extract GridMET variables. Following that, we merge similar variables obtained from different years, presumably to create a comprehensive dataset. Lastly, we merge all the variables together, possibly creating a unified dataset. These actions collectively aim to handle and organize GridMET climatology data.