## USGS and EMIT Data Matchup
In this notebook we will search the USGS database for a specific state code and paramater code/s to retrieve a list of sites. We will then use the site coordinates to find matching EMIT granules and gather data around the EMIT granules time stamp. 

### 1. Retrieving site codes
First import package and utils file

In [3]:
import dataretrieval.nwis as nwis
import geopandas as gpd
from shapely.geometry import Point, box, Polygon, MultiPolygon
import requests
import pandas as pd
import datetime as dt
import earthaccess
from tqdm import tqdm
import sys
sys.path.append('modules/')
from retrieval_utils import get_param_sites, get_all_site_granules, match_granules

Next we can find active parameters using the USGS website, for a separate guide on this there is a PDF called "Get param codes" in the Github. 

Then we can define the time-frame, state code and paramater codes and call the function. Note: all three are required for the function to work. 

In [1]:
'''
param_codes = ['32316'] # chla fluorescence
param_codes_str = ','.join(param_codes) 
# state_code = '06' # california
state_codes = [f"{i:02d}" for i in range(1, 57)]
site_types = ['LK', 'ES'] # lakes, estruaries
site_list = get_param_sites(param_codes_str, state_codes, site_types)

print(site_list.head())


'''

'\nparam_codes = [\'32316\'] # chla fluorescence\nparam_codes_str = \',\'.join(param_codes) \n# state_code = \'06\' # california\nstate_codes = [f"{i:02d}" for i in range(1, 57)]\nsite_types = [\'LK\', \'ES\'] # lakes, estruaries\nsite_list = get_param_sites(param_codes_str, state_codes, site_types)\n\nprint(site_list.head())\n'

In [3]:

spec_df = pd.read_csv('data/sites_spectra_app2.csv', dtype={'ID': str})

def fix_id(id_value):
    try:
        # Check if the ID contains 'E' or 'e' indicating scientific notation
        if 'E' in id_value.upper():
            # Convert the scientific notation string to a float, then to an integer, then back to a string
            id_fixed = str(int(float(id_value)))
            return id_fixed
        else:
            return id_value  # Return the ID as is if it's not in scientific notation
    except Exception as e:
        print(f"Error converting ID {id_value}: {e}")
        return id_value  # Return the original value if conversion fails

# Apply the function to the 'ID' column
spec_df['ID'] = spec_df['ID'].apply(fix_id)

# Verify the IDs after conversion
print(spec_df.head())

                          Category               ID   Latitude   Longitude  \
0  BLUE RIVER LAKE NEAR BLUE RIVER  441022000000000  44.173194 -122.324222   
1  BLUE RIVER LAKE NEAR BLUE RIVER  441022000000000  44.173194 -122.324222   
2  BLUE RIVER LAKE NEAR BLUE RIVER  441022000000000  44.173194 -122.324222   
3  BLUE RIVER LAKE NEAR BLUE RIVER  441022000000000  44.173194 -122.324222   
4  BLUE RIVER LAKE NEAR BLUE RIVER  441022000000000  44.173194 -122.324222   

                      Date  Band  wavelength   fwhm  reflectance  \
0  2023-08-12 22:32:34 UTC  B001     381.006  8.415     0.020886   
1  2023-08-12 22:32:34 UTC  B002     388.409  8.415     0.021061   
2  2023-08-12 22:32:34 UTC  B003     395.816  8.415     0.022508   
3  2023-08-12 22:32:34 UTC  B004     403.225  8.415     0.025560   
4  2023-08-12 22:32:34 UTC  B005     410.638  8.417     0.027297   

   good_wavelengths        elev  
0               1.0  369.370332  
1               1.0  369.370332  
2               1.0 

In [15]:
spec_df = pd.read_csv('data/site_spectra_app2.csv', dtype={'ID': str})

grouped = spec_df.groupby(['ID', 'Date'])

scenes = []

for (site_id, date), group in grouped:
    # Collect spectral data into a list of dictionaries
    spectral_data = group[['Band', 'wavelength', 'reflectance']].to_dict('records')
    
    # Create a dictionary for the scene
    scene = {
        'site_no': site_id,
        'datetime': date,
        'station_nm': group['Category'].iloc[0],
        'lat': group['Latitude'].iloc[0],
        'lon': group['Longitude'].iloc[0],
        'spectra': spectral_data
    }
    
    scenes.append(scene)

scenes_df = pd.DataFrame(scenes)

scenes_df['datetime'] = pd.to_datetime(scenes_df['datetime'])
scenes_df['datetime'] = scenes_df['datetime'].dt.strftime('%Y-%m-%dT%H:%M:%S.000Z')

# Display the first few rows
print(len(scenes_df['site_no'].unique()))



184


### 2. Retrieving granules based on site locations
Now we have the site list we can use coordinates to search for matching granules. 

Next setup the granule search and call the function.

In [10]:
start_date_dt = dt.datetime.strptime(start_date, '%Y-%m-%d')
end_date_dt = dt.datetime.strptime(end_date, '%Y-%m-%d')
dt_format = '%Y-%m-%dT%H:%M:%SZ'
temporal_str = start_date_dt.strftime(dt_format) + ',' + end_date_dt.strftime(dt_format)


site_granules = get_all_site_granules(site_list, temporal_str)
df_granules = pd.DataFrame(site_granules)
print(df_granules.head())

Processing sites: 100%|███████████████████████████| 5/5 [00:05<00:00,  1.05s/it]

    site_no                                      station_nm     site_lat  \
0  11455508  SUISUN BAY A VAN SICKLE ISLAND NR PITTSBURG CA  38.04953056   
1  11455508  SUISUN BAY A VAN SICKLE ISLAND NR PITTSBURG CA  38.04953056   
2  11455508  SUISUN BAY A VAN SICKLE ISLAND NR PITTSBURG CA  38.04953056   
3  11455508  SUISUN BAY A VAN SICKLE ISLAND NR PITTSBURG CA  38.04953056   
4  11455508  SUISUN BAY A VAN SICKLE ISLAND NR PITTSBURG CA  38.04953056   

     site_lon                                       granule_urls  \
0  -121.88755  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
1  -121.88755  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
2  -121.88755  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
3  -121.88755  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
4  -121.88755  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   

                   datetime  
0  2023-03-27T23:01:16.000Z  
1  2023-05-27T22:53:15.000Z  
2  2023-08-07T18:27:32.000Z  
3  2023-08-14T




In [11]:
print(len(df_granules))

45


In [16]:
site_list = pd.read_csv('data/chla_sites2.csv', dtype={'ID': str})

scenes_df['site_no'] = scenes_df['site_no'].astype(str)
matching_ids = scenes_df['site_no'].isin(site_list['site_no'])
num_matching_ids = matching_ids.sum()
total_ids = len(scenes_df['site_no'].unique())

print(f"Number of matching IDs: {num_matching_ids} out of {total_ids}")

# Optionally, list IDs that do not match
non_matching_ids = scenes_df.loc[~matching_ids, 'site_no'].unique()
print("IDs in scenes_df not found in site_list_clean:")
print(matching_ids)
print(site_list['site_no'])

Number of matching IDs: 0 out of 184
IDs in scenes_df not found in site_list_clean:
0       False
1       False
2       False
3       False
4       False
        ...  
1552    False
1553    False
1554    False
1555    False
1556    False
Name: site_no, Length: 1557, dtype: bool
0             72632996
1              7362591
2             11173200
3             11273400
4             11312676
            ...       
191            4084500
192            4085059
193           40851385
194            5545750
195    465130091060701
Name: site_no, Length: 196, dtype: int64


In [None]:
results = match_granules(scenes_df, ['32316'])
print(len(results))
results.to_csv('data/results_chla_app2.csv', index=False)