## USGS and EMIT Data Matchup
In this notebook we will search the USGS database for a specific state code and paramater code/s to retrieve a list of sites. We will then use the site coordinates to find matching EMIT granules and gather data around the EMIT granules time stamp. 

### 1. Retrieving site codes
First import package and utils file

In [1]:
import dataretrieval.nwis as nwis
import geopandas as gpd
from shapely.geometry import Point, box, Polygon, MultiPolygon
import requests
import pandas as pd
import datetime as dt
import earthaccess
from tqdm import tqdm
import sys
sys.path.append('modules/')
from retrieval_utils import get_param_sites, get_all_site_granules, match_granules

Next we can find active parameters using the USGS website, for a separate guide on this there is a PDF called "Get param codes" in the Github. 

Then we can define the time-frame, state code and paramater codes and call the function. Note: all three are required for the function to work. 

In [2]:
param_codes = ['32315'] # chla fluorescence
param_codes_str = ','.join(param_codes) 
state_codes = [f"{i:02d}" for i in range(1, 57)]
start_date = '2022-01-01'
end_date = '2024-10-03'

site_list = get_param_sites(param_codes_str, state_codes, start_date, end_date)
print(site_list.head())
print(len(site_list))

Processing sites: 100%|█████████████████████████| 56/56 [01:52<00:00,  2.00s/it]

196
     site_no                                         station_nm   dec_lat_va  \
0  072632996   Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
1   07362591  Alum Fork Saline River at Winona Dam at Reform...  34.79777778   
2   11173200                        ARROYO HONDO NR SAN JOSE CA  37.46160337   
3   11273400             SAN JOAQUIN R AB MERCED R NR NEWMAN CA   37.3472151   
4   11312676                        MIDDLE R AT MIDDLE RIVER CA  37.94226944   

    dec_long_va  
0   -92.4891667  
1   -92.8455556  
2   -121.769397  
3  -120.9761777  
4     -121.5337  
196





In [3]:
site_list.to_csv('data/chla_sites2.csv', index=False)

### 2. Retrieving granules based on site locations
Now we have the site list we can use coordinates to search for matching granules. 

Next setup the granule search and call the function.

In [4]:

start_date_dt = dt.datetime.strptime(start_date, '%Y-%m-%d')
end_date_dt = dt.datetime.strptime(end_date, '%Y-%m-%d')
dt_format = '%Y-%m-%dT%H:%M:%SZ'
temporal_str = start_date_dt.strftime(dt_format) + ',' + end_date_dt.strftime(dt_format)

site_granules = get_all_site_granules(site_list, temporal_str)
df_granules = pd.DataFrame(site_granules)
print(df_granules.head())
print(len(df_granules))

Processing sites: 100%|███████████████████████| 196/196 [04:12<00:00,  1.29s/it]

     site_no                                         station_nm     site_lat  \
0  072632996   Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
1  072632996   Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
2  072632996   Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
3   07362591  Alum Fork Saline River at Winona Dam at Reform...  34.79777778   
4   11173200                        ARROYO HONDO NR SAN JOSE CA  37.46160337   

      site_lon                                       granule_urls  \
0  -92.4891667  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
1  -92.4891667  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
2  -92.4891667  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
3  -92.8455556  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   
4  -121.769397  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   

                   datetime  
0  2024-01-28T19:13:25.000Z  
1  2024-01-28T19:13:37.000Z  
2  2024-02-26T




### 3. Collecting and matching data base on granule times

Next we can use the granule times and locations to collect and match the USGS data. 
The function will match, with each granule, the closest data time within the time window. 

Call the function and optionally store as a csv file. 

In [5]:
results = match_granules(df_granules, param_codes)
print(results.head())
print(len(results))
#results.to_csv('results.csv', index=False)

Processing granules: 100%|██████████████████| 1945/1945 [14:22<00:00,  2.26it/s]

     site_no                                        station_nm     site_lat  \
0  072632996  Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
1  072632996  Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
2  072632996  Lk Maumelle Raw Water Intake nr Natural Steps,AR  34.85194444   
3   11273400            SAN JOAQUIN R AB MERCED R NR NEWMAN CA   37.3472151   
4   11273400            SAN JOAQUIN R AB MERCED R NR NEWMAN CA   37.3472151   

       site_lon              granule_time  \
0   -92.4891667 2024-01-28 19:13:25+00:00   
1   -92.4891667 2024-01-28 19:13:37+00:00   
2   -92.4891667 2024-02-26 15:23:32+00:00   
3  -120.9761777 2022-08-10 17:42:13+00:00   
4  -120.9761777 2022-08-14 16:05:05+00:00   

                                        granule_urls result result_unit  \
0  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   3.62         RFU   
1  [https://data.lpdaac.earthdatacloud.nasa.gov/l...   3.62         RFU   
2  [https://data.lpdaac.earth


