# Notebook 6 - Using Total Sky Imager data to pick GOES cloud thresholds

From the Friday Harbor Labs Camera data, we found thresholds using visible reflectivity (green) and infrared (red). This performed well over the San Juan Islands but struggled significantly in mountainous areas with snow and glaciers. Using the decision tree, near IR (blue) was not used because it is unnecessary for precicting cloud cover over ocean/forests. To find the needed threshold combination that will work for those scenes and correctly identify clouds from snow, we will use Total Sky Imager data from Kettle Ponds, Colorado. 

Methods:
1. Process and load TSI data
    - need to ID what spatial domain TSI is looking at to select the correct GOES pixels
2. Load in GOES data for the East River and adjacent environs for Dec-Feb (high snow cover) and Jul-Sep (low/no snow cover)
3. Run the decision tree to find thresholds
    - add depth/branches to the tree to ensure it is using near IR as well as IR and visible

In [20]:
import xarray as xr
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report, accuracy_score


### Section 1 - Process TSI data

In [21]:
tsi_ds = xr.open_dataset('/storage/cdalden/goes/surface_obs/total_sky_imager/tsi_skycover_20210901_20230616.nc')

# Create a new cloud binary variable
tsi_ds['cloud_binary'] = xr.where(
    tsi_ds['percent_opaque'] > 0.75, 1,  # Cloudy: Set to 1 when > 0.75
    xr.where(
        (tsi_ds['percent_opaque'] >= 0) & (tsi_ds['percent_opaque'] < 0.25), 0,  # Clear: Set to 0 when >= 0 and < 0.25
        np.nan  # Otherwise, set to NaN (nighttime and mixed cloud cover)
    )
)

tsi_clouds_df = tsi_ds['cloud_binary'].to_dataframe()

tsi_clouds_df

Unnamed: 0_level_0,cloud_binary
time,Unnamed: 1_level_1
2021-09-01 00:00:00,1.0
2021-09-01 00:00:30,1.0
2021-09-01 00:01:00,1.0
2021-09-01 00:01:30,1.0
2021-09-01 00:02:00,1.0
...,...
2023-06-16 15:57:30,1.0
2023-06-16 15:58:00,1.0
2023-06-16 15:58:30,1.0
2023-06-16 15:59:00,1.0


In [22]:
# subset only for jul-aug 2022
tsi_ds_2022 = tsi_ds.sel(time=slice('2022-11-01', '2022-12-31'))
# tsi_ds_2022.percent_opaque.plot()

### Section 2 - Compare to GOES pixels

In [34]:
goes_path = '/storage/cdalden/goes/colorado/goes16/rgb_composite/'
goes_file = 'goes16_C02_C05_C13_rgb_colorado_20230101.nc'
goes_ds = xr.open_dataset(goes_path + goes_file, engine='netcdf4')
# Select pixels where y is between 40 and 41, and x is between -110 and -106
# goes_ds = goes_ds.sel(
#     latitude=slice(38.904, 39.065),
#     longitude=slice(-107.08, -106.993)
# )

In [37]:
zarr_path = '/storage/cdalden/goes/colorado/goes16/C02/'
zarr_file = 'goes16_C02_colorado_20230109.zarr'
zarr_ds = xr.open_zarr(zarr_path + zarr_file, consolidated=True)
zarr_ds

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 679.22 MiB 2.66 MiB Shape (288, 640, 966) (36, 80, 242) Dask graph 256 chunks in 2 graph layers Data type float32 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray
"Array Chunk Bytes 1.33 GiB 2.66 MiB Shape (288, 640, 966) (36, 80, 121) Dask graph 512 chunks in 2 graph layers Data type int64 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,int64 numpy.ndarray,int64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 679.22 MiB 2.66 MiB Shape (288, 640, 966) (36, 80, 242) Dask graph 256 chunks in 2 graph layers Data type float32 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.33 GiB 2.66 MiB Shape (288, 640, 966) (36, 80, 121) Dask graph 512 chunks in 2 graph layers Data type float64 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 1.33 GiB 2.66 MiB Shape (288, 640, 966) (36, 80, 121) Dask graph 512 chunks in 2 graph layers Data type float64 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,1.33 GiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 121)"
Dask graph,512 chunks in 2 graph layers,512 chunks in 2 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 679.22 MiB 2.66 MiB Shape (288, 640, 966) (36, 80, 242) Dask graph 256 chunks in 2 graph layers Data type float32 numpy.ndarray",966  640  288,

Unnamed: 0,Array,Chunk
Bytes,679.22 MiB,2.66 MiB
Shape,"(288, 640, 966)","(36, 80, 242)"
Dask graph,256 chunks in 2 graph layers,256 chunks in 2 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


In [35]:
goes_ds

In [30]:
# Compute spatial averages for red, green, and blue bands over x and y dimensions
spatial_avg = goes_ds[['red', 'green', 'blue']].mean(dim=['latitude', 'longitude'])

goes_uerw_pixels_df = spatial_avg.to_dataframe()
goes_uerw_pixels_df

Unnamed: 0_level_0,red,green,blue
t,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2023-01-01 00:02:30,,,
2023-01-01 00:07:30,,,
2023-01-01 00:12:30,,,
2023-01-01 00:17:30,,,
2023-01-01 00:22:30,,,
...,...,...,...
2023-01-01 23:37:30,,,
2023-01-01 23:42:30,,,
2023-01-01 23:47:30,,,
2023-01-01 23:52:30,,,


### Section 3 - Align GOES and TSI data

In [25]:
# Perform an asof merge to align GOES data to the nearest TSI timestamps
goes_tsi_df = pd.merge_asof(
    goes_uerw_pixels_df,  # Left dataframe
    tsi_clouds_df,  # Right dataframe
    left_index=True,  # Use the index (time) for TSI
    right_index=True,  # Use the index (time) for GOES
    direction='nearest'  # Match to the nearest GOES timestamp
)

# Drop rows with NaN values in 'cloud_binary' or any of the RGB columns
goes_tsi_df.dropna(subset=['cloud_binary', 'red', 'green', 'blue'], inplace=True)

In [26]:
goes_tsi_df

Unnamed: 0_level_0,red,green,blue,cloud_binary
t,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1


### Section 4 - Create Decision Tree to find thresholds 

In [27]:
# Assuming merged_df is already loaded
# Separate features and target
X = goes_tsi_df[['red', 'green', 'blue']]  # Features
y = goes_tsi_df['cloud_binary']  # Target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the decision tree classifier
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))

# Visualize the decision tree
plt.figure(figsize=(12, 8))
plot_tree(clf, feature_names=['red', 'green', 'blue'], class_names=['0', '1'], filled=True)
plt.show()

ValueError: With n_samples=0, test_size=0.2 and train_size=None, the resulting train set will be empty. Adjust any of the aforementioned parameters.