## Using manually classified timelapse camera data to find GOES multi-spectral thresholds

We want to find a combination of thresholds from the GOES Day Cloud Phase RGB bands that will help predict cloud/no cloud for all pixels over Western Washington. We have manually identified cloud types for daylight hours between 13 July 2022 and 30 September 2022 for Friday Harbor and Cattle Point.

#### Methods:
1. Create RGB composite files for all the desired dates
2. Match RGB composites for the pixel just East of cameras for corresponding timesteps
    - GOES data is 5min but cameras are 30min
    - one pixel East because cameras look East towards Lopez and Shaw islands ~1-2km away
3. Only look at times with blue sky only or 2 or fewer cloud types 
    - gets way too messy when there is patchy cloud cover/many types of clouds
4. Get rid of timesteps and combine both sites cloud data with their corresponding GOES pixels
5. Use a decision tree to find the thresholds

#### Next Steps:
1. Redownload GOES data for the same dates but a larger spatial domain
    - see manuscript for domain bounds
2. Create a cloud/no cloud frequency map
3. Compare to GOES cloud mask that only uses thermal
    - theoretically, this will be very different specifically over the ocean
    - GOES cloud mask confuses cold water with low clouds


In [1]:
import xarray as xr
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_text


In [12]:
camera_data = pd.read_csv('CP2022photos.csv', index_col=0, parse_dates=True)
camera_data.index = camera_data.index.tz_localize('America/Los_Angeles')
camera_data.index = camera_data.index.round('5min')
camera_data = camera_data.loc['2022-07-13':'2022-09-30']

In [13]:
goes_data_july = {}
goes_df_july = {}
for i in range(13, 32):
    key = f"goes_data_july{i}"  # Dynamically generate the key
    filename = f"/storage/cdalden/goes/washington/goes17/rgb_composite/goes17_C02_C05_C13_rgb_washington_202207{i}.nc"  # Dynamically generate the filename
    goes_data_july[key] = xr.open_dataset(filename)

    # select data near 48.5N and -123.5W
    lat = 48.464462
    lon = -122.9 # offset by ~0.05deg (1 grid cell) to east since that is where camera looks
    # Select data and convert to dataframe
    df_key = f"goes_df_july{i}"
    goes_df_july[df_key] = (
        goes_data_july[key]
        .sel(latitude=lat, longitude=lon, method='nearest')
        .drop_vars(['latitude', 'longitude'])
        .to_dataframe()
    )
# Concatenate all dataframes into a single dataframe
goes_df_july_all = pd.concat(goes_df_july.values(), axis=0)

# Convert goes times to PDT
goes_df_july_all.index = goes_df_july_all.index.tz_localize('UTC').tz_convert('America/Los_Angeles')
goes_df_july_all.index = goes_df_july_all.index.round('5min')

# Subset by time range (0600 to 2000 hours)
goes_df_july_all = goes_df_july_all.between_time('06:00', '19:30')



In [14]:
goes_data_aug = {}
goes_df_aug = {}
for i in range(1, 8):
    key = f"goes_data_aug{i}"  # Dynamically generate the key
    filename = f"/storage/cdalden/goes/washington/goes17/rgb_composite/goes17_C02_C05_C13_rgb_washington_2022080{i}.nc"  # Dynamically generate the filename
    goes_data_aug[key] = xr.open_dataset(filename)

    # select data near 48.5N and -123.5W
    lat = 48.464462
    lon = -122.9 # offset by ~0.05deg (1 grid cell) to east since that is where camera looks
    # Select data and convert to dataframe
    df_key = f"goes_df_aug{i}"
    goes_df_aug[df_key] = (
        goes_data_aug[key]
        .sel(latitude=lat, longitude=lon, method='nearest')
        .drop_vars(['latitude', 'longitude'])
        .to_dataframe()
    )
# Concatenate all dataframes into a single dataframe
goes_df_aug_all = pd.concat(goes_df_aug.values(), axis=0)

# Convert goes times to PDT
goes_df_aug_all.index = goes_df_aug_all.index.tz_localize('UTC').tz_convert('America/Los_Angeles')
goes_df_aug_all.index = goes_df_aug_all.index.round('5min')

# Subset by time range (0600 to 2000 hours)
goes_df_aug_all = goes_df_aug_all.between_time('06:00', '19:30')


In [15]:
goes_data_sep = {}
goes_df_sep = {}
for i in range(1, 23):
    # Make i two digits
    day = f"{i:02}"
    key = f"goes_data_sep{day}"  # Dynamically generate the key
    filename = f"/storage/cdalden/goes/washington/goes17/rgb_composite/goes17_C02_C05_C13_rgb_washington_202209{day}.nc"  # Dynamically generate the filename
    goes_data_sep[key] = xr.open_dataset(filename)

    # select data near 48.5N and -123.5W
    lat = 48.464462
    lon = -122.85 # offset by ~0.1deg (2 grid cells, ) to east since that is where camera looks
    # Select data and convert to dataframe
    df_key = f"goes_df_sep{day}"
    goes_df_sep[df_key] = (
        goes_data_sep[key]
        .sel(latitude=lat, longitude=lon, method='nearest')
        .drop_vars(['latitude', 'longitude'])
        .to_dataframe()
    )
# Concatenate all dataframes into a single dataframe
goes_df_sep_all = pd.concat(goes_df_sep.values(), axis=0)

# Convert goes times to PDT
goes_df_sep_all.index = goes_df_sep_all.index.tz_localize('UTC').tz_convert('America/Los_Angeles')
goes_df_sep_all.index = goes_df_sep_all.index.round('5min')

# Subset by time range (0600 to 2000 hours)
goes_df_sep_all = goes_df_sep_all.between_time('06:00', '19:30')

In [16]:
goes_df_sep_all

Unnamed: 0_level_0,green,blue,red
t,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2022-08-31 17:00:00-07:00,0.075575,0.009040,0.0
2022-08-31 17:10:00-07:00,0.073576,0.009040,0.0
2022-08-31 17:10:00-07:00,0.068380,0.004683,0.0
2022-08-31 17:20:00-07:00,0.070378,0.000325,0.0
2022-08-31 17:20:00-07:00,0.065981,0.000000,0.0
...,...,...,...
2022-09-22 16:40:00-07:00,0.106339,0.099220,0.0
2022-09-22 16:40:00-07:00,0.082231,0.028131,0.0
2022-09-22 16:50:00-07:00,0.252558,0.273710,0.0
2022-09-22 16:50:00-07:00,0.068790,0.081986,0.0


In [17]:
# Combine July and August data
goes_df_summer_all = pd.concat([goes_df_july_all, goes_df_aug_all, goes_df_sep_all])
# Ensure the time indices align
data = goes_df_summer_all.join(camera_data, how='inner')
data = data.dropna()

# Drop timesteps with more than 2 cloud types
data = data[data['clouds'].str.len() <= 2]

# Drop times when bluesky and string is 2 characters long (i.e., bluesky + 1 cloud type)
data = data[~((data['clouds'].str.len() == 2) & (data['clouds'].str.contains('b')))]

print(data['clouds'].value_counts(
))

clouds
b     808
s     428
f      91
sm     38
sc     34
m      26
fs     14
sr     12
r      10
cr      8
c       2
h       2
f!      2
Name: count, dtype: int64


In [18]:
# Group by time and clouds to count occurrences
cloud_counts = data.groupby([data.index, 'clouds']).size().unstack(fill_value=0)

# Normalize counts to get proportions
cloud_proportions = cloud_counts.div(cloud_counts.sum(axis=1), axis=0)

# # Plot the stacked area chart
# cloud_proportions.plot(kind='area', stacked=True, alpha=0.7, figsize=(10, 6))
# plt.xlabel('Time')
# plt.ylabel('Proportion')
# plt.title('Cloud Categories Proportion Over Time')
# plt.legend(title='Clouds')
# plt.show()

## Decision Tree

In [19]:
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score

# Create binary target variable: 1 if 'clouds' contains 'b' (no clouds), 0 otherwise (clouds)
data['no_clouds'] = data['clouds'].str.contains('b').astype(int)

# Split into features (X) and target (y)
X = data[['red', 'green', 'blue']]  # Replace with your variable names
y = data['no_clouds']  # Binary target variable

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a decision tree
tree_model = DecisionTreeClassifier(max_depth=3, min_samples_split=10, class_weight='balanced', random_state=42)
tree_model.fit(X_train, y_train)

# Visualize the decision tree rules
tree_rules = export_text(tree_model, feature_names=['red', 'green', 'blue'])
print("Decision Tree Rules:\n", tree_rules)

# Make predictions on the test set
y_pred = tree_model.predict(X_test)

# Detailed classification report
print("\nClassification Report:\n", classification_report(y_test, y_pred))

Decision Tree Rules:
 |--- green <= 0.12
|   |--- red <= 0.00
|   |   |--- green <= 0.01
|   |   |   |--- class: 0
|   |   |--- green >  0.01
|   |   |   |--- class: 1
|   |--- red >  0.00
|   |   |--- class: 0
|--- green >  0.12
|   |--- green <= 0.13
|   |   |--- class: 0
|   |--- green >  0.13
|   |   |--- class: 0


Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.82      0.90       123
           1       0.89      0.99      0.94       172

    accuracy                           0.92       295
   macro avg       0.94      0.91      0.92       295
weighted avg       0.93      0.92      0.92       295



In [20]:
# Evaluate metrics
print("Accuracy:", accuracy_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

Accuracy: 0.9220338983050848
F1 Score: 0.936986301369863
Precision: 0.8860103626943006
Recall: 0.9941860465116279


In [21]:
from sklearn.metrics import f1_score

# Define a function to classify based on the thresholds
def classify_based_on_thresholds(row):
    if row['green'] <= 0.13:
        if row['red'] <= 0.00:
            if row['blue'] <= 0.26:
                return 1  # Class 1 (no clouds)
    return 0  # Class 0 (clouds)

# Apply the function to the test set
manual_predictions = X_test.apply(classify_based_on_thresholds, axis=1)

# Calculate the F1 score
f1 = f1_score(y_test, manual_predictions)
print("F1 Score based on thresholds:", f1)

F1 Score based on thresholds: 0.7468354430379747
