## Gaussian Proccess (GP) on M1 and S2 Site Data
We wish to determine the seasonal and diurnal cycles of supermicron aerosols/bioaerosols. In this notebook we focus on the seasonal trends. We will split the data up by seasons and fit a GP on a subset of the data. Using a confidence interval, we will determine if the data indicates seasonal cycles. 

In [3]:
import pandas as pd

# Define file paths
file_m1 = r"C:\Users\396760\lanl_data\PM_clean\ARMSAILM1_cleaned.csv"
file_s2 = r'C:\Users\396760\lanl_data\PM_clean\ARMSAILS2_cleaned.csv'
# Load the data
m1data = pd.read_csv(file_m1)
s2data = pd.read_csv(file_s2)

# Convert timestamps to datetime format
m1data['Time(UTC)'] = pd.to_datetime(m1data['Time(UTC)'])
s2data['Time(UTC)'] = pd.to_datetime(s2data['Time(UTC)'])

# Handle missing values (NaN); fill with mean value
m1data.fillna(m1data.mean(), inplace=True)
s2data.fillna(s2data.mean(), inplace=True)

# Separate data by seasons for M1 data
winterM1 = m1data[(m1data['Time(UTC)'] >= '2022-12-21') & (m1data['Time(UTC)'] < '2023-03-20')]
springM1 = m1data[(m1data['Time(UTC)'] >= '2023-03-21') & (m1data['Time(UTC)'] < '2023-06-20')]  

#-----------------------------------------------------------------------------------------------#

#subset of collumns for PM1 data
collumnsPM1 = ['sample_rh_pct', 'sample_temp_C', 'sample_pres_mmHg', 'pm_1_ug_per_m3']
#subset of collumns for PM2.5 data
collumnsPM25 = ['sample_rh_pct', 'sample_temp_C', 'sample_pres_mmHg', 'pm_2_5_ug_per_m3']
#subset of collumns for pm10 data
collumnsPM10 = ['sample_rh_pct', 'sample_temp_C', 'sample_pres_mmHg', 'pm_10_ug_per_m3']

# Define features and target variable
features = ['sample_rh_pct', 'sample_temp_C', 'sample_pres_mmHg']
targetPM1= 'pm_1_ug_per_m3'

## Scaling Issues
A well known issue of Gaussian Processes is their limited scalability due to $O(n^3)$ complexity. We attempt to use a Sparse Gaussian Process (SGP) by means of Kmeans Clustering. *Note, for daily averages we do not need to use SGP since $n<1000$ for both Spring and Winter datasets

In [4]:
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# Function to generate inducing points
def inducing_points(data_set, sample_size, subset, num_of_inducing_points, features, random_state):
    subset = data_set.sample(n=sample_size, random_state=random_state)
    kmeans = KMeans(n_clusters=num_of_inducing_points, random_state=random_state).fit(subset[features]) 
    inducing_points = kmeans.cluster_centers_ #cluster centers
    silhouette_avg = silhouette_score(subset[features], kmeans.labels_) #score quantifying choice of clusters, closer to 1 is better

    return inducing_points, silhouette_avg
  

# GP

In [5]:
import numpy as np
from scipy.spatial.distance import cdist

# Matérn kernel function
def matern_kernel(X,  Y=None, length_scale=1.0, nu=1.5,):
    
    dists = cdist(X, Y, metric='euclidean')
    
    if nu == 0.5:
        K = np.exp(-dists / length_scale)
    elif nu == 1.5:
        sqrt3 = np.sqrt(3)
        K = (1.0 + sqrt3 * dists / length_scale) * np.exp(-sqrt3 * dists / length_scale)
    elif nu == 2.5:
        sqrt5 = np.sqrt(5)
        K = (1.0 + sqrt5 * dists / length_scale + 5 * dists**2 / (3 * length_scale**2)) * np.exp(-sqrt5 * dists / length_scale)
    else:
        raise ValueError("Unsupported nu value. Only 0.5, 1.5, and 2.5 are supported.")
    
    return K


In [6]:
def run_gp_regression(X_train, y_train, X_test, noise=1e-6):

    #reshaping for linear algebra operations
    X_train = np.array(X_train)
    X_test = np.array(X_test)   
    X_train = X_train.reshape(-1, 1)
    X_test = X_test.reshape(-1, 1)



    print(X_train.shape, y_train.shape, X_test.shape)
    K = matern_kernel(X_train, X_train, length_scale=1.0, nu=1.5)
    K_star = matern_kernel(X_train, X_test, length_scale=1.0, nu=1.5)
    K_star_star = matern_kernel(X_test, X_test, length_scale=1.0, nu=1.5)

    L = np.linalg.cholesky(K + noise * np.eye(K.shape[0]))
    temp = np.linalg.solve(L, y_train)
    alpha = np.linalg.solve(L.T, temp)

    f_star = K_star.T @ alpha
    v = np.linalg.solve(L, K_star)
    var_f_star = K_star_star - v.T @ v

    
    std_f_star = np.sqrt(np.diag(var_f_star))

    #log_marginal_likelihood = -0.5 * y_train.T @ alpha - np.sum(np.log(np.diag(L))) - 0.5 * y_train.shape[0] * np.log(2 * np.pi)

    return f_star, var_f_star, std_f_star #log_marginal_likelihood





# Daily Averages
Here we take the daily averages from each season, and run the GP on those averages

In [7]:
# Extract date from the datetime for both seasons
winterM1.loc[:, 'Date'] = winterM1['Time(UTC)'].dt.date
springM1.loc[:, 'Date'] = springM1['Time(UTC)'].dt.date

# Group by date and calculate daily average for winter season
daily_winterM1 = winterM1.groupby('Date').mean().reset_index()
daily_winterM1PM1 = daily_winterM1[collumnsPM1]

# Group by date and calculate daily average for spring season
daily_springM1 = springM1.groupby('Date').mean().reset_index()
daily_springM1PM1 = daily_springM1[collumnsPM1]


#Running GP on winter data
day_wint = daily_winterM1[collumnsPM1].sample(n=50 , random_state=42)

day_wint_feature = day_wint[features]
day_wint_feature_values = day_wint[targetPM1].values


test_day = winterM1.sample(n=30, random_state=42)['Time(UTC)'].values

run_gp_regression(day_wint_feature, day_wint_feature_values, test_day, noise=1e-6)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  winterM1.loc[:, 'Date'] = winterM1['Time(UTC)'].dt.date
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  springM1.loc[:, 'Date'] = springM1['Time(UTC)'].dt.date


(150, 1) (50,) (30, 1)


DTypePromotionError: The DTypes <class 'numpy.dtypes.Float64DType'> and <class 'numpy.dtypes.DateTime64DType'> do not have a common DType. For example they cannot be stored in a single array unless the dtype is `object`.