# Evaluating Models

This notebook is designed to read in and evaluate the performance of our model. Everything should run (hopefully) if you follow these instructions:

1) Download the models from the shared Google Drive (I've uploaded them) and make sure you have the data saved as well (you should have this from last time).
2) Make sure the model files are located in the right place, I recommend having it under `ml4ms_bandgap_final/models/`.
3) Get and configure the correct data and model path. This can be done by right-clicking on your file/directory and selecting "copy path". Or this can be done by opening a terminal (any terminal will do) and `cd` into the `ml4ms_bandgap_final` directory. then `cd` into where ever you have the data `cd data` and type `pwd`. This should print the path to your data. Following the same logic, you can do the same to get the path to your models (`cd models` etc...)
4) Once you have those paths, you'll need to update the corresponding paths below. This may take some effort but read the error messages. They will tell you where things are failing.
5) If everything works okay, you should be able to hit "Run All" at the top, and evaluation metrics will be output for each model.

Look at the outputs and try to understand them. This info is important and tells us how our model performed. This info will be in the report too. If you are unsure what some of the metrics mean (f1 score, recall, etc) look it up, they typically are pretty simple! Hopefully this help! lmk if you have questions!

In [22]:
import numpy as np
import joblib
import matplotlib.pyplot as plt
from tensorflow.keras.models import load_model
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import  mean_absolute_error, classification_report

from tensorflow.keras.metrics import AUC, Precision, Recall, MeanAbsoluteError
import gzip
import pickle
import absl.logging
import logging

In [None]:
data_path = "/Users/cadenmyers/billingelab/dev/ml4ms_bandgap_final/data/soap_and_atomic_features.pkl.gz"
# Step 1: Read the compressed pickle
with gzip.open(data_path, 'rb') as f:
    data_df = pickle.load(f)

data_df = data_df.dropna()
print(data_df.shape)
data_df.head()

In [None]:
X_atomic_8 = data_df[[
    'electronegativity_mean', 'electronegativity_std', 
    'atomic_radius_mean', 'atomic_radius_std',
    'ionenergies_mean', 'ionenergies_std', 
    'covalent_radius_mean', 'covalent_radius_std',
]]
soaps = np.array(data_df['padded_soap'].tolist())
X_soap_2d = soaps[..., np.newaxis]  # add channel dim: (N, 64, 800, 1)
X_atomic_8 = X_atomic_8.to_numpy()
print('X_soap_2d shape:', X_soap_2d.shape)
print('X_atomic_8 shape:', X_atomic_8.shape)
print('--------------------')

# Step 2: Split the data into training and testing sets
X_soap_train, X_soap_test, X_atomic_8_train, X_atomic_8_test, y_train, y_test = train_test_split(
    X_soap_2d, X_atomic_8, data_df['gap opt'], test_size=0.2, random_state=42
)

bg_threshold = 0.02 # eV
y_train_binary = (y_train > bg_threshold).astype(int)
y_test_binary = (y_test > bg_threshold).astype(int)

# Flatten the soap descriptors for scaling
X_soap_train_flat = X_soap_train.reshape(X_soap_train.shape[0], -1)
X_soap_test_flat = X_soap_test.reshape(X_soap_test.shape[0], -1)

# scale soap descriptors
scaler_soap = MinMaxScaler()

# Fit and transform the training set, and transform the test set
X_soap_train_scaled = scaler_soap.fit_transform(X_soap_train_flat)
X_soap_test_scaled = scaler_soap.transform(X_soap_test_flat)

# Reshape back to the original shape (N, 64, 800, 1)
X_soap_train_scaled = X_soap_train_scaled.reshape(X_soap_train.shape)
X_soap_test_scaled = X_soap_test_scaled.reshape(X_soap_test.shape)

# scale atomic input data
scaler_atomic_8 = MinMaxScaler()

X_atomic_8_train_scaled = scaler_atomic_8.fit_transform(X_atomic_8_train)
X_atomic_8_test_scaled = scaler_atomic_8.transform(X_atomic_8_test)

print('X_soap_train_scaled shape:', X_soap_train_scaled.shape)
print('X_soap_test_scaled shape:', X_soap_test_scaled.shape)
print('X_atomic_8_train_scaled shape:', X_atomic_8_train_scaled.shape)
print('X_atomic_8_test_scaled shape:', X_atomic_8_test_scaled.shape)
print('--------------------')
print('y_train_binary shape:', y_train_binary.shape)
print('y_test_binary shape:', y_test_binary.shape)

In [None]:
X_atomic_20 = data_df.drop(columns=['formula', 'mpid', 'gap opt', 'padded_soap', 'soap_flat']).to_numpy()
print('X_soap_2d shape:', X_soap_2d.shape)
print('X_atomic_20 shape:', X_atomic_20.shape)
print('--------------------')
# Step 2: Split the data into training and testing sets
_, _, X_atomic_20_train, X_atomic_20_test, y_train, y_test = train_test_split(
    X_soap_2d, X_atomic_20, data_df['gap opt'], test_size=0.2, random_state=42
)

# scale atomic input data
scaler_atomic_20 = MinMaxScaler()

X_atomic_20_train_scaled = scaler_atomic_20.fit_transform(X_atomic_20_train)
X_atomic_20_test_scaled = scaler_atomic_20.transform(X_atomic_20_test)

print('X_soap_train_scaled shape:', X_soap_train_scaled.shape)
print('X_soap_test_scaled shape:', X_soap_test_scaled.shape)
print('---------------------')
print('X_atomic_20_train_scaled shape:', X_atomic_20_train_scaled.shape)
print('X_atomic_20_test_scaled shape:', X_atomic_20_test_scaled.shape)
print('X_atomic_8_train_scaled shape:', X_atomic_8_train_scaled.shape)
print('X_atomic_8_test_scaled shape:', X_atomic_8_test_scaled.shape)
print('--------------------')
print('y_train_binary shape:', y_train_binary.shape)
print('y_test_binary shape:', y_test_binary.shape)
print('y_train shape:', y_train.shape)
print('y_test shape:', y_test.shape)

# Functions for Evaluating models

In [None]:
# Suppress absl warnings about compiled metrics
absl.logging.set_verbosity(absl.logging.ERROR)
logging.getLogger('absl').setLevel(logging.ERROR)

model_dir = '/Users/cadenmyers/billingelab/dev/ml4ms_bandgap_final/models/'
model_files = [
    'bandgap_classifier_pt_slicing_std_avg.h5',
    'bandgap_classifier_pt_slicing.h5',
    'bandgap_classifier.h5',
    'bandgap_cnn_model.h5'
]

def evaluate_classifier_model(model_file,  X_atomic_test, X_soap_test=X_soap_test, y_test_binary=y_test_binary, model_dir=model_dir):
    """
    Evaluate a classifier model and print metrics.
    Args:
        model_file (str): Path to the model file.
        X_soap_test (np.ndarray): Test SOAP features.
        X_atomic_test (np.ndarray): Test atomic features.
        y_test_binary (np.ndarray): Binary test labels.
        model_dir (str): Directory containing the model file.
    """

    print('-' * 100)
    print('Loading classifier model:', model_file)
    print('-' * 100)

    # Classification-specific metrics
    custom_objects = {
        'auc_5': AUC(name='auc_5'),
        'precision_5': Precision(name='precision_5'),
        'recall_5': Recall(name='recall_5'),
    }

    # Load model
    model = load_model(model_dir + model_file, custom_objects=custom_objects)

    # Evaluate
    results = model.evaluate(
        {'soap_input': X_soap_test, 'periodic_features': X_atomic_test},
        y_test_binary,
        verbose=0
    )
    print(f"\nEvaluation Metrics for {model_file}:")
    print(dict(zip(model.metrics_names, results)))

    # Predict and classify
    y_pred_probs = model.predict({'soap_input': X_soap_test, 'periodic_features': X_atomic_test}, verbose=0)
    y_pred = (y_pred_probs > 0.5).astype(int)

    print(f"\nClassification Report: {model_file}")
    print(classification_report(y_test_binary, y_pred, target_names=["No Gap", "Has Gap"]))


def evaluate_regressor_model(model_file, X_soap_test, y_test=y_test, model_dir=model_dir):
    """
    Evaluate a regression model and print metrics.
    Args:
        model_file (str): Path to the model file.
        X_soap_test (np.ndarray): Test SOAP features.
        y_test (np.ndarray): Test labels.
        model_dir (str): Directory containing the model file.
    """
    print('-' * 100)
    print('Loading regression model:', model_file)
    print('-' * 100)
    # Custom objects (for handling 'mae' string in saved model)
    custom_objects = {
        'mae': MeanAbsoluteError()
    }
    # Load model
    model = load_model(model_dir + model_file, custom_objects=custom_objects)

    # Evaluate
    results = model.evaluate(X_soap_test, y_test, verbose=0)
    print(f"\nEvaluation Metrics for {model_file}:")
    print(dict(zip(model.metrics_names, results)))

    # Predict
    y_pred = model.predict(X_soap_test, verbose=0)

    # Prediction summary
    print(f"\nPrediction Summary for {model_file}:")
    print(f"MAE: {mean_absolute_error(y_test, y_pred):.4f} eV")
    return y_pred, y_test


def evaluate_rf_model(rf_file, X_soap_test_flat, y_test=y_test, model_dir=model_dir):
    """
    Evaluate the Random Forest model on the test set.
    Args:
        rf_file (str): Path to the Random Forest model file.
        X_soap_test_flat (np.ndarray): Flattened test SOAP features.
        y_test (np.ndarray): Test labels.
        model_dir (str): Directory containing the model file.
    """
    print('-' * 100)
    print('Loading Random Forest model:', rf_file)
    print('-' * 100)


    rf_model = joblib.load(model_dir + rf_file)
    y_pred = rf_model.predict(X_soap_test_flat)
    mae = mean_absolute_error(y_test, y_pred)
    print(f"Random Forest MAE: {mae:.4f} eV")
    return y_pred, y_test, mae

# Classifier: std and avg atomic values
**Neural Net**: loads and displays metrics for `bandgap_classifier_pt_slicing_std_avg.h5`

In [None]:
evaluate_classifier_model(model_files[0], X_atomic_8_test)


# Classifier: std, avg, max, and min atomic values
**Neural Net**: loads and displays metrics for `bandgap_classifier_pt_slicing.h5`


In [None]:
evaluate_classifier_model(model_files[1], X_atomic_20_test)


# Classifier: std, avg, min and max atomic values, no slicing
**Neural Net**: loads and displays metrics for `bandgap_classifier.h5`. This model treats the atomic values the same as the SOAP values.

## **Double check this by reruning this model!!**


In [None]:
evaluate_classifier_model(model_files[2], X_atomic_20_test)


# Regressor: only SOAP descriptors
**Convolutional Neural Net**: loads and displays metrics for `bandgap_cnn_model.h5`. This model treats the atomic values the same as the SOAP values and tries to **predict actual band gap values**.

In [None]:
y_pred, y_test = evaluate_regressor_model(model_files[3], X_soap_test)


# Plotting the predictions
plt.figure(figsize=(10, 10))
plt.scatter(y_test, y_pred, alpha=0.5)
# Plot parity line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2)
plt.xlabel("True Band Gap (eV)")
plt.ylabel("Predicted Band Gap (eV)")
plt.title("Parity Plot: True vs Predicted Band Gap")
plt.grid(True)
plt.show()

# Regressor: Random forest on SOAP data
loads and displays metrics for `random_forest_model.pkl`. This model was trained on all data. The filtered model (`filtered_random_forest_model.pkl`) was only ran on SOAP descriptors where the bandgap was greater than 0.02 eV

In [None]:
rf_files = ['random_forest_model.pkl', 'filtered_random_forest_model.pkl']

y_pred_rf, y_test_rf, mae_rf = evaluate_rf_model(rf_files[0], X_soap_test_flat)

plt.figure(figsize=(10, 10))
plt.scatter(y_test_rf, y_pred_rf, alpha=0.5)
# Plot parity line
min_val = min(y_test.min(), y_pred.min())
max_val = max(y_test.max(), y_pred.max())
plt.plot([min_val, max_val], [min_val, max_val], 'r--', lw=2)
plt.xlabel("True Band Gap (eV)")
plt.ylabel("Predicted Band Gap (eV)")