# Notebook 1: Training Sargassum Detection Models

This notebook provides a step-by-step guide to training the Sargassum detection models. It mirrors `1_train_models.py` (Keras MLP + 1D CNN) and `1_train_models_ML.py` (XGBoost GPU).

**Models Trained:**
- **Keras MLP** — Mixed Precision float16, saved as `.keras` and quantized to TFLite INT8 / FLOAT16
- **1D CNN** — Mixed Precision float16, saved as `.keras` and quantized to TFLite INT8 / FLOAT16
- **XGBoost GPU** — GPU-accelerated, hyperparameter-tuned via `GridSearchCV`

**Workflow:**
1. Environment Setup
2. Data Loading & Exploration
3. Preprocessing
4. Keras MLP Training
5. 1D CNN Training
6. TFLite Quantization (INT8 + FLOAT16)
7. XGBoost GPU Training
8. Evaluation & Performance Summary

## 1. Environment Setup & Dependencies

Install all required packages with:
```bash
pip install -r requirements.txt
```

**Key Library Versions:**
- `numpy==1.26.4`
- `pandas==2.3.0`
- `scikit-learn==1.5.2`
- `xgboost==2.1.4`
- `tensorflow==2.17.0`
- `matplotlib==3.9.2`

In [None]:
# --- Set Matplotlib Backend (before any pyplot import) ---
import matplotlib
matplotlib.use('Agg')

# --- Standard Imports ---
import pandas as pd
import numpy as np
import os
import joblib
import time
import traceback
import matplotlib.pyplot as plt
import seaborn as sns

# --- Scikit-learn (utilities only) ---
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, confusion_matrix, ConfusionMatrixDisplay, classification_report
)

# --- XGBoost ---
try:
    import xgboost as xgb
    XGB_AVAILABLE = True
    print('XGBoost loaded successfully.')
except ImportError:
    XGB_AVAILABLE = False
    print('Warning: XGBoost not found. XGBoost training will be skipped.')

# --- Reproducibility Seed ---
SEED = 42
np.random.seed(SEED)

# --- TensorFlow / Keras ---
TF_AVAILABLE = False
try:
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Dense, Conv1D, MaxPooling1D, Flatten, Dropout, Input
    from tensorflow.keras.callbacks import EarlyStopping
    from tensorflow.keras.metrics import Precision as KerasPrecision, Recall as KerasRecall
    from tensorflow.keras import mixed_precision
    try:
        AdamOptimizer = tf.keras.optimizers.Adam
    except AttributeError:
        AdamOptimizer = tf.keras.optimizers.legacy.Adam
    tf.random.set_seed(SEED)

    # Configure GPU memory growth
    gpus = tf.config.list_physical_devices('GPU')
    if gpus:
        try:
            for gpu in gpus:
                tf.config.experimental.set_memory_growth(gpu, True)
            print(f'Set memory growth for {len(gpus)} GPU(s).')
        except RuntimeError as e:
            print(f'Error setting GPU memory growth: {e}')

    # Enable mixed precision for faster training on compatible GPUs
    policy = mixed_precision.Policy('mixed_float16')
    mixed_precision.set_global_policy(policy)
    print(f'Mixed precision policy: {policy.name}')
    TF_AVAILABLE = True
    print('TensorFlow/Keras loaded successfully.')
except ImportError:
    print('Warning: TensorFlow not found. Keras models will be skipped.')

# --- Global Configuration ---
CSV_PATH          = 'sargassum_data.csv'
OUTPUT_DIR_MODELS = 'output/models_classification'
OUTPUT_DIR_PLOTS  = 'output/plots_classification'
OUTPUT_DIR_TABLES = 'output/tables_classification'
POSITIVE_CLASS_LABEL = 'sargassum'
CLASS_COLUMN      = 'class'
TEST_SIZE         = 0.3

# Keras training hyperparameters (matching 1_train_models.py)
MLP_EPOCHS             = 100
CNN_EPOCHS             = 100
BATCH_SIZE             = 64
EARLY_STOPPING_PATIENCE = 15
VALIDATION_SPLIT       = 0.2

os.makedirs(OUTPUT_DIR_MODELS, exist_ok=True)
os.makedirs(OUTPUT_DIR_PLOTS,  exist_ok=True)
os.makedirs(OUTPUT_DIR_TABLES, exist_ok=True)
print('Directories created. Setup complete.')

## 2. Data Loading and Exploration

We load the spectral library `sargassum_data.csv`, which contains ~196 K labelled pixel samples with 5 reflectance bands and a binary `class` column (`sargassum` / `no_sargassum`).

In [None]:
df = pd.read_csv(CSV_PATH)

print('Dataset shape:', df.shape)
print('\nClass distribution:')
print(df[CLASS_COLUMN].value_counts())
print(f'\nClass balance ratio: {df[CLASS_COLUMN].value_counts().iloc[0] / df[CLASS_COLUMN].value_counts().iloc[1]:.1f}:1')

print('\nFirst 5 rows:')
display(df.head())

print('\nFeature statistics:')
display(df.describe())

## 3. Preprocessing: Splitting and Scaling

1. **Label Encoding** — convert string labels to integers (0 / 1).
2. **Train-Test Split** — 70 / 30 stratified split (`SEED=42`).
3. **Feature Scaling** — `StandardScaler` fit **only** on training data to prevent leakage.

Both the scaler and label encoder are saved to `output/models_classification/` for use during inference.

In [None]:
# 1. Separate features and target
feature_cols = [c for c in df.columns if c != CLASS_COLUMN]
X = df[feature_cols].values.astype(np.float64)
y_labels_str = df[CLASS_COLUMN].values

# 2. Label encoding
le = LabelEncoder()
y = le.fit_transform(y_labels_str)
sargassum_encoded_value = le.transform([POSITIVE_CLASS_LABEL])[0]
print(f'Class mapping: {list(zip(le.classes_, le.transform(le.classes_)))}')
print(f'Positive class "{POSITIVE_CLASS_LABEL}" encoded as: {sargassum_encoded_value}')

# 3. Train-test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=SEED, stratify=y
)
print(f'\nTraining set: {X_train.shape[0]:,} samples')
print(f'Test set:     {X_test.shape[0]:,} samples')

# 4. Feature scaling
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled  = scaler.transform(X_test)

# Save scaler and label encoder
joblib.dump(le,     os.path.join(OUTPUT_DIR_MODELS, 'label_encoder_classification.joblib'))
joblib.dump(scaler, os.path.join(OUTPUT_DIR_MODELS, 'scaler_classification.joblib'))
print('\nLabelEncoder and Scaler saved.')

## 4. Keras MLP Training (Mixed Precision)

A compact 3-layer MLP (100 → 50 → 1) trained with `float16` mixed precision and early stopping.
The trained model is saved as `.keras` and will be quantized to TFLite in Section 6.

In [None]:
# ── Model builder functions ───────────────────────────────────────────────────
def build_mlp_classifier(input_shape):
    """Keras MLP for binary classification."""
    if not TF_AVAILABLE:
        return None
    model = Sequential([
        Input(shape=input_shape),
        Flatten(),
        Dense(100, activation='relu'),
        Dropout(0.4),
        Dense(50, activation='relu'),
        Dropout(0.4),
        Dense(1, activation='sigmoid'),
    ])
    model.compile(
        optimizer=AdamOptimizer(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', KerasPrecision(name='precision'), KerasRecall(name='recall')],
    )
    return model


def build_cnn_classifier(input_shape):
    """1D CNN for binary classification."""
    if not TF_AVAILABLE:
        return None
    model = Sequential([
        Input(shape=input_shape),
        Conv1D(32, kernel_size=3, activation='relu', padding='same'),
        MaxPooling1D(pool_size=2), Dropout(0.3),
        Conv1D(64, kernel_size=3, activation='relu', padding='same'),
        MaxPooling1D(pool_size=2), Dropout(0.4),
        Flatten(),
        Dense(128, activation='relu'), Dropout(0.5),
        Dense(1, activation='sigmoid'),
    ])
    model.compile(
        optimizer=AdamOptimizer(learning_rate=0.001),
        loss='binary_crossentropy',
        metrics=['accuracy', KerasPrecision(name='precision'), KerasRecall(name='recall')],
    )
    return model


def plot_cm(y_true, y_pred, class_names, model_name):
    """Plot and save a confusion matrix."""
    cm = confusion_matrix(y_true, y_pred, labels=np.arange(len(class_names)))
    fig, ax = plt.subplots(figsize=(6, 5))
    ConfusionMatrixDisplay(cm, display_labels=class_names).plot(ax=ax, cmap=plt.cm.Blues, xticks_rotation=45)
    ax.set_title(f'Confusion Matrix - {model_name}')
    plt.tight_layout()
    fig.savefig(os.path.join(OUTPUT_DIR_PLOTS, f'cm_{model_name.lower().replace(" ","_")}.png'), dpi=150)
    plt.show()
    plt.close(fig)


# ── Train Keras MLP ───────────────────────────────────────────────────────────
results = {}
mlp_model = None

if TF_AVAILABLE:
    print('--- Training Keras MLP (Mixed Precision) ---')
    mlp_start = time.time()
    try:
        mlp_model = build_mlp_classifier((X_train_scaled.shape[1],))
        mlp_model.summary()
        early_stop = EarlyStopping(
            monitor='val_loss', patience=EARLY_STOPPING_PATIENCE,
            restore_best_weights=True, verbose=1,
        )
        mlp_model.fit(
            X_train_scaled, y_train,
            epochs=MLP_EPOCHS, batch_size=BATCH_SIZE,
            validation_split=VALIDATION_SPLIT,
            callbacks=[early_stop], verbose=2,
        )
        mlp_duration = time.time() - mlp_start
        y_pred_mlp = (mlp_model.predict(X_test_scaled).flatten() > 0.5).astype(int)
        results['Keras_MLP_Mixed'] = {
            'Accuracy':       accuracy_score(y_test, y_pred_mlp),
            'Precision (w)':  precision_score(y_test, y_pred_mlp, average='weighted', zero_division=0),
            'Recall (w)':     recall_score(y_test, y_pred_mlp, average='weighted', zero_division=0),
            'F1-Score (w)':   f1_score(y_test, y_pred_mlp, average='weighted', zero_division=0),
            'Train Time (s)': mlp_duration,
        }
        print(classification_report(y_test, y_pred_mlp, target_names=le.classes_, digits=4))
        plot_cm(y_test, y_pred_mlp, le.classes_, 'Keras_MLP_Mixed')
        mlp_path = os.path.join(OUTPUT_DIR_MODELS, 'mlp_classifier_mixed.keras')
        mlp_model.save(mlp_path)
        print(f'Keras MLP saved -> {mlp_path}')
    except Exception as e:
        print(f'Error training Keras MLP: {e}')
        traceback.print_exc()
        results['Keras_MLP_Mixed'] = {k: float('nan') for k in ['Accuracy','Precision (w)','Recall (w)','F1-Score (w)','Train Time (s)']}
else:
    print('Skipping Keras MLP (TensorFlow not available).')

## 5. 1D CNN Training (Mixed Precision)

A 1D Convolutional Neural Network treating each pixel's spectrum as a 1D signal of length 5.
Architecture: `Conv1D(32) → Pool → Conv1D(64) → Pool → Dense(128) → Dense(1)`.

In [None]:
cnn_model = None

if TF_AVAILABLE:
    print('--- Training 1D CNN (Mixed Precision) ---')
    cnn_start = time.time()
    try:
        X_train_cnn = X_train_scaled.reshape(X_train_scaled.shape[0], X_train_scaled.shape[1], 1)
        X_test_cnn  = X_test_scaled.reshape(X_test_scaled.shape[0],  X_test_scaled.shape[1],  1)
        cnn_model = build_cnn_classifier((X_train_cnn.shape[1], 1))
        cnn_model.summary()
        early_stop = EarlyStopping(
            monitor='val_loss', patience=EARLY_STOPPING_PATIENCE,
            restore_best_weights=True, verbose=1,
        )
        cnn_model.fit(
            X_train_cnn, y_train,
            epochs=CNN_EPOCHS, batch_size=BATCH_SIZE,
            validation_split=VALIDATION_SPLIT,
            callbacks=[early_stop], verbose=2,
        )
        cnn_duration = time.time() - cnn_start
        y_pred_cnn = (cnn_model.predict(X_test_cnn).flatten() > 0.5).astype(int)
        results['1D_CNN_Mixed'] = {
            'Accuracy':       accuracy_score(y_test, y_pred_cnn),
            'Precision (w)':  precision_score(y_test, y_pred_cnn, average='weighted', zero_division=0),
            'Recall (w)':     recall_score(y_test, y_pred_cnn, average='weighted', zero_division=0),
            'F1-Score (w)':   f1_score(y_test, y_pred_cnn, average='weighted', zero_division=0),
            'Train Time (s)': cnn_duration,
        }
        print(classification_report(y_test, y_pred_cnn, target_names=le.classes_, digits=4))
        plot_cm(y_test, y_pred_cnn, le.classes_, '1D_CNN_Mixed')
        cnn_path = os.path.join(OUTPUT_DIR_MODELS, 'cnn_classifier_mixed.keras')
        cnn_model.save(cnn_path)
        print(f'1D CNN saved -> {cnn_path}')
    except Exception as e:
        print(f'Error training 1D CNN: {e}')
        traceback.print_exc()
        results['1D_CNN_Mixed'] = {k: float('nan') for k in ['Accuracy','Precision (w)','Recall (w)','F1-Score (w)','Train Time (s)']}
else:
    print('Skipping 1D CNN (TensorFlow not available).')

## 6. TFLite Quantization

Both Keras models are quantized to two compact TFLite formats for edge deployment:
- **INT8** — smallest size, fastest CPU inference, uses a calibration dataset (1000 samples).
- **FLOAT16** — half the size of FP32, compatible with GPU delegate on mobile devices.

A temporary FP32 copy of each model is created (from the trained float16 weights) before quantization.

In [None]:
def _get_fp32_copy(keras_path):
    """Return a FP32 copy of a mixed-precision Keras model."""
    try:
        mx = tf.keras.models.load_model(keras_path)
        w  = mx.get_weights()
    except Exception as e:
        print(f'Error loading {keras_path}: {e}'); return None
    mixed_precision.set_global_policy('float32')
    try:
        fp32 = build_cnn_classifier(mx.input_shape[1:]) if 'cnn' in keras_path \
               else build_mlp_classifier(mx.input_shape[1:])
        fp32.set_weights(w)
    except Exception as e:
        print(f'Error building FP32 model: {e}'); fp32 = None
    mixed_precision.set_global_policy('mixed_float16')
    return fp32


def _quantize_int8(fp32_model, out_path, rep_gen):
    """Convert to INT8 TFLite."""
    c = tf.lite.TFLiteConverter.from_keras_model(fp32_model)
    c.optimizations = [tf.lite.Optimize.DEFAULT]
    c.representative_dataset = rep_gen
    c.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS_INT8]
    c.inference_input_type = c.inference_output_type = tf.float32
    with open(out_path, 'wb') as f: f.write(c.convert())
    print(f'INT8  saved -> {out_path}')


def _quantize_f16(fp32_model, out_path):
    """Convert to FLOAT16 TFLite."""
    c = tf.lite.TFLiteConverter.from_keras_model(fp32_model)
    c.optimizations = [tf.lite.Optimize.DEFAULT]
    c.target_spec.supported_types = [tf.float16]
    with open(out_path, 'wb') as f: f.write(c.convert())
    print(f'F16   saved -> {out_path}')


if TF_AVAILABLE:
    N_REP = min(1000, len(X_train_scaled))

    # ── MLP ──
    mlp_keras = os.path.join(OUTPUT_DIR_MODELS, 'mlp_classifier_mixed.keras')
    if os.path.exists(mlp_keras):
        print('--- Quantizing MLP ---')
        fp32_mlp = _get_fp32_copy(mlp_keras)
        if fp32_mlp:
            def _rep_mlp():
                for i in range(N_REP):
                    yield [X_train_scaled[i:i+1].astype(np.float32)]
            _quantize_int8(fp32_mlp, os.path.join(OUTPUT_DIR_MODELS, 'mlp_classifier_int8.tflite'), _rep_mlp)
            _quantize_f16(fp32_mlp,  os.path.join(OUTPUT_DIR_MODELS, 'mlp_classifier_f16.tflite'))

    # ── CNN ──
    cnn_keras = os.path.join(OUTPUT_DIR_MODELS, 'cnn_classifier_mixed.keras')
    if os.path.exists(cnn_keras):
        print('--- Quantizing CNN ---')
        fp32_cnn = _get_fp32_copy(cnn_keras)
        if fp32_cnn:
            def _rep_cnn():
                for i in range(N_REP):
                    s = X_train_scaled[i:i+1]
                    yield [s.reshape(s.shape[0], s.shape[1], 1).astype(np.float32)]
            _quantize_int8(fp32_cnn, os.path.join(OUTPUT_DIR_MODELS, 'cnn_classifier_int8.tflite'), _rep_cnn)
            _quantize_f16(fp32_cnn,  os.path.join(OUTPUT_DIR_MODELS, 'cnn_classifier_f16.tflite'))
else:
    print('Skipping TFLite quantization (TensorFlow not available).')

## 7. XGBoost GPU Training

GPU-accelerated gradient boosting using `device='cuda'` and `tree_method='hist'`.
A 5-fold stratified `GridSearchCV` searches over 48 hyperparameter combinations.
`scale_pos_weight` is automatically calculated from the class distribution to handle the dataset imbalance.

In [None]:
xgb_best = None

if XGB_AVAILABLE:
    print('--- Training XGBoost GPU ---')

    # Class-imbalance weight
    count_neg = np.sum(y_train != sargassum_encoded_value)
    count_pos = np.sum(y_train == sargassum_encoded_value)
    scale_pos_weight = count_neg / count_pos if count_pos > 0 else 1.0
    print(f'Class ratio (neg/pos) = {scale_pos_weight:.2f}')

    xgb_template = xgb.XGBClassifier(
        random_state=SEED,
        device='cuda',
        tree_method='hist',
        eval_metric='logloss',
    )
    param_grid = {
        'n_estimators':     [100, 150],
        'learning_rate':    [0.05, 0.1],
        'max_depth':        [3, 5, 7],
        'subsample':        [0.7, 1.0],
        'colsample_bytree': [0.7, 1.0],
        'scale_pos_weight': [scale_pos_weight],
    }
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
    gs = GridSearchCV(xgb_template, param_grid, cv=cv, scoring='f1_weighted', n_jobs=1, verbose=1)

    xgb_start = time.time()
    try:
        gs.fit(X_train_scaled, y_train)
        xgb_best = gs.best_estimator_
        xgb_duration = time.time() - xgb_start
        print(f'Best params:  {gs.best_params_}')
        print(f'Best CV F1 (weighted): {gs.best_score_:.4f}')

        y_pred_xgb = xgb_best.predict(X_test_scaled)
        results['XGBoost_GPU'] = {
            'Accuracy':       accuracy_score(y_test, y_pred_xgb),
            'Precision (w)':  precision_score(y_test, y_pred_xgb, average='weighted', zero_division=0),
            'Recall (w)':     recall_score(y_test, y_pred_xgb, average='weighted', zero_division=0),
            'F1-Score (w)':   f1_score(y_test, y_pred_xgb, average='weighted', zero_division=0),
            'Train Time (s)': xgb_duration,
        }
        print(classification_report(y_test, y_pred_xgb, target_names=le.classes_, digits=4))
        plot_cm(y_test, y_pred_xgb, le.classes_, 'XGBoost_GPU')

        xgb_path = os.path.join(OUTPUT_DIR_MODELS, 'xgboost_gpu_classifier.joblib')
        joblib.dump(xgb_best, xgb_path)
        print(f'XGBoost saved -> {xgb_path}')
    except Exception as e:
        print(f'Error training XGBoost: {e}')
        traceback.print_exc()
        results['XGBoost_GPU'] = {k: float('nan') for k in ['Accuracy','Precision (w)','Recall (w)','F1-Score (w)','Train Time (s)']}
else:
    print('Skipping XGBoost (not available).')

## 8. Performance Summary

Compile all test-set metrics into one table. Metrics are **weighted-averaged** across both classes.

In [None]:
import pandas as pd

if results:
    results_df = pd.DataFrame(results).T
    cols = ['Accuracy', 'Precision (w)', 'Recall (w)', 'F1-Score (w)', 'Train Time (s)']
    results_df = results_df[[c for c in cols if c in results_df.columns]]

    print('--- Overall Test Set Performance ---')
    display(results_df.sort_values('F1-Score (w)', ascending=False).style.format('{:.4f}'))

    out_path = os.path.join(OUTPUT_DIR_TABLES, 'classification_performance_summary.csv')
    results_df.to_csv(out_path, float_format='%.4f')
    print(f'Performance summary saved to: {out_path}')
else:
    print('No results to display — all models were skipped.')

## Conclusion

Training is complete. All models and artefacts have been saved to `output/models_classification/`:

| File | Description |
|------|-------------|
| `mlp_classifier_mixed.keras` | Full Keras MLP (float16 weights) |
| `mlp_classifier_int8.tflite` | INT8-quantized TFLite MLP |
| `mlp_classifier_f16.tflite` | FLOAT16-quantized TFLite MLP |
| `cnn_classifier_mixed.keras` | Full Keras 1D CNN (float16 weights) |
| `cnn_classifier_int8.tflite` | INT8-quantized TFLite CNN |
| `cnn_classifier_f16.tflite` | FLOAT16-quantized TFLite CNN |
| `xgboost_gpu_classifier.joblib` | XGBoost classifier |
| `scaler_classification.joblib` | Feature StandardScaler |
| `label_encoder_classification.joblib` | Label encoder |

Proceed to **`Notebook_2_classify_sargassum.ipynb`** to apply these models to satellite imagery.