# E-tivity 1 (Weeks 1-2)

* Barry Clarke

* 24325082

## Anomaly Detection

### Context
We have a mystery dataset. There are 9 explanatory variables and one response variable. The response variable is the last column and indicates if the sample is anomalous (=1, valid =0). The dataset is provided "data.csv".

Of course in this case we could use supervised learning to generate a model and detect anomalies in new data. However the focus is on autoencoders, anomaly detection is just one of the potential uses for autoencoders.

So we are going to pretend that we do not know which data are anomalous but we do know that the anomaly rate is small. Use an autoencoder to detect anomalies in the data. The correctness of the model can of course be checked.

### Task 4: VAE (completed by Sunday Week 2)

This task is a individual task and should **not** to be uploaded to the Group Locker. No direct support should be given via the forums. Marks will be deducted if the instructions are not followed (see rubrics). This part should be uploaded directly to Brightpsace.

Change the network to be a VAE. Again determine the optimal cutoff and plot the latent variables. Check how good the cutoffs were by constructing a confusion matrix or generating a classification report. Obviously for this task you need to use the Anom column.

**Hint** you can use the model topology from the AE (with the obvious modifications). I found that I had a good model (almost as good and the supervised learning model) when the KL divergence was small. You can print out both the KL divergence and reconstruction loss for each epoch. It can be tricky to train these type of models, so do not be surprised if you do not get a stellar result. What is more important is that you have the correct code to implement the VAE.

### Final Submission (complete by Sunday Week 2)

Submit Tasks 1-4 in a single notebook this before the deadline on Sunday.


In [None]:
# ==========================================
# Task 4: Variational Autoencoder (VAE)
# ==========================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
import time  # <--- Added for timer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix
from tensorflow.keras.layers import Input, Dense, Lambda
from tensorflow.keras.models import Model
from tensorflow.keras.optimizers import Adam

# ---------------------------------------------------------
# 1. Data Preprocessing
# ---------------------------------------------------------
print("--- Data Preprocessing ---")
# Load Data
df = pd.read_csv('data.csv')

# Separate features and target
X = df.drop('Anom', axis=1)
y = df['Anom']

# Split Data (80/20)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale Data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print(f"Training shape: {X_train_scaled.shape}")
print(f"Test shape: {X_test_scaled.shape}")

# ---------------------------------------------------------
# 2. VAE Model Definition
# ---------------------------------------------------------
class DenseVAE(tf.keras.Model):
    def __init__(self, input_dim, latent_dim=2):
        super(DenseVAE, self).__init__()
        self.latent_dim = latent_dim

        # --- Encoder ---
        self.encoder = tf.keras.Sequential([
            tf.keras.layers.InputLayer(shape=(input_dim,)),
            Dense(7, activation='relu'),
            Dense(4, activation='relu'),
            Dense(latent_dim + latent_dim) # Output: mean and logvar
        ])

        # --- Decoder ---
        self.decoder = tf.keras.Sequential([
            tf.keras.layers.InputLayer(shape=(latent_dim,)),
            Dense(4, activation='relu'),
            Dense(7, activation='relu'),
            Dense(input_dim, activation='linear')
        ])

        # Trackers for metrics
        self.total_loss_tracker = tf.keras.metrics.Mean(name="total_loss")
        self.reconstruction_loss_tracker = tf.keras.metrics.Mean(name="reconstruction_loss")
        self.kl_loss_tracker = tf.keras.metrics.Mean(name="kl_loss")

    @property
    def metrics(self):
        return [
            self.total_loss_tracker,
            self.reconstruction_loss_tracker,
            self.kl_loss_tracker,
        ]

    def encode(self, x):
        mean, logvar = tf.split(self.encoder(x), num_or_size_splits=2, axis=1)
        return mean, logvar

    def reparameterize(self, mean, logvar):
        eps = tf.random.normal(shape=tf.shape(mean))
        return eps * tf.exp(logvar * 0.5) + mean

    def decode(self, z):
        return self.decoder(z)

    def call(self, inputs):
        mean, logvar = self.encode(inputs)
        z = self.reparameterize(mean, logvar)
        return self.decode(z)

    def train_step(self, data):
        if isinstance(data, tuple):
            data = data[0]

        with tf.GradientTape() as tape:
            # 1. Forward pass
            mean, logvar = self.encode(data)
            z = self.reparameterize(mean, logvar)
            reconstruction = self.decode(z)

            # 2. Calculate Losses
            reconstruction_loss = tf.reduce_mean(
                tf.reduce_sum(tf.square(data - reconstruction), axis=1)
            )
            kl_loss = -0.5 * (1 + logvar - tf.square(mean) - tf.exp(logvar))
            kl_loss = tf.reduce_mean(tf.reduce_sum(kl_loss, axis=1))
            total_loss = reconstruction_loss + kl_loss

        # 3. Backpropagation
        grads = tape.gradient(total_loss, self.trainable_variables)
        self.optimizer.apply_gradients(zip(grads, self.trainable_variables))

        # 4. Update metrics
        self.total_loss_tracker.update_state(total_loss)
        self.reconstruction_loss_tracker.update_state(reconstruction_loss)
        self.kl_loss_tracker.update_state(kl_loss)

        return {
            "loss": self.total_loss_tracker.result(),
            "reconstruction_loss": self.reconstruction_loss_tracker.result(),
            "kl_loss": self.kl_loss_tracker.result(),
        }

# ---------------------------------------------------------
# 3. Model Training
# ---------------------------------------------------------
# Define a callback to print progress every 10 epochs
class PrintProgress(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        if (epoch + 1) % 10 == 0:
            print(f"Epoch {epoch + 1}: Loss={logs['loss']:.4f}, Recon={logs['reconstruction_loss']:.4f}, KL={logs['kl_loss']:.4f}")

input_dim = X_train_scaled.shape[1]
vae = DenseVAE(input_dim, latent_dim=2)
vae.compile(optimizer=Adam(learning_rate=0.0005))

print("\n--- Starting VAE Training ---")

# Start Timer
start_time = time.time()

history = vae.fit(
    X_train_scaled,
    epochs=150,
    batch_size=32,
    shuffle=True,
    verbose=0,
    callbacks=[PrintProgress()]
)

# End Timer
end_time = time.time()
total_seconds = end_time - start_time
minutes, seconds = divmod(total_seconds, 60)

print("Training complete.")
print(f"Total training time: {int(minutes)} min {seconds:.2f} sec")

# Plot Loss Curves
plt.figure(figsize=(10, 5))
plt.plot(history.history['reconstruction_loss'], label='Reconstruction Loss')
plt.plot(history.history['kl_loss'], label='KL Loss')
plt.plot(history.history['loss'], label='Total Loss', linestyle='--')
plt.title('VAE Training Loss Components')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

# ---------------------------------------------------------
# 4. Anomaly Detection
# ---------------------------------------------------------
# Predict on test set
reconstructions = vae.predict(X_test_scaled)
mse = np.mean(np.power(X_test_scaled - reconstructions, 2), axis=1)

# Determine Cutoff (92nd percentile)
threshold = np.percentile(mse, 92)
print(f"\nCalculated Threshold (92nd percentile): {threshold:.4f}")

# Histogram
plt.figure(figsize=(10, 6))
counts, bins, patches = plt.hist(mse, bins=50, alpha=0.75, label='Reconstruction Error')
plt.axvline(threshold, color='red', linestyle='--', label=f'Threshold: {threshold:.4f}')
plt.text(threshold, max(counts)*0.8, f' Cutoff\n {threshold:.4f}', color='red')
plt.title('VAE Reconstruction Error Histogram')
plt.legend()
plt.show()

# ---------------------------------------------------------
# 5. Evaluation
# ---------------------------------------------------------
y_pred = [1 if e > threshold else 0 for e in mse]

print("\nVAE Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix (Task 4: VAE)')
plt.show()

# ---------------------------------------------------------
# 6. Latent Space Visualization
# ---------------------------------------------------------
z_mean, z_log_var = vae.encode(X_test_scaled)
z_mean = z_mean.numpy()

plt.figure(figsize=(10, 8))
scatter = plt.scatter(z_mean[:, 0], z_mean[:, 1], c=y_test, cmap='coolwarm', alpha=0.6, s=10)
plt.colorbar(scatter, label='Anomaly (0=Normal, 1=Anom)')
plt.title('Latent Space Representation (2D Mean)')
plt.xlabel('Latent Dim 1')
plt.ylabel('Latent Dim 2')
plt.grid(True, alpha=0.3)
plt.show()

--- Data Preprocessing ---
Training shape: (39277, 9)
Test shape: (9820, 9)

--- Starting VAE Training ---
Epoch 10: Loss=nan, Recon=nan, KL=nan
Epoch 20: Loss=nan, Recon=nan, KL=nan
Epoch 30: Loss=nan, Recon=nan, KL=nan
Epoch 40: Loss=nan, Recon=nan, KL=nan


## Reflection

There are no specific marks allocated for a reflection. However due consideration will be given if pertinent comments or valuable insights are made.