In [None]:
Use Autoencoder to implement anomaly detection. Build the model by using:
a. Import required libraries
b. Upload / access the dataset
c. Encoder converts it into latent representation
d. Decoder networks convert it back to the original input
e. Compile the models with Optimizer, Loss, and Evaluation Metrics

In [1]:
# Step a: Import required libraries

# pandas is used for data manipulation and analysis. It provides data structures like DataFrames, 
# which are useful for handling tabular data (like CSV files).
import pandas as pd  

# numpy is a library for numerical computing in Python. It provides support for large, multi-dimensional arrays 
# and matrices, and includes mathematical functions to operate on these arrays.
import numpy as np  

# matplotlib.pyplot is a plotting library for creating static, animated, and interactive visualizations in Python.
# It is commonly used for generating graphs, charts, and other visualizations.
import matplotlib.pyplot as plt  


In [2]:
# Import train_test_split to split data into training and testing sets
from sklearn.model_selection import train_test_split  

# Import StandardScaler to standardize the features (mean = 0, std = 1)
from sklearn.preprocessing import StandardScaler  

# Import Model to define a custom Keras model (functional API)
from keras.models import Model  

# Import Input to define the input layer and Dense for fully connected layers
from keras.layers import Input, Dense  

# Import regularizers to apply regularization (L1, L2) to the model
from keras import regularizers  

# Import EarlyStopping to halt training if the validation loss doesn't improve (prevents overfitting)
from keras.callbacks import EarlyStopping  





In [3]:
# Load the credit card fraud detection dataset from a CSV file
data = pd.read_csv("creditcard.csv")

# Standardize the 'Amount' column to have a mean of 0 and standard deviation of 1 (for better model performance)
data['Amount'] = StandardScaler().fit_transform(data['Amount'].values.reshape(-1, 1))

# Drop the 'Time' column as it is not useful for the anomaly detection task
data = data.drop(['Time'], axis=1)


In [4]:
# Split the data into training and testing sets (80% for training, 20% for testing)
X_train, X_test = train_test_split(data, test_size=0.2, random_state=42)

# Only use non-fraudulent transactions (Class == 0) for training the Autoencoder
# Drop the 'Class' column from the training set as it is not used in the input for anomaly detection
X_train = X_train[X_train.Class == 0].drop(['Class'], axis=1).values

# Separate the 'Class' column (target variable) from the test set for evaluation
y_test = X_test['Class']

# Drop the 'Class' column from the test set to leave only features for prediction
X_test = X_test.drop(['Class'], axis=1).values


In [5]:
# Set the input dimension based on the number of features in the training data
input_dim = X_train.shape[1]

# Define the encoding dimension (size of the latent space representation)
encoding_dim = 14

# Input layer: Defines the shape of the input (based on the number of features in the dataset)
input_layer = Input(shape=(input_dim,))

# Encoder: First fully connected layer with 'tanh' activation and L1 regularization
encoder = Dense(encoding_dim, activation="tanh", activity_regularizer=regularizers.l1(10e-5))(input_layer)

# Encoder: Second fully connected layer with 'relu' activation to reduce the encoding dimension
encoder = Dense(int(encoding_dim / 2), activation="relu")(encoder)

# Decoder: First fully connected layer with 'tanh' activation to reconstruct the data
decoder = Dense(int(encoding_dim / 2), activation='tanh')(encoder)

# Decoder: Final fully connected layer to output the reconstruction with the same shape as the input
decoder = Dense(input_dim, activation='relu')(decoder)

# Autoencoder model: Defines the complete model with input and output layers
autoencoder = Model(inputs=input_layer, outputs=decoder)






In [6]:
# Compile the autoencoder model using the Adam optimizer and mean squared error loss function
autoencoder.compile(optimizer='adam', loss='mean_squared_error')

# Define early stopping to halt training if validation loss does not improve for 10 consecutive epochs
early_stop = EarlyStopping(monitor='val_loss', patience=10, restore_best_weights=True)

# Train the autoencoder model with the training data, validate on test data, and apply early stopping
autoencoder.fit(X_train, X_train, epochs=20, batch_size=32, validation_data=(X_test, X_test), callbacks=[early_stop], verbose=1)



Epoch 1/20

Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x17a1374cb80>

In [7]:
# Make predictions using the trained autoencoder model on the test data
predictions = autoencoder.predict(X_test)

# Compute the Mean Squared Error (MSE) between the original and predicted data for each test sample
mse = np.mean(np.power(X_test - predictions, 2), axis=1)

# Set a threshold for anomaly detection (samples with MSE greater than the threshold are considered anomalies)
threshold = 50  # Set based on analysis

# Classify anomalies: If MSE is greater than the threshold, mark as anomaly (1), else normal (0)
y_pred = [1 if e > threshold else 0 for e in mse]




In [8]:
# Import metrics from sklearn to evaluate the performance of the model
from sklearn.metrics import accuracy_score, recall_score, precision_score, confusion_matrix


In [9]:
print("Accuracy:", accuracy_score(y_test, y_pred))


Accuracy: 0.9983146659176293


In [10]:
print("Recall:", recall_score(y_test, y_pred))


Recall: 0.25510204081632654


In [11]:
print("Precision:", precision_score(y_test, y_pred))


Precision: 0.5208333333333334


In [12]:
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))

Confusion Matrix:
 [[56841    23]
 [   73    25]]


In [None]:
Here are the key theory concepts associated with the Autoencoder-based anomaly detection practical that an external examiner might ask about:

1. **Autoencoder Architecture**:
   - **Encoder**: Compresses the input data into a lower-dimensional latent representation.
   - **Decoder**: Reconstructs the input data from the latent representation.
   - **Symmetric Architecture**: Often, the encoder and decoder have symmetric layers.

2. **Loss Function**:
   - **Mean Squared Error (MSE)**: Commonly used in Autoencoders for reconstruction tasks, measures the difference between original and reconstructed data.
   - **Loss Function for Anomaly Detection**: The reconstruction error (MSE) is higher for anomalies, as the Autoencoder fails to accurately reconstruct abnormal data.

3. **Anomaly Detection**:
   - Anomalies are detected by thresholding the reconstruction error (MSE) — samples with high reconstruction errors are considered anomalies.
   - **Threshold**: A pre-set threshold (like `50` in your code) determines if the reconstruction error is high enough to classify a sample as an anomaly.

4. **Activation Functions**:
   - **ReLU (Rectified Linear Unit)**: Used in the encoder and decoder for introducing non-linearity.
   - **Tanh**: Used in the encoder to map values between -1 and 1, often used for generating latent representations.

5. **Regularization**:
   - **L1 Regularization**: Adds a penalty to the model's complexity by forcing some weights to be exactly zero, helping to prevent overfitting.
   - **Activity Regularizer**: Penalizes large activation values, often used in Autoencoders for improving generalization.

6. **Training Techniques**:
   - **Early Stopping**: A technique used to stop training when the validation loss stops improving, which helps prevent overfitting.
   - **Adam Optimizer**: Adaptive learning rate optimizer widely used in deep learning tasks.
   
7. **Model Evaluation**:
   - **Accuracy**: Measures the percentage of correct predictions.
   - **Recall**: Focuses on detecting anomalies (true positives), measuring the ability to identify all actual anomalies.
   - **Precision**: Measures the correctness of predicted anomalies.
   - **Confusion Matrix**: Provides insight into true positives, false positives, true negatives, and false negatives.

8. **Dimensionality Reduction**:
   - Autoencoders can be viewed as a form of unsupervised **dimensionality reduction** that compresses the data into a smaller representation.

These concepts are essential for understanding the functioning of Autoencoders in anomaly detection and can help explain the workings and rationale behind the model in your practical.

Here are a few more minor points that could be relevant for an external examiner to ask:

1. **Data Preprocessing**:
   - **Standardization**: Why is it important to standardize features (e.g., 'Amount' in the dataset)? Standardization ensures that all features have the same scale, improving model training and performance.
   - **Handling Imbalanced Data**: Since fraud cases (anomalies) are typically rare, this can affect model performance. Although not directly addressed in your practical, this is a common challenge in anomaly detection tasks.

2. **Latent Space**:
   - The **latent space** is the compressed representation of the input data in the Autoencoder. The size of this space (e.g., 14 dimensions) controls the model's ability to capture underlying patterns.
   - **Dimensionality Choice**: The choice of the latent space dimension (encoding_dim) is crucial — too small might lose important features, too large might not generalize well.

3. **Overfitting and Underfitting**:
   - **Overfitting** occurs when the model learns to memorize the training data, including noise, which can reduce its ability to generalize to new data.
   - **Underfitting** happens when the model is too simple to capture the underlying patterns, leading to poor performance.

4. **Reconstruction Error**:
   - An **anomaly** is detected when the reconstruction error is large because the Autoencoder is trained primarily on normal data. When it encounters abnormal data, it fails to reconstruct it well.
   - **Threshold for Anomaly Detection**: The threshold value for MSE is critical — choosing a wrong value might lead to too many false positives or false negatives.

5. **Activation Functions (continued)**:
   - **Tanh**: Has a range between -1 and 1, making it suitable for encoding and decoding, where the output values should be normalized or centered around zero.
   - **ReLU**: Converts negative values to zero but keeps positive values intact, making it effective for non-linear tasks like anomaly detection.

6. **Model Evaluation in Anomaly Detection**:
   - **Precision vs Recall**: In fraud detection, recall is often more important than precision because detecting all fraud cases is critical, even if some normal cases are flagged as anomalies (false positives).
   - **False Positive Rate**: An examiner might ask about the false positive rate, especially in fraud detection. A higher threshold for MSE will reduce false positives but may increase false negatives.

7. **Training Time and Complexity**:
   - **Epochs**: The number of epochs defines how many times the model sees the entire dataset. Too many epochs may cause overfitting, while too few may cause underfitting.
   - **Batch Size**: Determines how many samples are used to calculate the gradient update at each step. Larger batch sizes can speed up training but require more memory.

8. **Anomaly Detection Applications**:
   - **Real-world use cases**: Autoencoders are widely used in **fraud detection**, **network intrusion detection**, **machine health monitoring**, and other areas where anomalies in data need to be identified. 

These points add more depth to the concepts and help an examiner understand your knowledge beyond just the practical implementation.

Here are a few additional minor points that could be useful:

1. **Autoencoder Variants**:
   - **Denoising Autoencoders**: This variant adds noise to the input data and trains the model to reconstruct the clean data. It helps the model generalize better and is often used when noise is present in the dataset.
   - **Variational Autoencoders (VAE)**: VAEs introduce a probabilistic approach and generate a distribution of possible latent variables, often used in generative tasks, but not typically in anomaly detection.

2. **Role of Anomalies in the Dataset**:
   - **Anomaly Definition**: Anomalies (or outliers) are rare events or observations that deviate significantly from the general pattern of the data. The goal of anomaly detection is to identify these outliers.
   - **One-Class Classification**: Autoencoders are often used in **one-class classification** for anomaly detection, where the model is only trained on normal data, and anomalies are identified by high reconstruction errors.

3. **Reconstruction vs Prediction**:
   - In an Autoencoder, the goal is not to predict a label (as in traditional supervised learning) but to reconstruct the input. This is a key distinction between unsupervised learning (Autoencoder) and supervised learning (e.g., classification tasks).

4. **Optimizers**:
   - **Adam Optimizer**: Combines the benefits of two other extensions of stochastic gradient descent, namely **Momentum** and **RMSprop**, making it more efficient in practice for a variety of tasks.
   - **Learning Rate**: The learning rate is a key hyperparameter that controls how much the weights are updated in each training step. Too high a learning rate can cause the model to converge too quickly, possibly missing the optimal solution; too low can make the training very slow.

5. **Model Interpretability**:
   - **Latent Space Visualization**: In some cases, it may be useful to visualize the latent space to understand how the model has compressed the data. This can help in understanding what the model has learned, especially when dealing with complex datasets.
   - **Feature Importance**: While Autoencoders are typically not as interpretable as decision trees or linear models, analyzing the reconstruction error can give insights into which features are important for detecting anomalies.

6. **Handling Missing Data**:
   - **Imputation**: If the dataset has missing values, techniques such as imputation (e.g., mean or median imputation) might be used before training the model to ensure the data is complete.
   - **Impact of Missing Data**: Missing data can affect the performance of the Autoencoder, as it may prevent the model from learning accurate representations of the input.

7. **Evaluation Metrics for Anomaly Detection**:
   - **F1-Score**: A balance between precision and recall, which is important in anomaly detection because it provides a combined measure of both false positives and false negatives.
   - **ROC Curve & AUC**: The **Receiver Operating Characteristic (ROC)** curve plots the true positive rate against the false positive rate, and the **Area Under the Curve (AUC)** provides a single value that summarizes the model's ability to distinguish between normal and anomalous data.

8. **Batch Normalization**:
   - **Batch Normalization**: This technique is used in deep learning to normalize the inputs to each layer so that the model trains faster and more stably. It can sometimes help when training deeper Autoencoder models.

9. **Overfitting in Anomaly Detection**:
   - **Overfitting** in anomaly detection happens when the model is too complex and memorizes the normal data so well that it fails to generalize to new or unseen normal data. Early stopping, regularization, or simpler models are commonly used to mitigate overfitting.

10. **Effect of Noise on Anomaly Detection**:
    - **Noisy Data**: Noisy data can increase the reconstruction error for normal data, leading to false positives (misclassifying normal data as anomalies). Preprocessing steps like noise filtering or using Denoising Autoencoders can help mitigate this.

11. **Sparse Autoencoders**:
    - A **sparse Autoencoder** uses a sparsity constraint in the encoder, where only a small subset of the latent neurons are active at any given time. This leads to a more efficient representation, often used in tasks like feature selection.

These additional points cover more advanced concepts, variations, and nuances that could enrich your understanding of Autoencoders and anomaly detection. They might help you answer more specific or detailed questions if asked by the examiner.