# LSTM Classification Model

This notebook demonstrates how to load data, preprocess it, define an LSTM model, train the model, and evaluate its performance. The data is assumed to be in CSV format and stored in a directory.

## Setup

First, we need to install the necessary libraries. Run the following cell to install them.

In [69]:
%pip install torch torchvision torchaudio
%pip install pandas scikit-learn
%pip install wandb onnx -Uq
%pip install joblib



## Import Libraries and seed
Import the necessary libraries for data processing, model building, training, and evaluation. Adding a seed ensures reproducibility by making sure that the random number generation is consistent across different runs.

In [127]:
import os
import random

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
import joblib
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler, LabelEncoder
from torch.utils.data import DataLoader, TensorDataset

import wandb

def set_seed(seed):
    np.random.seed(seed)
    torch.manual_seed(seed)
    random.seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)


Using device: cpu


In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
wandb.login()
#94b4debef3cc9601df4d91995649548f8ab3a097

[34m[1mwandb[0m: Using wandb-core as the SDK backend. Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## Load Data from Github Repository


In [5]:
## Remove PIC-PAPER-01 folder:
!rm -rf PIC-PAPER-01

# # Download Github Repo (Private) https://stackoverflow.com/questions/74532852/clone-github-repo-with-fine-grained-token/78280453#78280453
# !git clone --no-checkout https://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@github.com/danimp94/PIC-PAPER-01.git

# # To clone data folder only:
# %cd PIC-PAPER-01 # Navigate to the repository directory
# !git sparse-checkout init --cone # Initialize sparse-checkout
# !git sparse-checkout set data # Set the sparse-checkout to include only the data/ folder
# !git checkout # Checkout the specified folder

In [71]:
def load_data_from_directory(input_path):
    data_frames = []
    for file in os.listdir(input_path):
        if file.endswith('.csv'):
            df = pd.read_csv(os.path.join(input_path, file), delimiter=';', header=0)
            data_frames.append(df)
    data = pd.concat(data_frames, ignore_index=True)

    print(data)
    print(data.shape)

    return data

## Preprocessing Data
Define a function to preprocess the data. This includes encoding categorical labels and standardizing the features.

In [72]:
def calculate_averages_and_dispersion(data, data_percentage):
    df = data
    results = []
    for (sample, freq), group in df.groupby(['Sample', 'Frequency (GHz)']):
        window_size = max(1, int(len(group) * data_percentage / 100))
        # print(f"Processing sample: {sample}, frequency: {freq} with window size: {window_size}")
        for start in range(0, len(group), window_size):
            window_data = group.iloc[start:start + window_size]
            mean_values = window_data[['LG (mV)', 'HG (mV)']].mean()
            std_deviation_values = window_data[['LG (mV)', 'HG (mV)']].std()
            results.append({
                'Frequency (GHz)': freq,
                'LG (mV) mean': mean_values['LG (mV)'],
                'HG (mV) mean': mean_values['HG (mV)'],
                'LG (mV) std deviation': std_deviation_values['LG (mV)'],
                'HG (mV) std deviation': std_deviation_values['HG (mV)'],
                'Thickness (mm)': window_data['Thickness (mm)'].iloc[0],
                'Sample': sample,
            })
    results_df = pd.DataFrame(results)
    # results_df.to_csv(output_file, sep=';', index=False)
    # print(f"Processed {input_file} and saved to {output_file}")
    print(results_df)
    return results_df

In [136]:
def preprocess_data(data, data_percentage):
    # Windowing the data
    data = calculate_averages_and_dispersion(data, data_percentage)
    print(data.shape)

    # Assuming the last column is the target
    X = data.iloc[:, :-1].values
    y = data.iloc[:, -1].values

    # Encode the target variable if it's categorical
    if y.dtype == 'object':
        le = LabelEncoder()
        y = le.fit_transform(y)

    # le is the fitted LabelEncoder
    joblib.dump(le, 'label_encoder.pkl')

    # Get the original labels and their encoded values
    original_labels = le.classes_
    encoded_values = le.transform(original_labels)

    # Create a DataFrame to display the mapping
    label_mapping_df = pd.DataFrame({
        'Original Label': original_labels,
        'Encoded Value': encoded_values
    })

    # Display the DataFrame
    print(label_mapping_df)

    # Standardize the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Convert to PyTorch tensors
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.long)

    return X, y

In [137]:
input_path = '/content/drive/MyDrive/PhD/Colab Notebooks/training_data/'
data = load_data_from_directory(input_path)

# Load and preprocess data
X, y = preprocess_data(data, data_percentage=3.7) # 1s window size


# print(le.classes_)

        Sample  Frequency (GHz)     LG (mV)    HG (mV)  Thickness (mm)
0           A1            100.0   -7.080942  -0.854611             0.2
1           A1            100.0   67.024785   0.244141             0.2
2           A1            100.0  124.893178  -1.098776             0.2
3           A1            100.0   91.075571   0.000000             0.2
4           A1            100.0   48.956174   0.122094             0.2
...        ...              ...         ...        ...             ...
2737958    REF            600.0    0.366256  16.237333             0.0
2737959    REF            600.0    0.000000  -7.080942             0.0
2737960    REF            600.0   -0.244170  15.260652             0.0
2737961    REF            600.0    0.366256  20.021975             0.0
2737962    REF            600.0    0.122085  13.185203             0.0

[2737963 rows x 5 columns]
(2737963, 5)
       Frequency (GHz)  LG (mV) mean  HG (mV) mean  LG (mV) std deviation  \
0                100.0     54.

## Config

In [75]:
config = dict(
    epochs=100,
    seed = 40,
    classes = data['Sample'].nunique(), # Each different sample is a different class
    k_folds = 4,  # Number of folds for cross-validation
    batch_size=128,
    learning_rate=0.001,
    dataset="experiment_1",
    architecture="LSTM",
    hidden_dim = 64
)

print(config)

{'epochs': 100, 'seed': 40, 'classes': 17, 'k_folds': 4, 'batch_size': 128, 'learning_rate': 0.001, 'dataset': 'experiment_1', 'architecture': 'LSTM', 'hidden_dim': 64}


## Define Model
Define the LSTM model architecture

In [76]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.dropout = nn.Dropout(p=0.2)
        self.fc = nn.Linear(hidden_dim, output_dim)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.dropout(out[:, -1, :])
        out = self.fc(out)
        return out


## Train Model
Define a function to train the model

In [77]:
def train_model(model, train_loader, val_loader, criterion, optimizer, device, config):
    num_epochs = config.epochs
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        for X_batch, y_batch in train_loader:
              X_batch, y_batch = X_batch.to(device), y_batch.to(device)

              outputs = model(X_batch)
              loss = criterion(outputs, y_batch)

              optimizer.zero_grad()
              loss.backward()
              optimizer.step()

              running_loss += loss.item()

        val_loss = 0.0
        model.eval()
        with torch.no_grad():
            for X_batch, y_batch in val_loader:
                X_batch, y_batch = X_batch.to(device), y_batch.to(device)
                outputs = model(X_batch)
                loss = criterion(outputs, y_batch)
                val_loss += loss.item()

        # Log metrics to W&B
        wandb.log({"epoch": epoch, "train_loss": running_loss / len(train_loader), "val_loss": val_loss / len(val_loader)})
        print(f"Epoch [{epoch+1}/{num_epochs}], Train Loss: {running_loss/len(train_loader):.4f}, Val Loss: {val_loss/len(val_loader):.4f}")





## Evaluate Model


In [78]:
def evaluate_model(model, test_loader, device):
    model.eval()
    correct = 0
    total = 0
    with torch.no_grad():
        for X_batch, y_batch in test_loader:
            X_batch, y_batch = X_batch.to(device), y_batch.to(device)
            outputs = model(X_batch)
            _, predicted = torch.max(outputs.data, 1)
            total += y_batch.size(0)
            correct += (predicted == y_batch).sum().item()

    print(f'Accuracy of the model on the test set: {100 * correct / total:.2f}%')

In [79]:
def make(config, X, y):
    # K-Fold Cross-Validation
    kfold = KFold(n_splits=config.k_folds, shuffle=True, random_state=config.seed)

    for fold, (train_idx, val_idx) in enumerate(kfold.split(X)):
        print(f'Fold {fold+1}/{config.k_folds}')

        # Create DataLoader for training and validation sets
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        # Convert data to tensors and add sequence length dimension
        X_train = torch.tensor(X_train).float().unsqueeze(1)
        X_val = torch.tensor(X_val).float().unsqueeze(1)
        y_train = torch.tensor(y_train).long()
        y_val = torch.tensor(y_val).long()

        train_dataset = TensorDataset(X_train, y_train)
        val_dataset = TensorDataset(X_val, y_val)
        train_loader = DataLoader(train_dataset, batch_size=config.batch_size, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=config.batch_size, shuffle=False)

        # Initialize the model, loss function, and optimizer
        input_dim = X_train.shape[2]
        hidden_dim = config.hidden_dim
        output_dim = config.classes
        model = LSTMModel(input_dim, hidden_dim, output_dim).to(device)
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=config.learning_rate)

        yield model, train_loader, val_loader, criterion, optimizer

In [80]:
def model_pipeline(hyperparameters):
    with wandb.init(project="PIC-PAPER-01-exp-1", config=hyperparameters):
        config = wandb.config

        # Set seed for reproducibility
        set_seed(config.seed)

        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=config.seed)

        # K-Fold Cross-Validation
        for model, train_loader, val_loader, criterion, optimizer in make(config, X_train, y_train):
            print(model)

            # Train the model
            train_model(model, train_loader, val_loader, criterion, optimizer, device, config)

            # Evaluate the model on the validation set
            evaluate_model(model, val_loader, device)

        # Evaluate the final model on the test set
        X_test = torch.tensor(X_test).float().unsqueeze(1)  # Add sequence length dimension
        y_test = torch.tensor(y_test).long()
        test_dataset = TensorDataset(X_test, y_test)
        test_loader = DataLoader(test_dataset, batch_size=config.batch_size, shuffle=False)
        evaluate_model(model, test_loader, device)

    return model


## Run Training

In [16]:
model = model_pipeline(config)

[34m[1mwandb[0m: Currently logged in as: [33mdanimp94[0m ([33mdanimp94-university-carlos-iii-of-madrid[0m). Use [1m`wandb login --relogin`[0m to force relogin


Fold 1/5


  X_train = torch.tensor(X_train).float().unsqueeze(1)
  X_val = torch.tensor(X_val).float().unsqueeze(1)
  y_train = torch.tensor(y_train).long()
  y_val = torch.tensor(y_val).long()


LSTMModel(
  (lstm): LSTM(6, 64, batch_first=True)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=64, out_features=17, bias=True)
)
Epoch [1/100], Train Loss: 2.7165, Val Loss: 2.4939
Epoch [2/100], Train Loss: 2.3578, Val Loss: 2.2630
Epoch [3/100], Train Loss: 2.2268, Val Loss: 2.1543
Epoch [4/100], Train Loss: 2.0991, Val Loss: 1.9984
Epoch [5/100], Train Loss: 1.9357, Val Loss: 1.8339
Epoch [6/100], Train Loss: 1.7893, Val Loss: 1.7010
Epoch [7/100], Train Loss: 1.6713, Val Loss: 1.5944
Epoch [8/100], Train Loss: 1.5775, Val Loss: 1.5061
Epoch [9/100], Train Loss: 1.4983, Val Loss: 1.4322
Epoch [10/100], Train Loss: 1.4295, Val Loss: 1.3656
Epoch [11/100], Train Loss: 1.3674, Val Loss: 1.3079
Epoch [12/100], Train Loss: 1.3132, Val Loss: 1.2539
Epoch [13/100], Train Loss: 1.2678, Val Loss: 1.2077
Epoch [14/100], Train Loss: 1.2218, Val Loss: 1.1647
Epoch [15/100], Train Loss: 1.1808, Val Loss: 1.1277
Epoch [16/100], Train Loss: 1.1422, Val Loss: 1.0913
Epoch 

  X_train = torch.tensor(X_train).float().unsqueeze(1)
  X_val = torch.tensor(X_val).float().unsqueeze(1)
  y_train = torch.tensor(y_train).long()
  y_val = torch.tensor(y_val).long()


Epoch [1/100], Train Loss: 2.7291, Val Loss: 2.5225
Epoch [2/100], Train Loss: 2.3592, Val Loss: 2.2838
Epoch [3/100], Train Loss: 2.2194, Val Loss: 2.1711
Epoch [4/100], Train Loss: 2.0869, Val Loss: 2.0123
Epoch [5/100], Train Loss: 1.9227, Val Loss: 1.8502
Epoch [6/100], Train Loss: 1.7750, Val Loss: 1.7181
Epoch [7/100], Train Loss: 1.6584, Val Loss: 1.6099
Epoch [8/100], Train Loss: 1.5639, Val Loss: 1.5204
Epoch [9/100], Train Loss: 1.4837, Val Loss: 1.4473
Epoch [10/100], Train Loss: 1.4175, Val Loss: 1.3849
Epoch [11/100], Train Loss: 1.3581, Val Loss: 1.3276
Epoch [12/100], Train Loss: 1.3078, Val Loss: 1.2781
Epoch [13/100], Train Loss: 1.2597, Val Loss: 1.2305
Epoch [14/100], Train Loss: 1.2189, Val Loss: 1.1894
Epoch [15/100], Train Loss: 1.1787, Val Loss: 1.1519
Epoch [16/100], Train Loss: 1.1437, Val Loss: 1.1188
Epoch [17/100], Train Loss: 1.1076, Val Loss: 1.0877
Epoch [18/100], Train Loss: 1.0785, Val Loss: 1.0544
Epoch [19/100], Train Loss: 1.0475, Val Loss: 1.0270
Ep

  X_test = torch.tensor(X_test).float().unsqueeze(1)  # Add sequence length dimension
  y_test = torch.tensor(y_test).long()


0,1
epoch,▁▃▄▅▇▁▁▂▆▇▇▁▂▂▃▃▃▄▄▅▆▆▇▇█▂▃▃▄▅████▁▃▄▅▆▇
train_loss,▆▄▃▃▃▂▂▁▁▁▇▄▄▃▃▁▁▄▄▃▂▁▁▁▄▂▁▁▁▁█▆▄▄▂▂▁▁▁▁
val_loss,▅▂▂▁▁▁▁▁▁▁█▆▄▂▂▁▁▇▃▃▂▂▂▂▂▁▃▂▂▂▁▁▁▇▃▂▁▁▁▁

0,1
epoch,99.0
train_loss,0.44973
val_loss,0.43477


## Save the model

In [17]:
# Save the model
torch.save(model.state_dict(), 'lstm_model.pth')

# # Save the model as onnx
# torch.onnx.export(model, X_train, 'lstm_model.onnx')

In [171]:
def preprocess_test_data(data, data_percentage):
    # Windowing the data
    data = calculate_averages_and_dispersion(data, data_percentage)
    print(data.shape)

    # Assuming the last column is the target
    X = data.iloc[:, :-1].values
    y = data.iloc[:, -1].values

    # Encode labeling of target data using presaved pkl file
    # Load label encoder
    label_encoder_path = '/content/drive/MyDrive/PhD/Colab Notebooks/label_encoder.pkl'
    le = joblib.load(label_encoder_path)
    y = le.transform(y)
    print('y: ', y)

    # Standardize the features
    scaler = StandardScaler()
    X = scaler.fit_transform(X)

    # Convert to PyTorch tensors
    X = torch.tensor(X, dtype=torch.float32)
    y = torch.tensor(y, dtype=torch.long)

    return X, y

## Load New Testing Data

In [172]:
# Load new data
input_data_test = '/content/drive/MyDrive/PhD/Colab Notebooks/test_data/'
print(os.listdir(input_data_test))

data_test = load_data_from_directory(input_data_test)

# Load and preprocess data
X_test, y_test = preprocess_test_data(data_test, data_percentage=8.33) # 1s window size


['H1_1 - Copy.csv']
      Sample  Frequency (GHz)    LG (mV)    HG (mV)  Thickness (mm)
0         H1              100  69.100232   0.244141            0.07
1         H1              100  53.229153   0.366211            0.07
2         H1              100  62.019289   1.587129            0.07
3         H1              100  67.268954  -0.244141            0.07
4         H1              100  75.326578   1.220798            0.07
...      ...              ...        ...        ...             ...
63947     H1              600   0.244170  24.417043            0.07
63948     H1              600  -0.732511  12.086436            0.07
63949     H1              600   0.122085  29.300451            0.07
63950     H1              600  -0.244170   1.220852            0.07
63951     H1              600  -0.610426  33.573434            0.07

[63952 rows x 5 columns]
(63952, 5)
     Frequency (GHz)  LG (mV) mean  HG (mV) mean  LG (mV) std deviation  \
0                100     66.101648     -0.055687    

## Run inference

In [173]:
# Set device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print('Using device:', device)

# Initialize the model
input_dim = X_test.shape[1]
hidden_dim = config['hidden_dim']  # Replace with the hidden dimension used during training
output_dim = config['classes']  # Replace with the number of output classes used during training
model = LSTMModel(input_dim, hidden_dim, output_dim).to(device)

# # Load label encoder
label_encoder_path = '/content/drive/MyDrive/PhD/Colab Notebooks/label_encoder.pkl'
le = joblib.load(label_encoder_path)

# Load pretrained model
model_path = '/content/drive/MyDrive/PhD/Colab Notebooks/lstm_model.pth'
model.load_state_dict(torch.load(model_path))
model.eval()

with torch.no_grad():
    X_test = X_test.unsqueeze(1).to(device)
    print(X_test)
    outputs = model(X_test)
    _, predicted = torch.max(outputs.data, 1)
    # print(predicted)

# Decode the predicted labels
# Now perform the inverse transform
predicted_labels = le.inverse_transform(predicted.cpu().numpy())

print(predicted_labels)

Using device: cpu
tensor([[[-1.6984e+00,  2.3475e+00, -6.4605e-01,  3.2998e+00, -8.6543e-01,
          -1.3878e-17]],

        [[-1.6984e+00,  2.3206e+00, -6.4533e-01,  3.2246e+00, -8.5962e-01,
          -1.3878e-17]],

        [[-1.6984e+00,  2.3179e+00, -6.4635e-01,  3.3801e+00, -8.5971e-01,
          -1.3878e-17]],

        ...,

        [[ 1.6984e+00, -4.5441e-01, -5.1670e-01, -3.4733e-01, -1.1629e-02,
          -1.3878e-17]],

        [[ 1.6984e+00, -4.5986e-01, -5.1558e-01, -3.5466e-01, -4.6631e-02,
          -1.3878e-17]],

        [[ 1.6984e+00, -4.7086e-01, -5.4707e-01, -4.3536e-01,  4.9528e-01,
          -1.3878e-17]]])
['B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1'
 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1'
 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1'
 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1'
 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'B1' 'A1' 'B1' 'B1' 'B1' 'B1' 'B1'
 'B1' '

  model.load_state_dict(torch.load(model_path))


| Original Label | Encoded Value |
|----------------|---------------|
| A1             | 0             |
| B1             | 1             |
| C1             | 2             |
| D1             | 3             |
| E1             | 4             |
| E2             | 5             |
| E3             | 6             |
| F1             | 7             |
| G1             | 8             |
| H1             | 9             |
| I1             | 10            |
| J1             | 11            |
| K1             | 12            |
| L1             | 13            |
| M1             | 14            |
| N1             | 15            |
| REF            | 16            |