# LSTM Model for Cache Hit Prediction

This notebook implements a Long Short-Term Memory (LSTM) neural network to predict cache hits based on memory access patterns. The model analyzes sequences of memory addresses to learn patterns that indicate whether a future memory access will result in a cache hit or miss.

In [1]:
# PyTorch libraries for deep learning
import torch
import torch.nn as nn

# NumPy for numerical operations
import numpy as np

# Matplotlib for visualization
import matplotlib.pyplot as plt

# PyTorch data utilities
from torch.utils.data import DataLoader, TensorDataset

# Advanced dictionary for counting and aggregation
from collections import defaultdict

## Imports

Importing the necessary libraries:
- `torch` and `torch.nn`: PyTorch deep learning framework
- `numpy`: For numerical operations
- `matplotlib.pyplot`: For plotting and visualization
- `DataLoader` and `TensorDataset`: For batching and dataset handling
- `defaultdict`: For advanced dictionary operations

In [2]:
# Open the trace file containing memory access data
with open("test1.out") as f:
    addrs = f.readlines()

# Split each line into hit/miss indicator and memory address
# The format is: 'hit_count memory_address'
hits, addrs = zip(*(l.rstrip().split(' ') for l in addrs))

# Convert hit counts to integers
hits = [int(hit) for hit in hits]

# Convert memory addresses from hex to integers
addrs = [int(addr, 16) for addr in addrs]

## Data Loading

Loading the cache access data from a file. Each line contains:
- A hit/miss indicator (1 for hit, 0 for miss)
- A memory address in hexadecimal format

In [3]:
hits[:10], addrs[:10]

([0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 [536870744,
  536870752,
  805385384,
  805385392,
  805385376,
  805385400,
  536870728,
  536870720,
  536870712,
  536870704])

In [4]:
# Convert hit counts to binary target values
# Any positive hit count becomes 1, 0 remains 0
y = np.asarray(hits)
y = (y > 0).astype(np.int_)  # 1 for hit, 0 for miss

# Convert memory addresses to feature vectors
# TODO: ideally we'd want PCs or to separate address space
X = np.asarray(addrs)

# Standardize features to have zero mean and unit variance
# This is important for neural network training
X = (X - X.mean()) / (X.std())

## Data Preparation

Preprocessing the data for model training:
1. Converting hit counts to binary values (0 = miss, 1 = hit)
2. Normalizing memory addresses to have zero mean and unit standard deviation

Note: Normalization is crucial for neural network training as it helps with convergence and prevents numerical issues.

In [5]:
X[:10]

array([-1.05364059, -1.05364053,  0.94539084,  0.9453909 ,  0.94539079,
        0.94539096, -1.05364071, -1.05364077, -1.05364083, -1.05364089])

In [6]:
y[:10]

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [7]:
y[16]

np.int64(0)

## Sequence Generation

The LSTM model requires sequential data. This function builds sequences of memory addresses (X) with their corresponding future cache hit/miss outcome (y).

For each sequence:
- Input: n consecutive memory addresses
- Target: Whether the n+1 memory access results in a hit or miss

In [8]:
def build_seqs(X, y, n):
    '''
    Builds sequences from the dataset for LSTM training.
    
    Parameters:
        X (array): Normalized memory addresses
        y (array): Hit/miss indicators (1/0)
        n (int): Length of input sequence (window size)
        
    Returns:
        tuple: (xs, ys) where:
            - xs is an array of input sequences of length n
            - ys is an array of corresponding target values
    '''
    assert len(X) == len(y)
    xs = []  # Will hold input sequences
    ys = []  # Will hold target values
    
    # Create sliding windows of size n
    for i in range(len(X)-n):
        # Extract a sequence of n consecutive addresses
        x_sample = X[i:(i+n)]
        # Target is whether the next address (after the sequence) results in a hit
        y_sample = y[i+n]
        xs.append(x_sample)
        ys.append(y_sample)
        
    return np.array(xs), np.array(ys)

In [9]:
# Build sequences with a window size of 15 addresses
# Each sequence contains 15 normalized memory addresses
# The target is whether the 16th access is a hit (1) or miss (0)
Xs, ys = build_seqs(X, y, 15)

In [10]:
Xs[0]

array([-1.05364059, -1.05364053,  0.94539084,  0.9453909 ,  0.94539079,
        0.94539096, -1.05364071, -1.05364077, -1.05364083, -1.05364089,
        0.9453859 , -1.05364089,  0.94514165,  0.94514207,  0.94514249])

In [11]:
ys[0]

np.int64(0)

In [12]:
# Check if CUDA-compatible GPU is available, otherwise use CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

## Device Configuration

Setting up the computation device - will use GPU (CUDA) if available, otherwise CPU.

In [13]:
# Convert input sequences (X) to PyTorch tensors
# Shape becomes [num_samples, sequence_length, 1] with the added dimension for LSTM input
# The None adds a dimension for the feature channel (required by LSTM)
trainX = torch.tensor(Xs[:, :, None], dtype=torch.float32).to(device)

# Convert target values (y) to PyTorch tensors
# Shape becomes [num_samples, 1]
trainY = torch.tensor(ys[:, None], dtype=torch.float32).to(device)

## Tensor Conversion

Converting NumPy arrays to PyTorch tensors, which are required for model training:
- Adding a dimension for the feature channel (required by LSTM)
- Moving tensors to the selected device (GPU/CPU)
- Setting the appropriate data type (float32)

In [14]:
trainX

tensor([[[-1.0536],
         [-1.0536],
         [ 0.9454],
         ...,
         [ 0.9451],
         [ 0.9451],
         [ 0.9451]],

        [[-1.0536],
         [ 0.9454],
         [ 0.9454],
         ...,
         [ 0.9451],
         [ 0.9451],
         [-1.0536]],

        [[ 0.9454],
         [ 0.9454],
         [ 0.9454],
         ...,
         [ 0.9451],
         [-1.0536],
         [-1.0536]],

        ...,

        [[-1.0537],
         [ 0.9454],
         [-1.0537],
         ...,
         [-1.0537],
         [-1.0537],
         [-1.0537]],

        [[ 0.9454],
         [-1.0537],
         [ 0.9450],
         ...,
         [-1.0537],
         [-1.0537],
         [-1.0537]],

        [[-1.0537],
         [ 0.9450],
         [ 0.9450],
         ...,
         [-1.0537],
         [-1.0537],
         [ 0.9454]]], device='cuda:0')

In [15]:
trainY

tensor([[0.],
        [0.],
        [0.],
        ...,
        [1.],
        [0.],
        [1.]], device='cuda:0')

In [16]:
trainX.shape

torch.Size([4999985, 15, 1])

## LSTM Model Definition

Defining the Long Short-Term Memory (LSTM) neural network architecture:

- **Input**: Sequences of normalized memory addresses
- **LSTM Layer**: Processes sequences and captures temporal patterns
- **Fully Connected Layer**: Maps LSTM output to binary prediction

The model maintains and updates hidden state (h) and cell state (c) between batches for continuous learning on sequential data.

In [21]:
class LSTMModel(nn.Module):
    def __init__(self, input_dim, hidden_dim, layer_dim, output_dim):
        '''
        Initialize the LSTM model.
        
        Parameters:
            input_dim (int): Size of input feature dimension (1 for scalar addresses)
            hidden_dim (int): Size of the hidden state
            layer_dim (int): Number of LSTM layers
            output_dim (int): Size of output (1 for binary prediction)
        '''
        super(LSTMModel, self).__init__()
        self.hidden_dim = hidden_dim
        self.layer_dim = layer_dim
        
        # LSTM layer with batch_first=True means input shape is [batch, seq, feature]
        self.lstm = nn.LSTM(input_dim, hidden_dim, layer_dim, batch_first=True)
        
        # Fully connected layer to produce output prediction
        self.fc = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x, h0=None, c0=None):
        '''
        Forward pass through the network.
        
        Parameters:
            x (tensor): Input tensor of shape [batch_size, seq_len, input_dim]
            h0 (tensor, optional): Initial hidden state
            c0 (tensor, optional): Initial cell state
            
        Returns:
            out (tensor): Output predictions
            hn (tensor): Final hidden state
            cn (tensor): Final cell state
        '''
        # Initialize hidden states if not provided
        if h0 is None or c0 is None:
            # Create zero tensors for hidden and cell states
            # Shape: [num_layers, batch_size, hidden_dim]
            h0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
            c0 = torch.zeros(self.layer_dim, x.size(0), self.hidden_dim).to(x.device)
        
        # Forward propagate the LSTM
        # out shape: [batch_size, seq_len, hidden_dim]
        # hn and cn shape: [num_layers, batch_size, hidden_dim]
        out, (hn, cn) = self.lstm(x, (h0, c0))
        
        # Decode the hidden state of the last time step
        # Only take the output from the final timestep
        out = self.fc(out[:, -1, :])
        
        return self.sigmoid(out), hn, cn

In [18]:
# Initialize the LSTM model
# - input_dim=1: Each timestep has a single feature (normalized address)
# - hidden_dim=100: Size of the hidden state vector
# - layer_dim=1: Single LSTM layer
# - output_dim=1: Binary output (hit probability)
model = LSTMModel(input_dim=1, hidden_dim=100, layer_dim=1, output_dim=1).to(device)

# Define loss function - Mean Squared Error
criterion = nn.MSELoss()

# Define optimizer - Adam with learning rate of 0.01
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

## Model Initialization

Initializing the LSTM model with the following configuration:
- Input dimension: 1 (single feature per timestep - normalized address)
- Hidden dimension: 100 (size of LSTM cell state)
- Layer dimension: 1 (single LSTM layer)
- Output dimension: 1 (binary prediction)

Also configuring the loss function (Mean Squared Error) and optimizer (Adam).

In [19]:
# Set batch size for training
# Larger batch size can speed up training but requires more memory
batch_size = 16384

# Create PyTorch Dataset from our tensors
dataset = TensorDataset(trainX, trainY)

# Create DataLoader for batch processing
# - shuffle=True: Randomizes data order in each epoch
# - drop_last=True: Drops the last incomplete batch
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True, drop_last=True)

## DataLoader Configuration

Preparing the data for batch processing using PyTorch's DataLoader:
- Creates a TensorDataset from input sequences and target values
- Configures the batch size, shuffling, and other training parameters

In [20]:
# Number of complete passes through the dataset
num_epochs = 10

# Initialize hidden and cell states as None (will be created in first forward pass)
h0, c0 = None, None

# Training loop
for epoch in range(num_epochs):
    # Set model to training mode
    model.train()
    # Zero all gradients
    optimizer.zero_grad()

    # Process each batch in the dataset
    for i, batch in enumerate(dataloader):
        # Unpack inputs and targets from batch
        x_batch, y_batch = batch
        
        # Forward pass: compute predictions and get new hidden states
        outputs, h0, c0 = model(x_batch, h0, c0)

        # Calculate batch accuracy (threshold at 0.5 for binary classification)
        accuracy = ((outputs > 0.5) == y_batch).sum() / x_batch.shape[0]
        
        # Calculate loss
        loss = criterion(outputs, y_batch)
        
        # Backward pass: compute gradients
        loss.backward()
        
        # Update model parameters based on gradients
        optimizer.step()
    
        # Detach hidden states from the computation graph to prevent 
        # backpropagation through the entire history (avoids exploding gradients)
        h0 = h0.detach()
        c0 = c0.detach()
        
        # Print progress every 10 batches
        if i % 10 == 0:
            print(f"Epoch {epoch}, Batch {i}, Batch Loss: {loss.item():.4f}, Batch Accuracy {accuracy}")
    
    # Alternative epoch-level reporting (commented out)
    #print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

TypeError: Sigmoid.__init__() takes 1 positional argument but 2 were given

## Model Training

Training the LSTM model with the following process:
1. Iterate through epochs
2. For each batch in the dataloader:
   - Forward pass through the model
   - Calculate accuracy and loss
   - Backward pass to compute gradients
   - Update model parameters

Note: The hidden and cell states are preserved between batches but detached from the computation graph to prevent exploding gradients.

In [None]:
# Example code for model evaluation (uncomment to use)

# def evaluate_model(model, X_test, y_test):
#     '''
#     Evaluate the model on test data
#     
#     Parameters:
#         model: Trained LSTM model
#         X_test: Test input sequences
#         y_test: Test labels
#     
#     Returns:
#         accuracy: Model accuracy on test data
#     '''
#     model.eval()  # Set model to evaluation mode
#     with torch.no_grad():  # Disable gradient computation
#         # Convert test data to tensors
#         X_test_tensor = torch.tensor(X_test[:, :, None], dtype=torch.float32).to(device)
#         y_test_tensor = torch.tensor(y_test[:, None], dtype=torch.float32).to(device)
#         
#         # Forward pass
#         outputs, _, _ = model(X_test_tensor)
#         
#         # Calculate accuracy
#         predictions = (outputs > 0.5).float()
#         accuracy = (predictions == y_test_tensor).sum() / len(y_test_tensor)
#     
#     return accuracy.item()

# # Plot training history
# def plot_training_history(history):
#     '''
#     Plot the training loss and accuracy over epochs
#     
#     Parameters:
#         history: Dictionary containing 'loss' and 'accuracy' lists
#     '''
#     plt.figure(figsize=(12, 5))
#     
#     # Plot loss
#     plt.subplot(1, 2, 1)
#     plt.plot(history['loss'])
#     plt.title('Training Loss')
#     plt.xlabel('Epoch')
#     plt.ylabel('Loss')
#     
#     # Plot accuracy
#     plt.subplot(1, 2, 2)
#     plt.plot(history['accuracy'])
#     plt.title('Training Accuracy')
#     plt.xlabel('Epoch')
#     plt.ylabel('Accuracy')
#     
#     plt.tight_layout()
#     plt.show()

## Model Evaluation and Further Steps

After training, the model can be used to predict cache hits for new memory address sequences. Possible next steps:

1. **Model Evaluation**: Test the model on a separate validation set to assess generalization performance
2. **Hyperparameter Tuning**: Experiment with different LSTM configurations (hidden size, number of layers)
3. **Feature Engineering**: Consider additional features like program counter values
4. **Visualization**: Plot the training loss and accuracy curves
5. **Inference**: Use the trained model to predict cache behavior for new memory traces