# Assignment 5: Extended Long Short-Term Memory (xLSTM)

*Author:* Philipp Seidl

*Copyright statement:* This  material,  no  matter  whether  in  printed  or  electronic  form,  may  be  used  for  personal  and non-commercial educational use only.  Any reproduction of this manuscript, no matter whether as a whole or in parts, no matter whether in printed or in electronic form, requires explicit prior acceptance of the authors.

In this assignment, we will explore the xLSTM architecture, a novel extension of the classic LSTM model. The paper can be found here: https://arxiv.org/abs/2405.04517

## Background
Recurrent Neural Networks (RNNs), particularly LSTMs, have proven highly effective in various sequence modeling tasks. However, the emergence of Transformers, with their parallel processing capabilities, has shifted the focus away from LSTMs, especially in large-scale language modeling.
The xLSTM architecture aims to bridge this gap by enhancing LSTMs with mechanisms inspired by modern LLMs (e.g. block-strucutre, residual connections, ...).  Further it introduces:
- Exponential gating with normalization and stabilization techniques, which improves gradient flow and memory capacity.
- Modifications to the LSTM memory structure, resulting in two variants:
    - sLSTM: Employs a scalar memory with a scalar update rule and a new memory mixing technique through recurrent connections.
    - mLSTM: Features a matrix memory, employs a covariance update rule, and is fully parallelizable, making it suitable for scaling.

By integrating these extensions into residual block backbones, xLSTM blocks are formed, which can then be residually stacked to create complete xLSTM architectures.

## Exercise 1: Environment Setup

When working with new architectures or specialized frameworks, it's essential to correctly set up the environment to ensure reproducability. This exercise focuses on setting up the environment for working with the `xlstm` repository.

1. Visit and clone the official repository: [https://github.com/NX-AI/xlstm](https://github.com/NX-AI/xlstm).  
2. Set up the environment  
3. Document your setup:  
   - OS, Python version, Environment setup, CUDA version (if applicable), and GPU details.  
   - Note any challenges you faced and how you resolved them. 
4. Submit your setup as a bash script using the IPython `%%bash` magic. Ensure it is reproducible.

Getting mLSTM working only is fine (if you encounter issues with sLSTM cuda kernels)

> **Note**: Depending on your system setup, you may need to adjust the `environment_pt220cu121.yaml` file, such as for the CUDA version. For this assignment, it is recommended to run it on GPUs. If you don't have one, consider using  [Colab](https://colab.research.google.com/notebooks/welcome.ipynb#recent=true) or other online resources.

> **Recommendations**: While the repository suggests using `conda`, we recommend using `mamba` or `micromamba` instead (way faster) (except if you are using colab). Learn more about them here: [https://mamba.readthedocs.io/en/latest/index.html](https://mamba.readthedocs.io/en/latest/index.html).

In [None]:
%%bash
#python3.11 -m venv venv
source venv/bin/activate

pip install xlstm
pip install --upgrade pip

pip install -e .
pip install mlstm_kernels


In [1]:
# Verify your installation of xLSTM:
from omegaconf import OmegaConf
from dacite import from_dict
from dacite import Config as DaciteConfig
from xlstm import xLSTMBlockStack, xLSTMBlockStackConfig
import os
import torch

DEVICE = "cuda" if torch.cuda.is_available() else 'cpu'

use_slstm_kernels = False # set to True if you want to check if sLSTM cuda kernels are working

xlstm_cfg = f"""
mlstm_block:
  mlstm:
    conv1d_kernel_size: 4
    qkv_proj_blocksize: 4
    num_heads: 4
slstm_block:
  slstm:
    backend: {'cuda' if use_slstm_kernels else 'vanilla'}
    num_heads: 4
    conv1d_kernel_size: 4
    bias_init: powerlaw_blockdependent
  feedforward:
    proj_factor: 1.3
    act_fn: gelu
context_length: 32
num_blocks: 7
embedding_dim: 64
slstm_at: [] # empty = mLSTM only
"""
cfg = OmegaConf.create(xlstm_cfg)
cfg = from_dict(data_class=xLSTMBlockStackConfig, data=OmegaConf.to_container(cfg), config=DaciteConfig(strict=True))
xlstm_stack = xLSTMBlockStack(cfg)

x = torch.randn(4, 32, 64).to(DEVICE)
xlstm_stack = xlstm_stack.to(DEVICE)
y = xlstm_stack(x)
y.shape == x.shape

True

## Exercise 2: Understanding xLSTM Hyperparameters
Explain key hyperparameters that influence the performance and behavior of the xLSTM architecture and explain how they influence total parameter count.
The explanation should include: proj_factor, num_heads, act_fn, context_length, num_blocks, embedding_dim, hidden_size, dropout, slstm_at, qkv_proj_blocksize, conv1d_kernel_size. Also include how the matrix memory size of mLSTM is determined.

In [None]:
########## SOLUTION BEGIN ##########

#proj_factor: scaling factor for the hidden dimensions in the feedforward layers relative to the input dimensions
##    Influence: Higher proj_factor increases the number of parameters in the feedforward layers – 
#       large number --> overfitting in small DS

#num_heads: Number of attention heads. Each head learns independent attention patterns. 
#     increases number of params in attention linearly.

#act_fn: the activation function, 
#    no effect on param count

#context_length: 
#    length of the input sequence processed by the model. --> how much past information the model can consider. 
#    more context_length --> more compute. 
#    No direct effect on 


#num_blocks: 


#embedding_dim: 
#     The size of the embedding vector for input tokens.
##    improves the model's ability to learn detailed patterns but increases memory usage and parameter count 
##    (how much increase in param count?)

#hidden_size:
#   hidden layers
##  similar to embedding, more hidden layer, detailed learning, more memory usage
##  increase number of params

#drop_out:
# The probability of dropping neurons during training for regularization.
# Reduce overfitting --> adding noise to training process
# no imapct on param count


# slstm_at:
##  number of sLSTM layers


#qkv_proj_blocksize:
#    Size of the blocks used for QKV projection in attention 
#   


#conv1d_kernel_size:
#    The kernel size of the convolutional layers in LSTM components


#Matrix Memory Size in mLSTM
#   ???



########## YOUR SOLUTION HERE ##########

## Exercise 3: Train an xLSTM model on the Trump Dataset from the previous exercise
Your task is to train an xLSTM model on the Trump Dataset from the previous exercise. 
- The goal is to achieve an average validation loss $\mathcal{L}_{\text{val}} < 1.35$. 
- You do not need to perform an extensive hyperparameter search, but you should document your runs. Log your runs with used hyperparameters using tools like wandb, neptune, mlflow, ... or a similar setup. Log training/validation loss and learning rate over steps as well as total trainable parameters of the model for each run.
- You can use the training setup from the previous exercises or any setup of your choice using high level training libaries.

In [None]:
########## SOLUTION BEGIN ##########

########## YOUR SOLUTION HERE ##########

## Exercise 4: Utilizing a Pretrained Model (Bonus)

Foundation Models, those pretrained on large amounts of data are more and more important. We can use those models and fine-tune them on our dataset, rather then training them from scratch.
Here are the things to consider:

- Model Selection: Choose a pretrained language model from an online repository. Hint: You can explore platforms like Hugging Face (huggingface.co), which host numerous pretrained models.

- Dataset: Use the Trump dataset with the same training and validation split as in previous exercises. You do not need to use character tokenization.

- Performance Evaluation: Evaluate the performance of the pretrained model on the validation set before and during fine-tuning. Report average-CE-loss as well as an example generated sequence with the same prompt for each epoch.
 
- Fine-tuning: Adjust the learning rate, potentially freeze some layers, train for a few epochs with a framework of your choice (e.g. [lightning](https://lightning.ai/docs/pytorch/stable/), [huggingface](https://huggingface.co/models), ...)

- Computational Resources: Be mindful of the computational demands of pretrained models. You might need access to GPUs. Try to keep the model size at a minimum and go for e.g. distilled versions or other small LMs

- Hyperparameter Tuning: You can experiment with different learning rates and potentially other hyperparameters during fine-tuning but no need to do this in depth

By completing this exercise, you will gain experience with utilizing pretrained models, understanding their capabilities, and the process of fine-tuning. Decreasing the validation loss can be seen a success for this exercise.

> **Note**: This is a standalone exercise and doesn't build upon the previous tasks.

In [None]:
########## SOLUTION BEGIN ##########

########## YOUR SOLUTION HERE ##########