# Flash Attention with Docker for Local Development and Scaling

This document provides a guide on how to set up Flash Attention using Docker for local development and scaling. It includes instructions for building the Docker image, running the container, and using Flash Attention in your projects.

## Prerequisites
- Docker installed on your machine
- A compatible GPU (NVIDIA) 
- NVIDIA Docker runtime installed

## Sections
- [Building the Docker Image](#building-the-docker-image)
- [Using Flash Attention Locally](#using-flash-attention)
- [Scaling with GCP Cloud Computing](#scaling-with-gcp-cloud-computing)

## Building the Docker Image

To build the Docker image for Flash Attention, follow these steps.

1. Clone the repository:
   ```bash
   git clone https://github.com/gabenavarro/MLContainerLab.git
   cd MLContainerLab
   ```
2. Build the Docker image:
   ```bashll
   docker build -f ./assets/build/Dockerfile.flashattn.cu128py26cp312 -t flash-attention:128-26-312 .
   ```

   > Note: The following steps are to development within the container. My tutorials will be run inside the container, so you can skip them if you are not interested in development.

3. Run the Docker container detached with terminal access and GPUs connected:
   ```bash
   docker run -dt \
   --gpus all \
   -v "$(pwd):/workspace" \
   --name flash-attention \
   --env NVIDIA_VISIBLE_DEVICES=all \
   --env GOOGLE_APPLICATION_CREDENTIALS=/workspace/assets/secrets/gcp-key.json \
   --env SYNAPSE_TOKEN=eyJ0eXAiOiJKV1QiLCJraWQiOiJXN05OOldMSlQ6SjVSSzpMN1RMOlQ3TDc6M1ZYNjpKRU9VOjY0NFI6VTNJWDo1S1oyOjdaQ0s6RlBUSCIsImFsZyI6IlJTMjU2In0.eyJhY2Nlc3MiOnsic2NvcGUiOlsidmlldyIsImRvd25sb2FkIiwibW9kaWZ5Il0sIm9pZGNfY2xhaW1zIjp7fX0sInRva2VuX3R5cGUiOiJQRVJTT05BTF9BQ0NFU1NfVE9LRU4iLCJpc3MiOiJodHRwczovL3JlcG8tcHJvZC5wcm9kLnNhZ2ViYXNlLm9yZy9hdXRoL3YxIiwiYXVkIjoiMCIsIm5iZiI6MTc0NjE2NDc3MCwiaWF0IjoxNzQ2MTY0NzcwLCJqdGkiOiIxOTc3OSIsInN1YiI6IjM1NDE2NzQifQ.XyBybxisMzD6pUae41cuWePDuFN9GTJq2lDPOHCkVQCPXAWxQ5lsY3jBer8ACp85FP9OhX-ZiF3Zp2W5MOXWgN4PaFDHQMPvYmnuRgPINYSZd7RvAI3mKkeOAujD_p7KJBxeNdBPPVUb_V50TDI4RQ4xPknfFS8lWFgkl2WQHftbToQ6ItoTFqr7YEn6h68Og1cijYN9d7vbkLQMnwbD1FeakW7mTcfvIZfzsmpYqMhz19D8jrRTJL-7UnJitGkgAfc9gpvowUNPoZHF2eKl8a5pAOeJ6AmK1fRXhJ0kdXIoHNRrYEHR1IGJho6TioVB87TPjavaJljyLHNHK34A8w \
   flash-attention:128-26-312
   ```
   > Note: The `-v $(pwd):/workspace` option mounts the current directory to `/workspace` in the container, allowing you to access your files from within the container. <br>
   > Note: The `--env` options set environment variables for GPU visibility and Google Cloud credentials. <br>
   > Note: The `--gpus all` option allows the container to use all available GPUs. <br>
   > Note: The `--name` option names the container `flash-attention`, which you can use to reference it later. <br>
   > Note: The `-dt` option runs the container in detached mode with terminal access. <br>
   > Note: Get your token from [Synapse](https://synapse.org/), and set it as an environment variable in the container. You can also set it in your local environment, but this is not recommended for security reasons. <br>
   > Note: Get a GCP key from [Google Cloud](https://cloud.google.com/docs/authentication/getting-started) and set it as an environment variable in the container. You can also set it in your local environment, but this is not recommended for security reasons. <br>

4. Open the container in VSCode: 
   ```bash
   code --folder-uri vscode-remote://dev-container+flash-attention/workspace
   ```

   If you have the Remote - Containers extension installed, this command will open the current directory in VSCode, allowing you to edit files directly in the container.
   If this fails, you can use the GUI to open the container:
   - Open VSCode
   - Press `F1` and type `Remote-Containers: Attach to Running Container...`
   - Select the `flash-attention` container from the list
   - This will open the container in a new VSCode window
   - Set workspace to `/workspace` in the container

   > Note: This command opens the current directory in VSCode, allowing you to edit files directly in the container. <br>
   > Note: You may need to install the Remote - Containers extension in VSCode to use this feature. <br>
   > Note: You may need to install the Python extension in VSCode to use this feature. <br>
   > Note: You may need to install the Jupyter extension in VSCode to use this feature. <br>

## Using Flash Attention

### Dataset Preparation
Lets setup a simple example to run Flash Attention in the container. First lets start off by downloading a sample dataset.

In [1]:
TIME_SERIES_CSV = "/workspace/datasets/btcusd_1-min_data.csv"
PROCESSED_DATA_DIR = "/workspace/datasets/processed_timeseries"

In [None]:
%%capture
# Download and unzip the Bitcoin historical data dataset from Kaggle
!curl -L -o /workspace/datasets/bitcoin-historical-data.zip \
  https://www.kaggle.com/api/v1/datasets/download/mczielinski/bitcoin-historical-data \
    && unzip -o /workspace/datasets/bitcoin-historical-data.zip -d /workspace/datasets/ \
    && rm /workspace/datasets/bitcoin-historical-data.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  113M  100  113M    0     0  28.8M      0  0:00:03  0:00:03 --:--:-- 36.0M
Archive:  /workspace/datasets/bitcoin-historical-data.zip
  inflating: /workspace/datasets/btcusd_1-min_data.csv  
rm: cannot remove '\': No such file or directory


After downloading the dataset, we can load it into a Pandas DataFrame and take a look at the first few rows. The dataset contains historical data for Bitcoin, including the date, open, high, low, close prices, and volume.

In [None]:
import pandas as pd
pd.set_option('display.max_columns', None)
pd.read_csv(TIME_SERIES_CSV, low_memory=False, nrows=5).head()

Unnamed: 0,Timestamp,Open,High,Low,Close,Volume,datetime
0,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:01:00+00:00
1,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:02:00+00:00
2,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:03:00+00:00
3,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:04:00+00:00
4,1325412000.0,4.58,4.58,4.58,4.58,0.0,2012-01-01 10:05:00+00:00


Now that we have the dataset, lets create a dataset and dataloader for the model using litdata. 

In [None]:
%%capture
import pandas as pd
import litdata as ld
from typing import Dict, Any
import torch

def process_timeseries(file_path: str, sequence_length: int = 2048) -> Dict[str, Any]:
    """Process a timeseries CSV file into a format suitable for autoregressive modeling"""
    
    # Read the CSV file
    df = pd.read_csv(file_path)
    
    # Convert datetime to a proper datetime object if it's not already
    if 'datetime' in df.columns:
        df['datetime'] = pd.to_datetime(df['datetime'])
    
    # Sort by timestamp to ensure chronological order
    df = df.sort_values('Timestamp')
    
    # Select the numerical columns for prediction
    numerical_features = ['Open', 'High', 'Low', 'Close', 'Volume']
    
    # Create sequences for autoregressive modeling
    sequence_length = 2048  # Define how many previous timesteps to use
    
    # Function to create samples for litdata
    def create_timeseries_sample(index: int) -> Dict[str, Any]:
        if index < sequence_length or index >= len(df):
            # Not enough previous data or beyond the dataset
            return None
            
        # Get the sequence of previous data points
        input_sequence = df.iloc[index-sequence_length:index][numerical_features].values
        # Target is the next timestep's Close price
        target = df.iloc[index]['Close']
        
        # Convert to appropriate tensor formats
        input_tensor = torch.tensor(input_sequence, dtype=torch.float32)
        target_tensor = torch.tensor(target, dtype=torch.float32)
        
        timestamp = df.iloc[index]['Timestamp']
        
        return {
            "index": index,
            "inputs": input_tensor,
            "target": target_tensor,
            "timestamp": timestamp
        }
    
    return create_timeseries_sample

# Set the sequence length for your model
sequence_length = 2048

# Get the processing function configured for your specific file
process_function = process_timeseries(TIME_SERIES_CSV, sequence_length)

# Determine the number of valid samples (will depend on your data size)
df_length = len(pd.read_csv(TIME_SERIES_CSV))
valid_indices = list(range(sequence_length, df_length))  # Assuming sequence_length=2048

# The optimize function writes data in an optimized format
ld.optimize(
    fn=process_function,              # the function that processes each sample
    inputs=valid_indices,             # the indices of valid samples
    output_dir=PROCESSED_DATA_DIR,    # optimized data is stored here
    num_workers=4,                    # The number of workers on the same machine
    chunk_bytes="64MB"                # size of each chunk
)

  df = pd.read_csv(file_path)
  df_length = len(pd.read_csv(dataset_path))


Create an account on https://lightning.ai/ to optimize your data faster using multiple nodes and large machines.
Setting multiprocessing start_method to fork. Tip: Libraries relying on lock can hang with `fork`. To use `spawn` in notebooks, move your code to files and import it within the notebook.
Storing the files under /workspace/datasets/processed_timeseries
Setup started with fast_dev_run=False.
Setup finished in 0.029 seconds. Found 7009037 items to process.
Starting 4 workers with 7009037 items. The progress bar is only updated when a worker finishes.
Workers are ready ! Starting data processing...


Progress:   0%|          | 0/7009037 [00:00<?, ?it/s]

Rank 0 inferred the following `['int', 'tensor', 'tensor', 'float']` data format.Rank 1 inferred the following `['int', 'tensor', 'tensor', 'float']` data format.

Rank 2 inferred the following `['int', 'tensor', 'tensor', 'float']` data format.
Rank 3 inferred the following `['int', 'tensor', 'tensor', 'float']` data format.
Worker 0 is terminating.
Worker 0 is done.
Worker 1 is terminating.
Worker 1 is done.
Worker 2 is terminating.
Worker 2 is done.
Worker 3 is terminating.
Worker 3 is done.
Workers are finished.
Finished data processing!


Next, lets split the dataset into training and validation sets. We will use 80% of the data for training and 10% for validation and 10% for test. We will also create a dataloader for the validation set.

In [2]:
from litdata import StreamingDataset, train_test_split
streaming_dataset = StreamingDataset(PROCESSED_DATA_DIR) # data are stored in the cloud

print(len(streaming_dataset)) # display the length of your data
# out: 100,000

train_dataset, val_dataset, test_dataset = train_test_split(streaming_dataset, splits=[0.8, 0.1, 0.1])

print("Train ",train_dataset)
# out: 80,000

print("Validation ",val_dataset)
# out: 10,000

print("Test" ,test_dataset)
# out: 10,000

7009037
Train  <litdata.streaming.dataset.StreamingDataset object at 0x7f5b78db5dc0>
Validation  <litdata.streaming.dataset.StreamingDataset object at 0x7f5b79363620>
Test <litdata.streaming.dataset.StreamingDataset object at 0x7f5b78ede840>


### Model Definition

Now that we have the dataset and dataloader, we can define the model. To do this, we will create a class that inherits from `nn.Module` and define the model architecture in the `__init__` method. We will also define the forward pass in the `forward` method.

The model architecture consists of an embedding layer, a transformer encoder, and a linear layer. The embedding layer converts the input data into a higher-dimensional space, the transformer encoder processes the data using self-attention mechanisms, and the linear layer outputs the final predictions.

In [3]:
# Transformer Layer
from flash_attn.modules.mha import MHA
from flash_attn.ops.rms_norm import RMSNorm
from flash_attn.ops.fused_dense import FusedMLP
from torch import nn, optim, Tensor
import lightning as pl


class TransformerLayer(nn.Module):
    def __init__(
            self,
            layer_idx: int,
            embed_dim: int,
            num_heads: int,
            mlp_ratio: float = 4.0,
            proj_groups: int = 1,
            fast_attention: bool = True,
        ):
        super(TransformerLayer, self).__init__()
        self.attention = MHA(
            D = embed_dim,                      # Dimension of the model
            num_heads = num_heads,              # Number of attention heads
            causal = True,                      # Causal attention
            layer_idx = layer_idx,              # Layer index for rotary embedding
            num_heads_kv = num_heads // proj_groups, # Number of heads for key/value
            rotary_emb_dim = embed_dim // num_heads, # Rotary embedding dimension
            use_flash_attn = fast_attention,         # Use flash attention
            return_residual = False                  # Return residual connection

        )
        self.norm1 = RMSNorm(embed_dim)
        self.mlp = FusedMLP(embed_dim, int(embed_dim * mlp_ratio))
        self.norm2 = RMSNorm(embed_dim)

    def forward(self, x: Tensor) -> Tensor:
        attn_output = self.attention(x)
        x = x + self.norm1(attn_output)
        mlp_output = self.mlp(x)
        x = x + self.norm2(mlp_output)
        return x
    
# Transformer Model
class TransformerModel(pl.LightningModule):
    def __init__(self, embed_dim: int, num_heads: int, num_layers: int, mlp_ratio: float = 4.0):
        super(TransformerModel, self).__init__()
        self.layers = nn.ModuleList([
            TransformerLayer(layer_idx, embed_dim, num_heads, mlp_ratio) for layer_idx in range(num_layers)
        ])
        self.fc_out = nn.Linear(embed_dim, 1)  # Output layer for regression

    # In pl.LightningModule, the forward method is used for inference
    def forward(self, x: Tensor) -> Tensor:
        for layer in self.layers:
            x = layer(x)
        return self.fc_out(x)
    
    # Training step is where the model learns from the data
    def training_step(self, batch, batch_idx):
        inputs = batch['inputs']
        targets = batch['target']
        
        # Forward pass
        outputs = self(inputs)
        
        # Compute loss
        loss = nn.MSELoss()(outputs, targets)
        
        # Log the loss for monitoring
        self.log('train_loss', loss)
        return loss
    
    # Validation step is where the model is evaluated on the validation set
    def validation_step(self, batch, batch_idx):
        inputs = batch['inputs']
        targets = batch['target']
        
        # Forward pass
        outputs = self(inputs)
        
        # Compute loss
        loss = nn.MSELoss()(outputs, targets)
        
        # Log the loss for monitoring
        self.log('val_loss', loss)
        return loss
    
    # Configuring the optimizer
    def configure_optimizers(self):
        optimizer = optim.Adam(self.parameters(), lr=1e-4)
        return optimizer
    

# Initialize the model
embed_dim = 128
num_heads = 8
num_layers = 6
model = TransformerModel(embed_dim, num_heads, num_layers)

ModuleNotFoundError: No module named 'dropout_layer_norm'