<a href="https://colab.research.google.com/github/antalvdb/olifant/blob/main/timbl_llm_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training an Olifant memory-based language model

Olifant offers an eco-friendly alternative to neural LLMs. Olifant models rely on CPUs; no GPUs or TPUs are required. Training Olifant is costly in terms of RAM, but not in terms of time or computing resources. This notebook exemplifies how to train an Olifant model. Olifant comes in two flavors:

1.   **IB1**, k-Nearest Neighbor classification - accurate but slow and RAM-intensive;
2.   **IGTree**, decision-tree classification based on prefix tree retrieval; fast, compact, but less accurate.

This notebook assumes that you have uploaded a raw text file to Colab, or to Google Drive (moving the file to My Drive and mounting this drive). A link to a sample text file is offered below.

##Installing packages

We start by installing the `olifant` package. Because we want to run `timbl` on the command line, we install that as well. Among the installed dependencies is `codecarbon`, which we will use to track CO2 emissions; it is recommended to log all estimated emissions when doing experiments.

In [None]:
!pip install olifant
!apt-get install timbl

import timbl

##Loading the sample training data

We load some sample training data. This file represents the first 100,000 lines of the first shard of the EduFineWeb refined web corpus by Hugging Face.

In [None]:
!wget https://antalvandenbosch.nl/mblm/edufineweb_train_000001-100k.txt

##Tokenizing the training data

We tokenize the training data with the GPT2 tokenizer, once developed by OpenAI for their GPT version 2 system, and made available via the Hugging Face platform. Once you train a model with a certain tokenizer, all future data will need to be processed with the same tokenizer.

In [None]:
import os
import pandas as pd
from transformers import AutoTokenizer

def process_file(input_filename):
    # Generate the output filename by adding "tok" before the file extension
    base, ext = os.path.splitext(input_filename)
    output_filename = f"{base}_tok{ext}"

    # Read the input file
    with open(input_filename, 'r') as file:
        lines = file.readlines()

    # Create a DataFrame
    df = pd.DataFrame(lines, columns=['text'])

    # Initialize the tokenizer
    tokenizer = AutoTokenizer.from_pretrained('gpt2')

    # Tokenize the text
    df['tokens'] = df['text'].apply(lambda x: tokenizer.tokenize(x))

    # Write the tokens to the output file
    with open(output_filename, 'w') as file:
        for tokens in df['tokens']:
            # Join the tokens list into a single string and write to the file
            file.write(' '.join(tokens) + '\n')

    print(f"Processed file saved as {output_filename}")

# Specify the filename directly
filename = "edufineweb_train_000001-100k.txt"  # Replace with your actual filename
input_filename = os.path.join('/content', filename)

# Process the file
process_file(input_filename)

##Generating training instances for TiMBL

We then create a file with fixed-width training instances that consist of a 4-word context as features, and the next word as the class label to be predicted.

In [None]:
import os

def generate_windowed_instances(file_path, output_file, window_size=4):
    # Start with an empty list to accumulate tokens for each block
    tokenized_text = []

    with open(file_path, 'r') as file, open(output_file, 'w') as outfile:
        for line in file:
            # Strip leading/trailing whitespace from the line
            stripped_line = line.strip()

            # Check if the line is empty, indicating the end of a block
            if not stripped_line:
                # Process the accumulated tokens for the current block
                if tokenized_text:
                    # Pad the beginning of the tokenized text with underscores
                    padded_text = ["_"] * window_size + tokenized_text

                    # Generate and print each windowed instance for this block
                    for i in range(window_size, len(padded_text) - 1):
                        context = padded_text[i - window_size:i]
                        target = padded_text[i]
                        outfile.write(f"{' '.join(context)} {target}\n")

                        # Reset tokenized_text for the next block
                        tokenized_text = []

            else:
                # Append tokens from the non-empty line to the current block
                tokenized_text.extend(stripped_line.split())

        # Process any remaining tokens after the last line
        if tokenized_text:
            padded_text = ["_"] * window_size + tokenized_text
            for i in range(window_size, len(padded_text) - 1):
                context = padded_text[i - window_size:i]
                target = padded_text[i]
                outfile.write(f"{' '.join(context)} {target}\n")

# Specify the input and output filenames directly
input_filename = "edufineweb_train_000001-100k_tok.txt"  # Replace with your actual input filename
output_filename = input_filename.replace(".txt", ".l4r0")  # Generate output filename

input_file_path = os.path.join('/content', input_filename)
output_file_path = os.path.join('/content', output_filename)

# Call the function to generate windowed instances and write to the output file
generate_windowed_instances(input_file_path, output_file_path)

##Training

Now we train our MBLM model with TiMBL. This can take a while and may consume high amounts of RAM.

The end result is `edufineweb_train_000001-100k_tok.l4r0.ibase`, an indexed and compressed instance base suitable for TiMBL classification. In LLM terms, this is the model file that you will need for your favorite LLM inference steps.

Again, TiMBL allows for two flavors:

1. The option `-a0` means that the training set is compressed losslessly, with
compression rates around 10-30%. This is the setting that implements **IB1**, k-Nearest Neighbor classification.

2. With `-a1`, a strong lossy compression is applied, yielding higher compression levels around 90-95%, and considerably faster but less accurate inference. This is TiMBL's **IGTree** option.

In this example, TiMBL is called from the Notebook shell commandline (it can also be called from the Python-timbl bindings). It is wrapped inside a codecarbon CO2 emission tracker. TiMBL's quite verbose output is mixed with the equally verbose codecarbon information. TiMBL goes through three phases:

1. Examining: reading all instances once into memory;
2. Indexing: building an index on all feature values and class labels;
3. Learning: storing all instances into a decision tree (lossless with **IB1**, lossy with **IGTree**).



In [None]:
from codecarbon import track_emissions

@track_emissions(project_name="mblm-edufineweb_train_000001-100k_tok.l4r0")
def train_model():
    !timbl -f edufineweb_train_000001-100k_tok.l4r0 -a0 +D -I edufineweb_train_000001-100k_tok.l4r0.ibase

train_model()