<a href="https://colab.research.google.com/github/antalvdb/mblm/blob/main/timbl_llm_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training a memory-based language model

MBLM offers an eco-friendly alternative to neural LLMs. MBLMs rely on CPUs; no GPUs or TPUs are required. Training MBLMs is costly in terms of RAM, but not in terms of time or computing resources. This notebook exemplifies how to train an MBLM model. MBLM comes in two flavors:

1.   **IB1**, k-Nearest Neighbor classification - accurate but slow and RAM-intensive;
2.   **IGTree**, decision-tree classification based on prefix tree retrieval; fast, compact, but less accurate.

This notebook assumes that you have uploaded a raw text file to Colab, or to Google Drive (moving the file to My Drive and mounting this drive). A sample text file can be downloaded from [here](https://antalvandenbosch.nl/mblm/edufineweb_train_000001-100k.txt). This file represents the first 100,000 lines of the first shard of the EduFineWeb refined web corpus by Hugging Face.

We start by installing the python bindings for TiMBL, the MBLM engine. We are also installing codecarbon to track CO2 emissions.

In [3]:
!apt install timbl
!pip install python3-timbl
!pip install codecarbon

import timbl

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
timbl is already the newest version (6.5-3).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
timbl is already the newest version (6.5-3).
0 upgraded, 0 newly installed, 0 to remove and 34 not upgraded.


We tokenize the training data with bert-base-cased, a Hugging Face tokenizer. This is the same tokenizer we will use for other data that will be handled by our model.

In [4]:
import os
import pandas as pd
from transformers import AutoTokenizer

def process_file(input_filename):
    # Generate the output filename by adding "tok" before the file extension
    base, ext = os.path.splitext(input_filename)
    output_filename = f"{base}_tok{ext}"

    # Read the input file
    with open(input_filename, 'r') as file:
        lines = file.readlines()

    # Create a DataFrame
    df = pd.DataFrame(lines, columns=['text'])

    # Initialize the tokenizer
    tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')

    # Tokenize the text
    df['tokens'] = df['text'].apply(lambda x: tokenizer.tokenize(x))

    # Write the tokens to the output file
    with open(output_filename, 'w') as file:
        for tokens in df['tokens']:
            # Join the tokens list into a single string and write to the file
            file.write(' '.join(tokens) + '\n')

    print(f"Processed file saved as {output_filename}")

# Specify the filename directly
filename = "edufineweb_train_000001-100k.txt"  # Replace with your actual filename
input_filename = os.path.join('/content', filename)

# Process the file
process_file(input_filename)

Token indices sequence length is longer than the specified maximum sequence length for this model (585 > 512). Running this sequence through the model will result in indexing errors


Processed file saved as /content/edufineweb_train_000001-100k_tok.txt


We then create a file with fixed-width training instances that consist of a 16-word context as features, and the next word as the class label to be predicted.

In [5]:
import os

def generate_windowed_instances(file_path, output_file, window_size=16):
    # Start with an empty list to accumulate tokens for each block
    tokenized_text = []

    with open(file_path, 'r') as file, open(output_file, 'w') as outfile:
        for line in file:
            # Strip leading/trailing whitespace from the line
            stripped_line = line.strip()

            # Check if the line is empty, indicating the end of a block
            if not stripped_line:
                # Process the accumulated tokens for the current block
                if tokenized_text:
                    # Pad the beginning of the tokenized text with underscores
                    padded_text = ["_"] * window_size + tokenized_text

                    # Generate and print each windowed instance for this block
                    for i in range(window_size, len(padded_text) - 1):
                        context = padded_text[i - window_size:i]
                        target = padded_text[i]
                        outfile.write(f"{' '.join(context)} {target}\n")

                        # Reset tokenized_text for the next block
                        tokenized_text = []

            else:
                # Append tokens from the non-empty line to the current block
                tokenized_text.extend(stripped_line.split())

        # Process any remaining tokens after the last line
        if tokenized_text:
            padded_text = ["_"] * window_size + tokenized_text
            for i in range(window_size, len(padded_text) - 1):
                context = padded_text[i - window_size:i]
                target = padded_text[i]
                outfile.write(f"{' '.join(context)} {target}\n")

# Specify the input and output filenames directly
input_filename = "edufineweb_train_000001-100k_tok.txt"  # Replace with your actual input filename
output_filename = input_filename.replace(".txt", ".l16r0")  # Generate output filename

input_file_path = os.path.join('/content', input_filename)
output_file_path = os.path.join('/content', output_filename)

# Call the function to generate windowed instances and write to the output file
generate_windowed_instances(input_file_path, output_file_path)

Now we train our MBLM model with TiMBL. This can take a while and may consume high amounts of RAM.

The end result is `edufineweb_train_000001-100k_tok.l16r0.ibase`, an indexed and compressed instance base suitable for TiMBL classification. In LLM terms, this is the model file that you will need for your favorite LLM inference steps.

The option `-a0` means that the training set is compressed losslessly, with compression rates around 10-30%. This is the setting that implements **IB1**, k-Nearest Neighbor classification.

With `-a1`, a strong lossy compression is applied, yielding higher compression levels around 90-95%, and considerably faster but less accurate inference. This is TiMBL's **IGTree** option.

TiMBL is called from the Notebook shell commandline. It is wrapped inside a codecarbon CO2 emission measurement. TiMBL's quite verbose output is mixed with the equally verbose codecarbon information.



In [None]:
from codecarbon import track_emissions

@track_emissions(project_name="mblm-edufineweb_train_000001-100k_tok.l16r0")
def train_model():
    !timbl -f edufineweb_train_000001-100k_tok.l16r0 -a0 +D -I edufineweb_train_000001-100k_tok.l16r0.ibase

train_model()

[codecarbon INFO @ 15:43:40] [setup] RAM Tracking...
[codecarbon INFO @ 15:43:40] [setup] CPU Tracking...
 Linux OS detected: Please ensure RAPL files exist at /sys/class/powercap/intel-rapl/subsystem to measure CPU

[codecarbon INFO @ 15:43:41] CPU Model on constant consumption mode: Intel(R) Xeon(R) CPU @ 2.20GHz
[codecarbon INFO @ 15:43:41] [setup] GPU Tracking...
[codecarbon INFO @ 15:43:41] No GPU found.
[codecarbon INFO @ 15:43:41] The below tracking methods have been set up:
                RAM Tracking Method: RAM power estimation model
                CPU Tracking Method: global constant
                GPU Tracking Method: Unspecified
            
[codecarbon INFO @ 15:43:41] >>> Tracker's metadata:
[codecarbon INFO @ 15:43:41]   Platform system: Linux-6.1.123+-x86_64-with-glibc2.35
[codecarbon INFO @ 15:43:41]   Python version: 3.11.12
[codecarbon INFO @ 15:43:41]   CodeCarbon version: 3.0.1
[codecarbon INFO @ 15:43:41]   Available RAM : 12.674 GB
[codecarbon INFO @ 15:43:41

TiMBL 6.5 (c) CLST/ILK/CLIPS 1998 - 2020.
Tilburg Memory Based Learner
Centre for Language and Speech Technology, Radboud University
Induction of Linguistic Knowledge Research Group, Tilburg University
CLiPS Computational Linguistics Group, University of Antwerp
Thu May  8 15:43:41 2025

Examine datafile 'edufineweb_train_000001-100k_tok.l16r0' gave the following results:
Number of Features: 16
InputFormat       : Columns

Phase 1: Reading Datafile: edufineweb_train_000001-100k_tok.l16r0
Start:          0 @ Thu May  8 15:43:41 2025
Examining: 100000 @ Thu May  8 15:43:45 2025
Examining: 200000 @ Thu May  8 15:43:48 2025
Examining: 300000 @ Thu May  8 15:43:53 2025


[codecarbon INFO @ 15:43:56] Energy consumed for RAM : 0.000042 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:43:56] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:43:56] Energy consumed for All CPU : 0.000177 kWh
[codecarbon INFO @ 15:43:56] 0.000219 kWh of electricity used since the beginning.


Examining: 400000 @ Thu May  8 15:43:57 2025
Examining: 500000 @ Thu May  8 15:44:02 2025
Examining: 600000 @ Thu May  8 15:44:07 2025


[codecarbon INFO @ 15:44:11] Energy consumed for RAM : 0.000083 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:44:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:44:11] Energy consumed for All CPU : 0.000354 kWh
[codecarbon INFO @ 15:44:11] 0.000437 kWh of electricity used since the beginning.


Examining: 700000 @ Thu May  8 15:44:12 2025
Examining: 800000 @ Thu May  8 15:44:16 2025
Examining: 900000 @ Thu May  8 15:44:22 2025


[codecarbon INFO @ 15:44:26] Energy consumed for RAM : 0.000125 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:44:26] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:44:26] Energy consumed for All CPU : 0.000531 kWh
[codecarbon INFO @ 15:44:26] 0.000656 kWh of electricity used since the beginning.


Examining: 1000000 @ Thu May  8 15:44:27 2025
Examining: 1100000 @ Thu May  8 15:44:32 2025
Examining: 1200000 @ Thu May  8 15:44:37 2025


[codecarbon INFO @ 15:44:41] Energy consumed for RAM : 0.000167 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:44:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:44:41] Energy consumed for All CPU : 0.000708 kWh
[codecarbon INFO @ 15:44:41] 0.000875 kWh of electricity used since the beginning.


Examining: 1300000 @ Thu May  8 15:44:42 2025
Examining: 1400000 @ Thu May  8 15:44:48 2025
Examining: 1500000 @ Thu May  8 15:44:53 2025


[codecarbon INFO @ 15:44:56] Energy consumed for RAM : 0.000208 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:44:56] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:44:56] Energy consumed for All CPU : 0.000886 kWh
[codecarbon INFO @ 15:44:56] 0.001094 kWh of electricity used since the beginning.


Examining: 1600000 @ Thu May  8 15:44:58 2025
Examining: 1700000 @ Thu May  8 15:45:04 2025
Examining: 1800000 @ Thu May  8 15:45:09 2025


[codecarbon INFO @ 15:45:11] Energy consumed for RAM : 0.000250 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:45:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:45:11] Energy consumed for All CPU : 0.001063 kWh
[codecarbon INFO @ 15:45:11] 0.001313 kWh of electricity used since the beginning.


Examining: 1900000 @ Thu May  8 15:45:15 2025
Examining: 2000000 @ Thu May  8 15:45:20 2025
Examining: 2100000 @ Thu May  8 15:45:26 2025


[codecarbon INFO @ 15:45:26] Energy consumed for RAM : 0.000292 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:45:26] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:45:26] Energy consumed for All CPU : 0.001240 kWh
[codecarbon INFO @ 15:45:26] 0.001531 kWh of electricity used since the beginning.


Examining: 2200000 @ Thu May  8 15:45:31 2025
Examining: 2300000 @ Thu May  8 15:45:37 2025


[codecarbon INFO @ 15:45:41] Energy consumed for RAM : 0.000333 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:45:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:45:41] Energy consumed for All CPU : 0.001417 kWh
[codecarbon INFO @ 15:45:41] 0.001750 kWh of electricity used since the beginning.
[codecarbon INFO @ 15:45:41] 0.010761 g.CO2eq/s mean an estimation of 339.34522920213897 kg.CO2eq/year


Examining: 2400000 @ Thu May  8 15:45:42 2025
Examining: 2500000 @ Thu May  8 15:45:48 2025
Examining: 2600000 @ Thu May  8 15:45:54 2025


[codecarbon INFO @ 15:45:56] Energy consumed for RAM : 0.000375 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:45:56] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:45:56] Energy consumed for All CPU : 0.001594 kWh
[codecarbon INFO @ 15:45:56] 0.001969 kWh of electricity used since the beginning.


Examining: 2700000 @ Thu May  8 15:45:59 2025
Examining: 2800000 @ Thu May  8 15:46:05 2025
Examining: 2900000 @ Thu May  8 15:46:11 2025


[codecarbon INFO @ 15:46:11] Energy consumed for RAM : 0.000417 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:46:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:46:11] Energy consumed for All CPU : 0.001771 kWh
[codecarbon INFO @ 15:46:11] 0.002188 kWh of electricity used since the beginning.


Examining: 3000000 @ Thu May  8 15:46:17 2025
Examining: 3100000 @ Thu May  8 15:46:22 2025


[codecarbon INFO @ 15:46:26] Energy consumed for RAM : 0.000458 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:46:26] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:46:26] Energy consumed for All CPU : 0.001948 kWh
[codecarbon INFO @ 15:46:26] 0.002406 kWh of electricity used since the beginning.


Examining: 3200000 @ Thu May  8 15:46:28 2025
Examining: 3300000 @ Thu May  8 15:46:33 2025
Examining: 3400000 @ Thu May  8 15:46:39 2025


[codecarbon INFO @ 15:46:41] Energy consumed for RAM : 0.000500 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:46:41] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:46:42] Energy consumed for All CPU : 0.002125 kWh
[codecarbon INFO @ 15:46:42] 0.002625 kWh of electricity used since the beginning.


Examining: 3500000 @ Thu May  8 15:46:45 2025
Examining: 3600000 @ Thu May  8 15:46:51 2025


[codecarbon INFO @ 15:46:56] Energy consumed for RAM : 0.000542 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:46:57] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:46:57] Energy consumed for All CPU : 0.002302 kWh
[codecarbon INFO @ 15:46:57] 0.002844 kWh of electricity used since the beginning.


Examining: 3700000 @ Thu May  8 15:46:57 2025
Examining: 3800000 @ Thu May  8 15:47:03 2025
Examining: 3900000 @ Thu May  8 15:47:09 2025


[codecarbon INFO @ 15:47:11] Energy consumed for RAM : 0.000583 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:47:11] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:47:12] Energy consumed for All CPU : 0.002479 kWh
[codecarbon INFO @ 15:47:12] 0.003062 kWh of electricity used since the beginning.


Examining: 4000000 @ Thu May  8 15:47:14 2025
Examining: 4100000 @ Thu May  8 15:47:20 2025
Examining: 4200000 @ Thu May  8 15:47:26 2025


[codecarbon INFO @ 15:47:27] Energy consumed for RAM : 0.000625 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:47:27] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:47:27] Energy consumed for All CPU : 0.002656 kWh
[codecarbon INFO @ 15:47:27] 0.003281 kWh of electricity used since the beginning.


Examining: 4300000 @ Thu May  8 15:47:32 2025
Examining: 4400000 @ Thu May  8 15:47:38 2025


[codecarbon INFO @ 15:47:42] Energy consumed for RAM : 0.000667 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:47:42] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:47:42] Energy consumed for All CPU : 0.002833 kWh
[codecarbon INFO @ 15:47:42] 0.003500 kWh of electricity used since the beginning.
[codecarbon INFO @ 15:47:42] 0.010760 g.CO2eq/s mean an estimation of 339.3284895404536 kg.CO2eq/year


Examining: 4500000 @ Thu May  8 15:47:44 2025
Examining: 4600000 @ Thu May  8 15:47:50 2025
Examining: 4700000 @ Thu May  8 15:47:56 2025


[codecarbon INFO @ 15:47:57] Energy consumed for RAM : 0.000708 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:47:57] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:47:57] Energy consumed for All CPU : 0.003010 kWh
[codecarbon INFO @ 15:47:57] 0.003718 kWh of electricity used since the beginning.


Examining: 4800000 @ Thu May  8 15:48:02 2025
Examining: 4900000 @ Thu May  8 15:48:08 2025


[codecarbon INFO @ 15:48:12] Energy consumed for RAM : 0.000750 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:48:12] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:48:12] Energy consumed for All CPU : 0.003187 kWh
[codecarbon INFO @ 15:48:12] 0.003937 kWh of electricity used since the beginning.


Examining: 5000000 @ Thu May  8 15:48:13 2025
Finished:  5016728 @ Thu May  8 15:48:14 2025
Calculating Entropy         Thu May  8 15:48:14 2025
Lines of data     : 5016728


[codecarbon INFO @ 15:48:27] Energy consumed for RAM : 0.000791 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:48:27] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:48:27] Energy consumed for All CPU : 0.003365 kWh
[codecarbon INFO @ 15:48:27] 0.004156 kWh of electricity used since the beginning.


DB Entropy        : 10.739261
Number of Classes : 26915

Feats	Vals	InfoGain	GainRatio
    1  26915	2.1906111	0.20398148
    2  26915	2.1920201	0.20411270
    3  26915	2.1954564	0.20443265
    4  26915	2.1958033	0.20446497
    5  26915	2.2009414	0.20494344
    6  26915	2.2021521	0.20505616
    7  26915	2.2081435	0.20561406
    8  26915	2.2123939	0.20600985
    9  26915	2.2217645	0.20688242
   10  26915	2.2388949	0.20847754
   11  26915	2.2593096	0.21037848
   12  26915	2.2870857	0.21296491
   13  26915	2.3497464	0.21879964
   14  26915	2.4916539	0.23201355
   15  26915	2.7844700	0.25927949
   16  26915	3.7981671	0.35367119

Preparation took 286 seconds, 345 milliseconds and 785 microseconds
Feature Permutation based on GainRatio/Values :
< 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1 >
Phase 2: Building multi index on Datafile: edufineweb_train_000001-100k_tok.l16r0
Start:          0 @ Thu May  8 15:48:28 2025
Indexing:  100000 @ Thu May  8 15:48:30 2025
Indexing:  200000 @ Th

[codecarbon INFO @ 15:48:42] Energy consumed for RAM : 0.000833 kWh. RAM Power : 10.0 W
[codecarbon INFO @ 15:48:42] Delta energy consumed for CPU with constant : 0.000177 kWh, power : 42.5 W
[codecarbon INFO @ 15:48:42] Energy consumed for All CPU : 0.003542 kWh
[codecarbon INFO @ 15:48:42] 0.004375 kWh of electricity used since the beginning.


Indexing:  600000 @ Thu May  8 15:48:42 2025
Indexing:  700000 @ Thu May  8 15:48:44 2025
