# Tokenize and Prep

This notebook loads the cleaned text data (`cleaned_combined.txt`), tokenizes it **line-by-line** using the GPT-2 tokenizer, and flattens all tokens into a single sequence.

It then chunks the sequence into fixed-length blocks of 256 tokens using a sliding window (with a stride of 128 tokens), creating overlapping training samples.

The final dataset is saved to `tokenized_data/cooper_tokenized_dataset/` in Hugging Face `datasets` format, ready for use during Cooper Model training.


In [1]:
from transformers import GPT2TokenizerFast
from datasets import Dataset
import os
from tqdm import tqdm

## Load GPT-2 Tokenizer

In [2]:
# Load Tokenizer

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

## Load Cleaned Text

## Tokenize with Block Size

In [3]:
#Tokenizing entire dataset line-by-line before applying sliding window
block_size = 256
stride = 128
tokens = []

with open("raw_data/cleaned_combined.txt", "r", encoding="utf-8") as f:
    for line in tqdm(f, desc="Tokenizing lines"):
        ids = tokenizer(
            line,
            return_tensors=None,
            truncation=False,
            add_special_tokens=False
        )["input_ids"]

        if isinstance(ids[0], list):
            ids = [item for sublist in ids for item in sublist]

        tokens.extend(ids)

Tokenizing lines: 0it [00:00, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (8026 > 1024). Running this sequence through the model will result in indexing errors
Tokenizing lines: 41094it [03:48, 180.15it/s]


In [4]:
#Apply sliding window of 256 tokens with 128-token stride
def chunk_tokens(tokens, block_size, stride):
    return [tokens[i:i + block_size] for i in range(0, len(tokens) - block_size + 1, stride)]

chunks = chunk_tokens(tokens, block_size, stride)
print(f"Number of samples: {len(chunks)}")

Number of samples: 1213568


## Wrap as Dataset

In [5]:
dataset = Dataset.from_dict({"input_ids": chunks})
dataset = dataset.map(lambda e: {"attention_mask": [1] * len(e["input_ids"])})

os.makedirs("tokenized_data", exist_ok=True)
dataset.save_to_disk("tokenized_data/cooper_tokenized_dataset")

print(f"Saved {len(dataset):,} training samples to tokenized_data/cooper_tokenized_dataset")

Map:   0%|          | 0/1213568 [00:00<?, ? examples/s]

Saving the dataset (0/4 shards):   0%|          | 0/1213568 [00:00<?, ? examples/s]

Saved 1,213,568 training samples to tokenized_data/cooper_tokenized_dataset


## Check

In [6]:
from datasets import load_from_disk

tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
dataset = load_from_disk("tokenized_data/cooper_tokenized_dataset")

for i in range(3):
    sample = dataset[i]["input_ids"]
    print(tokenizer.decode(sample))
    print("=" * 80)


Anarchism is a political philosophy and movement that is sceptical of authority and rejects all involuntary, coercive forms of hierarchy. Anarchism calls for the abolition of the state, which it holds to be unnecessary, undesirable, and harmful. As a historically left-wing movement, placed on the farthest left of the political spectrum, it is usually described alongside communalism and libertarian Marxism as the libertarian wing (libertarian socialism) of the socialist movement, and has a strong historical association with anti-capitalism and socialism. Humans lived in societies without formal hierarchies long before the establishment of formal states, realms, or empires. With the rise of organised hierarchical bodies, scepticism toward authority also rose. Although traces of anarchist thought are found throughout history, modern anarchism emerged from the Enlightenment. During the latter half of the 19th and the first decades of the 20th century, the anarchist movement flourished in m