# Training a GPT-2 Tokenizer on a Text Corpus

This notebook shows how we train a GPT-2 tokenizer from scratch on a custom text corpus using the Hugging Face `tokenizers` library, then integrate it into a `GPT2TokenizerFast` from `transformers`.

## 1. Installing Dependencies
```bash
!pip install tokenizers transformers datasets
```

## 2. Imports and Setup

In [1]:
from datasets import load_dataset
from tokenizers import ByteLevelBPETokenizer
from transformers import GPT2TokenizerFast
import os

## 3.Dealing With How2Sign DataSet

### Example with a Local File

In [2]:
import pandas as pd
from pathlib import Path

# 1) Path to the CSV/TSV file
csv_path = Path("/home/how2sign_realigned_train.csv")

# 2) We load it using tab as the separator
df = pd.read_csv(csv_path, sep='\t', engine="python", quoting=3)  # quoting=3 == csv.QUOTE_NONE

# 3) We check that the 'SENTENCE' column exists
print("Colonnes trouvées:", df.columns.tolist())

# 4) We extract and clean the sentences
sentences = (
    df["SENTENCE"]
      .dropna()           # Remove missing entries
      .astype(str)        # Ensure everything is a string
      .str.strip()        # Remove leading/trailing whitespace
      .tolist()           # Convert to list of strings
)

# 5) We save one sentence per line
out = Path("sentences.txt")
out.write_text("\n".join(sentences), encoding="utf-8")
print(f"{len(sentences)} phrases écrites dans {out}")


Colonnes trouvées : ['VIDEO_ID', 'VIDEO_NAME', 'SENTENCE_ID', 'SENTENCE_NAME', 'START_REALIGNED', 'END_REALIGNED', 'SENTENCE']
31165 phrases écrites dans sentences.txt


In [3]:
from pathlib import Path
from sklearn.model_selection import train_test_split

# 1. We load all sentences from the file
sentences = Path("sentences.txt").read_text(encoding="utf-8").splitlines()

# 2. We reserve 10% for testing and 10% for validation
train_and_val, test = train_test_split(
    sentences, test_size=0.10, random_state=42
)
train, val = train_test_split(
    train_and_val, test_size=0.11, random_state=42
)


# 3. We save each split into a separate file
out = Path(".")
out.joinpath("train.txt").write_text("\n".join(train), encoding="utf-8")
out.joinpath("val.txt").write_text("\n".join(val),   encoding="utf-8")
out.joinpath("test.txt").write_text("\n".join(test), encoding="utf-8")

# 4. We print the number of lines in each split
print(f"train: {len(train)} lines")
print(f"val:   {len(val)} lines")
print(f"test:  {len(test)} lines")


train: 24962 lignes
val:   3086 lignes
test:  3117 lignes


## 4. Training the ByteLevel BPE Tokenizer

We use a Byte-Level BPE tokenizer (as in GPT-2) and define:

- `vocab_size`: size of the vocabulary
- `min_frequency`: minimum token frequency to include
- `special_tokens`: special tokens (`<s>`, `<pad>`, `</s>`, `<unk>`, etc.)

In [18]:
!pip install --upgrade transformers




In [5]:
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
from transformers import (
    GPT2LMHeadModel,
    GPT2TokenizerFast,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments,
)
from datasets import Dataset

# ─── 1) Build vocab only on train split ─────────────────────────────────────────
tokenizer = Tokenizer(models.BPE(unk_token="<unk>"))
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
trainer_bpe = trainers.BpeTrainer(
    vocab_size=50257, min_frequency=2,
    special_tokens=["<s>","<pad>","</s>","<unk>","<mask>"]
)
# train only on "content/train.txt"
tokenizer.train(files=["/content/train.txt"], trainer=trainer_bpe)
os.makedirs("tokenizer_gpt2", exist_ok=True)
tokenizer.save("tokenizer_gpt2/tokenizer.json")

# ─── 2) Prepare HF datasets from our splits ─────────────────────────────────────
def load_split(path):
    lines = Path(path).read_text(encoding="utf-8").splitlines()
    return Dataset.from_dict({"text": lines})

train_ds = load_split("/content/train.txt")
val_ds   = load_split("/content/val.txt")

tokenizer_fast = GPT2TokenizerFast.from_pretrained(
    "tokenizer_gpt2",
    unk_token="<unk>", pad_token="<pad>",
    bos_token="<s>", eos_token="</s>"
)
def tok(ex): return tokenizer_fast(ex["text"], truncation=True, max_length=128)
train_ds = train_ds.map(tok, batched=True, remove_columns=["text"])
val_ds   = val_ds.map(tok,   batched=True, remove_columns=["text"])

# ─── 3) Fine-tune with W&B logging ──────────────────────────────────────────────
import wandb
wandb.init(project="gpt2-finetune", name="we_run", reinit=True)

model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer_fast))

data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer_fast, mlm=False
)

training_args = TrainingArguments(
    output_dir="./results",
    overwrite_output_dir=True,

    num_train_epochs=10,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    do_train=True,
    do_eval=True,
    eval_steps=500,

    save_steps=500,
    logging_steps=100,

    learning_rate=5e-5,
    weight_decay=0.01,
    warmup_steps=200,

    report_to=["wandb"],
    run_name="we_run",
    logging_dir="./logs",
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=val_ds,
    data_collator=data_collator,
)
trainer.train()
trainer.evaluate()


Map:   0%|          | 0/24962 [00:00<?, ? examples/s]

Map:   0%|          | 0/3086 [00:00<?, ? examples/s]

0,1
train/epoch,▁█
train/global_step,▁█
train/grad_norm,▁█
train/learning_rate,▁█
train/loss,█▁

0,1
train/epoch,0.06408
train/global_step,200.0
train/grad_norm,25.36269
train/learning_rate,5e-05
train/loss,6.2795


Step,Training Loss
100,8.3805
200,6.2795
300,5.9518
400,5.8538
500,5.7317
600,5.6358
700,5.5436
800,5.4937
900,5.4619
1000,5.3286


{'eval_loss': 4.571997165679932,
 'eval_runtime': 5.2496,
 'eval_samples_per_second': 587.859,
 'eval_steps_per_second': 73.53,
 'epoch': 10.0}

In [8]:

!zip -r result_tokenizer_gpt2.zip /content/results/checkpoint-31000
from google.colab import files
files.download('result_tokenizer_gpt2.zip')

  adding: content/results/checkpoint-31000/ (stored 0%)
  adding: content/results/checkpoint-31000/model.safetensors (deflated 7%)
  adding: content/results/checkpoint-31000/rng_state.pth (deflated 25%)
  adding: content/results/checkpoint-31000/generation_config.json (deflated 24%)
  adding: content/results/checkpoint-31000/vocab.json (deflated 60%)
  adding: content/results/checkpoint-31000/optimizer.pt (deflated 9%)
  adding: content/results/checkpoint-31000/tokenizer_config.json (deflated 76%)
  adding: content/results/checkpoint-31000/config.json (deflated 51%)
  adding: content/results/checkpoint-31000/scheduler.pt (deflated 56%)
  adding: content/results/checkpoint-31000/merges.txt (deflated 58%)
  adding: content/results/checkpoint-31000/tokenizer.json (deflated 83%)
  adding: content/results/checkpoint-31000/trainer_state.json (deflated 78%)
  adding: content/results/checkpoint-31000/special_tokens_map.json (deflated 79%)
  adding: content/results/checkpoint-31000/training_arg

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

## 5. Loading into a `GPT2TokenizerFast`

We wrap our trained tokenizer into the `transformers` API for easy integration:

In [9]:
gpt2_tokenizer = GPT2TokenizerFast.from_pretrained(
    "tokenizer_gpt2",
    unk_token="<unk>",
    pad_token="<pad>",
    bos_token="<s>",
    eos_token="</s>"
)

## 6. Testing the Tokenizer

In [13]:
from transformers import GPT2TokenizerFast

# Chargez votre tokenizer entraîné
tokenizer = GPT2TokenizerFast.from_pretrained(
    "tokenizer_gpt2",
    unk_token="<unk>",
    pad_token="<pad>",
    bos_token="<s>",
    eos_token="</s>"
)

text = "Hello! This is a test to verify our trained tokenizer for com304-course."
encoded = tokenizer(text)
decoded = tokenizer.decode(encoded["input_ids"], clean_up_tokenization_spaces=True)

print("Decoded:", decoded)



Decoded: H ell o! ĠThis Ġis Ġa Ġtest Ġto Ġver ify Ġour Ġtrained Ġto ken izer Ġfor Ġcom 30 4 - cour se.


>Note : When decoding with our custom Byte-Level BPE tokenizer, you may notice that common words are split into sub-word fragments (e.g. “Hello” becomes “H ell o”) and that tokens beginning with a space are prefixed by “Ġ”. This is expected behavior: the BPE algorithm learns frequently occurring byte-pair merges rather than whole words, and it uses the special “Ġ” marker to record word boundaries. As a result, low-level decoding shows each sub-word and its space marker explicitly.


---

### End of Notebook

We now have a complete notebook to train a GPT-2 tokenizer on our text corpus of How2SignDataset.