# Training Model From Scratch

## Dataset

Dataset was cleaned and uploaded to huggingface at `thisisfrantz/haitian-creole-english-train` for the train set and `thisisfrantz/haitian-creole-english-test` for the test set. 

"koman _" -> "koman ou ye"

In [1]:
from datasets import load_dataset

dataset = load_dataset("thisisfrantz/haitian-creole-english-train")

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Generating train split: 100%|██████████| 10813/10813 [00:00<00:00, 400935.39 examples/s]


In [2]:
print(dataset)
print(dataset['train'][0])

DatasetDict({
    train: Dataset({
        features: ['id', 'lang1', 'lang2'],
        num_rows: 10813
    })
})
{'id': 3042, 'lang1': 'Lidè Kiben an te di, ata John Kennedy dwe cheche fason pou kontoune anbago a.', 'lang2': 'Even John F. Kennedy had to find a way around the embargo, the Cuban leader said.'}


## Load Pretrained Tokenizer

I wanted to create a custom tokenizer but don't have enough data :( .

In [4]:
from transformers import MarianTokenizer

tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-ht-en")

In [7]:

example = dataset['train'][0]

source_text = example['lang1']
target_text = example['lang2']
print("Source text:", source_text)
print("Target text:", target_text)

source_tokens = tokenizer(source_text)
source_ids = tokenizer.convert_tokens_to_ids(source_tokens)

# Tokenize target (as target tokenizer)
with tokenizer.as_target_tokenizer():
    target_tokens = tokenizer.tokenize(target_text)
    target_ids = tokenizer.convert_tokens_to_ids(target_tokens)

print("\nSource Tokens:", source_tokens)
print("Source Token IDs:", source_ids)

print("\nTarget Tokens:", target_tokens)
print("Target Token IDs:", target_ids)

Source text: Lidè Kiben an te di, ata John Kennedy dwe cheche fason pou kontoune anbago a.
Target text: Even John F. Kennedy had to find a way around the embargo, the Cuban leader said.

Source Tokens: {'input_ids': [116, 16401, 61, 14693, 32, 7, 48, 2, 8611, 424, 29389, 9732, 108, 20212, 113, 14, 300, 2556, 4424, 276, 5887, 8, 3, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Source Token IDs: [1, 1]

Target Tokens: ['▁Even', '▁John', '▁F', '.', '▁Kenne', 'dy', '▁had', '▁to', '▁find', '▁a', '▁way', '▁around', '▁the', '▁emb', 'ar', 'go', ',', '▁the', '▁Cuba', 'n', '▁leader', '▁said', '.']
Target Token IDs: [871, 424, 1316, 3, 29389, 9732, 129, 10, 504, 8, 222, 1293, 6, 30292, 4423, 5887, 2, 6, 22471, 430, 5026, 260, 3]




In [8]:
# Tokenize the whole dataset
def tokeinze_function(example):
    inputs = tokenizer(example['lang1'], truncation=True, padding='max_length', max_length=128)
    with tokenizer.as_target_tokenizer():
        targets = tokenizer(example['lang2'], truncation=True, padding='max_length', max_length=128)
    inputs['labels'] = targets['input_ids']
    return inputs

tokenized_dataset = dataset.map(tokeinze_function, batched=True)

Map: 100%|██████████| 10813/10813 [00:04<00:00, 2194.66 examples/s]


## DataLoader

In [10]:
from torch.utils.data import DataLoader

# PyTorch Format
tokenized_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

# Create DataLoader
train_loader = DataLoader(tokenized_dataset['train'], batch_size=8, shuffle=True)