<a href="https://colab.research.google.com/github/Vritigarg/Dialect-Conversion-from-UK-to-US/blob/main/Dialect_Conversion_from_UK_to_US.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import torch
import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer, AdamW
from sklearn.model_selection import train_test_split
from torch.utils.data import Dataset, DataLoader

In [2]:
df = pd.read_csv("CozmoX Assignment Dataset.csv")  # Replace with actual dataset path

In [3]:
# Preprocessing
def preprocess_text(text):
    return text.strip()

df['input_text'] = df['input_text'].apply(preprocess_text)
df['target_text'] = df['target_text'].apply(preprocess_text)

# Train-test split
train_data, val_data = train_test_split(df, test_size=0.2, random_state=42)


In [13]:
# Dataset Class
class DialectDataset(Dataset):
    def __init__(self, data, tokenizer, max_length=128):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        uk_text = self.data.iloc[idx]['input_text']
        us_text = self.data.iloc[idx]['target_text']

        inputs = self.tokenizer("translate British English to American English: " + uk_text.lower(), max_length=self.max_length, padding='max_length', truncation=True, return_tensors="pt")
        targets = self.tokenizer(us_text, max_length=self.max_length, padding='max_length', truncation=True, return_tensors="pt")

        labels = targets['input_ids'].squeeze().clone()
        labels[labels == self.tokenizer.pad_token_id] = -100

        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'labels': labels
        }

In [5]:
# Load tokenizer & dataset
tokenizer = T5Tokenizer.from_pretrained("t5-small")
train_dataset = DialectDataset(train_data, tokenizer)
val_dataset = DialectDataset(val_data, tokenizer)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.32k [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.39M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


In [6]:
# Load Model
model = T5ForConditionalGeneration.from_pretrained("t5-small").to("cuda" if torch.cuda.is_available() else "cpu")
optimizer = AdamW(model.parameters(), lr=3e-5)

config.json:   0%|          | 0.00/1.21k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/242M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]



In [7]:
# Training Loop
def train(model, train_loader, val_loader, optimizer, epochs=10):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.train()
    for epoch in range(epochs):
        total_loss = 0
        for batch in train_loader:
            optimizer.zero_grad()
            inputs = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        print(f"Epoch {epoch+1}, Training Loss: {total_loss/len(train_loader)}")
        validate(model, val_loader)

In [8]:
# Validation Function
def validate(model, dataloader):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.eval()
    total_loss = 0
    with torch.no_grad():
        for batch in dataloader:
            inputs = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)

            outputs = model(input_ids=inputs, attention_mask=attention_mask, labels=labels)
            total_loss += outputs.loss.item()

    print(f"Validation Loss: {total_loss / len(dataloader)}")

In [14]:
# Inference Function
def infer(text, model, tokenizer):
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model.eval()
    inputs = tokenizer("translate British English to American English: " + text.lower(), return_tensors="pt", padding=True, truncation=True).to(device)
    outputs = model.generate(**inputs, max_length=50, num_beams=5)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [10]:
# Training the Model
train(model, train_loader, val_loader, optimizer, epochs=10)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch 1, Training Loss: 1.558734154701233
Validation Loss: 1.3084328770637512
Epoch 2, Training Loss: 1.3392638206481933
Validation Loss: 1.0667398571968079
Epoch 3, Training Loss: 1.0689674735069274
Validation Loss: 0.8249203860759735
Epoch 4, Training Loss: 0.7958863377571106
Validation Loss: 0.6437747329473495
Epoch 5, Training Loss: 0.6264368891716003
Validation Loss: 0.5321972966194153
Epoch 6, Training Loss: 0.4968629777431488
Validation Loss: 0.4516753554344177
Epoch 7, Training Loss: 0.39591753780841826
Validation Loss: 0.3897443860769272
Epoch 8, Training Loss: 0.32865403294563295
Validation Loss: 0.3427242562174797
Epoch 9, Training Loss: 0.26913669109344485
Validation Loss: 0.3055140934884548
Epoch 10, Training Loss: 0.22651545703411102
Validation Loss: 0.27586475014686584


In [15]:
# Example Inference
sample_text = "I colour the centre of my favourite book."
converted_text = infer(sample_text, model, tokenizer)
print(f"Converted Text: {converted_text}")

Converted Text: i color the center of my favorite book.


# README Section
README = """
### Instructions to Run the Notebook
1. Install dependencies using `pip install torch transformers pandas scikit-learn`.
2. Ensure the dataset is available at the specified path.
3. Run the notebook sequentially to train the model and perform inference.

### Dependencies and Installation
- Python 3.8+
- PyTorch
- Transformers (Hugging Face)
- Pandas
- scikit-learn

### Known Limitations and Potential Improvements
- The model may require fine-tuning on a larger dataset for better accuracy.
- Beam search parameters can be adjusted to improve text fluency.
- Limited training data may result in suboptimal translations.

### Handling Time Constraints
- Reduce the number of epochs for faster training.
- Use a smaller dataset subset for quick testing.
"""
print(README)

Why are we Using T5 Transformer?

I am using T5 because it is designed to handle tasks like text translation and text transformation. It works by treating everything as a text-to-text problem. In your case, you are turning UK English into US English, and T5 is good for this because:

    1.  It can easily handle tasks like translation (in your case, dialect conversion).
    2. It has been trained on lots of data and can generalize to different language tasks.
    3. The model works by taking an input text (UK English) and producing an output text (US English) using its encoder-decoder architecture.

In short, T5 is a good fit because it’s flexible, powerful, and trained for tasks like the one you're working on.