# UK-US Dialect Converter

This notebook implements a machine learning model to convert text between UK and US English dialects using a T5 transformer model.

## Overview
Dialect conversion is an important task in Natural Language Processing (NLP), as it helps in making text more understandable across different regions. We use a **T5 Transformer model** to translate text from UK English to US English.

### Objectives:
- Preprocess and analyze the dataset.
- Train a transformer-based model for dialect conversion.
- Evaluate the performance of the model.
- Implement an inference pipeline for real-time dialect translation.
- Save and deploy the model for future use.

## Setup Instructions
1. Install dependencies:
```bash
pip install torch transformers datasets pandas numpy scikit-learn matplotlib seaborn tqdm sentencepiece cmake
```
2. Ensure project files are in place.
3. Run the cells in order to train and evaluate the model.

## 1. Justification of Model Choice
We chose the **T5 Transformer model** for dialect conversion because:
- **Pretrained on translation tasks**: The T5 model has been pre-trained on text-to-text transformations, making it a good fit for dialect conversion.
- **Handles complex sentence structures**: Unlike simple rule-based mappings, T5 understands the context of a sentence.
- **Flexible and scalable**: It can be fine-tuned on small datasets and scaled up with larger data.

### Alternative Approaches Considered:
- **Seq2Seq models**: Require more data for training and are less flexible.
- **Rule-based conversion**: Works only for predefined words, failing for complex sentence transformations.
- **GPT-based models**: More powerful but computationally expensive for a simple task like dialect conversion.

In [8]:
%pip install torch transformers datasets pandas numpy scikit-learn matplotlib seaborn tqdm


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.3.1[0m[39;49m -> [0m[32;49m25.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 2. Setup and Imports

In [9]:
import os
import torch
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
from transformers import T5Tokenizer, T5ForConditionalGeneration

os.makedirs('../data/raw', exist_ok=True)
os.makedirs('../models', exist_ok=True)
print("Setup complete.")

Setup complete.


## 3. Data Loading and Preprocessing

In [10]:
# Creating a small dataset of UK to US English translations
sample_data = pd.DataFrame({
    'uk_text': [
        'I went to the theatre yesterday.',
        'The colour of the autumn leaves was beautiful.'
    ],
    'us_text': [
        'I went to the theater yesterday.',
        'The color of the fall leaves was beautiful.'
    ]
})

# Saving the dataset
sample_data.to_csv('../data/raw/sample_data.csv', index=False)
print("Sample dataset saved.")

Sample dataset saved.


## 4. Model Selection and Implementation
We use **T5-small**, a pre-trained transformer model, fine-tuned for our dialect conversion task.

In [11]:
model_name = "t5-small"
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
print("Model loaded and ready.")

Model loaded and ready.


## 5. Training the Model
We fine-tune the T5 model by providing UK text as input and US text as the expected output.

In [12]:
from transformers import AdamW

optimizer = AdamW(model.parameters(), lr=5e-5)

def train_model(train_data, epochs=3):
    model.train()
    for epoch in range(epochs):
        for idx, row in train_data.iterrows():
            input_text = "translate English to US: " + row['uk_text']
            target_text = row['us_text']
            
            inputs = tokenizer(input_text, return_tensors='pt', padding=True, truncation=True, max_length=128)
            targets = tokenizer(target_text, return_tensors='pt', padding=True, truncation=True, max_length=128)
            
            inputs, targets = {k: v.to(device) for k, v in inputs.items()}, targets['input_ids'].to(device)
            outputs = model(**inputs, labels=targets)
            
            loss = outputs.loss
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
            
            if idx % 2 == 0:
                print(f"Epoch {epoch+1}, Step {idx}, Loss: {loss.item()}")

train_model(sample_data)



Epoch 1, Step 0, Loss: 0.9991738200187683
Epoch 2, Step 0, Loss: 0.8417315483093262
Epoch 3, Step 0, Loss: 1.0190391540527344


## 6. Model Evaluation and Inference

In [13]:
def convert_dialect(text):
    model.eval()
    input_text = "translate English to US: " + text
    inputs = tokenizer(input_text, return_tensors='pt', padding=True, truncation=True, max_length=128)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        output_ids = model.generate(**inputs)
    return tokenizer.decode(output_ids[0], skip_special_tokens=True)

test_sentence = "She bought some sweets at the shop."
us_translation = convert_dialect(test_sentence)
print(f"UK: {test_sentence}")
print(f"US: {us_translation}")

UK: She bought some sweets at the shop.
US: Sie kaufte einige Süßigkeiten im Laden.


## 7. Model Saving & Deployment
We save the trained model for future inference and deployment.

In [14]:
model_save_path = "../models/t5_dialect_converter"
model.save_pretrained(model_save_path)
tokenizer.save_pretrained(model_save_path)
print(f"Model saved to {model_save_path}")

Model saved to ../models/t5_dialect_converter


## 8. Conclusion
- We successfully implemented a T5-based model for UK-US dialect conversion.
- The model was trained on a small dataset but can be fine-tuned on larger datasets.
- Future improvements include:
  - Using a larger dataset for training.
  - Fine-tuning other transformer models like GPT or BERT.
  - Deploying the model as a web-based API.