<a href="https://colab.research.google.com/github/ebunoluwazaynab/AdventureWorks-DB-EDA-with-SQL-and-PowerBI-/blob/main/English_to_Yoruba_Translation_Bootcamp_In_house_Hackathon.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DSN Bootcamp 2024 Hackathon
This notebook was written to help participants of the ML hackathon Track in the DSN AI Bootcamp 2024 In-house Hackathon as the original starter notebook was not of much help.

## Setup

### Mount Drive

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Install Required Libraries

In [2]:
%%capture
!pip install --upgrade datasets
!pip install evaluate
!pip install sacrebleu
!pip install git+https://github.com/csebuetnlp/normalizer
!pip install sentencepiece

### Importing the libraries

In [3]:
import pandas as pd
import torch
import unicodedata
from datasets import Dataset
from transformers import MT5ForConditionalGeneration, T5Tokenizer, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq, AutoTokenizer
import evaluate
import os

In [None]:
bleu = evaluate.load("sacrebleu")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Downloading builder script:   0%|          | 0.00/8.15k [00:00<?, ?B/s]

## Loading the Dataset

In [4]:
input_dir = "/content/drive/MyDrive/Machine Learning/"

train_dataset = pd.read_csv (f'{input_dir}Data TSV/train.tsv', sep='\t', names=['English', 'Yoruba'])
val_dataset = pd.read_csv(f'{input_dir}Data TSV/dev.tsv', sep='\t', names=['English', 'Yoruba'])
test_dataset = pd.read_csv(f'{input_dir}Data TSV/test.tsv', sep='\t', names=['English', 'Yoruba'])

In [5]:
# show the loaded datasets
print("Train:\n")
display(train_dataset.head())

print("\nValidation:\n")
display(val_dataset.head())

print("\nTest:\n")
display(test_dataset.head())

Train:



Unnamed: 0,English,Yoruba
0,English,Yoruba
1,Unit 1: What is Creative Commons?,﻿Ìdá 1: Kín ni Creative Commons?
2,This work is licensed under a Creative Commons...,Iṣẹ́ yìí wà lábẹ́ àṣẹ Creative Commons Attribu...
3,"Creative Commons is a set of legal tools, a no...",Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan ohun-...
4,Creative Commons began in response to an outda...,Creative Commons bẹ̀rẹ̀ láti wá wọ̀rọ̀kọ̀ fi ṣ...



Validation:



Unnamed: 0,English,Yoruba
0,English,Yoruba
1,"We prepare the saddle, and the goat presents i...",A di gàárì sílẹ̀ ewúrẹ́ ń yọjú; ẹrù ìran rẹ̀ ni?
2,"You have been crowned a king, and yet you make...",A fi ọ́ jọba ò ń ṣàwúre o fẹ́ jẹ Ọlọ́run ni?
3,By dancing we take possession of Awà; through ...,"A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, ..."
4,We lift a saddle and the goat (kin) scowls; it...,A gbé gàárì ọmọ ewúrẹ́ ń rojú; kì í ṣe ẹrù àgù...



Test:



Unnamed: 0,English,Yoruba
0,English,
1,Pending the time she would finally pack and go...,
2,She knew how best she was going to take care o...,
3,Alamu Should learn to look after himself.,
4,His old Mama should not come back again and be...,


In [8]:
def clean_data(data):
  # Remove the header row
  cleaned_data = data.iloc[1:].reset_index(drop=True)
  return cleaned_data

cleaned_train_data = clean_data(train_dataset)
cleaned_test_data = clean_data(test_dataset)
cleaned_val_data = clean_data(val_dataset)

def normalize_unicode(text):
    return unicodedata.normalize('NFKD', text)

cleaned_train_data['Yoruba'] = cleaned_train_data['Yoruba'].apply(normalize_unicode)
cleaned_val_data['Yoruba'] = cleaned_val_data['Yoruba'].apply(normalize_unicode)



In [10]:
# show the cleaned datasets
print("Train:\n")
display(cleaned_train_data.head())

print("\nValidation:\n")
display(cleaned_val_data.head())



Train:



Unnamed: 0,English,Yoruba
0,Unit 1: What is Creative Commons?,﻿Ìdá 1: Kín ni Creative Commons?
1,This work is licensed under a Creative Commons...,Iṣẹ́ yìí wà lábẹ́ àṣẹ Creative Commo...
2,"Creative Commons is a set of legal tools, a no...",Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan...
3,Creative Commons began in response to an outda...,Creative Commons bẹ̀rẹ̀ láti wá wọ̀rọ̀ko...
4,CC licenses are built on copyright and are des...,Àwọn àṣẹ CC jẹ mọ́ àṣẹ ẹni tí ó n...



Validation:



Unnamed: 0,English,Yoruba
0,"We prepare the saddle, and the goat presents i...",A di gàárì sílẹ̀ ewúrẹ́ ń yọjú; ẹru...
1,"You have been crowned a king, and yet you make...",A fi ọ́ jọba ò ń ṣàwúre o fẹ́ jẹ Ọlo...
2,By dancing we take possession of Awà; through ...,A fijó gba Awà; a fìjà gba Awà; bí a ò ...
3,We lift a saddle and the goat (kin) scowls; it...,A gbé gàárì ọmọ ewúrẹ́ ń rojú; kì i...
4,One does not share a farm boundary with a king...,A kì í bá ọba pàlà kí ọkọ́ ọba má s...


In [20]:
# Check for empty (null) values in both columns
empty_rows = cleaned_train_data[cleaned_train_data.isnull().any(axis=1)]

In [21]:
empty_rows

Unnamed: 0,English,Yoruba


In [22]:
# Check for potential misalignment by comparing sentence lengths (English vs. Yoruba)
cleaned_train_data['English_Length'] = cleaned_train_data['English'].apply(len)
cleaned_train_data['Yoruba_Length'] = cleaned_train_data['Yoruba'].apply(len)

In [23]:
# Calculate the difference in lengths between English and Yoruba
cleaned_train_data['Length_Difference'] = abs(cleaned_train_data['English_Length'] - cleaned_train_data['Yoruba_Length'])

In [30]:
cleaned_train_data.iloc[2]

Unnamed: 0,2
English,"Creative Commons is a set of legal tools, a no..."
Yoruba,Creative Commons jẹ́ àwọn ọ̀kan-ò-jọ̀kan...
English_Length,239
Yoruba_Length,407
Length_Difference,168


## Translation Model

In [6]:
from transformers import pipeline, VitsModel, AutoTokenizer, AutoModelForSeq2SeqLM

# Load translation model
translate_tokenizer = AutoTokenizer.from_pretrained("Davlan/m2m100_418M-eng-yor-mt")
translate_model = AutoModelForSeq2SeqLM.from_pretrained("Davlan/m2m100_418M-eng-yor-mt")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/1.40k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]



config.json:   0%|          | 0.00/964 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

In [7]:
# Define the translation function
def translate_text_to_yoruba(text):
    model_inputs = translate_tokenizer(text, return_tensors="pt")
    gen_tokens = translate_model.generate(**model_inputs, forced_bos_token_id=translate_tokenizer.get_lang_id("yo"))
    translation = translate_tokenizer.batch_decode(gen_tokens, skip_special_tokens=True)[0]
    return translation


In [None]:
# Test the translation function
#  You can try the translation model here

english_text = input("Enter an English text: ")
yoruba_translation = translate_text_to_yoruba(english_text)
print(f"Yoruba Translation: {yoruba_translation}")


Enter an English text: I am so happy to be a part of this year's bootcamp
Yoruba Translation: Inú mi dùn gan-an pé mo wà lára àwọn tó ń ṣiṣẹ́ níbi ìpàgọ́ ìkọ́lé ti ọdún yìí


In [None]:
# Evaluate the model on the validation set using sacrebleu

# Here I am using a subset of the validation set for this demonstration to save time

val_test_subset = val_dataset.head(10)
# Apply the translation function to the validation set
val_test_subset['Yoruba_Translation'] = val_test_subset['English'].apply(translate_text_to_yoruba)

val_test_subset.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  val_test_subset['Yoruba_Translation'] = val_test_subset['English'].apply(translate_text_to_yoruba)


Unnamed: 0,English,Yoruba,Yoruba_Translation
0,English,Yoruba,Gẹ̀ẹ́sì
1,"We prepare the saddle, and the goat presents i...",A di gàárì sílẹ̀ ewúrẹ́ ń yọjú; ẹrù ìran rẹ̀ ni?,"À ń pèsè àkàbà, ewúrẹ́ ń fún ara rẹ̀; ṣé ẹrù n..."
2,"You have been crowned a king, and yet you make...",A fi ọ́ jọba ò ń ṣàwúre o fẹ́ jẹ Ọlọ́run ni?,"A ti fi ọ́ jẹ ọba ní adé, síbẹ̀ ìwọ ṣe oògùn o..."
3,By dancing we take possession of Awà; through ...,"A fijó gba Awà; a fìjà gba Awà; bí a ò bá jó, ...",Nípa ijó jíjó ni a ń gba Awà; nípa ìjà ni a ń ...
4,We lift a saddle and the goat (kin) scowls; it...,A gbé gàárì ọmọ ewúrẹ́ ń rojú; kì í ṣe ẹrù àgù...,A gbé àkísà àti àkísà ewúrẹ́; kì í ṣe ẹrù ìnir...


In [None]:
bleu_score = bleu.compute(predictions=val_test_subset['Yoruba_Translation'], references=val_test_subset['Yoruba'])
print(f"BLEU Score: {bleu_score['score']}")

BLEU Score: 13.397405416153369


## Generate Submission

In [None]:
# generate predictions for submission

test_dataset['Yoruba'] = test_dataset['English'].apply(translate_text_to_yoruba)
test_dataset.head()

test_dataset.to_csv("Baseline Predictions.csv", index = False)