# Building a Simple Machine Translation Model

Machine Translation (MT) is the process of automatically translating text from one language to another.
  
  - Statistical Machine Translation (SMT): Utilizes statistical models to find the most probable translation.

  - Neural Machine Translation (NMT): Uses neural networks to model translation, often resulting in higher quality translations.

In [4]:
!pip install nltk
!pip install numpy




In [6]:
import nltk
from nltk.tokenize import word_tokenize

nltk.download('punkt')


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [7]:

# Example sentences
source_sentence = "Hello, how are you?"
target_sentence = "Bonjour, comment ça va?"

# Tokenize
source_tokens = word_tokenize(source_sentence.lower())
target_tokens = word_tokenize(target_sentence.lower())


In [9]:
#  Build a Simple Translation Model

from collections import defaultdict

# Example alignments (manually defined for demonstration)
word_alignments = defaultdict(lambda: defaultdict(float))

# Fill the dictionary with example probabilities (manually for simplicity)
word_alignments["hello"]["bonjour"] = 0.5
word_alignments["how"]["comment"] = 0.5
word_alignments["are"]["sont"] = 0.5
word_alignments["you"]["vous"] = 0.5


In [10]:
# Translation function

def translate_sentence(source_tokens, word_alignments):
    translation = []
    for token in source_tokens:
        if token in word_alignments:
            # Select the most probable translation
            target_word = max(word_alignments[token], key=word_alignments[token].get)
            translation.append(target_word)
        else:
            translation.append(token)  # Use the original word if no translation is found
    return ' '.join(translation)

# Translate a sentence
translated_sentence = translate_sentence(source_tokens, word_alignments)
print("Translated Sentence:", translated_sentence)


Translated Sentence: bonjour , comment sont vous ?


In [11]:
# evaluating model

from nltk.translate.bleu_score import sentence_bleu

# Example reference and candidate sentences
reference = [['bonjour', 'comment', 'ça', 'va']]
candidate = ['bonjour', 'comment', 'ça', 'va']

# Calculate BLEU score
bleu_score = sentence_bleu(reference, candidate)
print("BLEU Score:", bleu_score)


BLEU Score: 1.0


# Translation Model 2


In [12]:
! pip install nltk numpy pandas




In [13]:
# collect data
# Sample data (English to French)
data = [
    ("Hello, how are you?", "Bonjour, comment ça va?"),
    ("I am fine, thank you.", "Je vais bien, merci."),
    ("What is your name?", "Quel est ton nom?"),
    ("My name is John.", "Je m'appelle John."),
    ("I like to learn languages.", "J'aime apprendre des langues.")
]


In [14]:
#preprocessing

import nltk
from nltk.tokenize import word_tokenize

# Ensure necessary resources are downloaded
nltk.download('punkt')

def preprocess_data(data):
    source_sentences, target_sentences = zip(*data)
    source_tokens = [word_tokenize(sentence.lower()) for sentence in source_sentences]
    target_tokens = [word_tokenize(sentence.lower()) for sentence in target_sentences]
    return source_tokens, target_tokens

source_tokens, target_tokens = preprocess_data(data)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [15]:
# building model

from collections import defaultdict

# Build a translation dictionary
translation_dict = defaultdict(list)

for source_sentence, target_sentence in data:
    source_words = word_tokenize(source_sentence.lower())
    target_words = word_tokenize(target_sentence.lower())
    for src_word, tgt_word in zip(source_words, target_words):
        translation_dict[src_word].append(tgt_word)

def translate(sentence):
    words = word_tokenize(sentence.lower())
    translated_words = [translation_dict.get(word, [word])[0] for word in words]  # Default to the word itself if no translation
    return ' '.join(translated_words)

# Test translation
test_sentence = "Hello, how are you?"
print("Original:", test_sentence)
print("Translated:", translate(test_sentence))


Original: Hello, how are you?
Translated: bonjour , comment ça va ?


In [16]:
#evaluation

# Example evaluation
for src_sentence, tgt_sentence in data:
    translation = translate(src_sentence)
    print(f"Original: {src_sentence}")
    print(f"Translation: {translation}")
    print(f"Target: {tgt_sentence}")
    print()


Original: Hello, how are you?
Translation: bonjour , comment ça va ?
Target: Bonjour, comment ça va?

Original: I am fine, thank you.
Translation: je vais bien , merci va .
Target: Je vais bien, merci.

Original: What is your name?
Translation: quel est ton nom ?
Target: Quel est ton nom?

Original: My name is John.
Translation: je nom est . .
Target: Je m'appelle John.

Original: I like to learn languages.
Translation: je apprendre des langues . .
Target: J'aime apprendre des langues.



Obsevation:

Refinement and Improvement
  For a more sophisticated model, consider the following:

  - Data Size: Use a larger and more diverse parallel corpus.
  - Preprocessing: Implement more advanced preprocessing, such as stemming or lemmatization.
  - Model Complexity: Use advanced techniques such as statistical machine translation (SMT) or neural machine translation (NMT). Libraries like OpenNMT or transformers from Hugging Face are more suitable for such models.

# Translation Model using Pretrained Models

using hugging face


In [17]:
from transformers import MarianMTModel, MarianTokenizer

model_name = 'Helsinki-NLP/opus-mt-en-fr'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name)

def translate_with_model(sentence):
    inputs = tokenizer(sentence, return_tensors="pt")
    translated = model.generate(**inputs)
    translation = tokenizer.decode(translated[0], skip_special_tokens=True)
    return translation

# Test translation with a pre-trained model
print("Original:", test_sentence)
print("Translated with model:", translate_with_model(test_sentence))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

source.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

target.spm:   0%|          | 0.00/802k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.34M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]



pytorch_model.bin:   0%|          | 0.00/301M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

Original: Hello, how are you?
Translated with model: Bonjour, comment allez-vous?
