<a href="https://colab.research.google.com/github/Bhavnicksm/marathi-neural-machine-translation/blob/main/GT_bleu_score_calc.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

The purpose of this notebook is to calculate the bleu score of translation made by Google Translate for comparison with our own model. Let's begin.

**Before beginning make sure you have data.csv in the working directory.**

In [1]:
#!pip install googletrans
#!pip install mtranslate

In [2]:
#!python -m spacy download en

In [3]:
#!pip install torchtext==0.8.0

## Importing the data

In [4]:
import pandas as pd

In [5]:
data = pd.read_csv("data.csv", header=None )
data.columns=['english', 'marathi']
data.tail()

Unnamed: 0,english,marathi
40746,Just saying you don't like fish because of the...,हड्डींमुळे मासे आवडत नाही असं म्हणणं हे काय मा...
40747,The Japanese Parliament today officially elect...,आज जपानी संसदेने अधिकृतरित्या र्‍यौतारौ हाशिमो...
40748,Tom tried to sell his old VCR instead of throw...,टॉमने त्याचा जुना व्ही.सी.आर फेकून टाकण्याऐवजी...
40749,You can't view Flash content on an iPad. Howev...,आयपॅडवर फ्लॅश आशय बघता येत नाही. पण तुम्ही त्य...
40750,"In 1969, Roger Miller recorded a song called ""...","१९६९मध्ये रॉजर मिलरने ""यू डोन्ट वॉन्ट माय लव्ह..."


## Setting up the translator

In [6]:
import googletrans
from googletrans import Translator

In [7]:
import mtranslate

In [8]:
sent_mr = data['marathi'][20000]
sent_en = data['english'][20000]
print(sent_mr)
print(sent_en)

ते काय म्हणाले ते सोडा.
Never mind what he said.


In [9]:
mtranslate.translate(sent_mr,'en','mr')

'Leave what they said.'

## BLEU Score

In [10]:
#before going to calc the bleu we need an english tokenizer

In [11]:
import torchtext
from torchtext.data.metrics import bleu_score
import spacy
spacy_en = spacy.load('en')

In [12]:
def tokenize_en(text):
  return [tok.text for tok in spacy_en.tokenizer(text)]

In [13]:
def calculate_bleu_score(data, trg_tokenizer):
  
  trgs = []
  preds = []

  for i in range(len(data)):
    src = data['marathi'][i]
    trg = data['english'][i]

    trg = trg_tokenizer(trg)

    pred = mtranslate.translate(src,'en','mr')
    pred = trg_tokenizer(pred)
    
    #preds.append(trg)
    preds.append(pred)
    trgs.append([trg])

  return bleu_score(preds, trgs)

In [14]:
bleu = calculate_bleu_score(data, tokenize_en)
print(f"The BLEU score is {bleu*100:.2f}")

The BLEU score is 63.80
