<a href="https://colab.research.google.com/github/davidbaines/translate_docx/blob/main/Text_Translation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This Notebook shows a full pipeline for Text language identification and Translation using Facebook models fasttext and No Language Left Behind (NLLB). 

First, we start with taking an input text in any language, then we will detect its language code using fasttext.

After that, we take the entered text, and predicted label and feed them to NLLB which translates text from our original language to whatever language NLLB supports. 

Source: https://colab.research.google.com/drive/1fsbzykS5ANEMVcn7gtp8Wl7gkmRzbwOW

Webpage: https://medium.com/mlearning-ai/text-translation-using-nllb-and-huggingface-tutorial-7e789e0f7816

In [58]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [59]:
# download the language model pretrained file
!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin

--2023-01-29 21:15:11--  https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 172.67.9.4, 104.22.74.142, 104.22.75.142, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|172.67.9.4|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1176355829 (1.1G) [application/octet-stream]
Saving to: ‘lid218e.bin.1’


2023-01-29 21:15:47 (31.9 MB/s) - ‘lid218e.bin.1’ saved [1176355829/1176355829]



# Imports and functions

In [60]:
!pip install python-docx
!pip install fasttext
!pip install nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [61]:
#Test sentence tokenizer:
para = "Hello World. It's good to see you. Thanks for buying this book."
sent_tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [62]:
import fasttext

pretrained_lang_model = "/content/lid218e.bin" # path of pretrained model file
model = fasttext.load_model(pretrained_lang_model)



Now lets enter a test text in the original language, here we will translate from Arabic to Spanish.

In [63]:
text = "صباح الخير، الجو جميل اليوم والسماء صافية."

In [64]:
predictions = model.predict(text, k=1) 
print(predictions)

(('__label__arb_Arab',), array([0.99960977]))


In [65]:
input_lang = predictions[0][0].replace('__label__', '')

#Imports and Functions

# Text Translation

In [66]:
!pip install -U pip transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [67]:
!pip install sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

In [68]:
# Smallest 600M parameter model - distilled
checkpoint = 'facebook/nllb-200-distilled-600M'

# Medium 1.3B parameter model - distilled
# checkpoint = 'facebook/nllb-200-distilled-1.3B'

# Medium 1.3B parameter model
# 1.3B parameter model

# Large 3.3B parameter model
# checkpoint = 'facebook/nllb-200-3.3B'



In [69]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Test translation with model.

In [70]:
target_lang = 'spa_Latn'
translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=target_lang, 
                                max_length = 400)
output = translation_pipeline(text)
print(output[0]['translation_text'])

Buenos días, el clima es hermoso y el cielo está limpio.


In [92]:
import argparse
import docx
from docx import Document
import glob
from pathlib import Path
import shutil
from os.path import join, isfile


def get_paragraphs_from_docx(file):
        
    paras = []
    # Open connection to Word Document
    doc = docx.Document(file)
    
    # read in each paragraph in file and store the style name with it.
    for p, para in enumerate(doc.paragraphs,1):
        this_para = {'style': para.style.name}
        sentences = []
        for i, sentence in enumerate(sent_tokenize(para.text),1):
            sentences.append(sentence)
        this_para['sentences'] = sentences
        paras.append(this_para)

    #print(f'Found {len(styles_in_doc)} styles {styles_in_doc} in this document.')    
    return paras


def put_sentences_to_docx(sentences,file_in, file_out):
    # Sentences should be a dictionary of sent
    paras = []

    # Open connection to Word Document
    doc = docx.Document(file_out)
    
    # Output the paragraphs and sentences with the correct style.
#    for sentence in sentences:


def translate_sentences(file_in, input_lang, output_lang, file_out):

        sentences = get_sentences_from_docx(input_file)
        for sentence in sentences:
            print(sentence)

        print(f"Found {len(sentences)} sentences in {input_file.resolve}")

# From https://github.com/henrihapponen/docxedit/blob/main/docxedit.py


def replace_string(doc: object, old_string: str, new_string: str):
    """
    Replaces an old string (placeholder) with a new string
    without changing the formatting of the text.
    Args:
        doc (Object): The docx document object.
        old_string (String): The old string to replace.
        new_string (String): The new string to replace the old one with.
    Returns:
        modified document
    """

    for paragraph in doc.paragraphs:
        if old_string in paragraph.text:
            inline = paragraph.runs
            print(f"Inline is {inline}")
            
            for i in range(len(inline)):
                if old_string in inline[i].text:
                    text = inline[i].text.replace(str(old_string), str(new_string))
                    inline[i].text = text
                    return True
    
    # Couldn't fild old_string. Print to help find out why:
    
    return False


def copy_tree(source, destination):
    shutil.copytree(source, destination, symlinks=False, ignore=None, dirs_exist_ok=True)


In [88]:
# Process one to test:
input_lang = 'eng_Latn'
output_lang = 'tpi_Latn'

input_folder = Path("drive/MyDrive/EIL-Mark")
output_folder = Path("drive/MyDrive/EIL-Mark-Tagalog")
ext_in = 'docx'
ext_out = 'docx'

# Copy the input files to the output directory work only on the copies.
copy_tree(input_folder, output_folder)


translation_pipeline = pipeline('translation', 
                        model=model, 
                        tokenizer=tokenizer, 
                        src_lang=input_lang, 
                        tgt_lang=output_lang, 
                        max_length = 400)

In [95]:
# Get list of copied files.
output_files = [file for file in output_folder.rglob("*." + ext_in)]
print(f"Found {len(output_files)} {ext_in} files in {output_folder.resolve()}")

# Process one to test:
for output_file in output_files[:2]:
    
    output_file = output_file.resolve()
    print(f"Initial output_file is {output_file}")

    # Skip files that have already been translated.
    #if output_file.with_suffix(".translated.docx").is_file():
    #    continue

    # Open the output as a document
    document = Document(output_file)

    # Save it with a new name.
    translated_file = output_file.with_suffix(".translated.docx").resolve()
    document.save(translated_file)

    paragraphs_in = get_paragraphs_from_docx(output_file)
    #paragraphs_out = paragraphs_in

    for para, paragraph_in in enumerate(paragraphs_in):
        if not paragraph_in['sentences'] :
            continue
        else :
            for sent, sentence in enumerate(paragraph_in['sentences']):
                translated_sentence = translation_pipeline(sentence)[0]['translation_text']
                #translated_sentence = output[0]['translation_text']
                if not replace_string(document, sentence, translated_sentence):
                    print(f"Couldn't replace \n{sentence}\n  with:\n{translated_sentence}\n in file:\n{translated_file}")
                else: 
                    print(f"{sentence}        ->         {translated_sentence}")
                    document.save(translated_file)

    document.save(translated_file)
    print(f"Wrote translated file to {translated_file}")

   

Found 323 docx files in /content/drive/MyDrive/EIL-Mark-Tagalog
Initial output_file is /content/drive/MyDrive/EIL-Mark-Tagalog/0.intro/Mark Introduction 1.docx
Inline is [<docx.text.run.Run object at 0x7f8da97324f0>]
Mark Introduction        ->         Toktok Bilong Mak
Inline is [<docx.text.run.Run object at 0x7f8da9732eb0>]
Part 1: The author’s and audience’s story        ->         Hap 1: Stori bilong ol man i raitim ol dispela stori na stori bilong ol manmeri
Inline is [<docx.text.run.Run object at 0x7f8da9732e50>, <docx.text.run.Run object at 0x7f8da9863850>]
You may introduce the book of Mark through story form in as concrete a way as possible.        ->         Yu ken stori long buk Mak long rot bilong stori long rot i stret.
Inline is [<docx.text.run.Run object at 0x7f8da9732e50>, <docx.text.run.Run object at 0x7f8daace8fd0>]
One way to do this is to tell the following story and have your translation team act it out.        ->         Wanpela rot bilong mekim olsem em long stor

In [None]:
 # Create the output doument
    document = docx.Document(output_file)

    paragraphs_in = get_paragraphs_from_docx(input_file)
    #paragraphs_out = paragraphs_in

    for para, paragraph_in in enumerate(paragraphs_in):
        if not paragraph_in['sentences'] :
            continue
        else :
            for sentence in paragraph_in['sentences']:
                translation_pipeline = pipeline('translation', 
                                    model=model, 
                                    tokenizer=tokenizer, 
                                    src_lang=input_lang, 
                                    tgt_lang=output_lang, 
                                    max_length = 400)
                
                translated_sentence = translation_pipeline(sentence)[0]['translation_text']
                #translated_sentence = output[0]['translation_text']
                if not replace_string(document, sentence, translated_sentence):
                    print(f"Couldn't replace \n{sentence}\n  with:\n{translated_sentence}\n in file:\n{output_file.resolve()}")
                    

    document.save(output_file.resolve())
    print(f"Wrote translated file to {output_file.resolve()}")


In [None]:


    translated_paras = []
    for paragraph in paragraphs[:3]:
        translated_paras.append(paragraph[0])
        if len(paragraph) == 1:
            # This paragraph only contains style info and no text.
            translations.append(translate)
            continue
        elif len(paragraph)  >1 :
            for sentence in paragraph:
                translation_pipeline = pipeline('translation', 
                                    model=model, 
                                    tokenizer=tokenizer, 
                                    src_lang=input_lang, 
                                    tgt_lang=output_lang, 
                                    max_length = 400)
                
                output = translation_pipeline(sentence)
                translated_sentence = output[0]['translation_text']

                translated_sentences.append(translated_sentence)
                print(f"source: {sentence}\nTarges: {translated_sentence}")

        translations.append(translated_sentences)
        
for translated_para in translated_paras:
    print(translated_para)

In [None]:
input_texts = [r'You may introduce the book of Mark through story form in as concrete a way as possible. ',
               r'One way to do this is to tell the following story and have your translation team act it out. ',
               r'Ahead of time, choose the following characters: Mark, Peter, Jesus, Paul, and Barnabas. ',
               r'The rest of the team can play the parts of the followers and believers. ',
               r'Choose parts of the room to represent different parts of the world: Jerusalem, Antioch, Cyprus, and Rome. ',
               r'These four places could be the four corners of the room. ',
               r'As you tell the story, the characters in that part of the story can walk to the part of the room that is representing the place in the story.',
               r'Have the characters act out the journeys and the actions as you tell the story.',
               r'You may have the team retell the story after you tell it, again acting it out.',
               r'Help them if they forget parts of the story.',]      

In [None]:

outputs = []
max_length = max(len(input_text) for input_text in input_texts)
print(f"Max length is {max_length}")

for input_text in input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    outputs.append(translation_pipeline(input_text))

for output in outputs:
    print(output[0]['translation_text'])



Max length is 141
Yu ken kamapim buk Mak long rot bilong stori long rot i stret tru. 
Wanpela rot bilong mekim olsem em long storiim stori i kamap bihain na tokim ol lain bilong tanim tok long mekim olsem. 
Taim yu laik kisim sampela hap tok, makim ol dispela man: Mak, Pita, Jisas, Pol, na Barnabas. 
Ol narapela insait long lain inap mekim wok bilong ol disaipel na ol bilipman. 
Pinis long makim ol hap bilong rum bilong makim ol narapela hap bilong graun: Jerusalem, Antiok, Saiprus, na Rom. 
Dispela 4-pela hap inap makim 4-pela kona bilong rum.
Taim yu stori, ol man i stap long dispela hap stori i ken wokabaut i go long hap bilong rum em ples i stap long stori.
Taim yu stori, yu mas tokim ol man long ol samting yu mekim long rot bilong raun na wokabaut.
Ating bai yu tokim lain long stori gen taim yu stori pinis, na bihain bai yu mekim olsem.
Sapos ol i lusim tingting long sampela hap stori, orait helpim ol.


In [None]:
tpi_output_texts = []
for output in outputs:
    tpi_output_texts.append(output[0]['translation_text'])

In [None]:
input_lang = 'tpi_Latn'
output_lang = 'eng_Latn'

tpi_input_texts = tpi_output_texts
eng_outputs = []
eng_output_texts = []

for input_text in tpi_input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    eng_outputs.append(translation_pipeline(input_text))

eng_output_texts = [eng_output[0]['translation_text'] for eng_output in eng_outputs]

In [None]:
for eng_output_text in eng_output_texts:
    print(eng_output_text)


You can make Mark's book sound by telling the truth.
And the way of the translator is to make a report, and the way of the translator is to make a report.
Now if you want to take part in the discussion, choose these men: Mark, Peter, Jesus, Paul and Barnabas.
The rest of the congregation can share in the ministry of the disciples and the faithful.
And the city of Jerusalem, and the city of Antioch, and the city of Cyprus, and the city of Rome, were chosen.
And the four corners of the house were four corners.
When you tell the story, the people in the story can walk to the room where the story is told.
When thou speakest, thou shalt speak thy words, and thy ways.
And thou shalt speak, and thou shalt speak, and thou shalt speak.
If they forget a thing, help them.
