<a href="https://colab.research.google.com/github/davidbaines/translate_docx/blob/main/MyDrive_copy_of_Text_Translation_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### This Notebook shows a full pipeline for Translation using Facebook's No Language Left Behind (NLLB) model. 

Translate text from our original language to whatever language NLLB supports. 

Source: https://colab.research.google.com/drive/1fsbzykS5ANEMVcn7gtp8Wl7gkmRzbwOW

Webpage: https://medium.com/mlearning-ai/text-translation-using-nllb-and-huggingface-tutorial-7e789e0f7816

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
# download the language model pretrained file
#!wget https://dl.fbaipublicfiles.com/nllb/lid/lid218e.bin
#!pip install fasttext

#import fasttext

#pretrained_lang_model = "/content/lid218e.bin" # path of pretrained model file
#model = fasttext.load_model(pretrained_lang_model)

#text = "صباح الخير، الجو جميل اليوم والسماء صافية."
#predictions = model.predict(text, k=1) 
#print(predictions)
#input_lang = predictions[0][0].replace('__label__', '')

# Downloads and Imports

In [None]:
!pip install -U pip transformers
!pip install sentencepiece
!pip install python-docx
!pip install nltk
import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0mLooking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[0m

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [None]:
import docx
import glob
from pathlib import Path

# Download NLLB model

In [None]:
# Smallest 600M parameter model - distilled
checkpoint = 'facebook/nllb-200-distilled-600M'

# Medium 1.3B parameter model - distilled
# checkpoint = 'facebook/nllb-200-distilled-1.3B'

# Medium 1.3B parameter model
# 1.3B parameter model

# Large 3.3B parameter model
# checkpoint = 'facebook/nllb-200-3.3B'


In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint)
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Test tokenizer and translation.

In [None]:
#Test sentence tokenizer:
para = "Hello World. It's good to see you. Thanks for buying this book."
sent_tokenize(para)

['Hello World.', "It's good to see you.", 'Thanks for buying this book.']

In [None]:
input_lang = 'eng_Latn'
target_lang = 'spa_Latn'
translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=target_lang, 
                                max_length = 400)
output = translation_pipeline(para)
print(output[0]['translation_text'])

Hola Mundo, me alegra verte, gracias por comprar este libro.


# Define Functions

In [None]:

def replace_string(doc: object, old_string: str, new_string: str):
    
    # From https://github.com/henrihapponen/docxedit/blob/main/docxedit.py
    """
    Replaces an old string (placeholder) with a new string
    without changing the formatting of the text.
    Args:
        doc (Object): The docx document object.
        old_string (String): The old string to replace.
        new_string (String): The new string to replace the old one with.
    Returns:
        modified document
    """

    for paragraph in doc.paragraphs:
        if old_string in paragraph.text:
            inline = paragraph.runs

            for i in range(len(inline)):
                print(inline[i].text)
                if old_string in inline[i].text:
                    text = inline[i].text.replace(old_string, new_string)
                    inline[i].text = text
                    return doc, True

                # Abandoned attempt to find the old_string when it is split 
                # across two runs. 
                elif old_string in inline[i].text + inline[i+1].text:
                    inline[i].text = text
                    inline[i+1].text = ""
                    return doc, True

    # Couldn't find old_string. Print to help find out why:
    print(f"Couldn't find old_string:\n{old_string} in text runs:")
    for paragraph in doc.paragraphs:
        for run in paragraph.runs:
            print(run.text)

    return doc, False



def replace_string(doc: object, old_string: str, new_string: str):
    # Maybe something like this would work:
    # Untested.
    for paragraph in doc.paragraphs:
        if old_string in paragraph.text:
            paragraph.text = paragraph.text.replace(old_string, new_string)
            return True
    
    # Couldn't find old_string. Print to help find out why:
    print(f"Couldn't find old_string:\n{old_string} in paragraph.text s:")
    for paragraph in doc.paragraphs:
        print(paragraph.text)

    return False


In [None]:
def get_paragraphs_from_docx(file):
        
    paras = []
    # Open connection to Word Document
    doc = docx.Document(file)
    
    # read in each paragraph in file and store the style name with it.
    for para in doc.paragraphs:
        this_para = {'style': para.style.name}
        sentences = []
        for sentence in sent_tokenize(para.text):
            sentences.append(sentence)
        this_para['sentences'] = sentences

    #print(f'Found {len(styles_in_doc)} styles {styles_in_doc} in this document.')    
    return paras


def translate_docx(file):
    
    paras = []

    # Open connection to Word Document
    doc = docx.Document(file)  

    # read in each paragraph in file and store the style name with it.
    for para in doc.paragraphs:
        this_para = {'style': para.style.name}
        sentences = [sentence for sentence in sent_tokenize(para.text)]
        translations = [translation_pipeline(sentence)[0]['translation_text'] for sentence in sentences]

        this_para['sentences'] = sentences
        this_para['translations'] = translations
        paras.append(this_para)
        
        # This line was a great simplification of the find and replace code.
        para.text = " ".join(translations)

        # I'm not sure this is required, since the style shouldn't have changed.
        para.style = this_para['style']

        
        #print(this_para)

    doc.save(file)
    #print(f'Found {len(styles_in_doc)} styles {styles_in_doc} in this document.')    
    return paras

In [None]:
input_lang = 'eng_Latn'
output_lang = 'tpi_Latn'

input_folder = Path("drive/MyDrive/EIL-Mark")
output_folder = Path("drive/MyDrive/EIL-Mark-Tagalog")
ext_in = 'docx'
ext_out = 'docx'

# Copy the input files to the output directory work only on the copies.
# copy_tree(input_folder, output_folder)


translation_pipeline = pipeline('translation', 
                        model=model, 
                        tokenizer=tokenizer, 
                        src_lang=input_lang, 
                        tgt_lang=output_lang, 
                        max_length = 400)

# Get list of copied files.
output_files = [file for file in output_folder.rglob("*." + ext_in)]
print(f"Found {len(output_files)} {ext_in} files in {output_folder.resolve()}")

Found 325 docx files in /content/drive/MyDrive/EIL-Mark-Tagalog


In [None]:
# Try a version which replaces all the existing text.
#input_files = [output_folder / "0.intro" / "Mark Introduction 1.docx"]
#input_files = [output_folder /"1.1-1.13" / file for file in ["1 - Hear and Heart.docx"]]
#input_files = [output_folder /"1.1-1.13" / file for file in ["2 - Setting the Stage.docx"]]
input_files = [output_folder /"1.1-1.13" / file for file in ["3 - Defining the Scenes.docx", "4 - Embodying the Text.docx", "5 - Filling the Gaps.docx" ]]

input_files = [file for file in output_folder.rglob("*.docx")]

print(f"Found {len(input_files)} input files.")

# Process one to test:
for input_file in input_files:
    
    print(f"Opening {input_file}")
    # Open the input file as a Word document
    try :
        document = docx.Document(input_file)
    except BadZipFile:
        print(f"BadZipFile Error on opening {input_file}")
        continue
    
    # Get output filename.
    #output_file = input_file.with_suffix(".translated.docx").resolve()
    
    output_file = input_file.resolve()
    
    # Save the file.
    document.save(input_file)

    # From docx help
    #document = Document('existing-document-file.docx')
    #document.save('new-file-name.docx')
    
    # Translate the content
    paragraphs = translate_docx(output_file)
    
    print(f"Saved the translated file to {output_file}")

    #for paragraph in paragraphs:
    #    print(paragraph)


Found 328 input files.
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/4 - Embodying the Text.docx
Saved the translated file to /content/drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/4 - Embodying the Text.docx
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/5 - Filling the Gaps.docx
Saved the translated file to /content/drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/5 - Filling the Gaps.docx
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/1 - Hear and Heart.docx
Saved the translated file to /content/drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/1 - Hear and Heart.docx
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/2 - Setting the Stage.docx
Saved the translated file to /content/drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/2 - Setting the Stage.docx
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/3 - Defining the Scenes.docx
Saved the translated file to /content/drive/MyDrive/EIL-Mark-Tagalog/10.13-10.31/3 - Defining the Scenes.docx
Opening drive/MyDrive/EIL-Mark-Tagalog/10.13-10.3

BadZipFile: ignored

In [None]:
test_file = output_folder / "0.intro" / "Mark Introduction 1.docx"
input_files = [output_folder /"1.1-1.13" / file for file in ["2 - Setting the Stage.docx"]]
print(f"Found {len(input_files)} input files.")

# Process one to test:
for input_file in input_files:
    
    print(f"Opening {input_file}")
    # Open the input file as a Word document
    document = docx.Document(input_file)
    
    # Get output filename.
    #output_file = input_file.with_suffix(".translated.docx").resolve()
    
    output_file = input_file.resolve()
    
    # Save the file.
    document.save(input_file)
    print(f"Saved output_doc as {input_file}")

    # From docx help
    #document = Document('existing-document-file.docx')
    #document.save('new-file-name.docx')
    
    # Get the content to translate
    paragraphs_in = get_paragraphs_from_docx(output_file)
    
    print(paragraphs_in)

    replacements = []
    failed = False
    for para, paragraph_in in enumerate(paragraphs_in):

        if failed:
            break
        if not paragraph_in['sentences'] :
            continue
        else :
            for sent, sentence in enumerate(paragraph_in['sentences']):
                translated_sentence = translation_pipeline(sentence)[0]['translation_text']

                doc, replaced = replace_string(document, sentence, translated_sentence)
                if not replaced:
                    replacements.append(f"Failed   : {sentence}        ->         {translated_sentence}")
                    print(f"Couldn't replace \n{sentence}\n  with:\n{translated_sentence}\n in file:\n{output_file}")
                    failed = True
                    break
                else: 
                    replacements.append(f"Replaced : {sentence}        ->         {translated_sentence}")
                    #print(f"{sentence}        ->         {translated_sentence}")
                    

    document.save(output_file)
    
    for replacement in replacements:
        print(replacement)
    print(f"Wrote translated file to {output_file}")
   

Found 1 input files.
Opening drive/MyDrive/EIL-Mark-Tagalog/1.1-1.13/2 - Setting the Stage.docx
Saved output_doc as drive/MyDrive/EIL-Mark-Tagalog/1.1-1.13/2 - Setting the Stage.docx
[{'style': 'Normal', 'sentences': ['STATING']}, {'style': 'Normal', 'sentences': ['MARK 1:1-13']}, {'style': 'Normal', 'sentences': []}, {'style': 'Normal', 'sentences': ['Listen to the text once in the easiest to understand version.']}, {'style': 'Normal', 'sentences': []}, {'style': 'Normal', 'sentences': ["The beginning of the book of Mark sets the stage for the beginning of Jesus' ministry.", 'John comes to prepare the way for Jesus.', 'Mark immediately tells us that Jesus is the Son of God.', 'God shows us this is true by telling us that Jesus is his son, and that he loves and approves of Jesus.', 'Immediately, Jesus is in conflict with Satan, which shows us another theme of Mark--God and Satan at war with each other.']}, {'style': 'Normal', 'sentences': []}, {'style': 'Normal', 'sentences': ['This st

UnboundLocalError: ignored

In [None]:
 # Create the output doument
    document = docx.Document(output_file)

    paragraphs_in = get_paragraphs_from_docx(input_file)
    #paragraphs_out = paragraphs_in

    for para, paragraph_in in enumerate(paragraphs_in):
        if not paragraph_in['sentences'] :
            continue
        else :
            for sentence in paragraph_in['sentences']:
                translation_pipeline = pipeline('translation', 
                                    model=model, 
                                    tokenizer=tokenizer, 
                                    src_lang=input_lang, 
                                    tgt_lang=output_lang, 
                                    max_length = 400)
                
                translated_sentence = translation_pipeline(sentence)[0]['translation_text']
                #translated_sentence = output[0]['translation_text']
                if not replace_string(document, sentence, translated_sentence):
                    print(f"Couldn't replace \n{sentence}\n  with:\n{translated_sentence}\n in file:\n{output_file.resolve()}")
                    

    document.save(output_file.resolve())
    print(f"Wrote translated file to {output_file.resolve()}")


In [None]:


    translated_paras = []
    for paragraph in paragraphs[:3]:
        translated_paras.append(paragraph[0])
        if len(paragraph) == 1:
            # This paragraph only contains style info and no text.
            translations.append(translate)
            continue
        elif len(paragraph)  >1 :
            for sentence in paragraph:
                translation_pipeline = pipeline('translation', 
                                    model=model, 
                                    tokenizer=tokenizer, 
                                    src_lang=input_lang, 
                                    tgt_lang=output_lang, 
                                    max_length = 400)
                
                output = translation_pipeline(sentence)
                translated_sentence = output[0]['translation_text']

                translated_sentences.append(translated_sentence)
                print(f"source: {sentence}\nTarges: {translated_sentence}")

        translations.append(translated_sentences)
        
for translated_para in translated_paras:
    print(translated_para)

In [None]:
input_texts = [r'You may introduce the book of Mark through story form in as concrete a way as possible. ',
               r'One way to do this is to tell the following story and have your translation team act it out. ',
               r'Ahead of time, choose the following characters: Mark, Peter, Jesus, Paul, and Barnabas. ',
               r'The rest of the team can play the parts of the followers and believers. ',
               r'Choose parts of the room to represent different parts of the world: Jerusalem, Antioch, Cyprus, and Rome. ',
               r'These four places could be the four corners of the room. ',
               r'As you tell the story, the characters in that part of the story can walk to the part of the room that is representing the place in the story.',
               r'Have the characters act out the journeys and the actions as you tell the story.',
               r'You may have the team retell the story after you tell it, again acting it out.',
               r'Help them if they forget parts of the story.',]      

In [None]:

outputs = []
max_length = max(len(input_text) for input_text in input_texts)
print(f"Max length is {max_length}")

for input_text in input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    outputs.append(translation_pipeline(input_text))

for output in outputs:
    print(output[0]['translation_text'])



Max length is 141
Yu ken kamapim buk Mak long rot bilong stori long rot i stret tru. 
Wanpela rot bilong mekim olsem em long storiim stori i kamap bihain na tokim ol lain bilong tanim tok long mekim olsem. 
Taim yu laik kisim sampela hap tok, makim ol dispela man: Mak, Pita, Jisas, Pol, na Barnabas. 
Ol narapela insait long lain inap mekim wok bilong ol disaipel na ol bilipman. 
Pinis long makim ol hap bilong rum bilong makim ol narapela hap bilong graun: Jerusalem, Antiok, Saiprus, na Rom. 
Dispela 4-pela hap inap makim 4-pela kona bilong rum.
Taim yu stori, ol man i stap long dispela hap stori i ken wokabaut i go long hap bilong rum em ples i stap long stori.
Taim yu stori, yu mas tokim ol man long ol samting yu mekim long rot bilong raun na wokabaut.
Ating bai yu tokim lain long stori gen taim yu stori pinis, na bihain bai yu mekim olsem.
Sapos ol i lusim tingting long sampela hap stori, orait helpim ol.


In [None]:
tpi_output_texts = []
for output in outputs:
    tpi_output_texts.append(output[0]['translation_text'])

In [None]:
input_lang = 'tpi_Latn'
output_lang = 'eng_Latn'

tpi_input_texts = tpi_output_texts
eng_outputs = []
eng_output_texts = []

for input_text in tpi_input_texts:
    translation_pipeline = pipeline('translation', 
                                model=model, 
                                tokenizer=tokenizer, 
                                src_lang=input_lang, 
                                tgt_lang=output_lang, 
                                max_length = 400)

    eng_outputs.append(translation_pipeline(input_text))

eng_output_texts = [eng_output[0]['translation_text'] for eng_output in eng_outputs]

In [None]:
for eng_output_text in eng_output_texts:
    print(eng_output_text)


You can make Mark's book sound by telling the truth.
And the way of the translator is to make a report, and the way of the translator is to make a report.
Now if you want to take part in the discussion, choose these men: Mark, Peter, Jesus, Paul and Barnabas.
The rest of the congregation can share in the ministry of the disciples and the faithful.
And the city of Jerusalem, and the city of Antioch, and the city of Cyprus, and the city of Rome, were chosen.
And the four corners of the house were four corners.
When you tell the story, the people in the story can walk to the room where the story is told.
When thou speakest, thou shalt speak thy words, and thy ways.
And thou shalt speak, and thou shalt speak, and thou shalt speak.
If they forget a thing, help them.
