# 📕 DOCX localisation

A [JupyterLite](https://jupyterlite.readthedocs.io/en/stable/index.html#) instance to automate localisation of the clinical trials documents. This notebook runs entirely on the browser. Computations, file handling, and data storage occur locally on your machine without the need to send data to a server. 

## How to use?

1. Upload the source documents from the coordination team to the `source_documents` folder on the left. You can either drag and drop the files or use the upload icon on the top of the files section left. The files would include source docx (MS Word) files and placeholder logo.
2. Upload site's logo and `localisation.csv` to the `localisation_information` folder. `localisation.csv` will be provided by the coordination team and you must fill out the related column.
3. Press the `Run all` button on top of this page. Which would run the code in the cells below.
4. Once all finished, you can download your files from `localised_documents` folder.

## Setting up the notebook

### Package installations

Due to the need of re-installation of packages this might take a while. 

In [None]:
%pip install python-docx --no-deps

In [None]:
%pip install lxml

In [None]:
%pip install pillow

In [None]:
%pip install imagehash

In [None]:
print("Packages installed successfully!")

### Library imports

In [None]:
# IMPORTS
import os
import csv
import json
from docx import Document

import zipfile
import shutil
from PIL import Image
import imagehash

In [None]:
print("Libraries imported sucesfully!")

## Text replacements defined in `localisation.csv`

This csv file holds the document names, variables in these documents, and site localisations

`localisation.csv` file must follow the coordination team's structure. You are most likely to open and edit it in Excel. After you complete your amendments, save your file and drag and drop it into the `localisation_information` folder on the left side.

### Read the `localisation.csv`

In [None]:
# Define the path to your CSV file
file_path = 'localisation_information/localisation.csv'

# Create a dictionary to hold the JSON structure
changes_json = {}

# Open the CSV file and read the data
with open(file_path, mode='r', newline='', encoding='utf-8') as file:
    reader = csv.reader(file)
    next(reader)  # Skip the header row
    for document, template, localised in reader:
        if document not in changes_json:
            changes_json[document] = {}
        changes_json[document][template] = localised

# Output the JSON string
# print(json_output)

### Amend the `docx` files in `source_documents` folder and save in the `localised_documents` folder

This process checks all paragraphs, tables, headers and footers. It then makes text replacements based on `localisation.csv`. 

The formatting of the replaced text is **not** preserved. 

Keep in mind that the file names are taken from the `localisation.csv`. To be able to work, csv file names must match the documents in the `source_documents` folder. --> *Update needed here*


In [None]:
# def replace_text_in_paragraphs(paragraphs, changes):
#     for paragraph in paragraphs:
#         for old_text, new_text in changes.items():
#             if old_text in paragraph.text:
#                 paragraph.text = paragraph.text.replace(old_text, new_text)

def replace_text_in_paragraphs(paragraphs, changes):
    for paragraph in paragraphs:
        full_text = ''.join(run.text for run in paragraph.runs)
        for old_text, new_text in changes.items():
            if old_text in full_text:
                # Replace text and rebuild paragraph
                new_text = full_text.replace(old_text, new_text)
                # Clear all runs
                for run in paragraph.runs:
                    run.text = ''
                # Add new run with updated text
                paragraph.add_run(new_text)
                break  # exit after first match to avoid duplicating replacements

def replace_text_in_tables(tables, changes):
    for table in tables:
        for row in table.rows:
            for cell in row.cells:
                replace_text_in_paragraphs(cell.paragraphs, changes)

def replace_text_in_headers_footers(sections, changes):
    for section in sections:
        # Check and replace text in default header and footer
        replace_text_in_paragraphs(section.header.paragraphs, changes)
        replace_text_in_paragraphs(section.footer.paragraphs, changes)
        # Check and replace text in first page header and footer if they exist
        if section.first_page_header is not None:
            replace_text_in_paragraphs(section.first_page_header.paragraphs, changes)
        if section.first_page_footer is not None:
            replace_text_in_paragraphs(section.first_page_footer.paragraphs, changes)
        # Check and replace text in even page header and footer if they exist
        if section.even_page_header is not None:
            replace_text_in_paragraphs(section.even_page_header.paragraphs, changes)
        if section.even_page_footer is not None:
            replace_text_in_paragraphs(section.even_page_footer.paragraphs, changes)


def ensure_directory_empty(directory):
    if os.path.exists(directory):
        for filename in os.listdir(directory):
            file_path = os.path.join(directory, filename)
            try:
                if os.path.isfile(file_path) or os.path.islink(file_path):
                    os.unlink(file_path)  # Removes file or link
                elif os.path.isdir(file_path):
                    shutil.rmtree(file_path)  # Removes directory recursively
            except Exception as e:
                print(f'Failed to delete {file_path}. Reason: {e}')
    else:
        os.makedirs(directory)  # Create the directory if it does not exist

def apply_changes_to_document(changes_json, source_folder, target_folder):
    # Ensure the target directory is empty before processing
    ensure_directory_empty(target_folder)

    for doc_path, templates in changes_json.items():
        source_doc_path = os.path.join(source_folder, doc_path)
        print(f"Processing document: {source_doc_path}")
        doc = Document(source_doc_path)
        
        for template, localised in templates.items():
            print(f"Applying change: {template} -> {localised}")
            replace_text_in_paragraphs(doc.paragraphs, {template: localised})
            replace_text_in_tables(doc.tables, {template: localised})
            replace_text_in_headers_footers(doc.sections, {template: localised})
        
        modified_doc_path = os.path.join(target_folder, f'localised_{doc_path}')
        doc.save(modified_doc_path)
        print(f"Saved modified document to: {modified_doc_path}")

# Executing docx amendment
source_folder = 'source_documents'
target_folder = 'localised_documents'
apply_changes_to_document(changes_json, source_folder, target_folder)

## Logo replacement

Current implementation works only for a single logo. The process below uses image recongnition to identify placeholder logo in the `source_documents` folder and then replaces it with the logo in the `localisation_information` folder.

In [None]:
# These imports moved up
# import os
# import zipfile
# import shutil
# from PIL import Image
# import imagehash
# from docx import Document

def extract_images(doc_path, extract_dir):
    """ Unzip a docx file and extract all images to a specified directory. """
    with zipfile.ZipFile(doc_path, 'r') as zip_ref:
        zip_ref.extractall(extract_dir)
    return [os.path.join(extract_dir, 'word/media', name) for name in os.listdir(os.path.join(extract_dir, 'word/media'))]

def find_matching_image(images, placeholder_image_path):
    """ Find and return the path of the image that matches the placeholder image. """
    placeholder_hash = imagehash.average_hash(Image.open(placeholder_image_path))
    for image_path in images:
        current_hash = imagehash.average_hash(Image.open(image_path))
        if current_hash - placeholder_hash == 0:  # comparing hashes
            return image_path
    return None

def replace_image(doc_path, image_to_replace_path, new_image_path):
    """ Replace a specified image in the docx file with a new image. """
    # Extract docx to temp directory
    temp_dir = 'temp_docx'
    if os.path.exists(temp_dir):
        shutil.rmtree(temp_dir)
    os.makedirs(temp_dir)

    images = extract_images(doc_path, temp_dir)
    matching_image_path = find_matching_image(images, image_to_replace_path)

    if matching_image_path:
        # Replace the image
        shutil.copy(new_image_path, matching_image_path)

        # Repackage the DOCX
        with zipfile.ZipFile(doc_path, 'w') as docx:
            for root, dirs, files in os.walk(temp_dir):
                for file in files:
                    docx.write(os.path.join(root, file), os.path.relpath(os.path.join(root, file), temp_dir))
    shutil.rmtree(temp_dir)  # Clean up

def get_image_path(folder_path):
    """Finds the path of the first image file in a folder.

    Args:
        folder_path (str): The path to the folder containing the image.

    Returns:
        str: The full path to the image file, or None if no image is found.
    """

    # Use os.scandir() for potentially better performance and resource management
    for entry in os.scandir(folder_path):
        if entry.is_file() and entry.name.lower().endswith((".jpg", ".jpeg", ".png", ".bmp", ".gif")):
            return os.path.join(folder_path, entry.name)

    return None  # No image found

def list_docx_files(folder_path):
    """Lists all files ending with .docx in a folder.

    Args:
        folder_path (str): The path to the folder containing the files.

    Returns:
        list: A list of full paths to the .docx files, or an empty list if none are found.
    """

    docx_paths = []
    for entry in os.scandir(folder_path):
        if entry.is_file() and entry.name.lower().endswith(".docx"):
            docx_paths.append(os.path.join(folder_path, entry.name))

    return docx_paths

# Define variables and change the logo
placeholder_logo_path = get_image_path('source_documents')
new_logo_path = get_image_path('localisation_information')
docx_files = list_docx_files('localised_documents')

for docx in docx_files:
    replace_image(docx, placeholder_logo_path, new_logo_path)

print("Image replacement complete.")

# Zip the localised documents
shutil.make_archive('localised_documents', 'zip', 'localised_documents')

print("Localised documents zipped successfully. You can download the zip file now from the left sidebar.")
