**Description of project**:
This project started with Eileen and Angela. Angela used to manually sift through Annex Reviews documents from our stakeholders, a task that was quite time-consuming, especially for lengthy documents spanning a hundred pages. I helped to streamline this process by developing a Python script to analyze the frequency of specific terms or phrases within the documents. This automation not only saves time but also enhances the accuracy of our document review processes, ensuring stakeholders adhere to correct templates and consistent terminology.

**1. install packages**

In [1]:
!pip install pymupdf python-docx

Collecting pymupdf
  Downloading PyMuPDF-1.24.1-cp310-none-manylinux2014_x86_64.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-docx
  Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.1 (from pymupdf)
  Downloading PyMuPDFb-1.24.1-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m28.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-docx, PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.1 pymupdf-1.24.1 python-docx-1.1.0


**2. import modules**

In [None]:
import fitz  # PyMuPDF
import docx
from docx.shared import RGBColor
from docx.oxml.ns import nsdecls
from docx.oxml import parse_xml
import fitz
import docx

**3. script that searches for specific keywords in a PDF file and creates a Word document highlighting those keywords**

This code is a Python script that searches for specific keywords in a PDF file and creates a Word document highlighting those keywords along with some context around them.

1. Search for Keywords in a PDF: It takes a PDF file path and a list of keywords as input.
2. Extract Pages from PDF: It opens the PDF file and extracts each page's text content.
3. Search for Keywords: For each page, it checks if any of the keywords are present in the text.
4. Extract Context: If a keyword is found, it extracts a small portion of text around the keyword to provide context.
5. Create a Word Document: It creates a Word document and adds paragraphs for each occurrence of a keyword along with the context.
6. Highlight Keywords: It highlights the keywords within the context in the Word document.
7. Save the Output: It saves the resulting Word document to a specified file path.

So, essentially, this script helps in searching for specific keywords within a PDF document, extracts the relevant text around those keywords, and presents the findings in a Word document with the keywords highlighted.

In [None]:
def search_keywords_in_pdf(pdf_path, keywords):
    doc = fitz.open(pdf_path)
    output_doc = docx.Document()

    for page_number in range(doc.page_count):
        page = doc[page_number]
        text = page.get_text()

        for keyword in keywords:
            if keyword in text:
                # Extracting context around the keyword
                context_start = max(0, text.find(keyword) - 200)
                context_end = min(len(text), text.find(keyword) + 200 + len(keyword))
                context = text[context_start:context_end]
                # 200 is based on word characters

                # Add content to the DOCX document
                p = output_doc.add_paragraph(f"Page {page_number + 1}, Keyword: {keyword}:\n{context}\n\n")

                # Highlight the keyword in the paragraph
                inline = p.runs[0]
                start_index = context.find(keyword)
                end_index = start_index + len(keyword)
                inline.text = context[:start_index]
                keyword_run = p.add_run(context[start_index:end_index])
                keyword_run.font.highlight_color = docx.enum.text.WD_COLOR_INDEX.YELLOW
                inline = p.add_run(context[end_index:])
                inline.bold = False
                p = output_doc.add_paragraph(f"Page {page_number + 1}, Keyword: {keyword}:\n{context}\n\n")

    doc.close()
    return output_doc

def save_output_to_file(output_doc, output_file):
    output_doc.save(output_file)

if __name__ == "__main__":
    pdf_path = "/content/HPP_2021-2022_NRW-HCC_BurnSurgeAnnex_Final.pdf"
    keywords = ["coordination", "emergency", "third_keyword"]
    output_file = "/content/output.docx"

    result_doc = search_keywords_in_pdf(pdf_path, keywords)

    if len(result_doc.paragraphs) > 0:
        save_output_to_file(result_doc, output_file)
        print(f"Output saved to {output_file}")
    else:
        print(f"No occurrences of any keyword found in the PDF.")


Output saved to /content/output.docx
