**Description of project**:
This project started with Eileen and Angela. Angela used to manually sift through Annex Reviews documents from our stakeholders, a task that was quite time-consuming, especially for lengthy documents spanning a hundred pages. I helped to streamline this process by developing a Python script to analyze the frequency of specific terms or phrases within the documents. This automation not only saves time but also enhances the accuracy of our document review processes, ensuring stakeholders adhere to correct templates and consistent terminology.

**1. install packages**

In [None]:
!pip install pymupdf python-docx

Collecting pymupdf
  Downloading PyMuPDF-1.24.0-cp310-none-manylinux2014_x86_64.whl (3.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.9/3.9 MB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting python-docx
  Downloading python_docx-1.1.0-py3-none-any.whl (239 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m239.6/239.6 kB[0m [31m24.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting PyMuPDFb==1.24.0 (from pymupdf)
  Downloading PyMuPDFb-1.24.0-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (30.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m30.8/30.8 MB[0m [31m47.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: python-docx, PyMuPDFb, pymupdf
Successfully installed PyMuPDFb-1.24.0 pymupdf-1.24.0 python-docx-1.1.0


**2. import modules**

In [None]:
import fitz  # PyMuPDF
import docx

**3. search for specific keywords in a PDF file and saves the pages where those keywords are found in a word document with descriptions**

This is a Python function that searches for specific keywords in a PDF file and saves the pages where those keywords are found into a Word document. Let's break down what each part does:

1. **Opening PDF File**: It starts by opening the PDF file specified by `pdf_path` using the `fitz` library.

2. **Initializing Word Document**: It initializes a Word document using the `docx` library.

3. **Storing Keyword Pages**: It creates an empty dictionary called `keyword_pages` to store the page numbers where each keyword is found.

4. **Searching Keywords in PDF**: It iterates through each page of the PDF and searches for each keyword provided in the `keywords` list. If a keyword is found on a page, it stores the page number and adds information about the keyword's context in the Word document.

5. **Saving Word Document**: Once all keywords are searched for, it saves the Word document to the specified path (`output_docx_path`).

6. **Displaying Keyword Pages**: Finally, it prints out the pages where each keyword is found.

The function `search_keywords_in_pdf_and_save_docx` takes three parameters:
- `pdf_path`: The path of the PDF file to search in.
- `keywords`: A list of keywords to search for in the PDF.
- `output_docx_path`: The path where the output Word document will be saved.

The example at the end demonstrates how to use this function. It searches for the keywords "coordination", "emergency", and "response" in a PDF file located at `pdf_path`, and saves the pages where these keywords are found into a Word document specified by `output_docx_path`.

In [None]:
def search_keywords_in_pdf_and_save_docx(pdf_path, keywords, output_docx_path):
    # Open the PDF file
    pdf_document = fitz.open(pdf_path)

    # Initialize a Word document
    doc = docx.Document()

    # Initialize a dictionary to store page numbers where each keyword is found
    keyword_pages = {keyword: [] for keyword in keywords}

    # Iterate through each page in the PDF
    for page_number in range(pdf_document.page_count):
        # Get the page
        page = pdf_document[page_number]

        # Search for each keyword in the text of the page
        for keyword in keywords:
            keyword_instances = page.search_for(keyword)

            # If the keyword is found on the page, store the page number and add information to the Word document
            if keyword_instances:
                keyword_pages[keyword].append(page_number + 1)  # Page numbers start from 1

                # Display the context of the keyword on the page
                for inst in keyword_instances:
                    context = page.get_text("text", clip=inst).strip()
                    print(f"Page {page_number + 1}, Keyword: {keyword}, Context: {context}")

                    # Add information to the Word document
                    doc.add_paragraph(f"Page {page_number + 1}, Keyword: {keyword}, Context: {context}")

    # Close the PDF document
    pdf_document.close()

    # Save the Word document
    doc.save(output_docx_path)

    # Display the pages where each keyword is found
    for keyword, pages in keyword_pages.items():
        print(f"\nKeyword '{keyword}' found on pages: {', '.join(map(str, pages))}")


# Example usage with multiple keywords
pdf_path = "/content/HPP_2021-2022_NRW-HCC_BurnSurgeAnnex_Final.pdf"
output_docx_path = "/content/output_document.docx"
keywords_to_search = ["coordination", "emergency", "response"]
search_keywords_in_pdf_and_save_docx(pdf_path, keywords_to_search, output_docx_path)

