This Python script provides a solution to extract highlighted text from PDF files and save the extracted text into an Excel spreadsheet. It utilizes the `fitz` (PyMuPDF) library for handling PDF files and `openpyxl` for creating and managing Excel files. The script is divided into two main functions:

1. **`extract_highlighted_text(pdf_path)`**: This function takes the path to a PDF file as input and extracts all highlighted texts within it. It iterates through each page of the PDF, checks for annotations (highlights are a form of annotation), and if any are found, it specifically looks for annotations of type 8, which corresponds to highlights. The text associated with each highlight annotation is then collected.

2. **`extract_to_excel(folder_path, excel_path)`**: This function is designed to iterate over all PDF files in a specified folder, calling the `extract_highlighted_text` function for each to extract highlighted texts. It then creates a new Excel file (or opens an existing one) and writes the extracted highlights into it, organizing the data into two columns: one for the filename and one for the corresponding highlighted text. Finally, it saves the Excel file to a specified path.

The script concludes with a demonstration of how to use these functions, specifying a folder containing PDF files and the path where the Excel file should be saved.

### Usage Instructions:

1. **Install Dependencies**: Ensure you have `fitz` (PyMuPDF) and `openpyxl` installed in your Python environment. If not, you can install them using pip:
   ```
   pip install PyMuPDF openpyxl
   ```

2. **Prepare PDF Files**: Place all the PDF files you want to extract highlighted text from in a single folder.

3. **Configure Paths**: Modify the `folder_path` and `excel_path` variables in the script to match the locations on your system where the PDFs are stored and where you want the Excel file to be saved, respectively.

4. **Run the Script**: Execute the script. It will process each PDF file in the specified folder, extract highlighted texts, and save the results in an Excel file at the specified path.

### Notes:

- The script assumes all annotations of type 8 in a PDF are highlights. If a PDF uses different annotation types for highlights, adjustments to the `extract_highlighted_text` function may be necessary.
- The script is designed for simplicity and does not handle potential errors, such as file access issues or corrupt PDFs. Depending on your use case, you may want to add error handling mechanisms.
- The performance of the script can vary based on the size and number of PDF files processed. For large datasets, consider implementing progress indicators or optimizing the PDF processing loop.

This script is a useful tool for researchers, students, or professionals who frequently work with PDFs and need an efficient way to compile and analyze highlighted text across multiple documents.

In [None]:
import os
import fitz  # PyMuPDF
from openpyxl import Workbook

def extract_highlighted_text(pdf_path):
    """ Extracts highlighted text from a given PDF file. """
    highlighted_text = []
    
    with fitz.open(pdf_path) as doc:
        for page in doc:
            annotations = page.annots()
            if annotations:
                for annot in annotations:
                    if annot.type[0] == 8:  # Highlight annotation type
                        highlighted_text.append(annot.info["content"])

    return highlighted_text

def extract_to_excel(folder_path, excel_path):
    """ Extracts highlighted text from all PDFs in a folder and saves to an Excel file. """
    workbook = Workbook()
    sheet = workbook.active
    sheet.title = 'Highlighted Texts'
    sheet.append(['File Name', 'Highlighted Text'])

    for filename in os.listdir(folder_path):
        if filename.endswith('.pdf'):
            pdf_path = os.path.join(folder_path, filename)
            highlights = extract_highlighted_text(pdf_path)
            
            for text in highlights:
                sheet.append([filename, text])

    workbook.save(excel_path)
    print(f"Data saved to {excel_path}")

# Replace 'your_folder_path' with the path to your folder containing PDFs
# Replace 'your_excel_path' with the path where you want to save the Excel file
folder_path = 'C:/Users/user/Desktop'
excel_path = 'C:/Users/user/Desktop/highlights.xlsx'

extract_to_excel(folder_path, excel_path)