Skip to content

Conversation

@jmd1010
Copy link
Contributor

@jmd1010 jmd1010 commented Feb 25, 2025

PDF TO MARKDOWN CONVERSION IMPLEMENTATION

  • PDF to Markdown conversion functionality for the web interface
  • Automatic detection and processing of PDF files in chat
  • Conversion to markdown format for LLM processing
  • Installation instructions from the pdf-to-markdown repository
  • Testing steps for verification

The PDF conversion module has been integrated in the svelte web browser interface. Once installed, it will automatically detect pdf files in the chat interface and convert them to markdown automatically for llm processing. No extra servers required. Works with existing backend / front end svelte servers.

🎥 Demo Video (see at 4 min.)

https://youtu.be/bhwtWXoMASA

This document explains the new PDF to Markdown conversion implementation, detailing its functionality, installation process, and the file changes involved. Clone from https://github.com/jzillmann/pdf-to-markdown/tree/modularize.

Integration with Svelte

The integration approach focused on using the library's high-level API while maintaining SSR compatibility:

  • Create PdfConversionService for PDF processing
  • Detects PDF file uploads in ChatInput component
  • Convert PDF content to markdown text
  • Integrate with existing chat processing flow

How it Works

The PDF to Markdown conversion is implemented as a separate module located in the pdf-to-markdown directory. It leverages the pdf-parse library (likely via PdfParser.ts) to parse PDF documents and extract text content. The core logic resides in PdfPipeline.ts, which orchestrates the PDF parsing and conversion process. Pdf-to-Markdown is a folk from pdf.js - Mozilla's PDF parsing & rendering platform which is used as a raw parser

Here's a simplified breakdown of the process:

  1. PDF Parsing: The PdfParser.ts uses pdf-parse to read the PDF file and extract text content from each page.
  2. Content Extraction: The extracted text content is processed to identify text elements, formatting, and structure.
  3. Markdown Conversion: The PdfPipeline.ts then converts the extracted and processed text content into Markdown format. This involves mapping PDF elements to Markdown syntax, attempting to preserve formatting like headings, lists, and basic text styles.
  4. Frontend Integration: The PdfConversionService.ts in the web/src/lib/services directory acts as a frontend service that utilizes the pdf-to-markdown module. It provides a convertToMarkdown function that takes a File object (PDF file) as input, calls the pdf-to-markdown module to perform the conversion, and returns the Markdown output as a string.
  5. Chat Input Integration: The ChatInput.svelte component uses the PdfConversionService to convert uploaded PDF files to Markdown before sending the content to the chat service for pattern processing.

Installation

PDF TO MARKDOWN CONVERSION IMPLEMENTATION

  • PDF to Markdown conversion functionality for the web interface
  • Automatic detection and processing of PDF files in chat
  • Conversion to markdown format for LLM processing
  • Installation instructions from the pdf-to-markdown repository

The PDF conversion module has been integrated in the svelte web browser interface. Once installed, it will automatically detect pdf files in the chat interface and convert them to markdown automatically for llm processing.

HOW TO INSTALL

FROM FABRIC ROOT DIRECTORY

cd .. web

Install in this sequence:

Step 1

npm install -D patch-package

Step 2

npm install -D pdfjs-dist@2.5.207

Step 3

npm install -D github:jzillmann/pdf-to-markdown#modularize

File Changes

The following files were added or modified to implement the PDF to Markdown conversion:

  • web/src/lib/services/PdfConversionService.ts: (New file)

** Modified files: **

  • web/src/lib/components/chat/ChatInput.svelte:
    • Modified to import and use the PdfConversionService in the readFileContent function to handle PDF files.
    • Modified readFileContent to call pdfService.convertToMarkdown for PDF files.

These file changes introduce the new PDF to Markdown conversion functionality and integrate it into the chat input component of the web interface.

@eugeis
Copy link
Collaborator

eugeis commented Feb 25, 2025

Please resolve the merge conflict.

@jmd1010
Copy link
Contributor Author

jmd1010 commented Feb 26, 2025

Please resolve the merge conflict.

@eugeis Yes Eugen, will do tomorrow.
Could you please review pr 1321 that just merged. The whole point of that PR was to update the instruction video and the previous version is still present. Updated video : https://youtu.be/bhwtWXoMASA thanks Jean.

@eugeis
Copy link
Collaborator

eugeis commented Feb 26, 2025

Hi @jmd1010,

I watched the video and really liked it!

The new folders with uppercase letters don’t look very clean. It would be better to move them into a sub folder and use shorter, lowercase names for a neater structure.
Also it would be good to keep all tooling implemented in Go, we have plugins/tools package for it. I like the description generator from the patterns body we can have it implemented as tool for one or all patterns. There was already some ideas to have additional meta data for patterns in yaml format. So such a tool could generate such meta files to the each pattern folder.

Please update the link of the updated video in this PR.

@jmd1010
Copy link
Contributor Author

jmd1010 commented Feb 26, 2025

@eugeis Yes agreed, we will bring folder naming lowercase and we can move around, within Web folder maybe? Good idea to package with GO, we should do it also for Pdf_to Markdown that i'm doing a final testing run on it today. Can we collaborate on this GO packaging? Don't want to mess it up....

@eugeis
Copy link
Collaborator

eugeis commented Feb 26, 2025

  1. Yes, please move it to the web folder for now.
  1. No worries! Feel free to implement your cool tooling inside plugins/go/tools/, and I can clean it up and refactor it afterward.

Copy link
Contributor Author

@jmd1010 jmd1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eugeis

  1. Simplified install process V1 + Retest process.
  2. Move 3 out 4 README files to /web. The conflict prevented me from moving the pr-1284-update.md file, nor update the video and delete the folder. Hopefully you can finalize from your end. Kind of a pesting behind the scene git sync thing that I can't resolve from my end.
  3. We'll address Go implementation in Next step. got to get back to work:)
  4. Fnally brought back some .png that were excluded from my .gitignore. I suppose. Cheers, Jean

@jmd1010 jmd1010 force-pushed the pdf-integration-clean branch from 6c9a3b7 to a74da4a Compare February 27, 2025 04:15
@eugeis eugeis merged commit 0bec533 into danielmiessler:main Feb 27, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants