Replace PyPDF2 with pypdfium2 #38

yiwei-ang · 2023-08-23T05:06:38Z

I really appreciate @alejandro-ao for creating good video demonstrating the perfect blend of openai, PDF readers and streamlit!

I've tried to use the tool for several PDFs, I found that there's an issue of text extraction quality using PyPDF2, that contexts of a PDF are not extracted fully and completely.

After looking into https://github.com/py-pdf/benchmarks, it seems we can go with pypdfium2 that serves similar functionality, while providing better text extraction quality and faster computational time (Verified from my end!)

IlianP · 2023-09-08T14:20:04Z

As a side note, LangChain also supports pypdfium2 as a document loader:
https://python.langchain.com/docs/modules/data_connection/document_loaders/pdf#using-pypdfium2

costabm · 2023-11-02T14:26:15Z

I have added this important feature to my larger pull request (my first one ever). I gave you credit there, but no sure this is the right way to do it.

yiwei-ang added 3 commits August 23, 2023 12:45

replace PyPDF2 with pypdfium2

0a88e85

replace PyPDF2 with pypdfium2

3557fae

cleanup

f89146b

yiwei-ang changed the title ~~Replace pypdfium2 with~~ Replace PyPDF2 with pypdfium2 Aug 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace PyPDF2 with pypdfium2 #38

Replace PyPDF2 with pypdfium2 #38

yiwei-ang commented Aug 23, 2023 •

edited

Loading

IlianP commented Sep 8, 2023

costabm commented Nov 2, 2023

Replace PyPDF2 with pypdfium2 #38

Are you sure you want to change the base?

Replace PyPDF2 with pypdfium2 #38

Conversation

yiwei-ang commented Aug 23, 2023 • edited Loading

IlianP commented Sep 8, 2023

costabm commented Nov 2, 2023

yiwei-ang commented Aug 23, 2023 •

edited

Loading