Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load PDF file that contains non-ascii characters in file path on Windows #8

Closed
khanhtd36 opened this issue Oct 16, 2021 · 5 comments

Comments

@khanhtd36
Copy link

khanhtd36 commented Oct 16, 2021

OS: Windows 10 x64 21H1 (build 19043.1288)
Python: CPython 3.6.8 x64
Library version: pypdfium==0.0.15

Input pdf file path that contains non-ascii character (eg. tênfilechứakýtựđặcbiệt.pdf) cannot be read by the library.

doc = pdfium.FPDF_LoadDocument('tênfilechứakýtựnonasscii.pdf', None)
page_count = pdfium.FPDF_GetPageCount(doc)  # page_count = 0 !?
@adam-huganir
Copy link
Contributor

adam-huganir commented Oct 21, 2021

I can't authoritatively say this is not an issue with this pypdfium, since the library bindings are outside of my normal wheelhouse, but since it looks like that is a direct interface to the pdfium library itself (as in the google library) I suspect there is nothing that can be done on this side without making this project more than just an interface to the pdfium binary.

The related source that will show pypdfium is just forwarding the info to the linked library rather than doing its own transformations:

if _libs["pdfium"].has("FPDF_LoadDocument", "cdecl"):

I have run into issues with libraries not supporting filenames before, and the best option I have come up with before is just to rename and open and then rename the output or possibly use FPDF_LoadMemDocument64, to load from the doc you would manually open("tênfilechứakýtựnonasscii.pdf", "rb") in python

API Docs for pdfium FPDF_LoadMemDocument function here:
https://developers.foxit.com/resources/pdf-sdk/c_api_reference_pdfium/group___f_p_d_f_i_u_m.html#gafc1bfd72af5ccb5d6b9112c50fe90834

@mara004

This comment has been minimized.

@mara004
Copy link
Collaborator

mara004 commented Oct 23, 2021

Confirmed this is a platform-specific issue, see https://github.com/mara004/pypdfium-reboot/issues/1#issuecomment-950174027

@khanhtd36
Copy link
Author

khanhtd36 commented Oct 27, 2021

Thank you, using FPDF_LoadMemDocument I can now load any pdf file without error.

And here is my code for someone who need it.

def _load_pdf(pdf_filepath, password=None):
  # return None if file not found
  # return pdfium.FPDF_LoadDocument(f'{pdf_filepath}', password)

  try:
    with open(f'{pdf_filepath}', 'rb') as pdf_file:
      pdf_content_buff = pdf_file.read()
      buff_size = len(pdf_content_buff)
      return pdfium.FPDF_LoadMemDocument(pdf_content_buff, buff_size, password)
  except FileNotFoundError:
    return None

@khanhtd36 khanhtd36 changed the title Cannot load PDF file that contains non-ascii characters in file path Cannot load PDF file that contains non-ascii characters in file path on Windows Oct 27, 2021
@mara004
Copy link
Collaborator

mara004 commented Sep 8, 2022

This is now fixed upstream. pypdfium obviously doesn't get updates anymore, but pypdfium2 wheels contain the fix since quite some time. pypdfium2 now also properly offers other loading strategies in the support mode, including FPDF_LoadMemDocument() and FPDF_LoadCustomDocument().

Important note: the FPDF_LoadMemDocument() code as shown above is not complete; callers need to ensure that the buffer object stays referenced, otherwise it may be garbage collected while PDFium still assumes the memory to contain the PDF. This may lead to random content errors or segmentation faults.

@mara004 mara004 closed this as completed Sep 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants