Cannot load PDF file that contains non-ascii characters in file path on Windows #8

khanhtd36 · 2021-10-16T22:44:24Z

OS: Windows 10 x64 21H1 (build 19043.1288)
Python: CPython 3.6.8 x64
Library version: pypdfium==0.0.15

Input pdf file path that contains non-ascii character (eg. tênfilechứakýtựđặcbiệt.pdf) cannot be read by the library.

doc = pdfium.FPDF_LoadDocument('tênfilechứakýtựnonasscii.pdf', None)
page_count = pdfium.FPDF_GetPageCount(doc)  # page_count = 0 !?

adam-huganir · 2021-10-21T13:33:02Z

I can't authoritatively say this is not an issue with this pypdfium, since the library bindings are outside of my normal wheelhouse, but since it looks like that is a direct interface to the pdfium library itself (as in the google library) I suspect there is nothing that can be done on this side without making this project more than just an interface to the pdfium binary.

The related source that will show pypdfium is just forwarding the info to the linked library rather than doing its own transformations:

pypdfium/pypdfium/pypdfium.py

Line 1170 in 5807c8a

if _libs["pdfium"].has("FPDF_LoadDocument", "cdecl"):

I have run into issues with libraries not supporting filenames before, and the best option I have come up with before is just to rename and open and then rename the output or possibly use FPDF_LoadMemDocument64, to load from the doc you would manually open("tênfilechứakýtựnonasscii.pdf", "rb") in python

API Docs for pdfium FPDF_LoadMemDocument function here:
https://developers.foxit.com/resources/pdf-sdk/c_api_reference_pdfium/group___f_p_d_f_i_u_m.html#gafc1bfd72af5ccb5d6b9112c50fe90834

mara004 · 2021-10-23T16:07:16Z

Confirmed this is a platform-specific issue, see https://github.com/mara004/pypdfium-reboot/issues/1#issuecomment-950174027

khanhtd36 · 2021-10-27T11:04:07Z

Thank you, using FPDF_LoadMemDocument I can now load any pdf file without error.

And here is my code for someone who need it.

def _load_pdf(pdf_filepath, password=None):
  # return None if file not found
  # return pdfium.FPDF_LoadDocument(f'{pdf_filepath}', password)

  try:
    with open(f'{pdf_filepath}', 'rb') as pdf_file:
      pdf_content_buff = pdf_file.read()
      buff_size = len(pdf_content_buff)
      return pdfium.FPDF_LoadMemDocument(pdf_content_buff, buff_size, password)
  except FileNotFoundError:
    return None

mara004 · 2022-09-08T14:57:56Z

This is now fixed upstream. pypdfium obviously doesn't get updates anymore, but pypdfium2 wheels contain the fix since quite some time. pypdfium2 now also properly offers other loading strategies in the support mode, including FPDF_LoadMemDocument() and FPDF_LoadCustomDocument().

Important note: the FPDF_LoadMemDocument() code as shown above is not complete; callers need to ensure that the buffer object stays referenced, otherwise it may be garbage collected while PDFium still assumes the memory to contain the PDF. This may lead to random content errors or segmentation faults.

This comment has been minimized.

Sign in to view

mara004 mentioned this issue Oct 23, 2021

FPDF_LoadDocument() fails with non-ascii filenames on Windows pypdfium2-team/pypdfium2#2

Closed

khanhtd36 changed the title ~~Cannot load PDF file that contains non-ascii characters in file path~~ Cannot load PDF file that contains non-ascii characters in file path on Windows Oct 27, 2021

mara004 closed this as completed Sep 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot load PDF file that contains non-ascii characters in file path on Windows #8

Cannot load PDF file that contains non-ascii characters in file path on Windows #8

khanhtd36 commented Oct 16, 2021 •

edited

Loading

adam-huganir commented Oct 21, 2021 •

edited by mara004

Loading

This comment has been minimized.

mara004 commented Oct 23, 2021

khanhtd36 commented Oct 27, 2021 •

edited

Loading

mara004 commented Sep 8, 2022

Cannot load PDF file that contains non-ascii characters in file path on Windows #8

Cannot load PDF file that contains non-ascii characters in file path on Windows #8

Comments

khanhtd36 commented Oct 16, 2021 • edited Loading

adam-huganir commented Oct 21, 2021 • edited by mara004 Loading

This comment has been minimized.

mara004 commented Oct 23, 2021

khanhtd36 commented Oct 27, 2021 • edited Loading

mara004 commented Sep 8, 2022

khanhtd36 commented Oct 16, 2021 •

edited

Loading

adam-huganir commented Oct 21, 2021 •

edited by mara004

Loading

khanhtd36 commented Oct 27, 2021 •

edited

Loading