Skip to content
This repository has been archived by the owner on Apr 15, 2024. It is now read-only.

Can't extract text objects #319

Open
tuyenta opened this issue Aug 9, 2022 · 0 comments
Open

Can't extract text objects #319

tuyenta opened this issue Aug 9, 2022 · 0 comments

Comments

@tuyenta
Copy link

tuyenta commented Aug 9, 2022

Hi,

When using pdfminer.six to extract text elements from a pdf file, I found that it doesn't work in some cases.

Pdf files:
2022 Mar quarterly report_ Ali.pdf
SIA_AR_2021.pdf

Description:

  • File 1: can't extract text, however, it's able to extract text when we convert the original pdf file to a printed pdf.
  • File 2: can't extract only part of the text.

Code which is used:

  def get_page_layout(
      filename,
      line_overlap=0.5,
      char_margin=1.0,
      line_margin=0.5,
      word_margin=0.1,
      boxes_flow=0.5,
      detect_vertical=True,
      all_texts=True,
  ):
      """Returns a PDFMiner LTPage object and page dimension of a single
      page pdf. To get the definitions of kwargs, see
      https://pdfminersix.rtfd.io/en/latest/reference/composable.html.
      Parameters
      ----------
      filename : string
          Path to pdf file.
      line_overlap : float
      char_margin : float
      line_margin : float
      word_margin : float
      boxes_flow : float
      detect_vertical : bool
      all_texts : bool
      Returns
      -------
      layout : object
          PDFMiner LTPage object.
      dim : tuple
          Dimension of pdf page in the form (width, height).
      """
      with open(filename, "rb") as f:
          parser = PDFParser(f)
          document = PDFDocument(parser)
          if not document.is_extractable:
              raise PDFTextExtractionNotAllowed(
                  f"Text extraction is not allowed: {filename}"
              )
          laparams = LAParams(
              line_overlap=line_overlap,
              char_margin=char_margin,
              line_margin=line_margin,
              word_margin=word_margin,
              boxes_flow=boxes_flow,
              detect_vertical=detect_vertical,
              all_texts=all_texts,
          )
          rsrcmgr = PDFResourceManager()
          device = PDFPageAggregator(rsrcmgr, laparams=laparams)
          interpreter = PDFPageInterpreter(rsrcmgr, device)
          for page_num, page in enumerate(PDFPage.create_pages(document)):
              interpreter.process_page(page)
              layout = device.get_result()
              width = layout.bbox[2]
              height = layout.bbox[3]
              dim = (width, height)
          return layout, dim
  
  
  def get_text_objects(layout, ltype="char", t=None):
      """Recursively parses pdf layout to get a list of
      PDFMiner text objects.
      Parameters
      ----------
      layout : object
          PDFMiner LTPage object.
      ltype : string
          Specify 'char', 'lh', 'lv' to get LTChar, LTTextLineHorizontal,
          and LTTextLineVertical objects respectively.
      t : list
      Returns
      -------
      t : list
          List of PDFMiner text objects.
      """
      if ltype == "char":
          LTObject = LTChar
      elif ltype == "image":
          LTObject = LTImage
      elif ltype == "horizontal_text":
          LTObject = LTTextLineHorizontal
      elif ltype == "vertical_text":
          LTObject = LTTextLineVertical
      if t is None:
          t = []
      try:
          for obj in layout._objs:
              if isinstance(obj, LTObject):
                  t.append(obj)
              else:
                  t += get_text_objects(obj, ltype=ltype)
      except AttributeError:
          pass
      return t
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant