Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate strings assigned to the same cell #15

Closed
vinayak-mehta opened this issue Jul 4, 2019 · 12 comments · Fixed by #206
Closed

Duplicate strings assigned to the same cell #15

vinayak-mehta opened this issue Jul 4, 2019 · 12 comments · Fixed by #206
Labels
bug Something isn't working
Projects

Comments

@vinayak-mehta
Copy link
Member

Check out this birdisland.pdf output here.

@davidkong0987
Copy link
Contributor

I believe this occurs when bold characters are created by putting duplicate characters instead of widdening the character. I've noticed it often creates 4 copies of each, although in your example it is 2x. That implies it might be at the pdf level. I think it might be at the pdf level because these bold characters don't have any difference in terms of font and other characteristics.

@davidkong0987
Copy link
Contributor

In addition, this is made worse by the fact in some duplicates, the LTHorizontal Object splits the line into two, and in some duplicates it is not split.

@TheNetJedi
Copy link

Yep, facing the same issue.
And yes, this only occurs with bold characters AFAIK.
Any workaround for this apart from fixing the PDFs?

@davidkong0987
Copy link
Contributor

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

@TheNetJedi
Copy link

@davidkong0987

Can you please guide me on how I would do that?
I'm a noob.

@davidkong0987
Copy link
Contributor

You need to change the source code so this isn't a great task if you're not comfortable with programming.

Whenever you see horizontals = get_text_objects(ltype=LThorizontal), you can do the following code to delete horizontals.

        deletes = []
        for i in horizontals:
            if i not in deletes:
                for obj in horizontals:

                    if obj is not i:
                        try:
                            if all([
        min([t.x0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.x0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        min([t.y0 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) > min([t.y0 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])-1,
        max([t.x1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.x1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        max([t.y1 for t in obj if not isinstance(t,LTAnno) and t.get_text().strip()]) < max([t.y1 for t in i if not isinstance(t,LTAnno) and t.get_text().strip()])+1,
        ]):
                                print('largest',i)
                                print('delete',obj)
                                deletes += [obj]
                                i.customBold = True
                                for char in i:
                                    char.customBold = True
                        except:
                            pass
                horizontals = [obj for obj in horizontals if obj not in deletes]

If anyone notices cases that this does not cover, please let me know.

@TheNetJedi
Copy link

@davidkong0987

Thanks, I'll try this out and get back to you!

@davidkong0987
Copy link
Contributor

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

@vinayak-mehta
Copy link
Member Author

There's a relatively easy fix that probably works most of the time (haven't seen a counter example but assume there might be some) by simply eliminating any fully overlapping LTHorizontals.

Yes! Let me see if I can get this into the library. Would you like to raise a PR with a corresponding test with the example PDF?

sometimes text is stacked on top of each other intentionally, this doesn't adjust for that

Yes.

@edugonza
Copy link

Hi guys, I sent a PR with a working solution to the issue. I added a unittest with the PDF file mentioned in the first comment.

vinayak-mehta added a commit that referenced this issue Oct 28, 2020
[MRG] Fix #15 extraction of cell data discarding overlapping text boxes
@vinayak-mehta
Copy link
Member Author

vinayak-mehta commented Oct 28, 2020

@edugonza Thank you for fixing this! The PR looked good! Thank you for adding a test too 👍

I'll start working on a release soon.

@rain01
Copy link

rain01 commented Feb 4, 2021

Can't wait. Any idea when it will be released?

tomprogrammer pushed a commit to tomprogrammer/camelot that referenced this issue May 10, 2023
…ctions/actions/setup-python-4

Bump actions/setup-python from 3 to 4
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
TODO!
  
Awaiting triage
Development

Successfully merging a pull request may close this issue.

5 participants