Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting word coordinates in each cell #45

Closed
satheeshkatipomu opened this issue Aug 1, 2019 · 4 comments
Closed

Getting word coordinates in each cell #45

satheeshkatipomu opened this issue Aug 1, 2019 · 4 comments

Comments

@satheeshkatipomu
Copy link

I am able to get bounding boxes for each using table.cells, Any pointers to get bounding boxes for each word in each cell?

@akshowhini
Copy link

@sathk882 Could you please present on what on tried? If there was error, please indicate steps to replicate the problem.

@satheeshkatipomu
Copy link
Author

satheeshkatipomu commented Aug 1, 2019

@udayrddy , I am able to get bounding boxes for each cell using table.cells but currently I have no idea whether it is possible to get individual word coordinates in a particular cell. I tried table._text but it is also giving bounding box a text segment but not at word level.

Eg:Screenshot 2019-08-01 at 7 10 53 PM

I am able to get coordinates of "Effective Date" in the above image using table._text. But I want to get coordinates for words "Effective" and "Date" separately. But currently I have not idea

@akshowhini
Copy link

@sathk882
Listed below are the elements in the result Object, which is restricted to the end purpose
['cols', 'rows', 'cells', 'df', 'shape', 'accuracy', 'whitespace', 'order', 'page', 'flavor', '_text', '_image', '_segments', '_textedges', '_bbox']

You can work around mapping tables[-1].df and tables[-1].cells to get the cell location as below

for x, y in zip(tables[-1].df.__array__().tolist(), tables[-1].cells):
    for word, loc in zip(x, y):
        print(word, loc)

To your ask, which is completely out of scope, (I feel), especially the split of the cell value, cannot be obtained from the result. You might want to use PyPDF2 and tweak around it

Also, this, the ask, definitely not raised as an issue but asked as Stack Overflow Q&A. So, please be helpful in closing the issue and help the admin.

@satheeshkatipomu
Copy link
Author

satheeshkatipomu commented Aug 2, 2019

@udayrddy I know it is not an issue, I was just asking for pointers to do this. I do not think asking this in stack over flow will help, because i want to get word coordinates specifically in camelot. Anyways closing the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants