Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grobid Full Text Parser / Existing Document Augmenter #204

Merged
merged 10 commits into from
Mar 9, 2023

Conversation

geli-gel
Copy link
Contributor

@geli-gel geli-gel commented Mar 8, 2023

full text grobid parser, works like this (examples/grobid_full_text_parser/grobid_bibs.ipynb):

parser.parse takes in an optional output directory to save the raw grobid xml.
the parser takes in a config set with the basic grobid_client setup + an additionally requested coordinates piece "head" which contains the authors section. (had to look in chrome dev tools to find that out because I could only get the authors in the gui but not from the client, and couldn't find the info anywhere, also could not read the grobid js code!)

this is the first part of https://github.com/allenai/scholar/issues/35751 -- second part is going to be taking these raw grobid bibs and updating the spangroups so that the SpanGroup spans include tokens that indicate a bib's number, as well as use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of)

Next steps for this grobid parser are to make it output more pieces of the grobid output.

image

@geli-gel

"""
from grobid_client.grobid_client import GrobidClient
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this need to be added as a dependency somewhere for this parser to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh that's right! let me fix that


return doc

def _xml_coords_to_boxes(self, coords_attribute: str, page_sizes: dict):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?

you mention this as work still to come:

use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of

I'm not understanding yet how this is something that still needs to be done...can you explain, or annotate the code somehow so it is clearer what step might still be missing? or is there a way a test could clarify what this means?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant that as a note in regard to the overarching ticket, that this PR represents basically the first half of the work -- getting the raw grobid bibs into MMDA format - so our work in this repo is essentially finished for the bib detection replacement work. In the spp-grobid-api is where I was imagining the SpanGroup span altering code would live.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?

Once we have this code plus the other stuff in spp-grobid would be the better time to test, but as a super simple test I included a small sample in the python notebook showing the text extracted from the grobid boxes:
image

@@ -23,7 +23,7 @@ jobs:

- name: Test with Python ${{ matrix.python-version }}
run: |
pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors]
pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors,grobid_full_text_parser]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this not require the grobid binary to be downloaded to perform this test? (kinda like the run apt-get install poppler-utils stuff that we need under test_vila_predictors separate github workflow

Copy link
Collaborator

@kyleclo kyleclo Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah nvm i see, we dont actually test running grobid itself in the repo, just parsing the xml

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test mocks hitting grobid by just returning the xml that grobid would return so that it doesn't need grobid at all (I copied the method used for the test for the pre-existing grobid parser)

class TestGrobidFullTextParser(unittest.TestCase):
@um.patch("requests.request", side_effect=mock_request)
def test_processes_full_text(self, mock_request):
logging.getLogger("pdfminer").setLevel(logging.WARNING)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wats goin on here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why, but when I ran pytest on this file, a bunch of pdfminer DEBUG logs were popping up, so google told me to do this to turn it down a notch.

Copy link
Collaborator

@kyleclo kyleclo Mar 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah gotcha! i think i kno wats happening here, thx for the solution, might actually add this closer to where pdfminer appears, not just as a test. not relevant to this PR, future thing

NS = {"tei": "http://www.tei-c.org/ns/1.0"}


class GrobidFullTextParser(Parser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we rename this to make it clear in the name that this is an unusual Parser that takes as input also a separate Document? like GrobidAugmentExistingDocumentParser? then it wouldnt clash if we add a GrobidFullTextParser later which strictly creates new documents from the Xml itself without any mapping

Copy link
Collaborator

@kyleclo kyleclo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm; minor note about naming

@geli-gel geli-gel changed the title Grobid Full Text Parser Grobid Full Text Parser / Existing Document Augmenter Mar 9, 2023
@geli-gel geli-gel merged commit 62221cf into main Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants