-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grobid Full Text Parser / Existing Document Augmenter #204
Conversation
@geli-gel | ||
|
||
""" | ||
from grobid_client.grobid_client import GrobidClient |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this need to be added as a dependency somewhere for this parser to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh that's right! let me fix that
|
||
return doc | ||
|
||
def _xml_coords_to_boxes(self, coords_attribute: str, page_sizes: dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?
you mention this as work still to come:
use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of
I'm not understanding yet how this is something that still needs to be done...can you explain, or annotate the code somehow so it is clearer what step might still be missing? or is there a way a test could clarify what this means?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that as a note in regard to the overarching ticket, that this PR represents basically the first half of the work -- getting the raw grobid bibs into MMDA format - so our work in this repo is essentially finished for the bib detection replacement work. In the spp-grobid-api is where I was imagining the SpanGroup span altering code would live.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?
Once we have this code plus the other stuff in spp-grobid would be the better time to test, but as a super simple test I included a small sample in the python notebook showing the text extracted from the grobid boxes:
.github/workflows/mmda-ci.yml
Outdated
@@ -23,7 +23,7 @@ jobs: | |||
|
|||
- name: Test with Python ${{ matrix.python-version }} | |||
run: | | |||
pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors] | |||
pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors,grobid_full_text_parser] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this not require the grobid binary to be downloaded to perform this test? (kinda like the run apt-get install poppler-utils
stuff that we need under test_vila_predictors
separate github workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah nvm i see, we dont actually test running grobid itself in the repo, just parsing the xml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test mocks hitting grobid by just returning the xml that grobid would return so that it doesn't need grobid at all (I copied the method used for the test for the pre-existing grobid parser)
class TestGrobidFullTextParser(unittest.TestCase): | ||
@um.patch("requests.request", side_effect=mock_request) | ||
def test_processes_full_text(self, mock_request): | ||
logging.getLogger("pdfminer").setLevel(logging.WARNING) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wats goin on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why, but when I ran pytest on this file, a bunch of pdfminer DEBUG logs were popping up, so google told me to do this to turn it down a notch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah gotcha! i think i kno wats happening here, thx for the solution, might actually add this closer to where pdfminer appears, not just as a test. not relevant to this PR, future thing
NS = {"tei": "http://www.tei-c.org/ns/1.0"} | ||
|
||
|
||
class GrobidFullTextParser(Parser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we rename this to make it clear in the name that this is an unusual Parser that takes as input also a separate Document
? like GrobidAugmentExistingDocumentParser
? then it wouldnt clash if we add a GrobidFullTextParser
later which strictly creates new documents from the Xml itself without any mapping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; minor note about naming
full text grobid parser, works like this (examples/grobid_full_text_parser/grobid_bibs.ipynb):
parser.parse takes in an optional output directory to save the raw grobid xml.
the parser takes in a config set with the basic grobid_client setup + an additionally requested coordinates piece "head" which contains the authors section. (had to look in chrome dev tools to find that out because I could only get the authors in the gui but not from the client, and couldn't find the info anywhere, also could not read the grobid js code!)
this is the first part of https://github.com/allenai/scholar/issues/35751 -- second part is going to be taking these raw grobid bibs and updating the spangroups so that the SpanGroup spans include tokens that indicate a bib's number, as well as use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of)
Next steps for this grobid parser are to make it output more pieces of the grobid output.