Grobid Full Text Parser / Existing Document Augmenter #204

geli-gel · 2023-03-08T18:43:20Z

full text grobid parser, works like this (examples/grobid_full_text_parser/grobid_bibs.ipynb):

parser.parse takes in an optional output directory to save the raw grobid xml.
the parser takes in a config set with the basic grobid_client setup + an additionally requested coordinates piece "head" which contains the authors section. (had to look in chrome dev tools to find that out because I could only get the authors in the gui but not from the client, and couldn't find the info anywhere, also could not read the grobid js code!)

this is the first part of https://github.com/allenai/scholar/issues/35751 -- second part is going to be taking these raw grobid bibs and updating the spangroups so that the SpanGroup spans include tokens that indicate a bib's number, as well as use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of)

Next steps for this grobid parser are to make it output more pieces of the grobid output.

regan-huff · 2023-03-08T19:29:03Z

src/mmda/parsers/grobid_full_text_parser.py

+@geli-gel
+
+"""
+from grobid_client.grobid_client import GrobidClient


does this need to be added as a dependency somewhere for this parser to work?

oh that's right! let me fix that

regan-huff · 2023-03-08T20:09:50Z

src/mmda/parsers/grobid_full_text_parser.py

+
+        return doc
+
+    def _xml_coords_to_boxes(self, coords_attribute: str, page_sizes: dict):


is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?

you mention this as work still to come:

use the SpanGroup span boxes which essentially consolidate the grobid boxes which are drawn on a per-line basis (whereas the spangroup span boxes are drawn on a per text block basis, more similar to the bib detector we are using grobid in place of

I'm not understanding yet how this is something that still needs to be done...can you explain, or annotate the code somehow so it is clearer what step might still be missing? or is there a way a test could clarify what this means?

I meant that as a note in regard to the overarching ticket, that this PR represents basically the first half of the work -- getting the raw grobid bibs into MMDA format - so our work in this repo is essentially finished for the bib detection replacement work. In the spp-grobid-api is where I was imagining the SpanGroup span altering code would live.

is there a way to test that this works as expected? Like, that this conversion from Grobid's output gives us something close to what we get via the mmda bib detector, which (I believe) follows a different path to boxes?

Once we have this code plus the other stuff in spp-grobid would be the better time to test, but as a super simple test I included a small sample in the python notebook showing the text extracted from the grobid boxes:

kyleclo · 2023-03-09T01:36:05Z

.github/workflows/mmda-ci.yml

@@ -23,7 +23,7 @@ jobs:

      - name: Test with Python ${{ matrix.python-version }}
        run: |
-          pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors]
+          pip install -e .[dev,pysbd_predictors,hf_predictors,lp_predictors,grobid_full_text_parser]


does this not require the grobid binary to be downloaded to perform this test? (kinda like the run apt-get install poppler-utils stuff that we need under test_vila_predictors separate github workflow

ah nvm i see, we dont actually test running grobid itself in the repo, just parsing the xml

the test mocks hitting grobid by just returning the xml that grobid would return so that it doesn't need grobid at all (I copied the method used for the test for the pre-existing grobid parser)

kyleclo · 2023-03-09T01:37:37Z

tests/test_parsers/test_grobid_full_text_parser.py

+class TestGrobidFullTextParser(unittest.TestCase):
+    @um.patch("requests.request", side_effect=mock_request)
+    def test_processes_full_text(self, mock_request):
+        logging.getLogger("pdfminer").setLevel(logging.WARNING)


wats goin on here?

I'm not sure why, but when I ran pytest on this file, a bunch of pdfminer DEBUG logs were popping up, so google told me to do this to turn it down a notch.

ah gotcha! i think i kno wats happening here, thx for the solution, might actually add this closer to where pdfminer appears, not just as a test. not relevant to this PR, future thing

kyleclo · 2023-03-09T01:45:38Z

src/mmda/parsers/grobid_full_text_parser.py

+NS = {"tei": "http://www.tei-c.org/ns/1.0"}
+
+
+class GrobidFullTextParser(Parser):


can we rename this to make it clear in the name that this is an unusual Parser that takes as input also a separate Document? like GrobidAugmentExistingDocumentParser? then it wouldnt clash if we add a GrobidFullTextParser later which strictly creates new documents from the Xml itself without any mapping

kyleclo

lgtm; minor note about naming

full text grobid parser gets bibs in example ipynb

ef7940e

geli-gel requested review from kyleclo and regan-huff March 8, 2023 18:43

include paragraph coords in XML output

a68eaf3

regan-huff reviewed Mar 8, 2023

View reviewed changes

add in the grobid_client dependency

807ac6b

regan-huff approved these changes Mar 8, 2023

View reviewed changes

geli-gel added 5 commits March 8, 2023 15:36

different method to get xml from string

daa6645

added a legit test file

2abfc43

oopsies make the test pass when server's not actually running

00819ea

ok rlly fix the test

c670ff9

erase the horrible horrible space

3100d74

kyleclo reviewed Mar 9, 2023

View reviewed changes

kyleclo approved these changes Mar 9, 2023

View reviewed changes

rename the parser

d4c8ddc

geli-gel changed the title ~~Grobid Full Text Parser~~ Grobid Full Text Parser / Existing Document Augmenter Mar 9, 2023

update example nb

de60b9f

geli-gel merged commit 62221cf into main Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grobid Full Text Parser / Existing Document Augmenter #204

Grobid Full Text Parser / Existing Document Augmenter #204

geli-gel commented Mar 8, 2023 •

edited

Loading

regan-huff Mar 8, 2023

geli-gel Mar 8, 2023

regan-huff Mar 8, 2023

geli-gel Mar 8, 2023

geli-gel Mar 8, 2023

kyleclo Mar 9, 2023

kyleclo Mar 9, 2023 •

edited

Loading

geli-gel Mar 9, 2023

kyleclo Mar 9, 2023

geli-gel Mar 9, 2023

kyleclo Mar 9, 2023 •

edited

Loading

kyleclo Mar 9, 2023

kyleclo left a comment


		return doc

		def _xml_coords_to_boxes(self, coords_attribute: str, page_sizes: dict):

		NS = {"tei": "http://www.tei-c.org/ns/1.0"}


		class GrobidFullTextParser(Parser):

Grobid Full Text Parser / Existing Document Augmenter #204

Grobid Full Text Parser / Existing Document Augmenter #204

Conversation

geli-gel commented Mar 8, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleclo Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleclo Mar 9, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleclo left a comment

Choose a reason for hiding this comment

geli-gel commented Mar 8, 2023 •

edited

Loading

kyleclo Mar 9, 2023 •

edited

Loading

kyleclo Mar 9, 2023 •

edited

Loading