Bib Entry Parser/Predictor #83

stefanc-ai2 · 2022-06-17T23:46:25Z

https://github.com/allenai/scholar/issues/32461

Pretty standard model and interface implementation. You can see the result on http://bibentry-predictor.v0.dev.models.s2.allenai.org/ or

curl -X 'POST' \
  'http://bibentry-predictor.v0.dev.models.s2.allenai.org/invocations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "instances": [
    {
      "bib_entry": "[4] Wei Zhuo, Qianyi Zhan, Yuan Liu, Zhenping Xie, and Jing Lu. Context attention heterogeneous network embed- ding. Computational Intelligence and Neuroscience , 2019. doi: 10.1155/2019/8106073."
    }
  ]
}'

Tests: integration test passed and dev deployment works

TODO: release as version 0.0.10 after merge

stefanc-ai2 · 2022-06-17T23:51:28Z

mmda/predictors/hf_predictors/bibentry_predictor.py

+    url: Optional[StringWithSpan]
+
+
+class BibEntryPredictor:


I've been calling it BibEntryParser, but since it's under hf_predictor I renamed it to BibEntryPredictor. Which name do you prefer BibEntryParser or BibEntryPredictor?

The model's name was Bib Entry Parser, like Layout Parser, and the Predictor is part of what is built for that model, not the entire model itself. I feel like sticking with Bib Entry Parser is more consistent, though it does produce some clunky names and redundancy like 'BibEntryParserPredictor'

stefanc-ai2 · 2022-06-17T23:59:19Z

mmda/predictors/hf_predictors/bibentry_predictor.py

+        tokenized_inputs = self.tokenizer(bib_entries, padding=True, truncation=True, return_tensors="pt")
+        predictions = self.model(**tokenized_inputs)
+
+        pred_ids = predictions.logits.argmax(2).tolist()  # check GPU vs CPU; might need to .detach()


Kyle mentioned to check GPU vs CPU, might need .detach(). I checked that it works fine on both.

stefanc-ai2 · 2022-06-18T00:04:22Z

mmda/predictors/hf_predictors/bibentry_predictor.py

+        )
+
+    @staticmethod
+    def _get_word_level_prediction(word_ids: List[Optional[int]], predictions: List[int]) -> List[int]:


Token manipulation gymnastics from here down.

stefanc-ai2 · 2022-06-18T00:21:06Z

ai2_internal/config.yaml

+    python_version: 3.7
+
+    # Whether this model supports CUDA GPU acceleration
+    cuda: False


Not sure about this one. It currently works on CPU-only instance, but might not be optimal.

If you want you can set this to True. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False variant.

Whattabatt

Looks good to me

cmwilhelm

The TIMO-ification aspects here look good to me, thanks for working through that!

The part that I'd like to get more clarity on before we merge is what we talked about in sprint planning this morning: whether we're going to use the mmda data model.

As far as I'm aware, all the existing mmda predictors use a shared data model to represent documents and their components. I don't fully understand what the ramifications are for NOT using them for the bib entry predictor, other than that it complicates integration in SPP.

@kyleclo @lolipopshock what is your guidance on this front? Should all MMDA models be using the MMDA data model? And if so, how should it be applied to bib entry?

cmwilhelm · 2022-06-21T22:58:05Z

ai2_internal/config.yaml

+    python_version: 3.7
+
+    # Whether this model supports CUDA GPU acceleration
+    cuda: False


If you want you can set this to True. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False variant.

cmwilhelm · 2022-06-21T22:59:27Z

setup.py

@@ -2,7 +2,7 @@

 setuptools.setup(
    name="mmda",
-    version="0.0.8",
+    version="0.0.9rc1",


When this is ready to merge you should probably change this to 0.0.9

kyleclo · 2022-06-22T04:35:36Z

@cmwilhelm Not using the MMDA model likely results in a few things:

MMDA is undergoing some refactoring, and it's easier to refactor if all the Predictors adhere to the same input/output than if each Predictor was doing something on its own.
MMDA has a standard way of performing serialization/instantiation to/from JSON. Deviating from the underlying data model likely means SPP (or whatever downstream user of MMDA) will need to implement/maintain their own serialization.
MMDA abstracts away some annoying boilerplate stuff one would have to write to obtain bounding boxes of output of Predictors. For example, if we ever want to highlight BibEntry titles/authors/DOIs for whatever reason & we needed the exact bounding boxes of these predictions, the code you'd have to write to obtain this from the other MMDA objects is probably harder to write than if the predictor had adhered originally to the MMDA conventions.

For Stefan's specific instance, none of these are particularly severe/important; just a bit of hassle later for #1 and #2.
#3 seems unlikely (there are no plans for bounding boxes on the fields that Stefan is extracting) and even if it comes to it later, it's not really hard to adapt.

In fact, #3 is an interesting one because, Stefan's model doesn't have to be part of MMDA. The value of MMDA is mostly to handle a lot of annoying code one would write to handle text + visual annotations on a PDF simultaneously. Citation & BibEntry highlights, for example, really benefit from this library. But models like TLDRs or SPECTER, while they could conform to MMDA conventions, don't have to as they don't benefit from anything MMDA provides. One can view Stefan's model producing a JSON of metadata fields given a BibEntry string as falling within the same category of model as TLDRs (give me some text, I'll spit out some other text), as opposed as the category of model for VILA (give me a PDF, I'll spit out the locations of the highlighted units).

In all, I think it's not a big deal if Stefan's model doesn't adhere to MMDA conventions. In fact, it's probably even fine for now if Stefan's model lives outside of MMDA on its own, similar to TLDRs or SPECTER, until there's a real reason that it has to be in MMDA. But it's also not a big deal if it does live in MMDA for now; I'll maybe get around to refactoring it to confirm to MMDA conventions, but it's lower priority than ensuring Yogi/Angeles components conform.

Thoughts?

stefanc-ai2 · 2022-06-22T16:49:44Z

Good and interesting point about bib entry predictor not being part of MMDA and instead being more similar to TLDR. I actually like that better because it may be used by other services like maybe part of semantic reader's latex pipeline's citation matching.

cmwilhelm · 2022-06-22T17:11:30Z

@kyleclo @stefanc-ai2

I'm catching up on the comments above.

The only question i have re: to-mmda-or-not-to-mmda is where is the input for Stefan's model coming from?
it's coming from @geli-gel 's model right?

So if @geli-gel 's model needs to produce MMDA output, we need something that can identify and pull out the relevant strings that run in @stefanc-ai2 's model. I see three possibilities:

Angele's model just outputs strings. no bboxes or other mmda stuff. this would be the simplest to pipe directly into Stefan's model but would lose all of the mmda features of these annotations.
Angele's model outputs MMDA data. SPP's scala code somehow reads that, and identifies the strings that need structure inferencing. no further work needed in Stefan's model but now the logic for how to operate over MMDA data has to be reimplemented in scala. Potentially scary especially if you're refactoring how MMDA works
Angele's model outputs MMDA data and Stefan's model takes in that same mmda model. more work for Stefan, but has all the utilities necessary in mmda.git to do the work

…redictor

stefanc-ai2 · 2022-06-29T21:55:32Z

Merging in this PR. Tracking conversion to taking in Document as input and outputting Annotation in this ticket. https://github.com/allenai/scholar/issues/32722

* Speed up vila pre-processing (#84) * bump version and fix vila test fixture inclusion (#85) Co-authored-by: Chris Wilhelm <chris@allenai.org> * Bib Entry Parser/Predictor (#83) * Passed tt verify * fix variant name * up version num * rename enum * Pins vila to 0.3.0. (#86) Unbounded upper version was pulling in breaking changes transitively in builds. Co-authored-by: Chris Wilhelm <chris@allenai.org> * update demo with the rasterizer (#71) * Add to model device in bib entry predictor (#87) * bibentrypredictor model device * up setup.py Co-authored-by: Rodney Kinney <rodneyk@allenai.org> Co-authored-by: Chris Wilhelm <chris.wilhelm@gmail.com> Co-authored-by: Chris Wilhelm <chris@allenai.org> Co-authored-by: Stefan Candra <52424731+stefanc-ai2@users.noreply.github.com>

stefanc-ai2 added 3 commits June 16, 2022 17:49

Passed tt verify

74da42b

fix variant name

1b854de

up version num

bf42f91

stefanc-ai2 requested a review from Whattabatt June 17, 2022 23:46

stefanc-ai2 commented Jun 17, 2022

View reviewed changes

stefanc-ai2 requested review from lolipopshock and cmwilhelm June 18, 2022 00:01

stefanc-ai2 self-assigned this Jun 18, 2022

stefanc-ai2 commented Jun 18, 2022

View reviewed changes

rename enum

bbc026a

stefanc-ai2 commented Jun 18, 2022

View reviewed changes

Whattabatt approved these changes Jun 21, 2022

View reviewed changes

cmwilhelm reviewed Jun 21, 2022

View reviewed changes

Merge branch 'main' of https://github.com/allenai/mmda into bibentryp…

cbbc2ba

…redictor

stefanc-ai2 merged commit 7df6968 into main Jun 29, 2022

stefanc-ai2 deleted the bibentrypredictor branch June 29, 2022 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bib Entry Parser/Predictor #83

Bib Entry Parser/Predictor #83

stefanc-ai2 commented Jun 17, 2022 •

edited

stefanc-ai2 Jun 17, 2022

Whattabatt Jun 21, 2022

stefanc-ai2 Jun 17, 2022

stefanc-ai2 Jun 18, 2022

stefanc-ai2 Jun 18, 2022

cmwilhelm Jun 21, 2022

Whattabatt left a comment

cmwilhelm left a comment •

edited

cmwilhelm Jun 21, 2022

cmwilhelm Jun 21, 2022

kyleclo commented Jun 22, 2022 •

edited

stefanc-ai2 commented Jun 22, 2022 •

edited

cmwilhelm commented Jun 22, 2022

stefanc-ai2 commented Jun 29, 2022

Bib Entry Parser/Predictor #83

Bib Entry Parser/Predictor #83

Conversation

stefanc-ai2 commented Jun 17, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Whattabatt left a comment

Choose a reason for hiding this comment

cmwilhelm left a comment • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kyleclo commented Jun 22, 2022 • edited

stefanc-ai2 commented Jun 22, 2022 • edited

cmwilhelm commented Jun 22, 2022

stefanc-ai2 commented Jun 29, 2022

stefanc-ai2 commented Jun 17, 2022 •

edited

cmwilhelm left a comment •

edited

kyleclo commented Jun 22, 2022 •

edited

stefanc-ai2 commented Jun 22, 2022 •

edited