Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bib Entry Parser/Predictor #83

Merged
merged 5 commits into from Jun 29, 2022
Merged

Bib Entry Parser/Predictor #83

merged 5 commits into from Jun 29, 2022

Conversation

stefanc-ai2
Copy link
Contributor

@stefanc-ai2 stefanc-ai2 commented Jun 17, 2022

https://github.com/allenai/scholar/issues/32461

Pretty standard model and interface implementation. You can see the result on http://bibentry-predictor.v0.dev.models.s2.allenai.org/ or

curl -X 'POST' \
  'http://bibentry-predictor.v0.dev.models.s2.allenai.org/invocations' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "instances": [
    {
      "bib_entry": "[4] Wei Zhuo, Qianyi Zhan, Yuan Liu, Zhenping Xie, and Jing Lu. Context attention heterogeneous network embed- ding. Computational Intelligence and Neuroscience , 2019. doi: 10.1155/2019/8106073."
    }
  ]
}'

Tests: integration test passed and dev deployment works

TODO: release as version 0.0.10 after merge

url: Optional[StringWithSpan]


class BibEntryPredictor:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been calling it BibEntryParser, but since it's under hf_predictor I renamed it to BibEntryPredictor. Which name do you prefer BibEntryParser or BibEntryPredictor?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model's name was Bib Entry Parser, like Layout Parser, and the Predictor is part of what is built for that model, not the entire model itself. I feel like sticking with Bib Entry Parser is more consistent, though it does produce some clunky names and redundancy like 'BibEntryParserPredictor'

tokenized_inputs = self.tokenizer(bib_entries, padding=True, truncation=True, return_tensors="pt")
predictions = self.model(**tokenized_inputs)

pred_ids = predictions.logits.argmax(2).tolist() # check GPU vs CPU; might need to .detach()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kyle mentioned to check GPU vs CPU, might need .detach(). I checked that it works fine on both.

)

@staticmethod
def _get_word_level_prediction(word_ids: List[Optional[int]], predictions: List[int]) -> List[int]:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Token manipulation gymnastics from here down.

python_version: 3.7

# Whether this model supports CUDA GPU acceleration
cuda: False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this one. It currently works on CPU-only instance, but might not be optimal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can set this to True. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False variant.

Copy link
Contributor

@Whattabatt Whattabatt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

Copy link
Contributor

@cmwilhelm cmwilhelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TIMO-ification aspects here look good to me, thanks for working through that!

The part that I'd like to get more clarity on before we merge is what we talked about in sprint planning this morning: whether we're going to use the mmda data model.

As far as I'm aware, all the existing mmda predictors use a shared data model to represent documents and their components. I don't fully understand what the ramifications are for NOT using them for the bib entry predictor, other than that it complicates integration in SPP.

@kyleclo @lolipopshock what is your guidance on this front? Should all MMDA models be using the MMDA data model? And if so, how should it be applied to bib entry?

python_version: 3.7

# Whether this model supports CUDA GPU acceleration
cuda: False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want you can set this to True. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False variant.

setup.py Outdated
@@ -2,7 +2,7 @@

setuptools.setup(
name="mmda",
version="0.0.8",
version="0.0.9rc1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this is ready to merge you should probably change this to 0.0.9

@kyleclo
Copy link
Collaborator

kyleclo commented Jun 22, 2022

@cmwilhelm Not using the MMDA model likely results in a few things:

  1. MMDA is undergoing some refactoring, and it's easier to refactor if all the Predictors adhere to the same input/output than if each Predictor was doing something on its own.
  2. MMDA has a standard way of performing serialization/instantiation to/from JSON. Deviating from the underlying data model likely means SPP (or whatever downstream user of MMDA) will need to implement/maintain their own serialization.
  3. MMDA abstracts away some annoying boilerplate stuff one would have to write to obtain bounding boxes of output of Predictors. For example, if we ever want to highlight BibEntry titles/authors/DOIs for whatever reason & we needed the exact bounding boxes of these predictions, the code you'd have to write to obtain this from the other MMDA objects is probably harder to write than if the predictor had adhered originally to the MMDA conventions.

For Stefan's specific instance, none of these are particularly severe/important; just a bit of hassle later for #1 and #2.
#3 seems unlikely (there are no plans for bounding boxes on the fields that Stefan is extracting) and even if it comes to it later, it's not really hard to adapt.

In fact, #3 is an interesting one because, Stefan's model doesn't have to be part of MMDA. The value of MMDA is mostly to handle a lot of annoying code one would write to handle text + visual annotations on a PDF simultaneously. Citation & BibEntry highlights, for example, really benefit from this library. But models like TLDRs or SPECTER, while they could conform to MMDA conventions, don't have to as they don't benefit from anything MMDA provides. One can view Stefan's model producing a JSON of metadata fields given a BibEntry string as falling within the same category of model as TLDRs (give me some text, I'll spit out some other text), as opposed as the category of model for VILA (give me a PDF, I'll spit out the locations of the highlighted units).

In all, I think it's not a big deal if Stefan's model doesn't adhere to MMDA conventions. In fact, it's probably even fine for now if Stefan's model lives outside of MMDA on its own, similar to TLDRs or SPECTER, until there's a real reason that it has to be in MMDA. But it's also not a big deal if it does live in MMDA for now; I'll maybe get around to refactoring it to confirm to MMDA conventions, but it's lower priority than ensuring Yogi/Angeles components conform.

Thoughts?

@stefanc-ai2
Copy link
Contributor Author

stefanc-ai2 commented Jun 22, 2022

Good and interesting point about bib entry predictor not being part of MMDA and instead being more similar to TLDR. I actually like that better because it may be used by other services like maybe part of semantic reader's latex pipeline's citation matching.

@cmwilhelm
Copy link
Contributor

@kyleclo @stefanc-ai2

I'm catching up on the comments above.

The only question i have re: to-mmda-or-not-to-mmda is where is the input for Stefan's model coming from?
it's coming from @geli-gel 's model right?

So if @geli-gel 's model needs to produce MMDA output, we need something that can identify and pull out the relevant strings that run in @stefanc-ai2 's model. I see three possibilities:

  1. Angele's model just outputs strings. no bboxes or other mmda stuff. this would be the simplest to pipe directly into Stefan's model but would lose all of the mmda features of these annotations.
  2. Angele's model outputs MMDA data. SPP's scala code somehow reads that, and identifies the strings that need structure inferencing. no further work needed in Stefan's model but now the logic for how to operate over MMDA data has to be reimplemented in scala. Potentially scary especially if you're refactoring how MMDA works
  3. Angele's model outputs MMDA data and Stefan's model takes in that same mmda model. more work for Stefan, but has all the utilities necessary in mmda.git to do the work

@stefanc-ai2
Copy link
Contributor Author

Merging in this PR. Tracking conversion to taking in Document as input and outputting Annotation in this ticket. https://github.com/allenai/scholar/issues/32722

@stefanc-ai2 stefanc-ai2 merged commit 7df6968 into main Jun 29, 2022
@stefanc-ai2 stefanc-ai2 deleted the bibentrypredictor branch June 29, 2022 21:56
soldni added a commit that referenced this pull request Jul 5, 2022
* Speed up vila pre-processing (#84)

* bump version and fix vila test fixture inclusion (#85)

Co-authored-by: Chris Wilhelm <chris@allenai.org>

* Bib Entry Parser/Predictor (#83)

* Passed tt verify

* fix variant name

* up version num

* rename enum

* Pins vila to 0.3.0. (#86)

Unbounded upper version was pulling in breaking
changes transitively in builds.

Co-authored-by: Chris Wilhelm <chris@allenai.org>

* update demo with the rasterizer (#71)

* Add to model device in bib entry predictor (#87)

* bibentrypredictor model device

* up setup.py

Co-authored-by: Rodney Kinney <rodneyk@allenai.org>
Co-authored-by: Chris Wilhelm <chris.wilhelm@gmail.com>
Co-authored-by: Chris Wilhelm <chris@allenai.org>
Co-authored-by: Stefan Candra <52424731+stefanc-ai2@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants