New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bib Entry Parser/Predictor #83
Conversation
url: Optional[StringWithSpan] | ||
|
||
|
||
class BibEntryPredictor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been calling it BibEntryParser, but since it's under hf_predictor
I renamed it to BibEntryPredictor
. Which name do you prefer BibEntryParser
or BibEntryPredictor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model's name was Bib Entry Parser, like Layout Parser, and the Predictor is part of what is built for that model, not the entire model itself. I feel like sticking with Bib Entry Parser is more consistent, though it does produce some clunky names and redundancy like 'BibEntryParserPredictor'
tokenized_inputs = self.tokenizer(bib_entries, padding=True, truncation=True, return_tensors="pt") | ||
predictions = self.model(**tokenized_inputs) | ||
|
||
pred_ids = predictions.logits.argmax(2).tolist() # check GPU vs CPU; might need to .detach() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kyle mentioned to check GPU vs CPU, might need .detach()
. I checked that it works fine on both.
) | ||
|
||
@staticmethod | ||
def _get_word_level_prediction(word_ids: List[Optional[int]], predictions: List[int]) -> List[int]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Token manipulation gymnastics from here down.
python_version: 3.7 | ||
|
||
# Whether this model supports CUDA GPU acceleration | ||
cuda: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about this one. It currently works on CPU-only instance, but might not be optimal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can set this to True
. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False
variant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TIMO-ification aspects here look good to me, thanks for working through that!
The part that I'd like to get more clarity on before we merge is what we talked about in sprint planning this morning: whether we're going to use the mmda data model.
As far as I'm aware, all the existing mmda predictors use a shared data model to represent documents and their components. I don't fully understand what the ramifications are for NOT using them for the bib entry predictor, other than that it complicates integration in SPP.
@kyleclo @lolipopshock what is your guidance on this front? Should all MMDA models be using the MMDA data model? And if so, how should it be applied to bib entry?
python_version: 3.7 | ||
|
||
# Whether this model supports CUDA GPU acceleration | ||
cuda: False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want you can set this to True
. The image that results will be runnable on CPU or GPU infra. It just takes longer to build and is larger in size than the cuda: False
variant.
setup.py
Outdated
@@ -2,7 +2,7 @@ | |||
|
|||
setuptools.setup( | |||
name="mmda", | |||
version="0.0.8", | |||
version="0.0.9rc1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this is ready to merge you should probably change this to 0.0.9
@cmwilhelm Not using the MMDA model likely results in a few things:
For Stefan's specific instance, none of these are particularly severe/important; just a bit of hassle later for #1 and #2. In fact, #3 is an interesting one because, Stefan's model doesn't have to be part of MMDA. The value of MMDA is mostly to handle a lot of annoying code one would write to handle text + visual annotations on a PDF simultaneously. Citation & BibEntry highlights, for example, really benefit from this library. But models like TLDRs or SPECTER, while they could conform to MMDA conventions, don't have to as they don't benefit from anything MMDA provides. One can view Stefan's model producing a JSON of metadata fields given a BibEntry string as falling within the same category of model as TLDRs (give me some text, I'll spit out some other text), as opposed as the category of model for VILA (give me a PDF, I'll spit out the locations of the highlighted units). In all, I think it's not a big deal if Stefan's model doesn't adhere to MMDA conventions. In fact, it's probably even fine for now if Stefan's model lives outside of MMDA on its own, similar to TLDRs or SPECTER, until there's a real reason that it has to be in MMDA. But it's also not a big deal if it does live in MMDA for now; I'll maybe get around to refactoring it to confirm to MMDA conventions, but it's lower priority than ensuring Yogi/Angeles components conform. Thoughts? |
Good and interesting point about bib entry predictor not being part of MMDA and instead being more similar to TLDR. I actually like that better because it may be used by other services like maybe part of semantic reader's latex pipeline's citation matching. |
I'm catching up on the comments above. The only question i have re: to-mmda-or-not-to-mmda is where is the input for Stefan's model coming from? So if @geli-gel 's model needs to produce MMDA output, we need something that can identify and pull out the relevant strings that run in @stefanc-ai2 's model. I see three possibilities:
|
Merging in this PR. Tracking conversion to taking in |
* Speed up vila pre-processing (#84) * bump version and fix vila test fixture inclusion (#85) Co-authored-by: Chris Wilhelm <chris@allenai.org> * Bib Entry Parser/Predictor (#83) * Passed tt verify * fix variant name * up version num * rename enum * Pins vila to 0.3.0. (#86) Unbounded upper version was pulling in breaking changes transitively in builds. Co-authored-by: Chris Wilhelm <chris@allenai.org> * update demo with the rasterizer (#71) * Add to model device in bib entry predictor (#87) * bibentrypredictor model device * up setup.py Co-authored-by: Rodney Kinney <rodneyk@allenai.org> Co-authored-by: Chris Wilhelm <chris.wilhelm@gmail.com> Co-authored-by: Chris Wilhelm <chris@allenai.org> Co-authored-by: Stefan Candra <52424731+stefanc-ai2@users.noreply.github.com>
https://github.com/allenai/scholar/issues/32461
Pretty standard model and interface implementation. You can see the result on http://bibentry-predictor.v0.dev.models.s2.allenai.org/ or
Tests: integration test passed and dev deployment works
TODO: release as version
0.0.10
after merge