-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support automatic citation extraction from PDF attachments #61
Comments
An additional gateway to a locally running instance of Excite may be of interest to @cboulanger. |
The whole business of running local servers providing PDF extraction and citation matching has become much easier by Docker. See for example here for running Grobid and Reference matching servers: https://github.com/kermitt2/biblio-glutton#running-with-docker |
It might be enough to define an registration API into which backend connectors then can hook into via separate plugins - one for PDF extraction (e.g., given a PDF, return Zotero JSON), and one for matching (e.g., given an array of Zotero JSON items, return an array of arrays of Zotero JSON items containing best matches). I would leave implementation completely outside of CITA. |
Continuing this thought, what about this: why not implementing an experimental internal API in https://github.com/diegodlh/zotero-cita/blob/master/src/extract.js without GUI yet (so that no false expectations are created). This API is then made accessible in the "Run Javascript" console (I have just asked about how to "require()" plugin classes there) so one can run tests with it. Then, we create a plugin with a reference mock implementation which simply return static content. On the basis of this, extraction service add-ons can be developed without CITA core having to do anything. Once at least one such add-on has matured and delivers reliable results, the internal API can be frozen (and maybe versioned) and made accessible via the GUI. I can envision that this will enable the implementation of a GROBID service based on its Docker image pretty quickly, and I will be very much interested in providing an "EXcite" add-on. Here's a first idea for a minimalistic API (type-scripty pseudo-code): interface Item {
// Zotero item data
}
interface AbstractConnector {
get id: string // unique identifier
get label: string // label to be translated, I don't know how this works in Zotero
connect(): Promise<void> // throws if no connection can be established
}
interface ExtractionConnector extends AbstractConnector {
extract(pdfFile: File): Promise<Item[]>
}
interface MatchConnector extends AbstractConnector {
match(item: Item): Promise<Item[]>
}
interface AbstractRegistry {
connectors: []
register(connector: AbstractConnector): void // stores the connector
}
class ExtractorRegistry extends AbstractRegistry {}
class MatcherRegistry extends AbstractRegistry {}
export default class Extraction{
/**
* registers connector depending on its type to ExtractorRegistry or MatcherRegistry
*/
static register(connector: AbstractConnector): void
/**
* Extracts references from a given attachment item, using the given extractor. Will use registered matcher
* connectors to look up unique ids (DOI, ISBN, etc.) and complete/correct metadata for the extracted
* references
* @param extractorId the id of the extractor, from some UI setting
* or manually passed to the method in the console
* @param item a Zotero.Item object of type attachment from which a
* PDF file can be retrieved (either stored locally or by way of download)
*/
static extract(extractorId: string, item: Zotero.Item): Promise<Item[]>
/**
* @param matcherId the id of the extractor, from some UI setting
* or manually passed to the method in the console
* @param item Zotero JSON item from reference extraction (or somewhere else)
*/
static match(matcherId: string, item: Item): Promise<Item[]>
} Maybe it might make sense to separate matching from extraction since it could be called outside of extraction with items that already exists in Zotero. Any thoughts? |
An implementation using https://ref.scholarcy.com/api/ could be done pretty quickly, I think. |
A rough outline of the required steps to get this up and running. The main issue is finding a service to do the reference extraction from full-text documents. After that, wiring into the Cita workflow should be pretty straightforward, maybe with a validation step.
|
In this workflow, an initial step might be to check if this document has
already had its citations extracted. These citations may be stored locally
or in a larger graph somewhere like crossref.
On Sat, Aug 27, 2022 at 9:02 AM Dominic D ***@***.***> wrote:
A rough outline of the required steps to get this up and running. The main
issue is finding a service to do the reference extraction from full-text
documents. After that, wiring into the Cita workflow should be pretty
straightforward, maybe with a validation step.
- Test different citation extraction services (eg. Grobid
<https://github.com/kermitt2/grobid>, Scholarcy
<http://ref.scholarcy.com/api/>, but I guess there are many more) -
how well do they perform on a selection of PDFs?
- Easiest for us would be an online API (like Crossref or Wikidata)
- is this available?
- Otherwise, we could potentially setup a server using Wikimedia's
infrastructure that runs the service of choice
- As a last resort, we could run a service locally, but that would
require more setup for users. This could even be a separate addon.
- When selected, get an attachment item and send to this service (do
we only support PDFs? Are other formats necessary?)
- Parse the returned references
- Potentially have a validation step - showing PDF text and parsed
output for each reference
—
Reply to this email directly, view it on GitHub
<#61 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAJ2JTNYWJTIGDNWWW6X2TV3INZFANCNFSM42TK7NPQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
All the best,
-Hugh
…Sent from my iPhone
|
I have been working on an extraction workflow based on https://github.com/inukshuk/anystyle , which is a lightweight alternative to GROBID.[1] My suggestion would be to define a minimal REST API which can be used with both an endpoint that you set up or a port on localhost - so that people running their own extraction servers with custom models can intergrate them into CITA. [1] There is a general problem with extracting citations from Humanities scholarship which has no bibliography but puts all citation information in the footnotes. All of the existing solutions perform very badly with such literature, but I have promising results with AnyStyle based on a dataset of annotated documents for training a custom model. GROBID requires more complex training material because of the important it places on document layout. |
Ahh nice, and it's lightweight enough to run locally in a Zotero add on? Yeah, I guess if we define some sort of minimal API an extraction service should provide (ie. given a PDF, return the list of references in bibtex or some other format). I guess it would be nice if we could have a lightweight extraction service living either in Cita itself or a Zotero addon for ease of installation / getting things up and running? Then users who desire can run more advanced citation extraction services and those will still play nice with Cita? |
AnyStyle is written in Ruby and depends on a C extension (which makes it VERY fast) so as long as we cannot just transpile those two things to JavaScript I fear we are out of luck as far as a Zotero plugin is concerned. However, it is really trivial (at least once you have learned Ruby like I had to) to expose the desired functionality in a Docker container which can be deployed very easily locally or on a server. |
Technically it should be possible to compile c to wasm feed call that but I've found this kind of thing non-trivial. |
It would require to compile https://wapiti.limsi.fr/ to javascript, which is a library that is unfortunately no longer developed but performs really well compared to a CRF python implementation that I had been working before, plus translating the Ruby Bridge https://github.com/inukshuk/wapiti-ruby and the citation extraction library https://github.com/inukshuk/anystyle from Ruby to Javascript - quite an endeavour. What is really nice about AnyStyle is that it is very well & cleanly written (even though in-code documentation is largely missing), so a translation into JavaScript should be quite straightforward. |
Pinging @inukshuk - maybe he has an opinion on this. |
Just saw this: https://www.ruby2js.com/ |
Ok, that seems like an endeavour, but feasible at least. It'd be nice to have at least one extraction service run out of the box in Zotero (for the average user who doesn't want to setup docker & so on) - whether that be something we get to run locally in JS, or can host on a server somewhere. I guess a PDF->reference list service could be of interest to wikipedia/wikidata people more generally, so maybe we could organise some way to host it if that's the best option. |
With Heroku free gone, I don't readily know a service where you can host something like this with a controlled cost. Transpiling between language idioms frequently yields convoluted results for all but the most trivial examples. It's good for cases where the source stays Ruby, but for migration, my experiences weren't great. |
Hey all, I am a humanities researcher / student who is using Cita and have been in contact with Diego before. I do know a few people in lead university library IT and at this point might just reach out to them with something specific like citation extraction. I am also presenting at the International Art Libraries conference in the Fall and will reference Cita and discuss citation tracing for earlier 19th and 20th century sources which are often digitized but where citation extraction is even harder due to script and other variations. Some library heads and IT people will be there. Just wanted to let you know and citation PDF (or scan) extraction is specific and important enough to possibly get some experienced people interested if you are interested in that.
Cheers
|
Thanks a lot! Yeah, you're right - citation extraction is a problem in and of itself, not only for Cita. If you know more people working on this problem (who are nice enough to host a free service online 😛) that'd be great. Or in general, any efforts in this direction to produce a solution are something we should align with. Then Cita's focus should just be to make this accessible to the average Zotero user. Any experience you have with this landscape would be super valuable. And if there's anything we can do to help - let us know. |
Hello, I am a researcher working on a project that seeks bibliometric data on a large set of journal articles. Number of references per article is one of the measurements of focus. I am unacquainted with GROBID, but from the documentation it seems it may accomplish what we need. (See the reference segmenter feature): https://grobid.readthedocs.io/en/latest/training/Bibliographical-references/ Equally promising, I have discovered the 'zotero-reference' plugin that parses and counts the references from a pdf in a Zotero library. The lacking key feature of this plugin though is the option to view 'reference count' as a column in Zotero's main library view. This is crucial given that our study will compare a multitude of bibliographic measurements at once. Please see the zotero-reference project here (and Google/Chrome translate if necessary): https://github.com/MuiseDestiny/zotero-reference Fwiw, they seem to be open collaborators and are responsive to requests (see user 'polygon'): https://forums.zotero.org/discussion/comment/429031#Comment_429031 Writing here to notify the Cita developers of the need, and requesting if there are still plans to incorporate this feature in the Zotero plugin via GROBID, Scholarcy, or other. Thank you! |
Include Grobid and Scholarcy Reference Extraction API. See corresponding section in grant proposal.
The text was updated successfully, but these errors were encountered: