-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some questions about 0.jsonl for example #6
Comments
what do you mean by latex in ref_spans, eq_spans for example? |
?Hi again! gropid_parse = {'abstract': None, 'body_text': None, 'ref_entries': None, 'bib_entries': {}} for paper_id = 10022478 ============================== but paper_id = 199502503 |
Hey @Mayar2009, can you edit your comment to shorten the large dump of data? It's a bit hard to address your questions because I have to scroll through a lot of text to find them. Thanks! |
yes I did @kyleclo |
|
are you looking at the grobid parse or the latex parse? |
If a paper did not come with an accompanying PDF, there was nothing to parse. Hence, we leave the parse as None. If the paper came with an accompanying PDF, and the PDF parsing executed successfully, then there will be a Dictionary of fields. For transparency, we wanted to keep these two cases separate (that is, "We got a PDF but didn't process it correctly" vs "We never got a PDF"), but I can see how it's confusing. We'll consider removing these for future release |
Nice catch! That's a bug, thanks for identifying it. I'll look into fixing |
all these issues in 0.jsonl and I did not finish exploring all what I want |
Yea, as with all large data releases, there's going to be things we didn't catch; thanks for identifying these. We'll make adjustments in subsequent releases |
it does exist in grobid prase
thanksfor yor response! |
Yes, the keys exist in all |
ok, if it is possible to ask when the future release will be ready? |
Likely sometime in May |
thanks! we are waiting) |
@kyleclo
my questions are: |
Hey @Mayar2009 the abstract in "metadata" and abstract in "grobid_parse" are different. The former is a gold abstract sourced directly from the publisher (or whichever source we got the paper from). This can have mistakes, but in-general we trust these the most. The latter is any abstract that is being parsed from the PDF directly. These may not exist because (1) we dont have the PDF, (2) we dont have permission to release text from the PDF, (3) our PDF-parsing failed to find the abstract, (4) the PDF was distributed without an abstract [unlikely], (5) the abstract was parsed but mis-detected as a body paragraph It's allowed to have citations in abstracts. Rare but it happens. None vs [] is our way of documenting whether there was nothing to parse (None) or parsing failed (empty list). We're reconsidering whether that was a good decision since it seems to be confusing |
@kyleclo thanks for immediate response ! the section field of any (pape['grobid_parse']['body_text']) is always None |
I could not understand why many papers have get_citation_contexts = [] even though the paper passed these conditions
this condition cite_ref in paper['grobid_parse']['bib_entries'] |
|
@kyleclo |
Hi! |
@kyleclo |
|
@kyleclo I mean there are papers that have both, why? so s2_pdf_hash is the SHA1 of the PDF used to produce the grobid_parse and it is not related to paper id in semantic scholar database |
There are papers for which we have both a PDF and a LaTeX file, in which case, both parses are available.
|
@kyleclo |
No worries, let me try explaining a different way. Most papers on arXiv have an uploaded PDF as well as a LaTeX source file dump. We wrote separate parsers for both the PDF as well as the LaTeX. We don't want to force people to use one text source versus the other, so we included both of these for that same arXiv paper. It's up to you whether you want to use the PDF-parse, the LaTeX-parse, or both, or neither. We don't want them in separate JSON files because they're technically the same paper, and we want to ensure one-JSON-per-paper. Think of it more as different representations of the same paper. For example, in the future, when we parse XML or HTML representations of papers, we might have 3 keys: |
Thanks for the worthful explanation |
@kyleclo |
Hey @Mayar2009, thanks I'm looking into it; this is most definitely a situation where we couldn't find a better link & forgot to enforce a hard constraint about self-citation. I'd consider self-linked references should be |
Hey @Mayar2009, would it be alright if I closed this issue? It's a bit hard to follow since there are a lot of things being discussed in one thread. I believe with the new release of version |
why somwtimes the doi feild is string
["10.1029/2002JB001919"] and sometimes it is just list of string
why do you have two abstract feilda in jsonl files one in metadata and the other in probid_parse or latex_parse
what is the difference between them? I did not understand
3.in grobid_parse why abstarct sometimes is [] and some time is null?
for example
what do you mean by the other_ids is it just doi ?
5.in bib_entries ( i do not remember which paper but as an example)
for b10
but the reference in the paper was for 'BIBREF10' so why we need ref_id in 'BIBREF10'?
'cite_spans': [{'end': 1698,
'latex': None,
'ref_id': 'BIBREF10',
'start': 1678,
'text': '[Lyons et al., '
'2012]'},
{'end': 1863,
'latex': None,
'ref_id': None,
'start': 1857,
'text': '[2011]'}],
The text was updated successfully, but these errors were encountered: