Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

Closed
nathimel opened this issue Dec 28, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@nathimel
Copy link

I am using semanticscholar to get a large number of papers while traversing the S2AG iteratively, and sometimes this results in queries with SemanticScholar.get_paper that result in data being None.

This causes an error here in Paper._init_attributes of course, because that method assumes that data is a dict.

As a temporary workaround, I've writing custom get_paper and get_papers methods that change the line return Paper(data) to return Paper(data) if data is not None else None. (See here for an example.) Otherwise I like to use semanticscholar as is. But this gets tedious to maintain in parallel with updates to semanticscholar; for example, I now need to write more complicated functions to keep up with AsyncSemanticScholar.

Would the developers consider returning None, or otherwise not causing SemanticScholar to throw an error that would stop a loop that I might not be monitoring?

I suppose the alternative is to include try except blocks in my code and manually return None, but that seems uglier. On the other hand, the developers might feel it is preferred. Thanks for your consideration.

@danielnsilva danielnsilva added the bug Something isn't working label Dec 28, 2023
@danielnsilva
Copy link
Owner

@nathimel Thanks for pointing out the issue. If you could share an example of when this error happens, that'd be helpful for testing.

@kochbj
Copy link

kochbj commented Jan 3, 2024

Rather than creating a new issue, I believe I'm having the same bug. Part of the issue outside of the package is that papers seem to disappear in SS from time to time. But this makes it impossible to bulk download because if one paper doesn't exist out of 500, the whole function call fails:

Reproducible code:

import pandas as pd
from semanticscholar import SemanticScholar
'''
{"corpusid":211530585,"externalids":{"ACL":null,"DBLP":"conf/aaai/DorrB0MHSCSZS20","ArXiv":null,"MAG":"2998331601","CorpusId":"211530585","PubMed":null,"DOI":"10.1609/AAAI.V34I05.6269","PubMedCentral":null},"url":"https://www.semanticscholar.org/paper/77e61c39ee59a03be8813f961cb1b327926dcce2","title":"Detecting Asks in Social Engineering Attacks: Impact of Linguistic and Structural Knowledge","authors":[{"authorId":"1752326","name":"B. Dorr"},{"authorId":"2632964","name":"Archna Bhatia"},{"authorId":"36235196","name":"Adam Dalton"},{"authorId":"1505503134","name":"Brodie Mather"},{"authorId":"1505542669","name":"Bryanna Hebenstreit"},{"authorId":"50629423","name":"Sashank Santhanam"},{"authorId":"1470639547","name":"Zhuo Cheng"},{"authorId":"145102721","name":"Samira Shaikh"},{"authorId":"1877429","name":"Alan Zemel"},{"authorId":"1791072","name":"T. Strzalkowski"}],"venue":"AAAI","year":2020,"referencecount":64,"citationcount":3,"influentialcitationcount":0,"isopenaccess":true,"s2fieldsofstudy":[{"category":"Computer Science","source":"s2-fos-model"},{"category":"Computer Science","source":"external"}],"updated":"2022-03-10T06:13:39.710Z"}
'''
sch = SemanticScholar()
corpus_ids=['CorpusId:211530585']
sch.get_papers(corpus_ids, fields=['title','year','publicationDate','citations.publicationDate',
                                                    'venue','citations.year','citations.title','citations.externalIds',
                                                    'publicationTypes','publicationVenue','externalIds','citations.venue'
                                                   ])

@danielnsilva
Copy link
Owner

danielnsilva commented Jan 4, 2024

Hi @kochbj,

I believe the issue you're facing with get_papers() is different from @nathimel's. When get_paper() encounters a non-existent ID, it correctly throws an ObjectNotFoundException due to a 404 error. However, with get_papers(), the problem seems to be how it handles batches when one of the IDs was not found in S2, returning null instead.

A better approach might be for get_papers() to return data for existing papers and list non-existent ones in a warning. This would allow bulk downloads to continue smoothly despite missing papers.

In contrast, @nathimel's issue is different. It's not about missing IDs but about receiving a response with no data, leading to failures.

@kochbj
Copy link

kochbj commented Jan 4, 2024

Hi @danielnsilva , thanks for the quick reply! I believe the solution you proposed would be great. My use case is that I want to download metadata for 200,000 papers using ArxivIds. If I can do that for 500 papers at a time great. But if even 40 papers don't have IDs, I have to go back to doing 50 at a time, then 10 at a time, then 1 at a time until I find the bad ID. It uses up my API calls. If you could simply return None and/or throw a warning that would be splendid!

danielnsilva added a commit that referenced this issue Jan 7, 2024
Update the get_papers() method to support returning a list of not found paper IDs. When the return_not_found parameter is set to True, the method now returns a tuple containing both a list of found papers and a list of not found IDs.This enhancement addresses the issue where handling of missing papers was not clear.
danielnsilva added a commit that referenced this issue Jan 7, 2024
Update the get_papers() method to support returning a list of not
found paper IDs. When the return_not_found parameter is set to True,
the method now returns a tuple containing both a list of found papers
and a list of not found IDs.
@danielnsilva
Copy link
Owner

danielnsilva commented Jan 7, 2024

@kochbj I've fixed get_papers() so it no longer fails when some IDs are not found. You can use it as usual for bulk downloads. Additionally, if you need a list of not found IDs, just set return_not_found to True, and it will return both the papers and the missing IDs. If you want to try it out, you need install it directly from the source.

pip install --no-cache-dir --force-reinstall git+https://github.com/danielnsilva/semanticscholar@issue-80
from semanticscholar import SemanticScholar
sch = SemanticScholar()
list_of_paper_ids = [
     'CorpusId:211530585',
     'CorpusId:470667',
     '10.2139/ssrn.2250500',
     '0f40b1f08821e22e859c6050916cec3667778613'
]
list_of_papers, list_of_not_found_ids = sch.get_papers(list_of_paper_ids, return_not_found=True)

@kochbj
Copy link

kochbj commented Jan 7, 2024

Wonderful. Can't wait to try it out. :) Thanks!

@danielnsilva
Copy link
Owner

I couldn't reproduce @nathimel's issue. I'm closing this issue for now, but feel free to reopen it if necessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants