Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

nathimel · 2023-12-28T02:54:39Z

I am using semanticscholar to get a large number of papers while traversing the S2AG iteratively, and sometimes this results in queries with SemanticScholar.get_paper that result in data being None.

This causes an error here in Paper._init_attributes of course, because that method assumes that data is a dict.

As a temporary workaround, I've writing custom get_paper and get_papers methods that change the line return Paper(data) to return Paper(data) if data is not None else None. (See here for an example.) Otherwise I like to use semanticscholar as is. But this gets tedious to maintain in parallel with updates to semanticscholar; for example, I now need to write more complicated functions to keep up with AsyncSemanticScholar.

Would the developers consider returning None, or otherwise not causing SemanticScholar to throw an error that would stop a loop that I might not be monitoring?

I suppose the alternative is to include try except blocks in my code and manually return None, but that seems uglier. On the other hand, the developers might feel it is preferred. Thanks for your consideration.

The text was updated successfully, but these errors were encountered:

danielnsilva · 2023-12-28T13:27:05Z

@nathimel Thanks for pointing out the issue. If you could share an example of when this error happens, that'd be helpful for testing.

kochbj · 2024-01-03T18:47:45Z

Rather than creating a new issue, I believe I'm having the same bug. Part of the issue outside of the package is that papers seem to disappear in SS from time to time. But this makes it impossible to bulk download because if one paper doesn't exist out of 500, the whole function call fails:

Reproducible code:

import pandas as pd
from semanticscholar import SemanticScholar
'''
{"corpusid":211530585,"externalids":{"ACL":null,"DBLP":"conf/aaai/DorrB0MHSCSZS20","ArXiv":null,"MAG":"2998331601","CorpusId":"211530585","PubMed":null,"DOI":"10.1609/AAAI.V34I05.6269","PubMedCentral":null},"url":"https://www.semanticscholar.org/paper/77e61c39ee59a03be8813f961cb1b327926dcce2","title":"Detecting Asks in Social Engineering Attacks: Impact of Linguistic and Structural Knowledge","authors":[{"authorId":"1752326","name":"B. Dorr"},{"authorId":"2632964","name":"Archna Bhatia"},{"authorId":"36235196","name":"Adam Dalton"},{"authorId":"1505503134","name":"Brodie Mather"},{"authorId":"1505542669","name":"Bryanna Hebenstreit"},{"authorId":"50629423","name":"Sashank Santhanam"},{"authorId":"1470639547","name":"Zhuo Cheng"},{"authorId":"145102721","name":"Samira Shaikh"},{"authorId":"1877429","name":"Alan Zemel"},{"authorId":"1791072","name":"T. Strzalkowski"}],"venue":"AAAI","year":2020,"referencecount":64,"citationcount":3,"influentialcitationcount":0,"isopenaccess":true,"s2fieldsofstudy":[{"category":"Computer Science","source":"s2-fos-model"},{"category":"Computer Science","source":"external"}],"updated":"2022-03-10T06:13:39.710Z"}
'''
sch = SemanticScholar()
corpus_ids=['CorpusId:211530585']
sch.get_papers(corpus_ids, fields=['title','year','publicationDate','citations.publicationDate',
                                                    'venue','citations.year','citations.title','citations.externalIds',
                                                    'publicationTypes','publicationVenue','externalIds','citations.venue'
                                                   ])

danielnsilva · 2024-01-04T20:45:47Z

Hi @kochbj,

I believe the issue you're facing with get_papers() is different from @nathimel's. When get_paper() encounters a non-existent ID, it correctly throws an ObjectNotFoundException due to a 404 error. However, with get_papers(), the problem seems to be how it handles batches when one of the IDs was not found in S2, returning null instead.

A better approach might be for get_papers() to return data for existing papers and list non-existent ones in a warning. This would allow bulk downloads to continue smoothly despite missing papers.

In contrast, @nathimel's issue is different. It's not about missing IDs but about receiving a response with no data, leading to failures.

kochbj · 2024-01-04T21:00:38Z

Hi @danielnsilva , thanks for the quick reply! I believe the solution you proposed would be great. My use case is that I want to download metadata for 200,000 papers using ArxivIds. If I can do that for 500 papers at a time great. But if even 40 papers don't have IDs, I have to go back to doing 50 at a time, then 10 at a time, then 1 at a time until I find the bad ID. It uses up my API calls. If you could simply return None and/or throw a warning that would be splendid!

Update the get_papers() method to support returning a list of not found paper IDs. When the return_not_found parameter is set to True, the method now returns a tuple containing both a list of found papers and a list of not found IDs.This enhancement addresses the issue where handling of missing papers was not clear.

Update the get_papers() method to support returning a list of not found paper IDs. When the return_not_found parameter is set to True, the method now returns a tuple containing both a list of found papers and a list of not found IDs.

danielnsilva · 2024-01-07T14:03:56Z

@kochbj I've fixed get_papers() so it no longer fails when some IDs are not found. You can use it as usual for bulk downloads. Additionally, if you need a list of not found IDs, just set return_not_found to True, and it will return both the papers and the missing IDs. If you want to try it out, you need install it directly from the source.

pip install --no-cache-dir --force-reinstall git+https://github.com/danielnsilva/semanticscholar@issue-80

from semanticscholar import SemanticScholar
sch = SemanticScholar()
list_of_paper_ids = [
     'CorpusId:211530585',
     'CorpusId:470667',
     '10.2139/ssrn.2250500',
     '0f40b1f08821e22e859c6050916cec3667778613'
]
list_of_papers, list_of_not_found_ids = sch.get_papers(list_of_paper_ids, return_not_found=True)

kochbj · 2024-01-07T19:06:19Z

Wonderful. Can't wait to try it out. :) Thanks!

danielnsilva · 2024-01-14T16:21:08Z

I couldn't reproduce @nathimel's issue. I'm closing this issue for now, but feel free to reopen it if necessary.

danielnsilva added the bug Something isn't working label Dec 28, 2023

danielnsilva closed this as completed Jan 14, 2024

danielnsilva mentioned this issue Mar 15, 2024

How to get list of outputs without concerning several invalid inputs #85

Closed

danielnsilva mentioned this issue Apr 30, 2024

get_authors sometimes returns None values in a list of Author objects. #87

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

nathimel commented Dec 28, 2023

danielnsilva commented Dec 28, 2023

kochbj commented Jan 3, 2024

danielnsilva commented Jan 4, 2024 •

edited

kochbj commented Jan 4, 2024 •

edited

danielnsilva commented Jan 7, 2024 •

edited

kochbj commented Jan 7, 2024

danielnsilva commented Jan 14, 2024

Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

Return None in get_paper and get_papers when data is none instead of failing to construct a Paper #80

Comments

nathimel commented Dec 28, 2023

danielnsilva commented Dec 28, 2023

kochbj commented Jan 3, 2024

danielnsilva commented Jan 4, 2024 • edited

kochbj commented Jan 4, 2024 • edited

danielnsilva commented Jan 7, 2024 • edited

kochbj commented Jan 7, 2024

danielnsilva commented Jan 14, 2024

danielnsilva commented Jan 4, 2024 •

edited

kochbj commented Jan 4, 2024 •

edited

danielnsilva commented Jan 7, 2024 •

edited