Async functionality implemented. #56

gabriel-trigo · 2023-09-04T18:01:53Z

Hi! I have a project that required many API calls, and being able to do them asynchronously saves a huge amount of time.

The changes I made are as follows:

Added a new file called AsyncSemanticScholar.py, which is essentially the same as SemanticScholar.py, but with async methods.
Modified ApiRequester.py to support both sync and async requests.
Modified PaginatedResults to create a subclass AsyncPaginatedResults that makes Async requests. Notice that to instantiate an object of this class (asynchronously), the .create() method must be used instead of the default constructor.
Replicated added async versions of all tests from the current test suite. They all seem to be passing. Also added a decorator to test methods to ignore the memory-related warnings that pop up when tests are ran.
Modified requirements.txt to include the httpx library used to make async requests.

Comments:

Originally, I planned to add the async functionality as an argument on the already existing methods. However, since an async function must be marked as async, I could not found a way to do this without requiring users to use the "await" keyword before invoking the library methods. Since I didn't want to alter functionality for already existing users, I was left with no choice other than duplicate the SemanticScholar class to make an async version.
This same theme also showed up in other modifications made in other files.
This unfortunately evidently led to quite a bit of repeated code, but I couldn't find a better way to do it, my apologies.

@danielnsilva would you be able to review this? I'm a Brazilian CS student and I'm trying to begin contributing to open source projects. I found this project very cool, and have already used it quite a bit, so I was hoping to be able to contribute to it.

Best,
Gabriel

Below: picture of a simple usecase demonstrating the advantage of async requests.

danielnsilva · 2023-09-04T23:35:41Z

Hi @gabriel-trigo, thanks for the PR and the clear explanations! It touches some core parts, so I'll need a little bit of time review it. #52

gabriel-trigo · 2023-09-08T02:38:11Z

Hi @danielnsilva , I noticed there are some issues with the test cases that I hadn't noticed, so I'm fixing them, and will probably submit one more commit tomorrow.

…d of json string for httpx

gabriel-trigo · 2023-09-09T15:23:23Z

Hi @danielnsilva, I have corrected the testing problems (the async tests were not running properly before), and now they seem to all be passing. I committed the .yaml vcr files that were generated when testing locally.

I noticed that a few of the tests, particularly the ones iterating through PaginatedResults often took a couple of tries to work out (and generate the correct .yaml file), because often calls to next_page() would time out on the server side.

I think everything checks out now!

danielnsilva · 2023-09-10T23:52:07Z

Hi @gabriel-trigo

Initially, I believe that if we're adopting httpx, which supports both synchronous and asynchronous requests, there's no need to keep the requests lib.

Duplicated code can make maintenance tricky, and that concerns me. Also, it's not my intention to introduce breaking changes. A possible solution is to migrate everything to async methods and create methods that make the async ones behave synchronously. In this way, we concentrate all application logic in the async methods, which will simplify maintenance, but we still maintain methods with synchronous behavior.

To illustrate, the ApiRequester class could be structured like this...

import asyncio
from typing import List, Union

import httpx
from tenacity import (retry, retry_if_exception_type, stop_after_attempt,
                      wait_fixed)

from semanticscholar.SemanticScholarException import \
    BadQueryParametersException, ObjectNotFoundException


class BaseRequester:

    def __init__(self, timeout) -> None:
        '''
        :param float timeout: an exception is raised \
            if the server has not issued a response for timeout seconds.
        '''
        self.timeout = timeout

    @property
    def timeout(self) -> int:
        '''
        :type: :class:`int`
        '''
        return self._timeout

    @timeout.setter
    def timeout(self, timeout: int) -> None:
        '''
        :param int timeout:
        '''
        self._timeout = timeout


class ApiRequester(BaseRequester):
    '''
    This class handles calls to Semantic Scholar API.
    '''

    def __init__(self, timeout) -> None:
        super().__init__(timeout)
        self._async_requester = AsyncApiRequester(timeout)
    
    def get_data(
                self,
                url: str,
                parameters: str,
                headers: dict,
                payload: dict = None
            ) -> Union[dict, List[dict]]:
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(
            self._async_requester.get_data(url, parameters, headers, payload))


class AsyncApiRequester(BaseRequester):

    @retry(
        wait=wait_fixed(30),
        retry=retry_if_exception_type(ConnectionRefusedError),
        stop=stop_after_attempt(10)
    )
    async def get_data(
                self,
                url: str,
                parameters: str,
                headers: dict,
                payload: dict = None
            ) -> Union[dict, List[dict]]:
        '''Get data from Semantic Scholar API

        :param str url: absolute URL to API endpoint.
        :param str parameters: the parameters to add in the URL.
        :param str headers: request headers.
        :param dict payload: data for POST requests.
        :returns: data or empty :class:`dict` if not found.
        :rtype: :class:`dict` or :class:`List` of :class:`dict`
        '''

        url = f'{url}?{parameters}'
        method = 'POST' if payload else 'GET'

        async with httpx.AsyncClient() as client:
            r = await client.request(
                method, url, timeout=self._timeout, headers=headers, json=payload)

        data = {}
        if r.status_code == 200:
            data = r.json()
            if len(data) == 1 and 'error' in data:
                data = {}
        elif r.status_code == 400:
            data = r.json()
            raise BadQueryParametersException(data['error'])
        elif r.status_code == 403:
            raise PermissionError('HTTP status 403 Forbidden.')
        elif r.status_code == 404:
            data = r.json()
            raise ObjectNotFoundException(data['error'])
        elif r.status_code == 429:
            raise ConnectionRefusedError('HTTP status 429 Too Many Requests.')
        elif r.status_code in [500, 504]:
            data = r.json()
            raise Exception(data['message'])

        return data

Notice the get_data() method that uses run_until_complete(), which essentially makes the method behave synchronously.

I'm concerned about potentially introducing a poor design into the project, but I believe the SemanticScholar and PaginatedResults classes could follow the same approach. This approach seems preferable to duplicating everything... I'm not entirely sure, but it's my bet! 🤷‍♂️

There's one more thing I've been pondering. Shouldn't paginated results always be sequential? When using asynchronous methods and iterating over all results, can we ensure they'll be gathered in the correct order? I'll need to take a closer look into this.

gabriel-trigo · 2023-09-11T01:02:57Z

Hi @danielnsilva, these are very good points. In this approach that you suggested, would the SemanticScholar methods still be duplicated? I had difficulty not duplicating the SemanticScholar methods because even if I could get the methods to behave synchronously, I still had to declare them as async to support the async functionality. This meant that even if users wanted to use the method synchronously, they still had to put the keyword await before using the method.

In this suggestion, would the SemanticScholar methods be declared as Async?

await method(async=False) (Users would need to do something like this).

Regarding PaginatedResults, the Async implementation actually behaves almost identically as the synchronous one. The only difference is that the constructor and next_page() methods are non-blocking. But the whole idea of iterating through pages sequentially is all kept the same!

I'm more than happy to work on the changes you deem necessary based on this discussion!

Best

danielnsilva · 2023-09-11T02:44:27Z

Hi @danielnsilva, these are very good points. In this approach that you suggested, would the SemanticScholar methods still be duplicated? I had difficulty not duplicating the SemanticScholar methods because even if I could get the methods to behave synchronously, I still had to declare them as async to support the async functionality. This meant that even if users wanted to use the method synchronously, they still had to put the keyword await before using the method.

In this suggestion, would the SemanticScholar methods be declared as Async?

await method(async=False) (Users would need to do something like this).

Let me clarify this. I understand we might not be able to completely get rid of code duplication. But I think if we can focus most of the application logic in the async version, that'd be a step in the right direction. This approach could be better than having the same logic in both sync and async methods.

This means we'll always have two methods, sync and async. But the sync version will be just an async method called synchronously.

I'm not sure if using a sync/async parameter would help achieve our goal.

Here's a quick overview:

A BaseRequester class was created to consolidate shared code.
An AsyncApiRequester class was introduced to embody the async version of ApiRequester.
In this class, the get_data() method is asynchronous and contains the application logic.
In the original ApiRequester class, there's an attribute representing an instance of the asynchronous class version.
Using this attribute, it's feasible to craft synchronous versions of the methods by invoking them synchronously.

Building on the idea, within SemanticScholar, we can structure it to have both BaseSemanticScholar and AsyncSemanticScholar.

In AsyncSemanticScholar, methods like get_whatever() would be asynchronous and house the main logic. Meanwhile, in the original class, sync methods would exist mainly to invoke their respective asynchronous versions.

For clarity, here's how it'd look:

# For synchronous calls
sch.get_paper()

# For asynchronous calls
await async_sch.get_paper()

This approach aligns closely with what you had proposed, with the primary advantage being that the synchronous version doesn't replicate all the method's internal code—it simply calls it synchronously.

Regarding PaginatedResults, the Async implementation actually behaves almost identically as the synchronous one. The only difference is that the constructor and next_page() methods are non-blocking. But the whole idea of iterating through pages sequentially is all kept the same!

That's fine. My concern was related to iterating over the results, and if there's a possibility that the next page could be invoked even before the current page request completes. This could lead to potential out-of-order results. This is just a hypothetical scenario that crossed my mind, and I haven't looked into it in depth.

gabriel-trigo · 2023-09-13T16:44:45Z

Hi @danielnsilva, I thought a lot about this and I think I found a good approach following the intuition you provided:

There will be only one requester class, Requester, which only does Async Requests
There is an AsyncSemanticScholar class which provides async versions of all the usual methods.
There is a SemanticScholar class which provides synchronous versions of all the usual methods. Instead of duplicating all the code from AsyncSemanticScholar, it simpy instantiates an object of AsyncSemanticScholar on the constructor, and then calls the async methods synchronously using loop.run_unti_complete(async_method())
There will be only an PaginatedResults class, which offers an asynchronous constructor create and an async async_next_page method. In SemanticScholar, the constructor is ran using run_until_complete() to behave synchronously. The iteration still occurs synchronously to make sure the order makes sense.

I pushed a commit with these modifications, how does it sound to you?

…olar

gabriel-trigo · 2023-09-26T18:17:58Z

@danielnsilva what do you think about the solution?

semanticscholar/ApiRequester.py

semanticscholar/SemanticScholar.py

danielnsilva · 2023-10-01T21:40:05Z

I'm back! 😄 I think you did a great job @gabriel-trigo. I've made some remarks, but I believe we have a good version to add the async option to the project.

gabriel-trigo · 2023-10-03T03:09:44Z

Hi @danielnsilva, thank you for taking the time to review this. I'll review the code based on the points you mentioned, and I'll update the pull request by this week then!

…od signatures, kept sync get_data method and added deprecation warning, kept ApiRequester Class name)

gabriel-trigo · 2023-10-11T20:11:24Z

Hi @danielnsilva, thank you for the suggestions once again. I have implemented all of them and updated this PR 👍

gabriel-trigo added 2 commits September 4, 2023 13:42

implementing async functionality

0b9cea2

finalizing async functionality

d0e3c97

Fixing test cases

e02220e

gabriel-trigo added 3 commits September 9, 2023 01:17

Only 1 test not working. Modified Api_requester to pass object instea…

0e62a04

…d of json string for httpx

forgot to remove local path variable

d15987f

Removing test notebook

fe7a679

danielnsilva self-assigned this Sep 9, 2023

All tests passing

1540c5c

gabriel-trigo and others added 4 commits September 13, 2023 23:06

Architecture overhauled to support async_requests.

69ef3ee

Merge branch 'danielnsilva:master' into master

136a506

Forgot to remove local path before pushing

cb67f57

Merge branch 'master' of https://github.com/gabriel-trigo/semanticsch…

63269d8

…olar

danielnsilva reviewed Oct 1, 2023

View reviewed changes

semanticscholar/ApiRequester.py Outdated Show resolved Hide resolved

semanticscholar/ApiRequester.py Outdated Show resolved Hide resolved

semanticscholar/ApiRequester.py Show resolved Hide resolved

semanticscholar/SemanticScholar.py Outdated Show resolved Hide resolved

gabriel-trigo added 2 commits October 11, 2023 16:08

Implemented final changes suggested by Daniel to Async PR. (kept meth…

602bb12

…od signatures, kept sync get_data method and added deprecation warning, kept ApiRequester Class name)

Forgot to remove local path from test_semanticscholar.py file

f57a089

danielnsilva added 3 commits November 16, 2023 12:32

Merge branch 'master' into master

f33ab59

fix: missing variable bug in Paginated REsults

ca36b07

test: ✅ update test_semanticscholar.py with drop_unused_requests option

2741953

danielnsilva merged commit 4c930db into danielnsilva:master Nov 17, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Async functionality implemented. #56

Async functionality implemented. #56

gabriel-trigo commented Sep 4, 2023

danielnsilva commented Sep 4, 2023 •

edited

gabriel-trigo commented Sep 8, 2023

gabriel-trigo commented Sep 9, 2023

danielnsilva commented Sep 10, 2023

gabriel-trigo commented Sep 11, 2023 •

edited

danielnsilva commented Sep 11, 2023

gabriel-trigo commented Sep 13, 2023 •

edited

gabriel-trigo commented Sep 26, 2023

danielnsilva commented Oct 1, 2023 •

edited

gabriel-trigo commented Oct 3, 2023

gabriel-trigo commented Oct 11, 2023

Async functionality implemented. #56

Async functionality implemented. #56

Conversation

gabriel-trigo commented Sep 4, 2023

danielnsilva commented Sep 4, 2023 • edited

gabriel-trigo commented Sep 8, 2023

gabriel-trigo commented Sep 9, 2023

danielnsilva commented Sep 10, 2023

gabriel-trigo commented Sep 11, 2023 • edited

danielnsilva commented Sep 11, 2023

gabriel-trigo commented Sep 13, 2023 • edited

gabriel-trigo commented Sep 26, 2023

danielnsilva commented Oct 1, 2023 • edited

gabriel-trigo commented Oct 3, 2023

gabriel-trigo commented Oct 11, 2023

danielnsilva commented Sep 4, 2023 •

edited

gabriel-trigo commented Sep 11, 2023 •

edited

gabriel-trigo commented Sep 13, 2023 •

edited

danielnsilva commented Oct 1, 2023 •

edited