Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async functionality implemented. #56

Merged
merged 16 commits into from Nov 17, 2023

Conversation

gabriel-trigo
Copy link
Contributor

Hi! I have a project that required many API calls, and being able to do them asynchronously saves a huge amount of time.

The changes I made are as follows:

  • Added a new file called AsyncSemanticScholar.py, which is essentially the same as SemanticScholar.py, but with async methods.
  • Modified ApiRequester.py to support both sync and async requests.
  • Modified PaginatedResults to create a subclass AsyncPaginatedResults that makes Async requests. Notice that to instantiate an object of this class (asynchronously), the .create() method must be used instead of the default constructor.
  • Replicated added async versions of all tests from the current test suite. They all seem to be passing. Also added a decorator to test methods to ignore the memory-related warnings that pop up when tests are ran.
  • Modified requirements.txt to include the httpx library used to make async requests.

Comments:

  • Originally, I planned to add the async functionality as an argument on the already existing methods. However, since an async function must be marked as async, I could not found a way to do this without requiring users to use the "await" keyword before invoking the library methods. Since I didn't want to alter functionality for already existing users, I was left with no choice other than duplicate the SemanticScholar class to make an async version.
  • This same theme also showed up in other modifications made in other files.
  • This unfortunately evidently led to quite a bit of repeated code, but I couldn't find a better way to do it, my apologies.

@danielnsilva would you be able to review this? I'm a Brazilian CS student and I'm trying to begin contributing to open source projects. I found this project very cool, and have already used it quite a bit, so I was hoping to be able to contribute to it.

Best,
Gabriel

Below: picture of a simple usecase demonstrating the advantage of async requests.
image
async

@danielnsilva
Copy link
Owner

danielnsilva commented Sep 4, 2023

Hi @gabriel-trigo, thanks for the PR and the clear explanations! It touches some core parts, so I'll need a little bit of time review it. #52

@gabriel-trigo
Copy link
Contributor Author

Hi @danielnsilva , I noticed there are some issues with the test cases that I hadn't noticed, so I'm fixing them, and will probably submit one more commit tomorrow.

@danielnsilva danielnsilva self-assigned this Sep 9, 2023
@gabriel-trigo
Copy link
Contributor Author

Hi @danielnsilva, I have corrected the testing problems (the async tests were not running properly before), and now they seem to all be passing. I committed the .yaml vcr files that were generated when testing locally.

I noticed that a few of the tests, particularly the ones iterating through PaginatedResults often took a couple of tries to work out (and generate the correct .yaml file), because often calls to next_page() would time out on the server side.

I think everything checks out now!

@danielnsilva
Copy link
Owner

Hi @gabriel-trigo

Initially, I believe that if we're adopting httpx, which supports both synchronous and asynchronous requests, there's no need to keep the requests lib.

Duplicated code can make maintenance tricky, and that concerns me. Also, it's not my intention to introduce breaking changes. A possible solution is to migrate everything to async methods and create methods that make the async ones behave synchronously. In this way, we concentrate all application logic in the async methods, which will simplify maintenance, but we still maintain methods with synchronous behavior.

To illustrate, the ApiRequester class could be structured like this...

import asyncio
from typing import List, Union

import httpx
from tenacity import (retry, retry_if_exception_type, stop_after_attempt,
                      wait_fixed)

from semanticscholar.SemanticScholarException import \
    BadQueryParametersException, ObjectNotFoundException


class BaseRequester:

    def __init__(self, timeout) -> None:
        '''
        :param float timeout: an exception is raised \
            if the server has not issued a response for timeout seconds.
        '''
        self.timeout = timeout

    @property
    def timeout(self) -> int:
        '''
        :type: :class:`int`
        '''
        return self._timeout

    @timeout.setter
    def timeout(self, timeout: int) -> None:
        '''
        :param int timeout:
        '''
        self._timeout = timeout


class ApiRequester(BaseRequester):
    '''
    This class handles calls to Semantic Scholar API.
    '''

    def __init__(self, timeout) -> None:
        super().__init__(timeout)
        self._async_requester = AsyncApiRequester(timeout)
    
    def get_data(
                self,
                url: str,
                parameters: str,
                headers: dict,
                payload: dict = None
            ) -> Union[dict, List[dict]]:
        loop = asyncio.get_event_loop()
        return loop.run_until_complete(
            self._async_requester.get_data(url, parameters, headers, payload))


class AsyncApiRequester(BaseRequester):

    @retry(
        wait=wait_fixed(30),
        retry=retry_if_exception_type(ConnectionRefusedError),
        stop=stop_after_attempt(10)
    )
    async def get_data(
                self,
                url: str,
                parameters: str,
                headers: dict,
                payload: dict = None
            ) -> Union[dict, List[dict]]:
        '''Get data from Semantic Scholar API

        :param str url: absolute URL to API endpoint.
        :param str parameters: the parameters to add in the URL.
        :param str headers: request headers.
        :param dict payload: data for POST requests.
        :returns: data or empty :class:`dict` if not found.
        :rtype: :class:`dict` or :class:`List` of :class:`dict`
        '''

        url = f'{url}?{parameters}'
        method = 'POST' if payload else 'GET'

        async with httpx.AsyncClient() as client:
            r = await client.request(
                method, url, timeout=self._timeout, headers=headers, json=payload)

        data = {}
        if r.status_code == 200:
            data = r.json()
            if len(data) == 1 and 'error' in data:
                data = {}
        elif r.status_code == 400:
            data = r.json()
            raise BadQueryParametersException(data['error'])
        elif r.status_code == 403:
            raise PermissionError('HTTP status 403 Forbidden.')
        elif r.status_code == 404:
            data = r.json()
            raise ObjectNotFoundException(data['error'])
        elif r.status_code == 429:
            raise ConnectionRefusedError('HTTP status 429 Too Many Requests.')
        elif r.status_code in [500, 504]:
            data = r.json()
            raise Exception(data['message'])

        return data

Notice the get_data() method that uses run_until_complete(), which essentially makes the method behave synchronously.

I'm concerned about potentially introducing a poor design into the project, but I believe the SemanticScholar and PaginatedResults classes could follow the same approach. This approach seems preferable to duplicating everything... I'm not entirely sure, but it's my bet! 🤷‍♂️

There's one more thing I've been pondering. Shouldn't paginated results always be sequential? When using asynchronous methods and iterating over all results, can we ensure they'll be gathered in the correct order? I'll need to take a closer look into this.

@gabriel-trigo
Copy link
Contributor Author

gabriel-trigo commented Sep 11, 2023

Hi @danielnsilva, these are very good points. In this approach that you suggested, would the SemanticScholar methods still be duplicated? I had difficulty not duplicating the SemanticScholar methods because even if I could get the methods to behave synchronously, I still had to declare them as async to support the async functionality. This meant that even if users wanted to use the method synchronously, they still had to put the keyword await before using the method.

In this suggestion, would the SemanticScholar methods be declared as Async?

await method(async=False) (Users would need to do something like this).

Regarding PaginatedResults, the Async implementation actually behaves almost identically as the synchronous one. The only difference is that the constructor and next_page() methods are non-blocking. But the whole idea of iterating through pages sequentially is all kept the same!

I'm more than happy to work on the changes you deem necessary based on this discussion!

Best

@danielnsilva
Copy link
Owner

Hi @danielnsilva, these are very good points. In this approach that you suggested, would the SemanticScholar methods still be duplicated? I had difficulty not duplicating the SemanticScholar methods because even if I could get the methods to behave synchronously, I still had to declare them as async to support the async functionality. This meant that even if users wanted to use the method synchronously, they still had to put the keyword await before using the method.

In this suggestion, would the SemanticScholar methods be declared as Async?

await method(async=False) (Users would need to do something like this).

Let me clarify this. I understand we might not be able to completely get rid of code duplication. But I think if we can focus most of the application logic in the async version, that'd be a step in the right direction. This approach could be better than having the same logic in both sync and async methods.

This means we'll always have two methods, sync and async. But the sync version will be just an async method called synchronously.

I'm not sure if using a sync/async parameter would help achieve our goal.

Here's a quick overview:

  1. A BaseRequester class was created to consolidate shared code.
  2. An AsyncApiRequester class was introduced to embody the async version of ApiRequester.
  3. In this class, the get_data() method is asynchronous and contains the application logic.
  4. In the original ApiRequester class, there's an attribute representing an instance of the asynchronous class version.
  5. Using this attribute, it's feasible to craft synchronous versions of the methods by invoking them synchronously.

Building on the idea, within SemanticScholar, we can structure it to have both BaseSemanticScholar and AsyncSemanticScholar.

In AsyncSemanticScholar, methods like get_whatever() would be asynchronous and house the main logic. Meanwhile, in the original class, sync methods would exist mainly to invoke their respective asynchronous versions.

For clarity, here's how it'd look:

# For synchronous calls
sch.get_paper()

# For asynchronous calls
await async_sch.get_paper()

This approach aligns closely with what you had proposed, with the primary advantage being that the synchronous version doesn't replicate all the method's internal code—it simply calls it synchronously.

Regarding PaginatedResults, the Async implementation actually behaves almost identically as the synchronous one. The only difference is that the constructor and next_page() methods are non-blocking. But the whole idea of iterating through pages sequentially is all kept the same!

That's fine. My concern was related to iterating over the results, and if there's a possibility that the next page could be invoked even before the current page request completes. This could lead to potential out-of-order results. This is just a hypothetical scenario that crossed my mind, and I haven't looked into it in depth.

@gabriel-trigo
Copy link
Contributor Author

gabriel-trigo commented Sep 13, 2023

Hi @danielnsilva, I thought a lot about this and I think I found a good approach following the intuition you provided:

  • There will be only one requester class, Requester, which only does Async Requests
  • There is an AsyncSemanticScholar class which provides async versions of all the usual methods.
  • There is a SemanticScholar class which provides synchronous versions of all the usual methods. Instead of duplicating all the code from AsyncSemanticScholar, it simpy instantiates an object of AsyncSemanticScholar on the constructor, and then calls the async methods synchronously using loop.run_unti_complete(async_method())
  • There will be only an PaginatedResults class, which offers an asynchronous constructor create and an async async_next_page method. In SemanticScholar, the constructor is ran using run_until_complete() to behave synchronously. The iteration still occurs synchronously to make sure the order makes sense.

I pushed a commit with these modifications, how does it sound to you?

@gabriel-trigo
Copy link
Contributor Author

@danielnsilva what do you think about the solution?

semanticscholar/ApiRequester.py Outdated Show resolved Hide resolved
semanticscholar/ApiRequester.py Outdated Show resolved Hide resolved
semanticscholar/ApiRequester.py Show resolved Hide resolved
semanticscholar/SemanticScholar.py Outdated Show resolved Hide resolved
@danielnsilva
Copy link
Owner

danielnsilva commented Oct 1, 2023

I'm back! 😄 I think you did a great job @gabriel-trigo. I've made some remarks, but I believe we have a good version to add the async option to the project.

@gabriel-trigo
Copy link
Contributor Author

Hi @danielnsilva, thank you for taking the time to review this. I'll review the code based on the points you mentioned, and I'll update the pull request by this week then!

…od signatures, kept sync get_data method and added deprecation warning, kept ApiRequester Class name)
@gabriel-trigo
Copy link
Contributor Author

Hi @danielnsilva, thank you for the suggestions once again. I have implemented all of them and updated this PR 👍

@danielnsilva danielnsilva merged commit 4c930db into danielnsilva:master Nov 17, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants