Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

Closed
8 tasks
tarasivashchuk opened this issue Oct 12, 2020 · 10 comments
Closed
8 tasks

Comments

@tarasivashchuk
Copy link

tarasivashchuk commented Oct 12, 2020

Introduction

I was looking over the code for this project and am impressed with it's simplicity in design and brilliant approach to this problem. However, one thing that jumped out to me was the lack of asynchronous methods to allow for a huge speed increase, especially as the number of pages to scrape increases. I am quite familiar with the standard libraries used to meet this goal and propose the following changes:

Let me know your thoughts and if you're interested in the idea. The performance gains would be immense! Thanks!


Technical changes and additions proposal

  • 1. Subclass AutoScraper with AsyncAutoScraper, which would require the packages aiohttp, aiofiles, and aiosql along with a few others purely optionally to increase speed - uvloop, brotlipy, cchardet, and aiodns

  • 2. Refactor the _get_soup method by extracting an async method to download HTML asynchronously using aiohttp

  • 3. Refactor the get_results* and _build* functions to also be async (simply adding the keyword) and then making sure to call them by using a multiprocessing/threading pool

    • a. The get_* functions should handle the calling of these in an executor set to aforementioned pool
    • b. Pools are created using concurrent.futures.*
    • c. Inner-method logic should remain untouched since parsing is a CPU-bound task
  • 4. Use aiofiles for the save method to be able to export many individual JSON files quickly if desired, same for the load method if multiple sources are being used

  • 5. Add functionality for exporting to an SQL database asynchronously using aiosql


References

@alirezamika

@alirezamika
Copy link
Owner

alirezamika commented Oct 13, 2020

Hey, thanks for the idea, I like the idea of having async support.

Here's my thoughts:
The interface should be identical to the sync one. I guess the bottleneck is the get request which should be async.
About the sql functionality, I think we should keep the scraper focused on scraping. Keeping it simple but flexible enough for developers to use it in any scenario. Just use the scraper to get the data, after that, it is out of the scope of the scraper. You can save it in a sql db, text file, csv, .. etc or do whatever you want in the best possible way for your need.
This library is build with this idea in mind, even now you can use any async/threading method to get the html contents and then pass them to the scraper. It will add less than 10 lines of code.

To demonstrate the flexibility, consider this example:

scraper.get_result(url)

For async usage, you can just do:

html = await fetch(url)
scraper.get_result(url=url, html=html)

So I'm not sure whether we should implement it.

Let me know what you think. Thanks.

@tarasivashchuk
Copy link
Author

tarasivashchuk commented Oct 13, 2020

Ok, I actually agree with keeping a narrow focus on scraping for this package. In that case, your example is great although there needs to be a bit more work to setup the aiohttp.ClientSession and stuff - are you familiar with the aiohttp library? I could simply inherit AutoScraper and refactor the functionality related to fetching requests and send you a pull request?

@alirezamika

@alirezamika
Copy link
Owner

Yeah great.
Thanks.

@tarasivashchuk
Copy link
Author

Ok, so I forked this project and was going through the AutoScraper class but it seems that the class is built for handling one URL at a time rather than passing multiple URLs to a get_result method. In this case, the aiohttp.ClientSession would need to be created outside the class and optionally passed to a get_result method, in which case it will use the asynchronous version of the _get_soup method. What do you think about this?

@alirezamika

@alirezamika
Copy link
Owner

We can pass the session. Or another approach is to implement enter and exit methods (create and close the session inside the class) and use it like this:

async with AutoScraper() as scraper:
    ...

Let me know what you think.
Thanks.

@tarasivashchuk
Copy link
Author

I think using 'aenter' and 'aexit' is a great idea. That's the approach I'm gonna take.

@tarasivashchuk
Copy link
Author

Hey - I will be sending you the version with async today

@alirezamika

@mzeidhassan
Copy link

Thanks @tarasivashchuk and @alirezamika!
Just wondering if this is already implemented in the latest release?
Thanks

@alirezamika
Copy link
Owner

@mzeidhassan you can use async requests easily like this:

...
async with aiohttp.ClientSession() as session:
    async with session.get(url) as response:
        html = await response.text()
        result = scraper.get_result_similar(html=html)
        ...

@mzeidhassan
Copy link

Thanks a million, @alirezamika! I appreciate it. I will let you know if I have any further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants