Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

tarasivashchuk · 2020-10-12T05:05:58Z

Introduction

I was looking over the code for this project and am impressed with it's simplicity in design and brilliant approach to this problem. However, one thing that jumped out to me was the lack of asynchronous methods to allow for a huge speed increase, especially as the number of pages to scrape increases. I am quite familiar with the standard libraries used to meet this goal and propose the following changes:

Let me know your thoughts and if you're interested in the idea. The performance gains would be immense! Thanks!

Technical changes and additions proposal

1. Subclass AutoScraper with AsyncAutoScraper, which would require the packages aiohttp, aiofiles, and aiosql along with a few others purely optionally to increase speed - uvloop, brotlipy, cchardet, and aiodns
2. Refactor the _get_soup method by extracting an async method to download HTML asynchronously using aiohttp
3. Refactor the get_results* and _build* functions to also be async (simply adding the keyword) and then making sure to call them by using a multiprocessing/threading pool
- a. The get_* functions should handle the calling of these in an executor set to aforementioned pool
- b. Pools are created using concurrent.futures.*
- c. Inner-method logic should remain untouched since parsing is a CPU-bound task
4. Use aiofiles for the save method to be able to export many individual JSON files quickly if desired, same for the load method if multiple sources are being used
5. Add functionality for exporting to an SQL database asynchronously using aiosql

References

@alirezamika

The text was updated successfully, but these errors were encountered:

alirezamika · 2020-10-13T07:03:17Z

Hey, thanks for the idea, I like the idea of having async support.

Here's my thoughts:
The interface should be identical to the sync one. I guess the bottleneck is the get request which should be async.
About the sql functionality, I think we should keep the scraper focused on scraping. Keeping it simple but flexible enough for developers to use it in any scenario. Just use the scraper to get the data, after that, it is out of the scope of the scraper. You can save it in a sql db, text file, csv, .. etc or do whatever you want in the best possible way for your need.
This library is build with this idea in mind, even now you can use any async/threading method to get the html contents and then pass them to the scraper. It will add less than 10 lines of code.

To demonstrate the flexibility, consider this example:

scraper.get_result(url)

For async usage, you can just do:

html = await fetch(url)
scraper.get_result(url=url, html=html)

So I'm not sure whether we should implement it.

Let me know what you think. Thanks.

tarasivashchuk · 2020-10-13T12:57:49Z

Ok, I actually agree with keeping a narrow focus on scraping for this package. In that case, your example is great although there needs to be a bit more work to setup the aiohttp.ClientSession and stuff - are you familiar with the aiohttp library? I could simply inherit AutoScraper and refactor the functionality related to fetching requests and send you a pull request?

@alirezamika

alirezamika · 2020-10-13T13:47:20Z

Yeah great.
Thanks.

tarasivashchuk · 2020-10-16T05:06:17Z

Ok, so I forked this project and was going through the AutoScraper class but it seems that the class is built for handling one URL at a time rather than passing multiple URLs to a get_result method. In this case, the aiohttp.ClientSession would need to be created outside the class and optionally passed to a get_result method, in which case it will use the asynchronous version of the _get_soup method. What do you think about this?

@alirezamika

alirezamika · 2020-10-18T10:06:02Z

We can pass the session. Or another approach is to implement enter and exit methods (create and close the session inside the class) and use it like this:

async with AutoScraper() as scraper:
    ...

Let me know what you think.
Thanks.

tarasivashchuk · 2020-10-18T16:11:16Z

I think using 'aenter' and 'aexit' is a great idea. That's the approach I'm gonna take.

tarasivashchuk · 2020-11-07T21:39:46Z

Hey - I will be sending you the version with async today

@alirezamika

mzeidhassan · 2020-12-23T01:25:49Z

Thanks @tarasivashchuk and @alirezamika!
Just wondering if this is already implemented in the latest release?
Thanks

alirezamika · 2020-12-23T17:38:08Z

@mzeidhassan you can use async requests easily like this:

...
async with aiohttp.ClientSession() as session:
    async with session.get(url) as response:
        html = await response.text()
        result = scraper.get_result_similar(html=html)
        ...

mzeidhassan · 2020-12-25T06:02:04Z

Thanks a million, @alirezamika! I appreciate it. I will let you know if I have any further questions.

alirezamika closed this as completed Dec 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

tarasivashchuk commented Oct 12, 2020 •

edited by alirezamika

alirezamika commented Oct 13, 2020 •

edited

tarasivashchuk commented Oct 13, 2020 •

edited

alirezamika commented Oct 13, 2020

tarasivashchuk commented Oct 16, 2020

alirezamika commented Oct 18, 2020

tarasivashchuk commented Oct 18, 2020

tarasivashchuk commented Nov 7, 2020

mzeidhassan commented Dec 23, 2020

alirezamika commented Dec 23, 2020

mzeidhassan commented Dec 25, 2020

Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32

Comments

tarasivashchuk commented Oct 12, 2020 • edited by alirezamika

Introduction

Technical changes and additions proposal

References

alirezamika commented Oct 13, 2020 • edited

tarasivashchuk commented Oct 13, 2020 • edited

alirezamika commented Oct 13, 2020

tarasivashchuk commented Oct 16, 2020

alirezamika commented Oct 18, 2020

tarasivashchuk commented Oct 18, 2020

tarasivashchuk commented Nov 7, 2020

mzeidhassan commented Dec 23, 2020

alirezamika commented Dec 23, 2020

mzeidhassan commented Dec 25, 2020

tarasivashchuk commented Oct 12, 2020 •

edited by alirezamika

alirezamika commented Oct 13, 2020 •

edited

tarasivashchuk commented Oct 13, 2020 •

edited