New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asynchronous methods for fetching URLs, parsing HTML, and exporting data #32
Comments
Hey, thanks for the idea, I like the idea of having async support. Here's my thoughts: To demonstrate the flexibility, consider this example: scraper.get_result(url) For async usage, you can just do: html = await fetch(url)
scraper.get_result(url=url, html=html) So I'm not sure whether we should implement it. Let me know what you think. Thanks. |
Ok, I actually agree with keeping a narrow focus on scraping for this package. In that case, your example is great although there needs to be a bit more work to setup the |
Yeah great. |
Ok, so I forked this project and was going through the |
We can pass the session. Or another approach is to implement enter and exit methods (create and close the session inside the class) and use it like this: async with AutoScraper() as scraper:
... Let me know what you think. |
I think using 'aenter' and 'aexit' is a great idea. That's the approach I'm gonna take. |
Hey - I will be sending you the version with async today |
Thanks @tarasivashchuk and @alirezamika! |
@mzeidhassan you can use async requests easily like this: ...
async with aiohttp.ClientSession() as session:
async with session.get(url) as response:
html = await response.text()
result = scraper.get_result_similar(html=html)
... |
Thanks a million, @alirezamika! I appreciate it. I will let you know if I have any further questions. |
Introduction
I was looking over the code for this project and am impressed with it's simplicity in design and brilliant approach to this problem. However, one thing that jumped out to me was the lack of asynchronous methods to allow for a huge speed increase, especially as the number of pages to scrape increases. I am quite familiar with the standard libraries used to meet this goal and propose the following changes:
Let me know your thoughts and if you're interested in the idea. The performance gains would be immense! Thanks!
Technical changes and additions proposal
1. Subclass
AutoScraper
withAsyncAutoScraper
, which would require the packagesaiohttp
,aiofiles
, andaiosql
along with a few others purely optionally to increase speed -uvloop
,brotlipy
,cchardet
, andaiodns
2. Refactor the
_get_soup
method by extracting anasync
method to download HTML asynchronously usingaiohttp
3. Refactor the
get_results*
and_build*
functions to also beasync
(simply adding the keyword) and then making sure to call them by using a multiprocessing/threading poolget_*
functions should handle the calling of these in an executor set to aforementioned poolconcurrent.futures.*
4. Use
aiofiles
for thesave
method to be able to export many individual JSON files quickly if desired, same for theload
method if multiple sources are being used5. Add functionality for exporting to an SQL database asynchronously using
aiosql
References
aiohttp
aiofiles
aiosql
@alirezamika
The text was updated successfully, but these errors were encountered: