Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

otto next page extraction does not work as expected for some pages #120

Open
BigDatalex opened this issue Jan 25, 2023 · 5 comments
Open
Labels
bug Something isn't working

Comments

@BigDatalex
Copy link
Collaborator

I just noticed in the otto log files, that the next page extraction is not working as expected for some pages and the following error shows up:

2023-01-21 04:38:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.otto.de/schuhe/halbschuhe/?nachhaltigkeit=alle-nachhaltigen-artikel&zielgruppe=herren&l=gq&o=120 via http://splash:8050/execute> (referer: None)
[...]
  File "/tmp/scraping-1674237617-grym39ye.egg/scraping/spiders/otto_de.py", line 65, in parse_SERP
    if int(pagination_info["o"]) > response.meta.get("o", 0):
ValueError: invalid literal for int() with base 10: ''

This Error shows up 18 times in the log file. In #115 we updated the next page extraction in order to scrape products without filtering for sustainable products only. It could be that this change is the cause, but needs some inspection.

@BigDatalex BigDatalex added the bug Something isn't working label Jan 25, 2023
@itrajanovska
Copy link
Collaborator

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step?
But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

@BigDatalex
Copy link
Collaborator Author

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115

"laptop": ProductCategory.LAPTOP.value,
"tablet": ProductCategory.TABLET.value,
"audio/kopfhoerer": ProductCategory.HEADPHONES.value,
"fernseher": ProductCategory.TV.value,

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAY to speed up the scraping. For example by adding a custom setting as for zalando:

custom_settings = {"DOWNLOAD_DELAY": 2}

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

@BigDatalex
Copy link
Collaborator Author

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step? But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

It is not affecting the extraction step, at least not in the first place ... The log file is from scrapyd and during the scraping process we extract the next pages.

@itrajanovska
Copy link
Collaborator

Ah sorry, I didn't notice this was another issue

@itrajanovska
Copy link
Collaborator

itrajanovska commented Jan 25, 2023

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115

"laptop": ProductCategory.LAPTOP.value,
"tablet": ProductCategory.TABLET.value,
"audio/kopfhoerer": ProductCategory.HEADPHONES.value,
"fernseher": ProductCategory.TV.value,

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAY to speed up the scraping. For example by adding a custom setting as for zalando:

custom_settings = {"DOWNLOAD_DELAY": 2}

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

Right, and maybe for now we can skip the headphones and tvs as well, as I didn't expect they would have that big of an impact.

Update 17.02.2023
To tackle these comments, we created a new issue:
#126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants