otto next page extraction does not work as expected for some pages #120

BigDatalex · 2023-01-25T13:16:36Z

I just noticed in the otto log files, that the next page extraction is not working as expected for some pages and the following error shows up:

2023-01-21 04:38:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.otto.de/schuhe/halbschuhe/?nachhaltigkeit=alle-nachhaltigen-artikel&zielgruppe=herren&l=gq&o=120 via http://splash:8050/execute> (referer: None)
[...]
  File "/tmp/scraping-1674237617-grym39ye.egg/scraping/spiders/otto_de.py", line 65, in parse_SERP
    if int(pagination_info["o"]) > response.meta.get("o", 0):
ValueError: invalid literal for int() with base 10: ''

This Error shows up 18 times in the log file. In #115 we updated the next page extraction in order to scrape products without filtering for sustainable products only. It could be that this change is the cause, but needs some inspection.

The text was updated successfully, but these errors were encountered:

itrajanovska · 2023-01-25T13:23:03Z

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step?
But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

BigDatalex · 2023-01-25T13:27:58Z

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115

green-db/scraping/scraping/start_scripts/otto_de.py

Lines 150 to 153 in 7ab12c9

    
           "laptop": ProductCategory.LAPTOP.value, 
        
           "tablet": ProductCategory.TABLET.value, 
        
           "audio/kopfhoerer": ProductCategory.HEADPHONES.value, 
        
           "fernseher": ProductCategory.TV.value,

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAY to speed up the scraping. For example by adding a custom setting as for zalando:

green-db/scraping/scraping/spiders/zalando_de.py

Line 20 in 7ab12c9

custom_settings = {"DOWNLOAD_DELAY": 2}

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

BigDatalex · 2023-01-25T13:31:28Z

Ah, I see, but if this is an issue in the scraping process, why would it be effective in the extraction step? But maybe I'm missing on something, I'll investigate it as well so I get a better understanding

It is not affecting the extraction step, at least not in the first place ... The log file is from scrapyd and during the scraping process we extract the next pages.

itrajanovska · 2023-01-25T13:36:38Z

Ah sorry, I didn't notice this was another issue

itrajanovska · 2023-01-25T13:38:53Z

In addition the otto job is still running (4 days, 19:16:53) and the amount of products has increased more than I would have expected (from about 43k scraped on 13.01.2023 to more than 58k on 20.01.2023.

This increase is probably related to the 4 additional electronics categories we added as of #115

green-db/scraping/scraping/start_scripts/otto_de.py

Lines 150 to 153 in 7ab12c9

"laptop": ProductCategory.LAPTOP.value,

"tablet": ProductCategory.TABLET.value,

"audio/kopfhoerer": ProductCategory.HEADPHONES.value,

"fernseher": ProductCategory.TV.value,

However, since we do not yield additional requests in the extractor for receiving the sustainability information, we can probably decrease the otto DOWNLOAD_DELAY to speed up the scraping. For example by adding a custom setting as for zalando:

green-db/scraping/scraping/spiders/zalando_de.py

Line 20 in 7ab12c9

custom_settings = {"DOWNLOAD_DELAY": 2}

The default delay is 5 seconds, maybe 4 seconds is already enough, but we could also try 3 seconds directly.

Right, and maybe for now we can skip the headphones and tvs as well, as I didn't expect they would have that big of an impact.

Update 17.02.2023
To tackle these comments, we created a new issue:
#126

BigDatalex added the bug Something isn't working label Jan 25, 2023

itrajanovska mentioned this issue Jan 27, 2023

Task/optimize otto scrapes disable unavailable for sustainabile products #121

Merged

itrajanovska mentioned this issue Feb 17, 2023

Speed up the OTTO scrape #126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

otto next page extraction does not work as expected for some pages #120

otto next page extraction does not work as expected for some pages #120

BigDatalex commented Jan 25, 2023

itrajanovska commented Jan 25, 2023

BigDatalex commented Jan 25, 2023

BigDatalex commented Jan 25, 2023

itrajanovska commented Jan 25, 2023

itrajanovska commented Jan 25, 2023 •

edited

Loading

otto next page extraction does not work as expected for some pages #120

otto next page extraction does not work as expected for some pages #120

Comments

BigDatalex commented Jan 25, 2023

itrajanovska commented Jan 25, 2023

BigDatalex commented Jan 25, 2023

BigDatalex commented Jan 25, 2023

itrajanovska commented Jan 25, 2023

itrajanovska commented Jan 25, 2023 • edited Loading

itrajanovska commented Jan 25, 2023 •

edited

Loading