Zalando extractor fails to extract sustainability labels #74

en-GB · 2022-06-13T13:57:10Z

We used to extract labels directly from the rendered HTML.
Since Splash is no longer able to render zalando product pages, we extract them from this json file

green-db/extract/extract/extractors/zalando.py

Line 152 in bf77115

    
           beautiful_soup.find("script", {"type": "application/json", "class": "re-1-13"}).get_text()

but occasionally some labels will be missing.
This only affects ~10 products in any given run and ive only seen it happen on zalando.co.uk.

Switching the zalando scraper to Playwright would probably fix it tho.

se-jaeger · 2022-07-07T07:13:20Z

With the latest changes from #79 the extractor can't find any sustainability labels leading to not create products.

BigDatalex · 2022-07-07T09:08:01Z

I just updated the zalando extractor. It was just a minor change, due to a change of a class name in the html. See: 53601e8

BigDatalex · 2022-07-07T13:12:23Z

There are two commits from @en-GB that might be more robust and improve the extraction of the zalando sustainability-labels see:

We (@en-GB) should check if these behave the same (extract the same sustainability-labels) like in the original approach or if there are some implications. So far, for our zalando tests, these achieve the same results.

se-jaeger · 2022-08-30T10:43:02Z

@en-GB what's the status about this one? Ist this still an issue or can we just close it? Especially after the lates changes #83

en-GB added the low priority label Jun 13, 2022

en-GB changed the title ~~Zalando scraper sometimes misses sustainability labels on certain products~~ Zalando scraper sometimes misses sustainability labels Jun 13, 2022

en-GB changed the title ~~Zalando scraper sometimes misses sustainability labels~~ Zalando extractor sometimes misses sustainability labels Jun 13, 2022

se-jaeger added bug Something isn't working high priority High priority and removed low priority labels Jul 7, 2022

se-jaeger changed the title ~~Zalando extractor sometimes misses sustainability labels~~ Zalando extractor fails to extract sustainability labels Jul 7, 2022

BigDatalex mentioned this issue Jul 7, 2022

Incorporate source, country, gender & age information and change color & size to arrays #79

Merged

se-jaeger added enhancement New feature or request and removed bug Something isn't working high priority High priority labels Jul 7, 2022

en-GB closed this as completed Sep 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zalando extractor fails to extract sustainability labels #74

Zalando extractor fails to extract sustainability labels #74

en-GB commented Jun 13, 2022

se-jaeger commented Jul 7, 2022

BigDatalex commented Jul 7, 2022

BigDatalex commented Jul 7, 2022

se-jaeger commented Aug 30, 2022

Zalando extractor fails to extract sustainability labels #74

Zalando extractor fails to extract sustainability labels #74

Comments

en-GB commented Jun 13, 2022

se-jaeger commented Jul 7, 2022

BigDatalex commented Jul 7, 2022

BigDatalex commented Jul 7, 2022

se-jaeger commented Aug 30, 2022