Extracting with include_images option causes subsequent extractions to fail #51

TommiNieminen · 2021-01-05T11:30:43Z

There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:

  if include_images is True:
      # Many websites have <img> inside <figure> or <picture> or <source> tag
      for element in ['figure', 'picture', 'source']:
          MANUALLY_CLEANED.remove(element)
      MANUALLY_STRIPPED.remove('img')

This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:

if include_images is True:
        # Many websites have <img> inside <figure> or <picture> or <source> tag
        for element in ['figure', 'picture', 'source']:
            if element in MANUALLY_CLEANED:
                MANUALLY_CLEANED.remove(element)
        if 'img' in MANUALLY_STRIPPED:
            MANUALLY_STRIPPED.remove('img')

The text was updated successfully, but these errors were encountered:

adbar · 2021-01-05T18:54:26Z

Thanks, the way the lists were used was inconsistent.
It should be fixed in 8a5123f, can you confirm by running the last version from the repository?

TommiNieminen · 2021-01-06T21:43:46Z

Thanks, the cleaning_list removal works now, but the stripping_list removal will still cause a ValueError with repeated extractions, since the code tries to remove 'img' that was already removed during the first extraction.

        cleaning_list = [e for e in cleaning_list if e
                         not in ('figure', 'picture', 'source')]
        stripping_list.remove('img')

adbar · 2021-01-07T14:02:43Z

I got it wrong: without copy() the lists were linked to the same object and thus identical. It should be fixed now (33cd96b).
Can you please confirm?

TommiNieminen · 2021-01-07T14:45:06Z

Seems to work now, thanks.

adbar added the bug Something isn't working label Jan 5, 2021

adbar closed this as completed Jan 7, 2021

adbar mentioned this issue Jan 8, 2021

No Formatting in Plain Text Output #48

Closed

carschno mentioned this issue Apr 12, 2022

include_images changes text extraction #194

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting with include_images option causes subsequent extractions to fail #51

Extracting with include_images option causes subsequent extractions to fail #51

TommiNieminen commented Jan 5, 2021

adbar commented Jan 5, 2021

TommiNieminen commented Jan 6, 2021

adbar commented Jan 7, 2021

TommiNieminen commented Jan 7, 2021

Extracting with include_images option causes subsequent extractions to fail #51

Extracting with include_images option causes subsequent extractions to fail #51

Comments

TommiNieminen commented Jan 5, 2021

adbar commented Jan 5, 2021

TommiNieminen commented Jan 6, 2021

adbar commented Jan 7, 2021

TommiNieminen commented Jan 7, 2021