Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting with include_images option causes subsequent extractions to fail #51

Closed
TommiNieminen opened this issue Jan 5, 2021 · 4 comments
Labels
bug Something isn't working

Comments

@TommiNieminen
Copy link

There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:

  if include_images is True:
      # Many websites have <img> inside <figure> or <picture> or <source> tag
      for element in ['figure', 'picture', 'source']:
          MANUALLY_CLEANED.remove(element)
      MANUALLY_STRIPPED.remove('img')

This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:

if include_images is True:
        # Many websites have <img> inside <figure> or <picture> or <source> tag
        for element in ['figure', 'picture', 'source']:
            if element in MANUALLY_CLEANED:
                MANUALLY_CLEANED.remove(element)
        if 'img' in MANUALLY_STRIPPED:
            MANUALLY_STRIPPED.remove('img')
@adbar
Copy link
Owner

adbar commented Jan 5, 2021

Thanks, the way the lists were used was inconsistent.
It should be fixed in 8a5123f, can you confirm by running the last version from the repository?

@adbar adbar added the bug Something isn't working label Jan 5, 2021
@TommiNieminen
Copy link
Author

Thanks, the cleaning_list removal works now, but the stripping_list removal will still cause a ValueError with repeated extractions, since the code tries to remove 'img' that was already removed during the first extraction.

        cleaning_list = [e for e in cleaning_list if e
                         not in ('figure', 'picture', 'source')]
        stripping_list.remove('img')

@adbar
Copy link
Owner

adbar commented Jan 7, 2021

I got it wrong: without copy() the lists were linked to the same object and thus identical. It should be fixed now (33cd96b).
Can you please confirm?

@TommiNieminen
Copy link
Author

Seems to work now, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants