You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:
if include_images is True:
# Many websites have <img> inside <figure> or <picture> or <source> tag
for element in ['figure', 'picture', 'source']:
MANUALLY_CLEANED.remove(element)
MANUALLY_STRIPPED.remove('img')
This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:
if include_images is True:
# Many websites have <img> inside <figure> or <picture> or <source> tag
for element in ['figure', 'picture', 'source']:
if element in MANUALLY_CLEANED:
MANUALLY_CLEANED.remove(element)
if 'img' in MANUALLY_STRIPPED:
MANUALLY_STRIPPED.remove('img')
The text was updated successfully, but these errors were encountered:
Thanks, the cleaning_list removal works now, but the stripping_list removal will still cause a ValueError with repeated extractions, since the code tries to remove 'img' that was already removed during the first extraction.
cleaning_list = [e for e in cleaning_list if e
not in ('figure', 'picture', 'source')]
stripping_list.remove('img')
There's a bug with extraction using include_images option, it only works once and then extractions start to fail. The bug is in htmlprocessing.py, lines 46 to 50:
This bit of code will be repeated for all extractions, but since the tags have been removed on the first extraction, a ValueError will be thrown for all other extractions. I tested that adding existence checks solves the issue:
The text was updated successfully, but these errors were encountered: