You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.
<ul>Description of the list:
<li>List item 1</li>
<li>List item 2</li>
<li>List item 3</li>
</ul>
In the previous code example, the extracted text would be:
List item 1
List item 2
List item 3
"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.
The text was updated successfully, but these errors were encountered:
Hi, thanks for your feedback, as you say it does happen although it's not standard.
Just fixed it in 3e4e6aa, please install the latest version from the repository if you want the changes to take effect: pip3 install -U git+https://github.com/adbar/trafilatura.git
I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.
In the previous code example, the extracted text would be:
"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.
The text was updated successfully, but these errors were encountered: