Extracting Text from HTML: Unordered List Description\Header #58

zmeharen · 2021-03-01T18:42:23Z

I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.

<ul>Description of the list:
	<li>List item 1</li>
	<li>List item 2</li>
	<li>List item 3</li>
</ul>

In the previous code example, the extracted text would be:

List item 1
List item 2
List item 3

"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.

The text was updated successfully, but these errors were encountered:

adbar · 2021-03-02T12:50:14Z

Hi, thanks for your feedback, as you say it does happen although it's not standard.

Just fixed it in 3e4e6aa, please install the latest version from the repository if you want the changes to take effect:
pip3 install -U git+https://github.com/adbar/trafilatura.git

adbar added the bug Something isn't working label Mar 2, 2021

adbar closed this as completed Mar 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extracting Text from HTML: Unordered List Description\Header #58

Extracting Text from HTML: Unordered List Description\Header #58

zmeharen commented Mar 1, 2021

adbar commented Mar 2, 2021

Extracting Text from HTML: Unordered List Description\Header #58

Extracting Text from HTML: Unordered List Description\Header #58

Comments

zmeharen commented Mar 1, 2021

adbar commented Mar 2, 2021