Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extracting Text from HTML: Unordered List Description\Header #58

Closed
zmeharen opened this issue Mar 1, 2021 · 1 comment
Closed

Extracting Text from HTML: Unordered List Description\Header #58

zmeharen opened this issue Mar 1, 2021 · 1 comment
Labels
bug Something isn't working

Comments

@zmeharen
Copy link

zmeharen commented Mar 1, 2021

I have been using trafilatura to extract text from HTML pages. I have noticed that sometimes the text following an unordered list is not extracted, the list items are extracted but not the text following the unordered list tag.

<ul>Description of the list:
	<li>List item 1</li>
	<li>List item 2</li>
	<li>List item 3</li>
</ul>

In the previous code example, the extracted text would be:

  • List item 1
  • List item 2
  • List item 3

"Description of the list" would not be extracted into the text file. This is probably due to incorrect HTML coding practices but I'm wondering if Trafilatura can capture that text.

@adbar
Copy link
Owner

adbar commented Mar 2, 2021

Hi, thanks for your feedback, as you say it does happen although it's not standard.

Just fixed it in 3e4e6aa, please install the latest version from the repository if you want the changes to take effect:
pip3 install -U git+https://github.com/adbar/trafilatura.git

@adbar adbar added the bug Something isn't working label Mar 2, 2021
@adbar adbar closed this as completed Mar 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants