Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: change HTML conversion backend from boilerpy3 to Trafilatura #7705

Merged
merged 2 commits into from
May 17, 2024

Conversation

anakin87
Copy link
Member

Related Issues

Proposed Changes:

As discussed offline, we want to replace boilerpy3 with Trafilatura, which is robust and well-maintained.
During my recent work on AutoQuizzer, I battle-tested this library, which worked well for a diverse range of HTML pages.

I'm trying not to break the existing API. The implementation is simpler.

How did you test it?

CI, new tests.

Checklist

@anakin87 anakin87 requested review from a team as code owners May 16, 2024 17:17
@anakin87 anakin87 requested review from dfokina and julian-risch and removed request for a team May 16, 2024 17:17
@anakin87 anakin87 requested review from masci and vblagoje and removed request for julian-risch May 16, 2024 17:17
@@ -57,7 +57,7 @@ dependencies = [
"more-itertools", # TextDocumentSplitter
"networkx", # Pipeline graphs
"typing_extensions>=4.7", # typing support for Python 3.8
"boilerpy3", # Fulltext extraction from HTML pages
"trafilatura", # Fulltext extraction from HTML pages
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my main concern.
boilerpy3 is 163kb, while trafilatura is 1390kb.

If you think it's better, I can add trafilatura as an optional dependency and wrap it in a lazy import block.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.4Mb is still ok, let's keep it

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 9116483899

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.04%) to 90.55%

Files with Coverage Reduction New Missed Lines %
components/converters/html.py 2 95.0%
Totals Coverage Status
Change from base Build 9103206687: -0.04%
Covered Lines: 6583
Relevant Lines: 7270

💛 - Coveralls

Copy link
Member

@vblagoje vblagoje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! Code reduction, simplification, better features - a heaven on earth!

@anakin87 anakin87 merged commit 7181f6b into main May 17, 2024
27 checks passed
@anakin87 anakin87 deleted the trafilatura-html-converter branch May 17, 2024 08:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Better HTML converter using trafilatura
4 participants