Skip to content
master
Switch branches/tags
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 

ArticleParse

License

Library that strips boilerplate HTML from news articles and performs heuristic analysis to determine the body of the article. Ranks text sections of the website by probability of being news content.

Currently uses for analysis:

  • Section Length
  • Section Position
  • Number of Anchors in a Section
  • Anchor Density in a Section
  • Word Count
  • Uppercase Word Count
  • Average Word Length
  • Average Sentence Length
  • Number of Sentences

This is a work in progress. I have manually tested it on several news websites, but extensive testing still needs to be performed.

Supports Python3

About

Heuristic text extraction from news sites in Python3

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages