Given a html document, it pulls out the main body text and cleans it up.
This is a python port of a ruby port of arc90's readability project.
It's easy using
pip, just run:
$ pip install readability-lxml
>> import requests >> from readability import Document >> >> response = requests.get('http://example.com') >> doc = Document(response.text) >> doc.title() >> 'Example Domain'
- 0.7 Improved HTML5 tags handling. Heuristics were changed for a lot of sites: Fixed an important
bug with stripping unwanted HTML nodes (only first matching node was removed before). - 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3
- 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
- 0.4 Added Videos loading and allowed more images per paragraph
- 0.3 Added Document.encoding, positive_keywords and negative_keywords
This code is under the Apache License 2.0 license.