fast python port of arc90's readability tool, updated to match latest readability.js!
HTML Python Makefile
Pull request Compare This branch is 121 commits ahead, 1 commit behind timbertson:master.
Latest commit 97e86c4 Nov 27, 2017 @buriy buriy Merge pull request #101 from hugovk/add-3.5-3.6
Add support for Python 3.5 and 3.6, drop support for Python 3.3 and 2.6

README.rst

https://travis-ci.org/buriy/python-readability.svg?branch=master

python-readability

Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.

Installation

It's easy using pip, just run:

$ pip install readability-lxml

Usage

>> import requests
>> from readability import Document
>>
>> response = requests.get('http://example.com')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'

Change Log

  • 0.3 Added Document.encoding, positive_keywords and negative_keywords
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3 and 3.4

Licensing

This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js
  • Ruby port by starrhorne and iterationlabs
  • Python port by gfxmonk
  • Decruft effort to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.