fast python port of arc90's readability tool, updated to match latest readability.js!
Clone or download
Pull request Compare This branch is 126 commits ahead, 1 commit behind timbertson:master.



Given a html document, it pulls out the main body text and cleans it up.

This is a python port of a ruby port of arc90's readability project.


It's easy using pip, just run:

$ pip install readability-lxml


>> import requests
>> from readability import Document
>> response = requests.get('')
>> doc = Document(response.text)
>> doc.title()
>> 'Example Domain'

Change Log

  • 0.7 Improved HTML5 tags handling. Heuristics were changed for a lot of sites: Fixed an important

bug with stripping unwanted HTML nodes (only first matching node was removed before). - 0.6 Finally a release which supports Python versions 2.6, 2.7, 3.3

and 3.4
  • 0.5 Preparing a release to support Python versions 2.6, 2.7, 3.3 and 3.4
  • 0.4 Added Videos loading and allowed more images per paragraph
  • 0.3 Added Document.encoding, positive_keywords and negative_keywords


This code is under the Apache License 2.0 license.

Thanks to

  • Latest readability.js
  • Ruby port by starrhorne and iterationlabs
  • Python port by gfxmonk
  • Decruft effort to move to lxml
  • "BR to P" fix from readability.js which improves quality for smaller texts
  • Github users contributions.