Python wrapper of the C++ Linux tool html2text
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
benchmarks
c/html2text
pyhtml2text
tests
.gitignore
.travis.yml
LICENSE
README.md
setup.py

README.md

pyhtml2text: Python wrapper for html2text

Build Status

This is a Python wrapper of the C++ html2text tool. The original C++ html2text project is slightly modified such that Python can use the functions through cffi.

html2text was written up to version 1.2.2 by Arno Unkrig for GMRS, up to version 1.3.2 by Martin Bayer. An active fork is currently maintained by Debian here.

Installation

pip install git+https://github.com/carsonip/pyhtml2text.git

Example

>>> from pyhtml2text import html2text
>>> html2text('<div>hello world</div>')
'hello world\n'
>>> html2text('<ol><li>one</li><li>two</li><li>three</li></ol>')
'   1. one\n   2. two\n   3. three\n'

Development

  1. In the project directory, cd into the C++ project then compile the html2text as a shared library.

    cd c/html2text
    ./configure
    make
  2. Under c/html2text, there should be a libhtml2text.so now. Place it next to the Python code.

    cp libhtml2text.so ../../pyhtml2text
  3. The cffi function in Python code should be able to load the .so now.

FAQ

Q: There's already a Python html2text. What's the difference?

A: The 2 projects share the common goal, but the Python html2text has some extra features like converting to markdown format, and preserving styles and links. This pyhtml2text project aims to provide a Python interface to the C++ html2text project and get the same output as C++ html2text does. The 2 projects produce different output due to wrapping and spacing. At the time of writing, pyhtml2text (using C++ html2text) produces better expected output than Python html2text. For example, on inputs like <div><br></div>, Python html2text yields extra new lines, which is unexpected. Also note that pyhtml2text is significantly faster than Python html2text. Please refer to the benchmarks under benchmarks/.

License

The html2text C++ code is licensed in GPLv2. Therefore this wrapper will also be licensed in GPLv2.