pyhtml2text: Python wrapper for html2text
html2text was written up to version 1.2.2 by Arno Unkrig for GMRS, up to version 1.3.2 by Martin Bayer. An active fork is currently maintained by Debian here.
pip install git+https://github.com/carsonip/pyhtml2text.git
>>> from pyhtml2text import html2text >>> html2text('<div>hello world</div>') 'hello world\n' >>> html2text('<ol><li>one</li><li>two</li><li>three</li></ol>') ' 1. one\n 2. two\n 3. three\n'
In the project directory,
cdinto the C++ project then compile the html2text as a shared library.
cd c/html2text ./configure make
c/html2text, there should be a
libhtml2text.sonow. Place it next to the Python code.
cp libhtml2text.so ../../pyhtml2text
The cffi function in Python code should be able to load the
Q: There's already a Python html2text. What's the difference?
A: The 2 projects share the common goal, but the Python html2text has some extra features like converting to markdown format, and preserving styles and links. This pyhtml2text project aims to provide a Python interface to the C++ html2text project and get the same output as C++ html2text does. The 2 projects produce different output due to wrapping and spacing. At the time of writing, pyhtml2text (using C++ html2text) produces better expected output than Python html2text. For example, on inputs like
<div><br></div>, Python html2text yields extra new lines, which is unexpected. Also note that pyhtml2text is significantly faster than Python html2text. Please refer to the benchmarks under
The html2text C++ code is licensed in GPLv2. Therefore this wrapper will also be licensed in GPLv2.