Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[core] Add tidylib, a python wrapper for libtidy for HTML validation
- Loading branch information
bc Wong
committed
Apr 7, 2012
1 parent
dae55a7
commit cdd1f98
Showing
36 changed files
with
4,056 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
Copyright 2009 Jason Stitt | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in | ||
all copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN | ||
THE SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
include README | ||
include LICENSE | ||
include MANIFEST.in | ||
include tidylib/*.py | ||
include tests/*.py | ||
include *.py | ||
include docs/pytidylib.pdf | ||
include docs/html/*.html | ||
include docs/html/*.js | ||
include docs/html/_static/*.* | ||
include docs/html/_sources/*.* | ||
include docs/rst/*.* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
Metadata-Version: 1.0 | ||
Name: pytidylib | ||
Version: 0.2.1 | ||
Summary: Python wrapper for HTML Tidy (tidylib) | ||
Home-page: http://countergram.com/open-source/pytidylib/ | ||
Author: Jason Stitt | ||
Author-email: js@jasonstitt.com | ||
License: UNKNOWN | ||
Download-URL: http://cloud.github.com/downloads/countergram/pytidylib/pytidylib-0.2.1.tar.gz | ||
Description: 0.2.0: Works on Windows! See documentation for available DLL download | ||
locations. Documentation rewritten and expanded. | ||
|
||
`PyTidyLib`_ is a Python package that wraps the `HTML Tidy`_ library. This | ||
allows you, from Python code, to "fix" invalid (X)HTML markup. Some of the | ||
library's many capabilities include: | ||
|
||
* Clean up unclosed tags and unescaped characters such as ampersands | ||
* Output HTML 4 or XHTML, strict or transitional, and add missing doctypes | ||
* Convert named entities to numeric entities, which can then be used in XML | ||
documents without an HTML doctype. | ||
* Clean up HTML from programs such as Word (to an extent) | ||
* Indent the output, including proper (i.e. no) indenting for ``pre`` elements, | ||
which some (X)HTML indenting code overlooks. | ||
|
||
Small example of use | ||
==================== | ||
|
||
The following code cleans up an invalid HTML document and sets an option:: | ||
|
||
from tidylib import tidy_document | ||
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''', | ||
options={'numeric-entities':1}) | ||
print document | ||
print errors | ||
|
||
Docs | ||
==== | ||
|
||
Documentation is shipped with the source distribution and is available at | ||
the `PyTidyLib`_ web page. | ||
|
||
.. _`HTML Tidy`: http://tidy.sourceforge.net/ | ||
.. _`PyTidyLib`: http://countergram.com/open-source/pytidylib/ | ||
|
||
Platform: UNKNOWN | ||
Classifier: Development Status :: 4 - Beta | ||
Classifier: Environment :: Other Environment | ||
Classifier: Intended Audience :: Developers | ||
Classifier: License :: OSI Approved :: MIT License | ||
Classifier: Programming Language :: Python | ||
Classifier: Natural Language :: English | ||
Classifier: Topic :: Utilities | ||
Classifier: Topic :: Text Processing :: Markup :: HTML | ||
Classifier: Topic :: Text Processing :: Markup :: XML |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
For documentation, see docs/html/index.html in this distribution, or | ||
http://countergram.com/open-source/pytidylib/ | ||
|
||
Small example of use: | ||
|
||
from tidylib import tidy_document | ||
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''', | ||
options={'numeric-entities':1}) | ||
print document | ||
print errors |
100 changes: 100 additions & 0 deletions
100
desktop/core/ext-py/pytidylib-0.2.1/docs/html/_sources/index.txt
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
PyTidyLib: A Python Interface to HTML Tidy | ||
------------------------------------------ | ||
|
||
`PyTidyLib`_ is a Python package that wraps the `HTML Tidy`_ library. This allows you, from Python code, to "fix" invalid (X)HTML markup. Some of the library's many capabilities include: | ||
|
||
* Clean up unclosed tags and unescaped characters such as ampersands | ||
* Output HTML 4 or XHTML, strict or transitional, and add missing doctypes | ||
* Convert named entities to numeric entities, which can then be used in XML documents without an HTML doctype. | ||
* Clean up HTML from programs such as Word (to an extent) | ||
* Indent the output, including proper (i.e. no) indenting for ``pre`` elements, which some (X)HTML indenting code overlooks. | ||
|
||
PyTidyLib is intended as as replacement for uTidyLib, which fills a similar purpose. The author previously used uTidyLib but found several areas for improvement, including OS X support, 64-bit platform support, unicode support, fixing a memory leak, and better speed. | ||
|
||
Naming conventions | ||
================== | ||
|
||
`HTML Tidy`_ is a longstanding open-source library written in C that implements the actual functionality of cleaning up (X)HTML markup. It provides a shared library (``so``, ``dll``, or ``dylib``) that can variously be called ``tidy``, ``libtidy``, or ``tidylib``, as well as a command-line executable named ``tidy``. For clarity, this document will consistently refer to it by the project name, HTML Tidy. | ||
|
||
`PyTidyLib`_ is the name of the Python package discussed here. As this is the package name, ``easy_install pytidylib`` or ``pip install pytidylib`` is correct (they are case-insenstive). The *module* name is ``tidylib``, so ``import tidylib`` is correct in Python code. This document will consistently use the package name, PyTidyLib, outside of code examples. | ||
|
||
Installing HTML Tidy | ||
==================== | ||
|
||
You must have both `HTML Tidy`_ and `PyTidyLib`_ installed in order to use the functionality described here. There is no affiliation between the two projects. The following briefly outlines what you must do to install HTML Tidy. See the `HTML Tidy`_ web site for more information. | ||
|
||
**Linux/BSD or similar:** First, try to use your distribution's package management system (``apt-get``, ``yum``, etc.) to install HTML Tidy. It might go under the name ``libtidy``, ``tidylib``, ``tidy``, or something similar. Otherwise see *Building from Source*, below. | ||
|
||
**OS X:** You may already have HTML Tidy installed. In the Terminal, run ``locate libtidy`` and see if you get any results, which should end in ``dylib``. Otherwise see *Building from Source*, below. | ||
|
||
**Windows:** (Use PyTidyLib version 0.2 or later!) Prebuilt HTML Tidy DLLs are available from at least two locations. The `int64.org Tidy Binaries`_ page provides binaries that were built in 2005, for both 32-bit and 64-bit Windows, against a patched version of the source. The `HTML Tidy`_ web site links to a DLL built in 2006, for 32-bit Windows only, using the vanilla source (scroll near the bottom to "Other Builds" -- use the one that reads "exe/lib/dll", *not* the "exe"-only version.) | ||
|
||
Once you have a DLL (which may be named ``tidy.dll``, ``libtidy.dll``, or ``tidylib.dll``), you must place it in a directory on your system path. If you are running Python from the command-line, placing the DLL in the present working directory will work, but this is unreliable otherwise (e.g. for server software). | ||
|
||
See the articles `How to set the path in Windows 2000/Windows XP <http://www.computerhope.com/issues/ch000549.htm>`_ (ComputerHope.com) and `Modify a Users Path in Windows Vista <http://www.question-defense.com/2009/06/22/modify-a-users-path-in-windows-vista-vista-path-environment-variable/>`_ (Question Defense) for more information on your system path. | ||
|
||
**Building from Source:** The HTML Tidy developers have chosen to make the source code downloadable *only* through CVS, and not from the web site. Use the following CVS checkout at the command line:: | ||
|
||
cvs -z3 -d:pserver:anonymous@tidy.cvs.sourceforge.net:/cvsroot/tidy co -P tidy | ||
|
||
Then see the instructions packaged with the source code or on the `HTML Tidy`_ web site. | ||
|
||
Installing PyTidyLib | ||
==================== | ||
|
||
PyTidyLib is available on the Python Package Index and may be installed in the usual ways if you have `pip`_ or `setuptools`_ installed:: | ||
|
||
pip install pytidylib | ||
# or: | ||
easy_install pytidylib | ||
|
||
You can also download the latest source distribution from the `PyTidyLib`_ web site. | ||
|
||
Small example of use | ||
==================== | ||
|
||
The following code cleans up an invalid HTML document and sets an option:: | ||
|
||
from tidylib import tidy_document | ||
document, errors = tidy_document('''<p>fõo <img src="bar.jpg">''', | ||
options={'numeric-entities':1}) | ||
print document | ||
print errors | ||
|
||
Configuration options | ||
===================== | ||
|
||
The Python interface allows you to pass options directly to HTML Tidy. For a complete list of options, see the `HTML Tidy Configuration Options Quick Reference`_ or, from the command line, run ``tidy -help-config``. | ||
|
||
.. _`HTML Tidy Configuration Options Quick Reference`: http://tidy.sourceforge.net/docs/quickref.html | ||
|
||
This module sets certain default options, as follows:: | ||
|
||
BASE_OPTIONS = { | ||
"output-xhtml": 1, # XHTML instead of HTML4 | ||
"indent": 1, # Pretty; not too much of a performance hit | ||
"tidy-mark": 0, # No tidy meta tag in output | ||
"wrap": 0, # No wrapping | ||
"alt-text": "", # Help ensure validation | ||
"doctype": 'strict', # Little sense in transitional for tool-generated markup... | ||
"force-output": 1, # May not get what you expect but you will get something | ||
} | ||
|
||
If you do not like these options to be set for you, do the following after importing ``tidylib``:: | ||
|
||
tidylib.BASE_OPTIONS = {} | ||
|
||
Function reference | ||
================== | ||
|
||
.. autofunction:: tidylib.tidy_document | ||
|
||
.. autofunction:: tidylib.tidy_fragment | ||
|
||
.. autofunction:: tidylib.release_tidy_doc | ||
|
||
.. _`HTML Tidy`: http://tidy.sourceforge.net/ | ||
.. _`PyTidyLib`: http://countergram.com/open-source/pytidylib/ | ||
.. _`int64.org Tidy Binaries`: http://int64.org/projects/tidy-binaries | ||
.. _`setuptools`: http://pypi.python.org/pypi/setuptools | ||
.. _`pip`: http://pypi.python.org/pypi/pip |
Oops, something went wrong.