jusText

Program jusText is a tool for removing boilerplate content, such as navigation links, headers, and footers from HTML pages. It is designed to preserve mainly text containing full sentences and it is therefore well suited for creating linguistic resources such as Web corpora. You can try it online.

This is a fork of original (currently unmaintained) code of jusText hosted on Google Code. Below are some alternatives that I found:

Installation

Make sure you have Python 2.6+/3.2+ and pip (Windows, Linux) installed. Run simply (preferred way):

$ [sudo] pip install justext

Or for the fresh version:

$ [sudo] pip install git+git://github.com/miso-belica/jusText.git

Or if you have to:

$ wget https://github.com/miso-belica/jusText/archive/master.zip # download the sources
$ unzip master.zip # extract the downloaded file
$ jusText-master/
$ [sudo] python setup.py install # install the package

Dependencies

lxml>=2.2.4

Usage

$ python -m justext -s Czech -o text.txt http://www.zdrojak.cz/clanky/automaticke-zabezpeceni/
$ python -m justext -s English -o plain_text.txt english_page.html
$ python -m justext --help # for more info

Python API

import requests
import justext

response = requests.get("http://planet.python.org/")
paragraphs = justext.justext(response.content, justext.get_stoplist("English"))
for paragraph in paragraphs:
  if not paragraph.is_boilerplate:
    print paragraph.text

Testing

Run tests via

$ nosetests tests

Acknowledgements

This software has been developed at the Natural Language Processing Centre of Masaryk University in Brno with a financial support from PRESEMT and Lexical Computing Ltd. It also relates to PhD research of Jan Pomikálek.

Name		Name	Last commit message	Last commit date
Latest commit History 104 Commits
doc		doc
justext		justext
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGELOG.rst		CHANGELOG.rst
LICENSE.rst		LICENSE.rst
MANIFEST.in		MANIFEST.in
README.rst		README.rst
dev_requirements.txt		dev_requirements.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

jusText

Installation

Dependencies

Usage

Python API

Testing

Acknowledgements

About

Releases

Packages

License

chekunkov/jusText

Folders and files

Latest commit

History

Repository files navigation

jusText

Installation

Dependencies

Usage

Python API

Testing

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages