The ExtractContent will extract content from HTML.
It has been rewritten in PHP from Ruby and Perl.
Syuyo Nakatani:: http://labs.cybozu.co.jp/blog/nakatani/downloads/extractcontent.rb
Ina Lintaro:: http://search.cpan.org/dist/HTML-ExtractContent/lib/HTML/ExtractContent.pm
Clone the repo, git clone git://github.com/aoiaoi/ExtractContent.git: , or download the latest release.
<?php
require_once 'ExtractContent.php';
$extractor = new ExtractContent();
$extractor = new ExtractContent();
$extractor->extract($html);
echo $extractor->asText();
// if retrieve as HTML:
echo $extractor->asHtml();
$extractor = new ExtractContent();
$extractor->extract($html);
echo $extractor->getTitle();
$extractor = new ExtractContent(array('g_adsense' => true));
// This is actually exactly the same:
$extractor = new ExtractContent();
$extractor->setOptions('g_adsense' => true);
threshold | integer 60
min_length | integer 30
decay_factor | integer 0.75
no_body_factor | integer 0.72
continuous_factor | integer 1.62
punctuation_weight | integer 10
min_nolink | integer 8
nolist_ratio | integer 0.2
g_adsense | boolean false
debug | boolian false
ExtractContent::setOptions(array $options);
ExtractContent::extract(string $html);
ExtractContent::getTitle();
ExtractContent::asText();
ExtractContent::asHtml();
PHP 5.4.1-rc2
Copyright (c) 2012 aoiaoi. All rights reserved.
Copyright (c) 2007-2012 Nakatani Shuyo / Cybozu Labs Inc. All rights reserved.
The files in this archive are released under the New BSD license. You can find a copy of this license in LICENSE.txt