Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Item2190: extracted from KinoSearchContrib, some simplifications and …
…fixes git-svn-id: http://svn.foswiki.org/trunk/StringifierContrib@5196 0b4bb1d4-4e5a-0410-9cc4-b2b747904278
- Loading branch information
MichaelDaum
authored and
MichaelDaum
committed
Oct 2, 2009
0 parents
commit de4311a
Showing
72 changed files
with
2,492 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,134 @@ | ||
%META:TOPICINFO{author="micha" comment="reprev" date="1254480148" format="1.1" reprev="1.2" version="1.2"}% | ||
%META:TOPICPARENT{name="Contribs"}% | ||
---+ StringifierContrib | ||
%SHORTDESCRIPTION% | ||
|
||
This extension has been extracted from Foswiki:Extensions/KinoSearchContrib to make it available | ||
for search engines other than kinosearch. | ||
|
||
%TOC% | ||
|
||
---++ Supported file formats | ||
|
||
* =.txt= | ||
* =.html= | ||
* =.xml= | ||
* =.doc= | ||
* =.docx= | ||
* =.xls= | ||
* =.xlsx= | ||
* =.ppt= | ||
* =.pptx= | ||
* =.pdf= | ||
|
||
You can change this with the | ||
=$Foswiki::cfg{StringifierContrib}{IndexExtensions}= setting in =configure=. | ||
|
||
If you add other file extensions, they are treated as ASCII files. If needed, | ||
you can add more specialised stringifiers for further document types (see below). | ||
|
||
---++ Backend for Indexing Word 2003 Documents | ||
|
||
To index Word 2003 Documents (=.doc=) you will need to install one of the following: | ||
|
||
* =antiword= (recommended) | ||
* =abiword= | ||
* =wvWare= | ||
|
||
You can then select the tool to use in =configure=. | ||
|
||
---++ Backend for PDF | ||
|
||
To index =.pdf= files you need to install =xpdf-utils=. | ||
|
||
---++ Backend for PPT | ||
|
||
To index =.ppt= files you need to install =ppthtml=. | ||
|
||
---++ Backends for DOCX, PPTX | ||
|
||
To index these file types, you will need to install the following tools from Sourceforge: | ||
* [[http://sourceforge.net/projects/docx2txt/][docx2txt]] for =.docx= | ||
* [[http://sourceforge.net/projects/pptx2txt/][pptx2txt]] for =.pptx= | ||
|
||
Then set the command path to these tools in =configure=. | ||
|
||
---++ Instaling the Contrib | ||
|
||
%$INSTALL_INSTRUCTIONS% | ||
|
||
---++ Configuration | ||
|
||
There are a number of settings that need to be set in =configure= before you can use the Contrib. | ||
|
||
---++ Test of the Installation | ||
|
||
* Test if the installation was successful: | ||
* Check that =antiword=, =abiword= or =wvHtml= is in place: Type =antiword=, =abiword= or =wvHtml= on the prompt and check that the command exists. | ||
* Check that =pdftotext= is in place: Type =pdftotext= on the prompt and check that the command exists. | ||
* Check that =ppthtml= is in place: Type =ppthtml= on the prompt and check that the command exists. | ||
* Change the working directory to the =kinosearch/bin= Foswiki installation directory. | ||
* Run =./kinoindex= | ||
* Once finished, open a browser window and point it to the =[[System.KinoSearch]]= topic. | ||
* Just type a query and check the results. | ||
|
||
---++ Test of Stringification with =stringify= | ||
|
||
Some users report problems with the stringification: The stringifier scipts | ||
fails, takes too long on attachments. Some times this may result from | ||
installation errors esp. of the installation of the backends for the | ||
stringification. | ||
|
||
=stringify= give you the opportunity to test the stringification in advance. | ||
|
||
Usage: =stringify file_name= | ||
|
||
In the result you see, which stringifier is used and the result of the | ||
stringification. | ||
|
||
Example: | ||
|
||
<verbatim> | ||
stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc | ||
|
||
Simple example | ||
|
||
Keyword: dummy | ||
|
||
Umlaute: Grober, Uberschall, Anderung | ||
</verbatim> | ||
|
||
---++ Further Development | ||
|
||
In this extension, a plug-in mechanism is implemented, so that additional | ||
stringifiers can be added without changing the existing code. All stringifier | ||
plugins are stored in the directory =lib/Foswiki/Contrib/StringifierContrib/Plugins=. | ||
|
||
You can add new stringifier plugins by just adding new files here. The minimum | ||
things to be implemented are: | ||
|
||
* The plugin must inherit from =Foswiki::Contrib::StringififierContrib::Base= | ||
* The plugin must register itself by =__PACKAGE__->register_handler($application, $file_extension)=; | ||
* The plugin must implement the method =$text = stringForFile ($filename)= | ||
|
||
All the stringifiers have unit tests associated with them, and we would | ||
encourage you to provide unit tests for any you wish to contribute. See | ||
Foswiki:Development/UnitTests for more information on unit testing. | ||
|
||
See Foswiki:Tasks/StringifierContrib for currently open tasks. | ||
|
||
---++ Contrib Info | ||
|
||
<!-- | ||
* Set SHORTDESCRIPTION = Helper library to stringify binary document formats | ||
--> | ||
|
||
| Author(s): | Foswiki:Main.MarkusHesse, Foswiki:Main.SvenDowideit, Foswiki:Main.MichaelDaum & Foswiki:Main.AndrewJones | | ||
| Copyright: | © 2007, Foswiki:Main.MarkusHesse; © 2009, Foswiki Contributors | | ||
| Release: | %$RELEASE% | | ||
| Version: | %$VERSION% | | ||
| Change History: | <!-- versions below in reverse order --> | | ||
| 02 Oct 2009: | extracted from Foswiki:Extensions/KinoSearchContrib (MD) | | ||
| Dependencies: | %$DEPENDENCIES% | | ||
| Home: | Foswiki:Extensions/%TOPIC% | | ||
| Support: | Foswiki:Support/%TOPIC% | |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
# Copyright (C) 2009 Foswiki Contributors | ||
# | ||
# For licensing info read LICENSE file in the Foswiki root. | ||
# This program is free software; you can redistribute it and/or | ||
# modify it under the terms of the GNU General Public License | ||
# as published by the Free Software Foundation; either version 2 | ||
# of the License, or (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details, published at | ||
# http://www.gnu.org/copyleft/gpl.html | ||
|
||
|
||
package Foswiki::Contrib::StringifierContrib; | ||
use strict; | ||
use base "Foswiki::Contrib::StringifierContrib::Base"; | ||
use Carp; | ||
use File::MMagic; | ||
use File::Spec::Functions qw(rel2abs); | ||
use File::Basename; | ||
use File::stat; | ||
|
||
use vars qw($VERSION $RELEASE $magic); | ||
|
||
$VERSION = '$Rev: 4426 (2009-07-03) $'; | ||
$RELEASE = '1.0'; | ||
$magic = File::MMagic->new(); | ||
|
||
sub stringFor { | ||
my ($class, $filename, $encoding) = @_; | ||
return unless -r $filename; | ||
my $mime = $magic->checktype_filename($filename); | ||
|
||
#print STDERR "filename=$filename, mime=$mime\n"; | ||
my $self = $class->handler_for($filename, $mime)->new(); | ||
|
||
return $self->stringForFile($filename); | ||
} | ||
|
||
1; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,92 @@ | ||
# Copyright (C) 2009 Foswiki Contributors | ||
# | ||
# For licensing info read LICENSE file in the Foswiki root. | ||
# This program is free software; you can redistribute it and/or | ||
# modify it under the terms of the GNU General Public License | ||
# as published by the Free Software Foundation; either version 2 | ||
# of the License, or (at your option) any later version. | ||
# | ||
# This program is distributed in the hope that it will be useful, | ||
# but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
# GNU General Public License for more details, published at | ||
# http://www.gnu.org/copyleft/gpl.html | ||
|
||
package Foswiki::Contrib::StringifierContrib::Base; | ||
use strict; | ||
|
||
use Module::Pluggable (require => 1, search_path => [qw/Foswiki::Contrib::StringifierContrib::Plugins/]); | ||
|
||
__PACKAGE__->plugins; | ||
|
||
use constant DEFAULT_HANDLER => "Foswiki::Contrib::StringifierContrib::Plugins::Text"; | ||
{ | ||
my %mime_handlers; | ||
my %extension_handlers; | ||
sub register_handler { | ||
my ($package, @specs) = @_; | ||
|
||
for my $spec (@specs) { | ||
if ($spec =~ m{/}) { | ||
$mime_handlers{$spec} = $package; | ||
} else { | ||
$extension_handlers{$spec} = $package; | ||
} | ||
} | ||
} | ||
sub handler_for { | ||
my ($self, $filename, $mime) = @_; | ||
if (exists $mime_handlers{$mime}) { return $mime_handlers{$mime} } | ||
$filename = lc($filename); | ||
for my $spec (keys %extension_handlers) { | ||
if ($filename =~ /$spec$/) { return $extension_handlers{$spec} } | ||
} | ||
return DEFAULT_HANDLER; | ||
} | ||
|
||
# Returns 1, if the program can be called. | ||
# This is as service method that a sub calss can use to decise, | ||
# if it wants to register or not. | ||
sub _programExists { | ||
my ($self, $program) = @_; | ||
|
||
return defined(`$program 2>&1`); | ||
} | ||
} | ||
|
||
sub new { | ||
my ($handler) = @_; | ||
my $self = bless {}, $handler; | ||
|
||
$self; | ||
} | ||
|
||
# Service method to remove the director $dir and | ||
# all contence including sub directories | ||
sub rmtree { | ||
my ($self, $dir) = @_; | ||
local *DIR; | ||
|
||
# If the dir is infact a file, I just delete that. | ||
if (-f $dir) { | ||
unlink($dir); | ||
} | ||
|
||
opendir (DIR, $dir) || return 0; | ||
while (my $file = readdir(DIR)) { | ||
# Ignores . and .. | ||
next if ($file =~ /^\.{1,2}$/); | ||
|
||
$file = "$dir/$file"; | ||
if (-d $file) { | ||
$self->rmtree($file); | ||
} elsif (-f $file) { | ||
unlink($file); | ||
} | ||
} | ||
closedir DIR; | ||
rmdir($dir); | ||
return 1; | ||
} | ||
|
||
1; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,64 @@ | ||
# ---+ Extensions | ||
# ---++ StringifierContrib | ||
|
||
# **STRING** | ||
# Comma seperated list of webs to skip | ||
$Foswiki::cfg{StringifierContrib}{SkipWebs} = 'Trash, Sandbox'; | ||
|
||
# **STRING** | ||
# Comma seperated list of extenstions to index | ||
$Foswiki::cfg{StringifierContrib}{IndexExtensions} = '.txt, .html, .xml, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf'; | ||
|
||
# **STRING** | ||
# List of attachments to skip | ||
# For example: Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf | ||
$Foswiki::cfg{StringifierContrib}{SkipAttachments} = ''; | ||
|
||
# **STRING** | ||
# List of topics to skip. | ||
# Topics can be in the form of Web.MyTopic, or if you want a topic to be excluded from all webs just enter MyTopic. | ||
# For example: Main.WikiUsers, WebStatistics | ||
$Foswiki::cfg{StringifierContrib}{SkipTopics} = ''; | ||
|
||
# **SELECT antiword,wv,abiword** | ||
# Select which MS Word indexer to use (you need to have antiword, abiword or wvHtml installed) | ||
# <dl> | ||
# <dt>antiword</dt><dd>is the default, and should be used on Linux/Unix.</dd> | ||
# <dt>wvHtml</dt><dd> is recommended for use on Windows.</dd> | ||
# <dt>abiword</dt><dd></dd> | ||
# </dl> | ||
$Foswiki::cfg{StringifierContrib}{WordIndexer} = 'antiword'; | ||
|
||
# **COMMAND** | ||
# abiword command | ||
$Foswiki::cfg{StringifierContrib}{abiwordCmd} = 'abiword'; | ||
|
||
# **COMMAND** | ||
# antiword command | ||
$Foswiki::cfg{StringifierContrib}{antiwordCmd} = 'antiword'; | ||
|
||
# **COMMAND** | ||
# wvHtml command | ||
$Foswiki::cfg{StringifierContrib}{wvHtmlCmd} = 'wvHtml'; | ||
|
||
# **COMMAND** | ||
# ppthtml command | ||
$Foswiki::cfg{StringifierContrib}{ppthtmlCmd} = 'ppthtml'; | ||
|
||
# **COMMAND** | ||
# pdftotext command | ||
$Foswiki::cfg{StringifierContrib}{pdftotextCmd} = 'pdftotext'; | ||
|
||
# **COMMAND** | ||
# pptx2txt.pl command | ||
$Foswiki::cfg{StringifierContrib}{pptx2txtCmd} = '../tools/pptx2txt.pl'; | ||
|
||
# **COMMAND** | ||
# docx2txt.pl command | ||
$Foswiki::cfg{StringifierContrib}{docx2txtCmd} = '../tools/docx2txt.pl'; | ||
|
||
# **BOOLEAN** | ||
# Debug setting | ||
$Foswiki::cfg{StringifierContrib}{Debug} = '0'; | ||
|
||
1; |
Empty file.
Oops, something went wrong.