Skip to content

Commit

Permalink
Item2190: extracted from KinoSearchContrib, some simplifications and …
Browse files Browse the repository at this point in the history
…fixes

git-svn-id: http://svn.foswiki.org/trunk/StringifierContrib@5196 0b4bb1d4-4e5a-0410-9cc4-b2b747904278
  • Loading branch information
MichaelDaum authored and MichaelDaum committed Oct 2, 2009
0 parents commit de4311a
Show file tree
Hide file tree
Showing 72 changed files with 2,492 additions and 0 deletions.
134 changes: 134 additions & 0 deletions data/System/StringifierContrib.txt
@@ -0,0 +1,134 @@
%META:TOPICINFO{author="micha" comment="reprev" date="1254480148" format="1.1" reprev="1.2" version="1.2"}%
%META:TOPICPARENT{name="Contribs"}%
---+ StringifierContrib
%SHORTDESCRIPTION%

This extension has been extracted from Foswiki:Extensions/KinoSearchContrib to make it available
for search engines other than kinosearch.

%TOC%

---++ Supported file formats

* =.txt=
* =.html=
* =.xml=
* =.doc=
* =.docx=
* =.xls=
* =.xlsx=
* =.ppt=
* =.pptx=
* =.pdf=

You can change this with the
=$Foswiki::cfg{StringifierContrib}{IndexExtensions}= setting in =configure=.

If you add other file extensions, they are treated as ASCII files. If needed,
you can add more specialised stringifiers for further document types (see below).

---++ Backend for Indexing Word 2003 Documents

To index Word 2003 Documents (=.doc=) you will need to install one of the following:

* =antiword= (recommended)
* =abiword=
* =wvWare=

You can then select the tool to use in =configure=.

---++ Backend for PDF

To index =.pdf= files you need to install =xpdf-utils=.

---++ Backend for PPT

To index =.ppt= files you need to install =ppthtml=.

---++ Backends for DOCX, PPTX

To index these file types, you will need to install the following tools from Sourceforge:
* [[http://sourceforge.net/projects/docx2txt/][docx2txt]] for =.docx=
* [[http://sourceforge.net/projects/pptx2txt/][pptx2txt]] for =.pptx=

Then set the command path to these tools in =configure=.

---++ Instaling the Contrib

%$INSTALL_INSTRUCTIONS%

---++ Configuration

There are a number of settings that need to be set in =configure= before you can use the Contrib.

---++ Test of the Installation

* Test if the installation was successful:
* Check that =antiword=, =abiword= or =wvHtml= is in place: Type =antiword=, =abiword= or =wvHtml= on the prompt and check that the command exists.
* Check that =pdftotext= is in place: Type =pdftotext= on the prompt and check that the command exists.
* Check that =ppthtml= is in place: Type =ppthtml= on the prompt and check that the command exists.
* Change the working directory to the =kinosearch/bin= Foswiki installation directory.
* Run =./kinoindex=
* Once finished, open a browser window and point it to the =[[System.KinoSearch]]= topic.
* Just type a query and check the results.

---++ Test of Stringification with =stringify=

Some users report problems with the stringification: The stringifier scipts
fails, takes too long on attachments. Some times this may result from
installation errors esp. of the installation of the backends for the
stringification.

=stringify= give you the opportunity to test the stringification in advance.

Usage: =stringify file_name=

In the result you see, which stringifier is used and the result of the
stringification.

Example:

<verbatim>
stringify /path/to/foswiki/StringifierContrib/test/unit/StringifierContrib/attachement_examples/Simple_example.doc

Simple example

Keyword: dummy

Umlaute: Grober, Uberschall, Anderung
</verbatim>

---++ Further Development

In this extension, a plug-in mechanism is implemented, so that additional
stringifiers can be added without changing the existing code. All stringifier
plugins are stored in the directory =lib/Foswiki/Contrib/StringifierContrib/Plugins=.

You can add new stringifier plugins by just adding new files here. The minimum
things to be implemented are:

* The plugin must inherit from =Foswiki::Contrib::StringififierContrib::Base=
* The plugin must register itself by =__PACKAGE__->register_handler($application, $file_extension)=;
* The plugin must implement the method =$text = stringForFile ($filename)=

All the stringifiers have unit tests associated with them, and we would
encourage you to provide unit tests for any you wish to contribute. See
Foswiki:Development/UnitTests for more information on unit testing.

See Foswiki:Tasks/StringifierContrib for currently open tasks.

---++ Contrib Info

<!--
* Set SHORTDESCRIPTION = Helper library to stringify binary document formats
-->

| Author(s): | Foswiki:Main.MarkusHesse, Foswiki:Main.SvenDowideit, Foswiki:Main.MichaelDaum & Foswiki:Main.AndrewJones |
| Copyright: | &copy; 2007, Foswiki:Main.MarkusHesse; &copy; 2009, Foswiki Contributors |
| Release: | %$RELEASE% |
| Version: | %$VERSION% |
| Change History: | <!-- versions below in reverse order -->&nbsp; |
| 02 Oct 2009: | extracted from Foswiki:Extensions/KinoSearchContrib (MD) |
| Dependencies: | %$DEPENDENCIES% |
| Home: | Foswiki:Extensions/%TOPIC% |
| Support: | Foswiki:Support/%TOPIC% |
42 changes: 42 additions & 0 deletions lib/Foswiki/Contrib/StringifierContrib.pm
@@ -0,0 +1,42 @@
# Copyright (C) 2009 Foswiki Contributors
#
# For licensing info read LICENSE file in the Foswiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details, published at
# http://www.gnu.org/copyleft/gpl.html


package Foswiki::Contrib::StringifierContrib;
use strict;
use base "Foswiki::Contrib::StringifierContrib::Base";
use Carp;
use File::MMagic;
use File::Spec::Functions qw(rel2abs);
use File::Basename;
use File::stat;

use vars qw($VERSION $RELEASE $magic);

$VERSION = '$Rev: 4426 (2009-07-03) $';
$RELEASE = '1.0';
$magic = File::MMagic->new();

sub stringFor {
my ($class, $filename, $encoding) = @_;
return unless -r $filename;
my $mime = $magic->checktype_filename($filename);

#print STDERR "filename=$filename, mime=$mime\n";
my $self = $class->handler_for($filename, $mime)->new();

return $self->stringForFile($filename);
}

1;
92 changes: 92 additions & 0 deletions lib/Foswiki/Contrib/StringifierContrib/Base.pm
@@ -0,0 +1,92 @@
# Copyright (C) 2009 Foswiki Contributors
#
# For licensing info read LICENSE file in the Foswiki root.
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details, published at
# http://www.gnu.org/copyleft/gpl.html

package Foswiki::Contrib::StringifierContrib::Base;
use strict;

use Module::Pluggable (require => 1, search_path => [qw/Foswiki::Contrib::StringifierContrib::Plugins/]);

__PACKAGE__->plugins;

use constant DEFAULT_HANDLER => "Foswiki::Contrib::StringifierContrib::Plugins::Text";
{
my %mime_handlers;
my %extension_handlers;
sub register_handler {
my ($package, @specs) = @_;

for my $spec (@specs) {
if ($spec =~ m{/}) {
$mime_handlers{$spec} = $package;
} else {
$extension_handlers{$spec} = $package;
}
}
}
sub handler_for {
my ($self, $filename, $mime) = @_;
if (exists $mime_handlers{$mime}) { return $mime_handlers{$mime} }
$filename = lc($filename);
for my $spec (keys %extension_handlers) {
if ($filename =~ /$spec$/) { return $extension_handlers{$spec} }
}
return DEFAULT_HANDLER;
}

# Returns 1, if the program can be called.
# This is as service method that a sub calss can use to decise,
# if it wants to register or not.
sub _programExists {
my ($self, $program) = @_;

return defined(`$program 2>&1`);
}
}

sub new {
my ($handler) = @_;
my $self = bless {}, $handler;

$self;
}

# Service method to remove the director $dir and
# all contence including sub directories
sub rmtree {
my ($self, $dir) = @_;
local *DIR;

# If the dir is infact a file, I just delete that.
if (-f $dir) {
unlink($dir);
}

opendir (DIR, $dir) || return 0;
while (my $file = readdir(DIR)) {
# Ignores . and ..
next if ($file =~ /^\.{1,2}$/);

$file = "$dir/$file";
if (-d $file) {
$self->rmtree($file);
} elsif (-f $file) {
unlink($file);
}
}
closedir DIR;
rmdir($dir);
return 1;
}

1;
64 changes: 64 additions & 0 deletions lib/Foswiki/Contrib/StringifierContrib/Config.spec
@@ -0,0 +1,64 @@
# ---+ Extensions
# ---++ StringifierContrib

# **STRING**
# Comma seperated list of webs to skip
$Foswiki::cfg{StringifierContrib}{SkipWebs} = 'Trash, Sandbox';

# **STRING**
# Comma seperated list of extenstions to index
$Foswiki::cfg{StringifierContrib}{IndexExtensions} = '.txt, .html, .xml, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf';

# **STRING**
# List of attachments to skip
# For example: Web.SomeTopic.AnAttachment.txt, Web.OtherTopic.OtherAttachment.pdf
$Foswiki::cfg{StringifierContrib}{SkipAttachments} = '';

# **STRING**
# List of topics to skip.
# Topics can be in the form of Web.MyTopic, or if you want a topic to be excluded from all webs just enter MyTopic.
# For example: Main.WikiUsers, WebStatistics
$Foswiki::cfg{StringifierContrib}{SkipTopics} = '';

# **SELECT antiword,wv,abiword**
# Select which MS Word indexer to use (you need to have antiword, abiword or wvHtml installed)
# <dl>
# <dt>antiword</dt><dd>is the default, and should be used on Linux/Unix.</dd>
# <dt>wvHtml</dt><dd> is recommended for use on Windows.</dd>
# <dt>abiword</dt><dd></dd>
# </dl>
$Foswiki::cfg{StringifierContrib}{WordIndexer} = 'antiword';

# **COMMAND**
# abiword command
$Foswiki::cfg{StringifierContrib}{abiwordCmd} = 'abiword';

# **COMMAND**
# antiword command
$Foswiki::cfg{StringifierContrib}{antiwordCmd} = 'antiword';

# **COMMAND**
# wvHtml command
$Foswiki::cfg{StringifierContrib}{wvHtmlCmd} = 'wvHtml';

# **COMMAND**
# ppthtml command
$Foswiki::cfg{StringifierContrib}{ppthtmlCmd} = 'ppthtml';

# **COMMAND**
# pdftotext command
$Foswiki::cfg{StringifierContrib}{pdftotextCmd} = 'pdftotext';

# **COMMAND**
# pptx2txt.pl command
$Foswiki::cfg{StringifierContrib}{pptx2txtCmd} = '../tools/pptx2txt.pl';

# **COMMAND**
# docx2txt.pl command
$Foswiki::cfg{StringifierContrib}{docx2txtCmd} = '../tools/docx2txt.pl';

# **BOOLEAN**
# Debug setting
$Foswiki::cfg{StringifierContrib}{Debug} = '0';

1;
Empty file.

0 comments on commit de4311a

Please sign in to comment.