Baagle Desktop Search v2.0
Background
I write this on a dare in 2004 after the release of Google Desktop Search when I said "one could write something like that in 2 days" and a friend said "no you can't", so hey. Is it good for much these days? Probably not. But I shall let Github judge!
Description
This is a work-alike to things like Google/Yahoo Desktop Search.
It is primarily designed for UNIX-like systems, but works passably on Windows (it's received about 10 minutes of testing using ActivePerl and Swish-E for Windows; YMMV) The basic principle is that you run an indexer that indexes all of your files, then run a simple dummy webserver which serves you the results of searches in a convenient web environment.
Requirements
This package has the following requirements:
- Swish-e 2.4.x
- Perl 5.6+
additionally, the following non-standard Perl modules are required:
- YAML
- Tie::IxHash
- LWP::UserAgent
- Parallel::ForkManager
- POE::Component::Server::HTTPServer
(some of these, particularly the last, may have their own requirements) For all of the requirements I'd recommend your operating system's native package installer, followed by the perl installer for, followed by installing them by hand. So:
Windows:
- Grab and use installers for Swish-E and ActiveState Perl
- Run PPM to install the Perl modules
Unix:
- Use apt-get, yum, portinstall, etc to grab and install everything for you; in the likely case that you can't find packages for one or more of the Perl modules:
- Use the perl CPAN module to install the other modules
You may wish to install the zlib library for compressed index support, to save on space. Disk-space requirements are moderate; if you don't have enough disk to spare, you probably shouldn't be running this.
The SWISH::Filter module is now used for document conversions (only HTML, TXT, and XML are natively supported, everything else requires a converter). The following conversions are supported. To add additional conversion support, add a module to the SWISH/Filters directory. You will need to either run a full reindex with -F or update the modification times (see touch(1)) of all of the files you missed if you install one of the packages below or write your own filter. See http://swish-e.com/docs/filter.html#writing_filters for more details.
File Format Requirements
----------- ------------
Microsoft Word catdoc (for basic text conversion) OR
wvWare (for nicer html conversion)
Rich Text Format rtfreader
Microsoft Excel Spreadsheet::ParseExcel perl module
Adobe PDF pdftotext and pdfinfo (part of the xpdf package)
MP3 Audio MP3::Tag perl module
Installation
-
Copy baagle.conf.sample to baagle.conf and edit it; you will need to minimally set SWISH_E, SWISH_PERL_LIB, and SEARCH_DIRS or WEB_HISTORY (or both), and maybe PORT to choose a different server port. If you really want to power-use, you can set OPENERS to configure programs for the system to run for you when you click on certain files.
-
Run ./indexer (Windows: run "perl indexer")
-
(Optionally) put entries in your crontab to rerun indexer whenever you'd like; I'd suggest something like this:
42 */2 * * * /path/to/this/dir/indexer >/dev/null 2>&1 12 2 * * 0 /path/to/this/dir/indexer -F >/dev/null 2>&1
That will run an incremental index (very fast) once every other hour, and a full index once a week. If you have frequent changes to a small set of files, you may wish to increase the frequency of the first; if you have a large number of changes you may need to increase the frequency of the second. If you have a just plain large dataset, you may wish to decrease the frequency of both.
Windows: Uh, I don't know.. some sort of Scheduled Task or something?
-
Run ./server (Windows: "perl server")
-
Point your browser at http://localhost:2986/ (if you changed PORT from the default; use that in place of 2986) You probably want to bookmark this url.
-
To kill the server, go to http://localhost:2986/quit
Notes
This is designed for single-user systems. If you run the indexer as yourself, information from files only readable by you will be saved. If you run the server as yourself, any security problems that crop up in the server will run as you. If the server port is accessible by other people, and you have LOCALHOST_ONLY set to false (not the default), other people may be able to access your index. If other people have local access to your machine, they will be able to access your index, period.
The indexer will always build a full index instead of an incremental when:
-
The baagle.conf file has been modified (the most common case, I'd imagine)
-
The indexer script has been modified (but you shouldn't really have to do this)
-
The swish-e.conf file has been modified (you also shouldn't have to do this)
This whole thing is Copyright (c) 2004 Floating Sheep Studios
Bugs
There's a bug in POE::Component::Server::HTTP which results in most browsers producing doubled log lines for the server. There's a note and patch here: http://www.mail-archive.com/poe@perl.org/msg02900.html if you care. I'm sure the next version of PCSH will probably fix this.
To Do
-
Clean up summary text some more
-
Use SWISH::API after beating ports@freebsd.org over the head (see below)
-
Build a scheduler into the server so you don't have to run the indexer out of cron
-
Index filenames of files we don't handle, too, so at the very least you can find extensionless files you care about
FAQ
Q) Why are you calling the swish-e binary directly? Why not use SWISH::API?
A) The FreeBSD port of swish-e does not install those modules for some reason and I haven't gotten around to bugging ports@freebsd.org about it (there doesn't appear to be a maintainer, currently). Since I don't have it installed without extra work, it's reasonable to assume other people won't either. Running swish-e by hand is plenty fast enough anyway; this is a server for running on your desktop, not serving hundreds of hits a second.
Q) This is kind of lame compared to (Google|Yahoo) Desktop Search. Why did you bother?
A) Because:
- Someone claimed I couldn't do it in a weekend, so of course I had to.
(that was version 1.0) - [GY]DS don't run on UNIX boxen.
Q) I have bags of money and want you to do some short-deadline demo-like project for me. Who are you guys?