Skip to content
This repository has been archived by the owner on Sep 7, 2023. It is now read-only.

recoll engine #1257

Closed
wants to merge 22 commits into from
Closed

recoll engine #1257

wants to merge 22 commits into from

Conversation

Yetangitu
Copy link
Contributor

@Yetangitu Yetangitu commented Apr 3, 2018

Recoll is a local search engine based on Xapian:

http://www.lesbonscomptes.com/recoll/

Although Recoll seems to be mostly aimed at desktop users the engine can be run without any graphical interface or interaction, I never use the GUI tools while I've been running it for many years over a large (currently ~2TB) document collection.

By itself recoll does not offer web or API access, this can be achieved using recoll-webui:

https://github.com/koniu/recoll-webui

As recoll-webui by default does not support paged JSON results it is advisable to use a patched version which does:

https://github.com/Yetangitu/recoll-webui/tree/jsonpage

(A pull request was sent upstream, if this is merged the patched version is no longer needed)

This engine uses a custom files result template included in this PR (only for the oscar theme using the logicdev style, I can make versions for other themes/styles if this PR goes through)

Use:

set base_url to the location where recoll-webui can be reached
set dl_prefix to a location where the file hierarchy as indexed by recoll can be reached
set search_dir to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain

Example from settings.yml:

    # this entry (with search_dir set to an empty string) covers the entire recoll search domain
  - name : library
    engine : recoll
    shortcut : lib
    base_url: 'https://recoll.example.org/'
    search_dir : ''
    dl_prefix : 'https://download.example.org'
    timeout : 30.0
    categories : files

    # this entry only searches the 'reference' directory
  - name : library reference
    engine : recoll
    base_url: 'https://recoll.example.org/'
    search_dir : reference
    dl_prefix : 'https://download.example.org'
    shortcut : libr
    timeout : 30.0
    categories : files
    disabled : True

Example output:

image

BTW, I'm using Searx with a custom theme so I adapted the oscar theme for this PR. I did not test the adaptations so if something doesn't work this ight be the cause. The result should more or less look like the image above.

   recoll is a local search engine based on Xapian:
   http://www.lesbonscomptes.com/recoll/

   By itself recoll does not offer web or API access,
   this can be achieved using recoll-webui:
   https://github.com/koniu/recoll-webui

   As recoll-webui by default does not support paged JSON
   results it is advisable to use a patched version which does:
   https://github.com/Yetangitu/recoll-webui/tree/jsonpage
   A pull request was sent upstream, if this is merged the patched
   version is no longer needed

   This engine uses a custom 'files' result template

   set base_url to the location where recoll-webui can be reached
   set dl_prefix to a location where the file hierarchy as indexed by recoll can be reached
   set search_dir to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain
patched recoll-webui supports paged JSON on that endpoint
@Yetangitu
Copy link
Contributor Author

FYI, the Filesize reported by Recoll does not always correspond to the actual file size, this problem lies two steps upstream (engine -> recoll-webui -> Recoll) so it is not something I can easily fix here. The discrepancy seems to arise from the way Recoll handles compound file types (i.e. compressed container files containing multiple individual files representing pages or chapters) like epub, it reports the size of the contained file instead of the containing file. For these files the Type will also be reported incorrectly, for epub containers it generally will report Type text/html.

searx/settings.yml Outdated Show resolved Hide resolved
searx/engines/recoll.py Outdated Show resolved Hide resolved
@Yetangitu
Copy link
Contributor Author

URL handling sanitised and engine disabled by default

@Yetangitu
Copy link
Contributor Author

Yetangitu commented Apr 8, 2018

BTW, if this PR is to be merged I - or someone else - should add template/style support for the 'files' template to the other styles as it currently only works as intended in oscar/logicdev. I'll only spend time doing so if it will be merged as I use searx with a custom theme and style.

@Yetangitu
Copy link
Contributor Author

The commits above adds number_of_results to the engine output. This only works in combination with a patched version of recoll-webui as the standard version doesn't produce result numbers.

It looks like recoll-webui is unmaintained (no commits since Sept. 2016) so I won't hold my breath for the PR to be merged.

@Yetangitu
Copy link
Contributor Author

PS hold a bit with merging, there are some features I'm adding at the moment (embedded preview) plus one part of the code which turns out to be specific to my network which needs to be generalised (the download logic).

    * add mount_prefix parameter, set this to location where _local_
      filesystem covered by index is mounted, used to create
      download path, see explanation in settings.yml
    * add preview support for audio, video and image types
 - settings.yml:
    * add mount_prefix plus explanation on how to use it
 - templates/.../files.html
    * add generic media preview support
@Yetangitu
Copy link
Contributor Author

The commit above adds preview support for audio, video and image types. It also adds a mandatory mount_prefix parameter to settings.yml with explanation on how to use it.

searx/settings.yml Outdated Show resolved Hide resolved
@kvch
Copy link
Member

kvch commented Apr 13, 2018

@Yetangitu What are the dependencies of the webui? I am trying to start it, but it fails with ImportError: No module named recoll. However, I have installed python-recoll using my package manager.

@Yetangitu
Copy link
Contributor Author

Yetangitu commented Apr 13, 2018

Try to run the webui by itself first using webui-standalone.py, this will produce somewhat more informative error messages. To run it in any sensible way you'll want to have a working Recoll installation (with the python-recoll module included) for it to dig through, apart from that it just depends on Python (2.x) plus some common modules, if so desired completed with ujson for somewhat faster JSON handling.

I.e. install Recoll, have it index something, point the webui at this config and it should work. You don't need to start the Recoll GUI (I never touch it) to get it going. Here's a quick example Recoll config to get started:

https://gist.github.com/Yetangitu/1bb4c5cd4b35e2911123d71b6ca3cc1c

Dump it in a directory (preferably hosted on SSD) which has enough capacity to hold the expected index file size, these can become quite big when indexing a large set.

To actually have Recoll index the contents of the files you'll want to run the recollindex tool which is part of the Recoll package. I find it easiest to run it from a script:

https://gist.github.com/Yetangitu/dbb624db032fc217cf97898008a27a71

@Yetangitu
Copy link
Contributor Author

Any progress on deciding whether to merge this PR? I've been using it for a long time and I assume this functionality can be of benefit to others who maintain a large document collection.

@return42
Copy link
Contributor

return42 commented Jan 6, 2020

@kvch do you have time to continue with your review?

@Yetangitu
Copy link
Contributor Author

Yetangitu commented May 16, 2020

FYI, I submitted a PR to the Python3-version of the Recoll web interface to make it work better in combination with this change - it allows the use of more than one Recoll instance in a query. (edit: the PR was merged)

For those using this engine it may be interesting to review a bug report I submitted to the Recoll project related to degraded indexing and searching performance in later versions using a specific database format. The bug report contains a suggestion on how to work around the issue.


# helper functions
def get_time_range(time_range):
sw = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a global variable.

@kvch
Copy link
Member

kvch commented Oct 8, 2020

@Yetangitu Could you please rebase your PR? After that, I will approve it and merge it to master.

@kvch kvch mentioned this pull request Nov 19, 2020
@dalf dalf closed this in #2325 Nov 30, 2020
ameisehaufen pushed a commit to ameisehaufen/kmrecollwebui that referenced this pull request Jan 4, 2021
This change adds the possibility to run queries over more than one database by pointing the program at extra recoll configuration directories using the RECOLL_EXTRACONFDIR environment variable. This variable can contain space-separated recoll configuration directories (i.e. directories which contain `recoll.conf`) which are parsed to find out the indexed *topdirs* and the location of the database directory. The _topdirs_ are added to the directory tree, the databases are added to the `extradbs` list. When running a query over the entire tree (using `<all>`, the default) all databases are searched. When the query is limited to a subdirectory the searched set is limited to only those databases which cover the related _topdir_, thus reducing search time and overhead.

The raison d'ètre for this change is to allow the web interface to be used to search a large index split over several databases, e.g. _fiction_, _nonfiction_ and _audio_. This in turn is used in the _recoll engine_ for the _Searx_ meta-search engine, see searx/searx#1257 .

This is a further development of an earlier change I submitted to Github, most of which was merged but for the extra databases.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants