recoll engine #1257

Yetangitu · 2018-04-03T22:10:42Z

Recoll is a local search engine based on Xapian:

http://www.lesbonscomptes.com/recoll/

Although Recoll seems to be mostly aimed at desktop users the engine can be run without any graphical interface or interaction, I never use the GUI tools while I've been running it for many years over a large (currently ~2TB) document collection.

By itself recoll does not offer web or API access, this can be achieved using recoll-webui:

https://github.com/koniu/recoll-webui

As recoll-webui by default does not support paged JSON results it is advisable to use a patched version which does:

https://github.com/Yetangitu/recoll-webui/tree/jsonpage

(A pull request was sent upstream, if this is merged the patched version is no longer needed)

This engine uses a custom files result template included in this PR (only for the oscar theme using the logicdev style, I can make versions for other themes/styles if this PR goes through)

Use:

set base_url to the location where recoll-webui can be reached
set dl_prefix to a location where the file hierarchy as indexed by recoll can be reached
set search_dir to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain

Example from settings.yml:

    # this entry (with search_dir set to an empty string) covers the entire recoll search domain
  - name : library
    engine : recoll
    shortcut : lib
    base_url: 'https://recoll.example.org/'
    search_dir : ''
    dl_prefix : 'https://download.example.org'
    timeout : 30.0
    categories : files

    # this entry only searches the 'reference' directory
  - name : library reference
    engine : recoll
    base_url: 'https://recoll.example.org/'
    search_dir : reference
    dl_prefix : 'https://download.example.org'
    shortcut : libr
    timeout : 30.0
    categories : files
    disabled : True

Example output:

BTW, I'm using Searx with a custom theme so I adapted the oscar theme for this PR. I did not test the adaptations so if something doesn't work this ight be the cause. The result should more or less look like the image above.

recoll is a local search engine based on Xapian: http://www.lesbonscomptes.com/recoll/ By itself recoll does not offer web or API access, this can be achieved using recoll-webui: https://github.com/koniu/recoll-webui As recoll-webui by default does not support paged JSON results it is advisable to use a patched version which does: https://github.com/Yetangitu/recoll-webui/tree/jsonpage A pull request was sent upstream, if this is merged the patched version is no longer needed This engine uses a custom 'files' result template set base_url to the location where recoll-webui can be reached set dl_prefix to a location where the file hierarchy as indexed by recoll can be reached set search_dir to the part of the indexed file hierarchy to be searched, use an empty string to search the entire search domain

patched recoll-webui supports paged JSON on that endpoint

Yetangitu · 2018-04-05T15:43:33Z

FYI, the Filesize reported by Recoll does not always correspond to the actual file size, this problem lies two steps upstream (engine -> recoll-webui -> Recoll) so it is not something I can easily fix here. The discrepancy seems to arise from the way Recoll handles compound file types (i.e. compressed container files containing multiple individual files representing pages or chapters) like epub, it reports the size of the contained file instead of the containing file. For these files the Type will also be reported incorrectly, for epub containers it generally will report Type text/html.

searx/settings.yml

searx/engines/recoll.py

…engine remote commit

Yetangitu · 2018-04-07T22:50:46Z

URL handling sanitised and engine disabled by default

Yetangitu · 2018-04-08T10:53:55Z

BTW, if this PR is to be merged I - or someone else - should add template/style support for the 'files' template to the other styles as it currently only works as intended in oscar/logicdev. I'll only spend time doing so if it will be merged as I use searx with a custom theme and style.

does not produce it.

Yetangitu · 2018-04-08T18:43:58Z

The commits above adds number_of_results to the engine output. This only works in combination with a patched version of recoll-webui as the standard version doesn't produce result numbers.

It looks like recoll-webui is unmaintained (no commits since Sept. 2016) so I won't hold my breath for the PR to be merged.

Yetangitu · 2018-04-08T23:08:23Z

PS hold a bit with merging, there are some features I'm adding at the moment (embedded preview) plus one part of the code which turns out to be specific to my network which needs to be generalised (the download logic).

* add mount_prefix parameter, set this to location where _local_ filesystem covered by index is mounted, used to create download path, see explanation in settings.yml * add preview support for audio, video and image types - settings.yml: * add mount_prefix plus explanation on how to use it - templates/.../files.html * add generic media preview support

Yetangitu · 2018-04-09T07:34:44Z

The commit above adds preview support for audio, video and image types. It also adds a mandatory mount_prefix parameter to settings.yml with explanation on how to use it.

…engine remote commit

searx/settings.yml

kvch · 2018-04-13T19:59:03Z

@Yetangitu What are the dependencies of the webui? I am trying to start it, but it fails with ImportError: No module named recoll. However, I have installed python-recoll using my package manager.

Yetangitu · 2018-04-13T20:59:30Z

Try to run the webui by itself first using webui-standalone.py, this will produce somewhat more informative error messages. To run it in any sensible way you'll want to have a working Recoll installation (with the python-recoll module included) for it to dig through, apart from that it just depends on Python (2.x) plus some common modules, if so desired completed with ujson for somewhat faster JSON handling.

I.e. install Recoll, have it index something, point the webui at this config and it should work. You don't need to start the Recoll GUI (I never touch it) to get it going. Here's a quick example Recoll config to get started:

https://gist.github.com/Yetangitu/1bb4c5cd4b35e2911123d71b6ca3cc1c

Dump it in a directory (preferably hosted on SSD) which has enough capacity to hold the expected index file size, these can become quite big when indexing a large set.

To actually have Recoll index the contents of the files you'll want to run the recollindex tool which is part of the Recoll package. I find it easiest to run it from a script:

https://gist.github.com/Yetangitu/dbb624db032fc217cf97898008a27a71

Yetangitu · 2020-01-03T22:47:00Z

Any progress on deciding whether to merge this PR? I've been using it for a long time and I assume this functionality can be of benefit to others who maintain a large document collection.

return42 · 2020-01-06T08:54:07Z

@kvch do you have time to continue with your review?

merge upstream

Yetangitu · 2020-05-16T22:21:07Z

FYI, I submitted a PR to the Python3-version of the Recoll web interface to make it work better in combination with this change - it allows the use of more than one Recoll instance in a query. (edit: the PR was merged)

For those using this engine it may be interesting to review a bug report I submitted to the Recoll project related to degraded indexing and searching performance in later versions using a specific database format. The bug report contains a suggestion on how to work around the issue.

kvch · 2020-10-08T17:54:07Z

searx/engines/recoll.py

+
+# helper functions
+def get_time_range(time_range):
+    sw = {


This could be a global variable.

kvch · 2020-10-08T17:58:14Z

@Yetangitu Could you please rebase your PR? After that, I will approve it and merge it to master.

This change adds the possibility to run queries over more than one database by pointing the program at extra recoll configuration directories using the RECOLL_EXTRACONFDIR environment variable. This variable can contain space-separated recoll configuration directories (i.e. directories which contain `recoll.conf`) which are parsed to find out the indexed *topdirs* and the location of the database directory. The _topdirs_ are added to the directory tree, the databases are added to the `extradbs` list. When running a query over the entire tree (using `<all>`, the default) all databases are searched. When the query is limited to a subdirectory the searched set is limited to only those databases which cover the related _topdir_, thus reducing search time and overhead. The raison d'ètre for this change is to allow the web interface to be used to search a large index split over several databases, e.g. _fiction_, _nonfiction_ and _audio_. This in turn is used in the _recoll engine_ for the _Searx_ meta-search engine, see searx/searx#1257 . This is a further development of an earlier change I submitted to Github, most of which was merged but for the extra databases.

Yetangitu added 4 commits April 3, 2018 23:35

- recoll engine, change search path to generic JSON now that the

b6a4cfc

patched recoll-webui supports paged JSON on that endpoint

- recoll.py PEPped up

06d1618

- recoll.py even more PEPped

485779b

Merge branch 'master' into recollengine

6173750

kvch reviewed Apr 7, 2018

View reviewed changes

searx/settings.yml Outdated Show resolved Hide resolved

kvch reviewed Apr 7, 2018

View reviewed changes

searx/engines/recoll.py Outdated Show resolved Hide resolved

Yetangitu added 3 commits April 8, 2018 00:43

- recoll.py: sanitize url handling

9e0b6c3

- settings.yml: disable recoll engine by default

c5c7cbe

Merge branch 'recollengine' of github.com:Yetangitu/searx into recoll…

c537a19

…engine remote commit

Yetangitu added 2 commits April 8, 2018 19:55

- recoll.py: add number_of_results

b5bcf19

- recoll.py: guard access to nres parameter, unpatched recoll-webui

5c8c852

does not produce it.

Yetangitu added 6 commits April 9, 2018 09:36

- recoll.py: type -> mtype

20004f7

- recoll.py: type -> ttype (tag type)

b895fbb

- templates/.../files.html: add (back) mime subtype

ce7fdd6

Merge branch 'master' into recollengine

a28e975

- settings.yml: fix url in config example

3bc0172

Merge branch 'recollengine' of github.com:Yetangitu/searx into recoll…

1a1f9f1

…engine remote commit

kvch reviewed Apr 13, 2018

View reviewed changes

searx/settings.yml Outdated Show resolved Hide resolved

Yetangitu added 2 commits April 13, 2018 23:01

- settings.yml: comment out recoll engine by default

a40bff4

Merge branch 'master' into recollengine

78c9196

asciimoo mentioned this pull request Apr 16, 2018

Consider stop being just a meta search engine #1269

Closed

dalf added the engine label May 12, 2018

Yetangitu added 2 commits August 22, 2018 22:32

Merge branch 'master' into recollengine

f740c9e

Merge branch 'master' into recollengine

c586415

Merge remote-tracking branch 'upstream/master' into recollengine

115e40a

merge upstream

kvch reviewed Oct 8, 2020

View reviewed changes

searx/engines/recoll.py

# helper functions

def get_time_range(time_range):

sw = {

Copy link

Member

kvch Oct 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a global variable.

kvch mentioned this pull request Nov 19, 2020

Add recoll engine #2325

Merged

dalf closed this in #2325 Nov 30, 2020

OliveiraHermogenes mentioned this pull request Feb 7, 2021

[feat] recoll: paged json support #2539

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recoll engine #1257

recoll engine #1257

Yetangitu commented Apr 3, 2018 •

edited

Loading

Yetangitu commented Apr 5, 2018

Yetangitu commented Apr 7, 2018

Yetangitu commented Apr 8, 2018 •

edited

Loading

Yetangitu commented Apr 8, 2018

Yetangitu commented Apr 8, 2018

Yetangitu commented Apr 9, 2018

kvch commented Apr 13, 2018

Yetangitu commented Apr 13, 2018 •

edited

Loading

Yetangitu commented Jan 3, 2020

return42 commented Jan 6, 2020

Yetangitu commented May 16, 2020 •

edited

Loading

kvch Oct 8, 2020

kvch commented Oct 8, 2020

recoll engine #1257

recoll engine #1257

Conversation

Yetangitu commented Apr 3, 2018 • edited Loading

Use:

Yetangitu commented Apr 5, 2018

Yetangitu commented Apr 7, 2018

Yetangitu commented Apr 8, 2018 • edited Loading

Yetangitu commented Apr 8, 2018

Yetangitu commented Apr 8, 2018

Yetangitu commented Apr 9, 2018

kvch commented Apr 13, 2018

Yetangitu commented Apr 13, 2018 • edited Loading

Yetangitu commented Jan 3, 2020

return42 commented Jan 6, 2020

Yetangitu commented May 16, 2020 • edited Loading

kvch Oct 8, 2020

Choose a reason for hiding this comment

kvch commented Oct 8, 2020

Yetangitu commented Apr 3, 2018 •

edited

Loading

Yetangitu commented Apr 8, 2018 •

edited

Loading

Yetangitu commented Apr 13, 2018 •

edited

Loading

Yetangitu commented May 16, 2020 •

edited

Loading