Skip to content

Commit

Permalink
Use semantic line breaks
Browse files Browse the repository at this point in the history
  • Loading branch information
gforcada committed Mar 21, 2016
1 parent a04c9dd commit e6eb98d
Show file tree
Hide file tree
Showing 22 changed files with 233 additions and 180 deletions.
9 changes: 6 additions & 3 deletions docs/base/dependencies.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,9 @@ Dependencies
------------

Currently we depend on `collective.indexing` as a means to hook into the normal catalog machinery of Plone to detect content changes.
`c.indexing` before version two had some persistent data structures that frequently caused problems when removing the add-on. These problems have been fixed in version two.
Unfortunately `c.indexing` still has to hook the catalog machinery in various evil ways, as the machinery lacks the required hooks for its use-case.
Going forward it is expected for `c.indexing` to be merged into the underlying `ZCatalog` implementation, at which point `collective.solr` can use those hooks directly.
`c.indexing` before version two had some persistent data structures that frequently caused problems when removing the add-on.
These problems have been fixed in version two.
Unfortunately `c.indexing` still has to hook the catalog machinery in various evil ways,
as the machinery lacks the required hooks for its use-case.
Going forward it is expected for `c.indexing` to be merged into the underlying `ZCatalog` implementation,
at which point `collective.solr` can use those hooks directly.
2 changes: 0 additions & 2 deletions docs/base/index.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,6 @@
Base Information how Solr and the Integration of Solr and Plone work
====================================================================



Architecture
------------

Expand Down
44 changes: 31 additions & 13 deletions docs/base/indexing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,37 +3,55 @@ Indexing

Solr is not transactional aware or supports any kind of rollback or undo.
We therefor only sent data to Solr at the end of any successful request.
This is done via collective.indexing, a transaction manager and an end request transaction hook.
This is done via collective.indexing,
a transaction manager and an end request transaction hook.
This means you won't see any changes done to content inside a request when doing Solr searches later on in the same request.
Inside tests you need to either commit real transactions or otherwise flush the Solr connection.
There's no transaction concept, so one request doing a search might get some results in its beginning, than a different request might add new information to Solr.
There's no transaction concept,
so one request doing a search might get some results in its beginning,
than a different request might add new information to Solr.
If the first request is still running and does the same search again it might get different results taking the changes from the second request into account.

Solr is not a real time search engine.
While there's work under way to make Solr capable of delivering real time results, there's currently always a certain delay up to some minutes from the time data is sent to Solr to when it is available in searches.
While there's work under way to make Solr capable of delivering real time results,
there's currently always a certain delay up to some minutes from the time data is sent to Solr to when it is available in searches.

Search results are returned in Solr by distinct search threads.
These search threads hold a great number of caches which are crucial for Solr to perform.
When index or unindex operations are sent to Solr, it will keep those in memory until a commit is executed on its own search index.
When index or unindex operations are sent to Solr,
it will keep those in memory until a commit is executed on its own search index.
When a commit occurs, all search threads and thus all caches are thrown away and new threads are created reflecting the data after the commit.
While there's a certain amount of cache data that is copied to the new search threads, this data has to be validated against the new index which takes some time.
While there's a certain amount of cache data that is copied to the new search threads,
this data has to be validated against the new index which takes some time.
The `useColdSearcher` and `maxWarmingSearchers` options of the Solr recipe relate to this aspect.
While cache data is copied over and validated for a new search thread, the searcher is `warming up`.
If the warming up is not yet completed the searcher is considered to be `cold`.

In order to get real good performance out of Solr, we need to minimize the number of commits against the Solr index.
In order to get real good performance out of Solr,
we need to minimize the number of commits against the Solr index.
We can achieve this by turning off `auto-commit` and instead use `commitWithin`.
So we don't sent a `commit` to Solr at the end of each index/unindex request on the Plone side.
Instead we tell Solr to commit the data to its index at most after a certain time interval.
Values of 15 minutes to 1 minute work well for this interval.
The larger you can make this interval, the better the performance of Solr will be, at the cost of search results lagging behind a bit.
In this setup we also need to configure the `autoCommitMaxTime` option of the Solr server, as `commitWithin` only works for index but not unindex operations.
The larger you can make this interval,
the better the performance of Solr will be,
at the cost of search results lagging behind a bit.
In this setup we also need to configure the `autoCommitMaxTime` option of the Solr server,
as `commitWithin` only works for index but not unindex operations.
Otherwise a large number of unindex operations without any index operations occurring could not be reflected in the index for a long time.

As a result of all the above, the Solr index and the Plone site will always have slightly diverging contents.
If you use Solr to do searches you need to be aware of this, as you might get results for objects that no longer exist.
As a result of all the above,
the Solr index and the Plone site will always have slightly diverging contents.
If you use Solr to do searches you need to be aware of this,
as you might get results for objects that no longer exist.
So any `brain/getObject` call on the Plone side needs to have error handling code around it as the object might not be there anymore and traversing to it can throw an exception.

When adding new or deleting old content or changing the workflow state of it, you will also not see those actions reflected in searches right away, but only after a delay of at most the `commitWithin` interval.
After a `commitWithin` operation is sent to Solr, any other operations happening during that time window will be executed after the first interval is over.
So with a 15 minute interval, if document A is indexed at 5:15, B at 5:20 and C at 5:35, both A & B will be committed at 5:30 and C at 5:50.
When adding new or deleting old content or changing the workflow state of it,
you will also not see those actions reflected in searches right away,
but only after a delay of at most the `commitWithin` interval.
After a `commitWithin` operation is sent to Solr,
any other operations happening during that time window will be executed after the first interval is over.
So with a 15 minute interval,
if document A is indexed at 5:15,
B at 5:20 and C at 5:35,
both A & B will be committed at 5:30 and C at 5:50.
61 changes: 42 additions & 19 deletions docs/base/searching.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,13 @@ Searching
*********

Information retrieval is a complex science.
We try to give a very brief explanation here, refer to the literature and documentation of Lucene/Solr for much more detailed information.
We try to give a very brief explanation here,
refer to the literature and documentation of Lucene/Solr for much more detailed information.

If you do searches in normal Plone, you have a search term and query the SearchableText index with it.
The SearchableText is a simple concatenation of all searchable fields, by default title, description and the body text.
If you do searches in normal Plone,
you have a search term and query the SearchableText index with it.
The SearchableText is a simple concatenation of all searchable fields,
by default title, description and the body text.

The default ZCTextIndex in Plone uses a simplified version of the Okapi BM25 algorithm described in papers in 1998.
It uses two metrics to score documents:
Expand All @@ -15,40 +18,60 @@ It uses two metrics to score documents:
Terms only occurring in a few documents are scored higher than those occurring in many documents.

It calculates the sum of all scores, for every term common to the query and any document.
So for a query with two terms, a document is likely to score higher if it contains both terms, except if one of them is a very common term and the other document contains the non-common term more often.
So for a query with two terms,
a document is likely to score higher if it contains both terms,
except if one of them is a very common term and the other document contains the non-common term more often.

The similarity function used in Solr/Lucene uses a different algorithm, based on a combination of a boolean and vector space model, but taking the same underlying metrics into account.
The similarity function used in Solr/Lucene uses a different algorithm,
based on a combination of a boolean and vector space model,
but taking the same underlying metrics into account.
In addition to the term frequency and inverse document frequency Solr respects some more metrics:

- length normalization: The number of all terms in a field.
Shorter fields contribute higher scores compared to long fields.
- boost values: There's a variety of boost values that can be applied, both index-time document boost values as well as boost values per search field or search term

In its pre 2.0 versions, collective.solr used a naive approach and mirrored the approach taken by ZCTextIndex.
In its pre 2.0 versions,
collective.solr used a naive approach and mirrored the approach taken by ZCTextIndex.
So it sent each search query as one query and matched it against the full SearchableText field inside Solr.
By doing that Solr basically used the same algorithm as ZCTextIndex as it only had one field to match with the entire text in it.
The only difference was the use of the length normalization, so shorter documents ranked higher than those with longer texts.
This actually caused search quality to be worse, as you'd frequently find folders, links or otherwise rather empty documents.
The only difference was the use of the length normalization,
so shorter documents ranked higher than those with longer texts.
This actually caused search quality to be worse,
as you'd frequently find folders, links or otherwise rather empty documents.
The Okapi BM25 implementation in ZCTextIndex deliberately ignores the document length for that reason.

In order to get good or better search quality from Solr, we have to query it in a different way.
Instead of concatenating all fields into one big text, we need to preserve the individual fields and use their intrinsic importance.
In order to get good or better search quality from Solr,
we have to query it in a different way.
Instead of concatenating all fields into one big text,
we need to preserve the individual fields and use their intrinsic importance.
We get the main benefit be realizing that matches on the title and description are more important than matches on the body text or other fields in a document.
collective.solr 2.0+ does exactly that by introducing a `search-pattern` to be used for text searches. In its default form it causes each query to work against the title, description and full searchable text fields and boosts the title by a high and the description by a medium value.
The length normalization already provides an improvement for these fields, as the title is likely short, the description a bit longer and the full text even longer.
collective.solr 2.0+ does exactly that by introducing a `search-pattern` to be used for text searches.
In its default form it causes each query to work against the title,
description and full searchable text fields and boosts the title by a high and the description by a medium value.
The length normalization already provides an improvement for these fields,
as the title is likely short,
the description a bit longer and the full text even longer.
By using explicit boost values the effect gets to be more pronounced.

If you do custom searches or want to include more fields into the full text search you need to keep the above in mind.
Simply setting the `searchable` attribute on the schema of a field to `True` will only include it in the big searchable text stream.
If you for example include a field containing tags, the simple tag names will likely 'drown' in the full body text.
If you for example include a field containing tags,
the simple tag names will likely 'drown' in the full body text.
You might want to instead change the search pattern to include the field and potentially put a boost value on it - though it will be more important as it's likely to be extremely short.
Similarly extracting the full text of binary files and simply appending them into the search stream might not be the best approach.
You should rather index those in a separate field and then maybe use a boost value of less than one to make the field less important.
Given two documents with the same content, one as a normal page and one as a binary file, you'll likely want to find the page first, as it's faster to access and read than the file.

Given two documents with the same content,
one as a normal page and one as a binary file,
you'll likely want to find the page first,
as it's faster to access and read than the file.

There's a good number of other improvements you can do using query time and index time boost values.
To provide index time boost values, you can provide a skin script called `solr_boost_index_values` which gets the object to be indexed and the data sent to Solr as arguments and returns a dictionary of field names to boost values for each document.
The safest is to return a boost value for the empty string, which results in a document boost value.
Field level boost values don't work with all searches, especially wildcard searches as done by most simple web searches.
The index time boost allows you to implement policies like boosting certain content types over others, taking into account ratings or number of comments as a measure of user feedback or anything else that can be derived from each content item.
To provide index time boost values,
you can provide a skin script called `solr_boost_index_values` which gets the object to be indexed and the data sent to Solr as arguments and returns a dictionary of field names to boost values for each document.
The safest is to return a boost value for the empty string,
which results in a document boost value.
Field level boost values don't work with all searches,
especially wildcard searches as done by most simple web searches.
The index time boost allows you to implement policies like boosting certain content types over others,
taking into account ratings or number of comments as a measure of user feedback or anything else that can be derived from each content item.
4 changes: 2 additions & 2 deletions docs/development/TODO.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,8 +14,8 @@ TODOs:
* evaluate http://www.gnuenterprise.org/~jcater/solr.py as a replacement
(also see http://tinyurl.com/2zcogf)
* evaluate sunburnet as a replacement https://pypi.python.org/pypi/sunburnt
* evaluat mysolr as backend https://pypi.python.org/pypi/mysolr
* evaluate mysolr as backend https://pypi.python.org/pypi/mysolr
* implement LocalParams to have a nicer facet view http://wiki.apache.org/solr/SimpleFacetParameters#Multi-Select_Faceting_and_LocalParams
* Use current search view and get rid of anicient search override
* Use current search view and get rid of ancient search override
* Implement a push only and read only mode
* Play nice with eea.facetednavigation
18 changes: 13 additions & 5 deletions docs/features/atomic_updates.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
Partial indexing documents (AtomicUpdates)
******************************************

This means whenever possible, only the necessary / specified attributes get updated in Solr, and more importantly, re-indexed by Plone's indexers.
This means whenever possible,
only the necessary/specified attributes get updated in Solr,
and more importantly,
re-indexed by Plone's indexers.

With collective.recipe.solr a new configuration is introduced, called `updateLog`. ``updateLog`` is enabled by default and allows atomic updates. In detail it adds a new field ``_version_`` to the schema and also adds "<updateLog />" to your solr config.
With collective.recipe.solr a new configuration is introduced,
called ``updateLog``.
``updateLog`` is enabled by default and allows atomic updates.
In detail it adds a new field ``_version_`` to the schema and also adds "<updateLog />" to your solr config.

Further all your indexes configured in solr.cfg needs the stored:true attribute (Except the ``default`` field).
Further all your indexes configured in solr.cfg needs the ``stored:true`` attribute (except the ``default`` field).

See http://wiki.apache.org/solr/Atomic_Updates for details.


Also note, that the AtomicUpdate feature is no compatible with the "Index time boost" feature. You have to decide, whether using atomic updates, or boosting on index time. You can enable/disable atomic updates thru the collective.solr control panel. Atomic updates are enabled by default.
Also note, that the AtomicUpdate feature is no compatible with the "Index time boost" feature.
You have to decide, whether using atomic updates, or boosting on index time.
You can enable/disable atomic updates through the collective.solr control panel.
Atomic updates are enabled by default.
12 changes: 8 additions & 4 deletions docs/features/binary.rst
Original file line number Diff line number Diff line change
@@ -1,14 +1,18 @@
Indexing binary documents
*************************

At this point collective.solr uses Plone's default capabilities to index binary documents via `portal_transforms` and installing command line tools like `wv2` or `pdftotext`.
At this point collective.solr uses Plone's default capabilities to index binary documents.
It does so via `portal_transforms` and installing command line tools like `wv2` or `pdftotext`.
Work is under way to expose and use the `Apache Tika`_ Solr integration available via the `update/extract` handler.

Once finished this will speed up indexing of binary documents considerably, as the extraction will happen out-of-process on the Solr server side.
Once finished this will speed up indexing of binary documents considerably,
as the extraction will happen out-of-process on the Solr server side.
`Apache Tika`_ also supports a much larger list of formats than can be supported by adding external command line tools.

There is room for more improvements in this area, as c.solr will still send the binary data to Solr as part of the end-user request/transaction.
To further optimize this, Solr index operations can be stored in a task queue as provided by `plone.app.async` or solutions build on top of `Celery`.
There is room for more improvements in this area,
as collective.solr will still send the binary data to Solr as part of the end-user request/transaction.
To further optimize this,
Solr index operations can be stored in a task queue as provided by `plone.app.async` or solutions build on top of `Celery`.
This is currently outside the scope of `collective.solr`.

.. _`Apache Tika`: http://tika.apache.org/
6 changes: 4 additions & 2 deletions docs/features/exclude.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,14 @@ Exclude from search and elevation

By default this add-on introduces two new fields to the default content types or any custom type derived from ATContentTypes.

The `showinsearch` boolean field lets you hide specific content items from the search results, by setting the value to `false`.
The `showinsearch` boolean field lets you hide specific content items from the search results,
by setting the value to `false`.

The `searchwords` lines field allows you to specify multiple phrases per content item.
A phrase is specified per line.
User searches containing any of these phrases will show the content item as the first result for the search.
This technique is also known as `elevation`.

Both of these features depend on the default `search-pattern` to include the required parts as included in the default configuration.
The `searchwords` approach to elevation doesn't depend on the Solr elevation feature, as that would require maintaining a xml file as part of the Solr server configuration.
The `searchwords` approach to elevation doesn't depend on the Solr elevation feature,
as that would require maintaining a xml file as part of the Solr server configuration.

0 comments on commit e6eb98d

Please sign in to comment.