Rework the Search class and provide a search service #6359

Toflar · 2015-05-04T09:53:28Z

This is a rough concept:

SearchInterface
- index(DocumentInterface $doc)
- indexLazy(DocumentInterface $doc)
- triggerLazyIndex()
- search(QueryInterface $query)
- etc.
DocumentInterface
- getUrl()
- getContent()
- getProtected()
- etc.

Every Controller or Page type decides on its own whether to index something or not and does that by calling either $this->search->index() for immediate indexing or $this->search->indexLazy() for adding the content to the service and do a bulk insert when triggerLazyIndex() is called.
The AddToSearchIndexListener executes $this->search->triggerLazyIndex().

Advantages to the current implementation:

We get rid of one more "legacy" class (Search).
The Search service can easily replaced by Elasticsearch and Co. All they need to do is implement our SearchInterface.
It's possible to add content by different content providers for the same url.
In theory, rebuilding the index can be sub requests instead of real requests.

The text was updated successfully, but these errors were encountered:

Toflar · 2015-05-21T15:19:23Z

Also, drop the current getSearchablePages() behaviour in favour of a simple web crawler (see contao/core#6942). Note that it must update on a regular basis using a cronjob behaviour as well as manually in the back end somehow.

Metis77 · 2017-06-07T13:51:29Z

+1

HolyMacarony · 2019-03-28T01:00:09Z

+1

leofeyer · 2020-06-11T15:44:43Z

This is a long running task, which will eventually result in the https://github.com/contao/search library.

Description ----------- This PR is based on #1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~ - [x] Rebase once #1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7 Improve search query performance 0fcb18d Only count wildcard matches if necessary 20e8880 Move search words to their own table 7f0d95e Improve search performance da22036 Use cosine similarity to rank search results e84a6af Rename cosineSimilarity back to relevance b9db167 Use MySQL variables to prevent multiple count computations 4a48b46 Coding style 803855a Fix division by zero bb4cb66 Adapt indexing to the new data structure d14eb23 Drop search tables instead of migrating the data e7bb1f0 Remove obsolete language column 08350d0 Add unique index for word and pid 4f48c96 Fix division by zero a8b7052 Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c7 Remove unnecessary default values 99c999d Update vectorLength of 100 random documents when indexing 51f0dd0 Coding style c29db11 Coding style 3b6c19c Comment the vector length update process fba1e98 Rename tl_search_words to tl_search_term 463ed5b Rename tl_search_words to tl_search_term debb634 Ensure that the relevance is always above zero bf86114 CS fixes a9698c2 Added missing default value for vectorLength cd2ae6e Also delete search entries from the tl_search_term table 235bdaf Add tl_search_term to maintenance description 3431d42 Fix syntax error 6fe7dc9 Use contao.search.indexer service to purge deleted pages 8e513c4 Fix missing group by clause d3ec34d Cast integer terms to string 14098c9 Fix unsigned value is out of range error 52b69b8 Try to prevent deadlocks 86e7d73 Fix concurrent indexing of the same page a0e28fa Fix duplicate error for tl_search_index termId-pid 7e96c37 Add index for documentFrequency to prevent deadlocks dbc587b Remove obsolete index dc23f81 Try to fix another deadlock fb90e2f Revert "Try to fix another deadlock" This reverts commit dc23f81. 84c7e43 Fix bug with division by zero 6053561 Lock tables to prevent deadlocks 7a3009e CS fixes 8450dcf Merge branch 'master' into feature/efficient-search-storage

Description ----------- This PR is based on #1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~ - [x] Rebase once #1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7a Improve search query performance 0fcb18d1 Only count wildcard matches if necessary 20e88805 Move search words to their own table 7f0d95e0 Improve search performance da220368 Use cosine similarity to rank search results e84a6afc Rename cosineSimilarity back to relevance b9db1675 Use MySQL variables to prevent multiple count computations 4a48b46f Coding style 803855ac Fix division by zero bb4cb665 Adapt indexing to the new data structure d14eb239 Drop search tables instead of migrating the data e7bb1f0b Remove obsolete language column 08350d09 Add unique index for word and pid 4f48c967 Fix division by zero a8b7052d Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c77 Remove unnecessary default values 99c999d1 Update vectorLength of 100 random documents when indexing 51f0dd08 Coding style c29db11f Coding style 3b6c19ce Comment the vector length update process fba1e98d Rename tl_search_words to tl_search_term 463ed5b2 Rename tl_search_words to tl_search_term debb6343 Ensure that the relevance is always above zero bf861148 CS fixes a9698c2c Added missing default value for vectorLength cd2ae6eb Also delete search entries from the tl_search_term table 235bdaf0 Add tl_search_term to maintenance description 3431d421 Fix syntax error 6fe7dc90 Use contao.search.indexer service to purge deleted pages 8e513c43 Fix missing group by clause d3ec34d2 Cast integer terms to string 14098c95 Fix unsigned value is out of range error 52b69b86 Try to prevent deadlocks 86e7d733 Fix concurrent indexing of the same page a0e28fad Fix duplicate error for tl_search_index termId-pid 7e96c376 Add index for documentFrequency to prevent deadlocks dbc587bd Remove obsolete index dc23f81c Try to fix another deadlock fb90e2fc Revert "Try to fix another deadlock" This reverts commit dc23f81cd7844e847ec9001a8e5f298e2974403c. 84c7e433 Fix bug with division by zero 60535614 Lock tables to prevent deadlocks 7a3009e8 CS fixes 8450dcfd Merge branch 'master' into feature/efficient-search-storage

Description ----------- This PR is based on contao#1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to contao#1678~~ - [x] Rebase once contao#1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7 Improve search query performance 0fcb18d Only count wildcard matches if necessary 20e8880 Move search words to their own table 7f0d95e Improve search performance da22036 Use cosine similarity to rank search results e84a6af Rename cosineSimilarity back to relevance b9db167 Use MySQL variables to prevent multiple count computations 4a48b46 Coding style 803855a Fix division by zero bb4cb66 Adapt indexing to the new data structure d14eb23 Drop search tables instead of migrating the data e7bb1f0 Remove obsolete language column 08350d0 Add unique index for word and pid 4f48c96 Fix division by zero a8b7052 Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c7 Remove unnecessary default values 99c999d Update vectorLength of 100 random documents when indexing 51f0dd0 Coding style c29db11 Coding style 3b6c19c Comment the vector length update process fba1e98 Rename tl_search_words to tl_search_term 463ed5b Rename tl_search_words to tl_search_term debb634 Ensure that the relevance is always above zero bf86114 CS fixes a9698c2 Added missing default value for vectorLength cd2ae6e Also delete search entries from the tl_search_term table 235bdaf Add tl_search_term to maintenance description 3431d42 Fix syntax error 6fe7dc9 Use contao.search.indexer service to purge deleted pages 8e513c4 Fix missing group by clause d3ec34d Cast integer terms to string 14098c9 Fix unsigned value is out of range error 52b69b8 Try to prevent deadlocks 86e7d73 Fix concurrent indexing of the same page a0e28fa Fix duplicate error for tl_search_index termId-pid 7e96c37 Add index for documentFrequency to prevent deadlocks dbc587b Remove obsolete index dc23f81 Try to fix another deadlock fb90e2f Revert "Try to fix another deadlock" This reverts commit dc23f81. 84c7e43 Fix bug with division by zero 6053561 Lock tables to prevent deadlocks 7a3009e CS fixes 8450dcf Merge branch 'master' into feature/efficient-search-storage

leofeyer added the feature label May 15, 2015

leofeyer assigned Toflar May 15, 2015

Toflar mentioned this issue May 21, 2015

interne Suchfunktion / Suchindex contao/core#6942

Closed

ausi mentioned this issue Jun 12, 2020

Make the search storage more efficient #1679

Merged

8 tasks

leofeyer transferred this issue from contao/core-bundle Sep 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework the Search class and provide a search service #6359

Rework the Search class and provide a search service #6359

Toflar commented May 4, 2015

Toflar commented May 21, 2015

Metis77 commented Jun 7, 2017

HolyMacarony commented Mar 28, 2019

leofeyer commented Jun 11, 2020

Rework the Search class and provide a search service #6359

Rework the Search class and provide a search service #6359

Comments

Toflar commented May 4, 2015

Toflar commented May 21, 2015

Metis77 commented Jun 7, 2017

HolyMacarony commented Mar 28, 2019

leofeyer commented Jun 11, 2020