-
-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework the Search class and provide a search service #6359
Labels
Comments
Also, drop the current |
+1 |
1 similar comment
+1 |
This is a long running task, which will eventually result in the https://github.com/contao/search library. |
leofeyer
referenced
this issue
Jul 3, 2020
Description ----------- This PR is based on #1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~ - [x] Rebase once #1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7 Improve search query performance 0fcb18d Only count wildcard matches if necessary 20e8880 Move search words to their own table 7f0d95e Improve search performance da22036 Use cosine similarity to rank search results e84a6af Rename cosineSimilarity back to relevance b9db167 Use MySQL variables to prevent multiple count computations 4a48b46 Coding style 803855a Fix division by zero bb4cb66 Adapt indexing to the new data structure d14eb23 Drop search tables instead of migrating the data e7bb1f0 Remove obsolete language column 08350d0 Add unique index for word and pid 4f48c96 Fix division by zero a8b7052 Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c7 Remove unnecessary default values 99c999d Update vectorLength of 100 random documents when indexing 51f0dd0 Coding style c29db11 Coding style 3b6c19c Comment the vector length update process fba1e98 Rename tl_search_words to tl_search_term 463ed5b Rename tl_search_words to tl_search_term debb634 Ensure that the relevance is always above zero bf86114 CS fixes a9698c2 Added missing default value for vectorLength cd2ae6e Also delete search entries from the tl_search_term table 235bdaf Add tl_search_term to maintenance description 3431d42 Fix syntax error 6fe7dc9 Use contao.search.indexer service to purge deleted pages 8e513c4 Fix missing group by clause d3ec34d Cast integer terms to string 14098c9 Fix unsigned value is out of range error 52b69b8 Try to prevent deadlocks 86e7d73 Fix concurrent indexing of the same page a0e28fa Fix duplicate error for tl_search_index termId-pid 7e96c37 Add index for documentFrequency to prevent deadlocks dbc587b Remove obsolete index dc23f81 Try to fix another deadlock fb90e2f Revert "Try to fix another deadlock" This reverts commit dc23f81. 84c7e43 Fix bug with division by zero 6053561 Lock tables to prevent deadlocks 7a3009e CS fixes 8450dcf Merge branch 'master' into feature/efficient-search-storage
leofeyer
referenced
this issue
in contao/core-bundle
Jul 3, 2020
Description ----------- This PR is based on #1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~ - [x] Rebase once #1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7a Improve search query performance 0fcb18d1 Only count wildcard matches if necessary 20e88805 Move search words to their own table 7f0d95e0 Improve search performance da220368 Use cosine similarity to rank search results e84a6afc Rename cosineSimilarity back to relevance b9db1675 Use MySQL variables to prevent multiple count computations 4a48b46f Coding style 803855ac Fix division by zero bb4cb665 Adapt indexing to the new data structure d14eb239 Drop search tables instead of migrating the data e7bb1f0b Remove obsolete language column 08350d09 Add unique index for word and pid 4f48c967 Fix division by zero a8b7052d Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c77 Remove unnecessary default values 99c999d1 Update vectorLength of 100 random documents when indexing 51f0dd08 Coding style c29db11f Coding style 3b6c19ce Comment the vector length update process fba1e98d Rename tl_search_words to tl_search_term 463ed5b2 Rename tl_search_words to tl_search_term debb6343 Ensure that the relevance is always above zero bf861148 CS fixes a9698c2c Added missing default value for vectorLength cd2ae6eb Also delete search entries from the tl_search_term table 235bdaf0 Add tl_search_term to maintenance description 3431d421 Fix syntax error 6fe7dc90 Use contao.search.indexer service to purge deleted pages 8e513c43 Fix missing group by clause d3ec34d2 Cast integer terms to string 14098c95 Fix unsigned value is out of range error 52b69b86 Try to prevent deadlocks 86e7d733 Fix concurrent indexing of the same page a0e28fad Fix duplicate error for tl_search_index termId-pid 7e96c376 Add index for documentFrequency to prevent deadlocks dbc587bd Remove obsolete index dc23f81c Try to fix another deadlock fb90e2fc Revert "Try to fix another deadlock" This reverts commit dc23f81cd7844e847ec9001a8e5f298e2974403c. 84c7e433 Fix bug with division by zero 60535614 Lock tables to prevent deadlocks 7a3009e8 CS fixes 8450dcfd Merge branch 'master' into feature/efficient-search-storage
AlexejKossmann
referenced
this issue
in AlexejKossmann/contao
Apr 6, 2021
Description ----------- This PR is based on contao#1678 It improves the performance of the search further by storing the words in their own table with a unique index. It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*) In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds. #### ToDo (for this pull request) - [ ] ~~Check if some of the optimizations are bug fixes that need to be added to contao#1678~~ - [x] Rebase once contao#1678 got merged upstream - [x] Update the index process to save the words in the new table - [x] Check how and when to update the `vectorLength` of the documents #### ToDo (for a contao/search library) - [ ] Functional Tests (if possible) - [ ] Move logic to a search service or library - [ ] ~~Use doctrine entities instead of DCA~~ #### Further ideas - [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used. Related: https://github.com/contao/core-bundle/issues/242 ### How the search works now (updated 2020-06-25): 1. The `tl_search` table holds all documents (pages) as one row for every document. 2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency) 3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency) When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values: ``` Queryᵢ = log(1+(N/nt)) Documentᵢ = 1 + log ƒt,d ___ ╲ ╱ Queryᵢ × Documentᵢ ‾‾‾ cos(ϕ) = ────────────────────────────────────────────────── ________________ ___________________ ╱ ___ ╱ ___ ╱ ╲ 2 ╱ ╲ 2 ╱ ╱ Queryᵢ × ╱ ╱ Documentᵢ ╲╱ ‾‾‾ ╲╱ ‾‾‾ ``` This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query). With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document. The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it. Commits ------- f0955a7 Improve search query performance 0fcb18d Only count wildcard matches if necessary 20e8880 Move search words to their own table 7f0d95e Improve search performance da22036 Use cosine similarity to rank search results e84a6af Rename cosineSimilarity back to relevance b9db167 Use MySQL variables to prevent multiple count computations 4a48b46 Coding style 803855a Fix division by zero bb4cb66 Adapt indexing to the new data structure d14eb23 Drop search tables instead of migrating the data e7bb1f0 Remove obsolete language column 08350d0 Add unique index for word and pid 4f48c96 Fix division by zero a8b7052 Merge branch master into feature/efficient-search-storage Conflicts: core-bundle/src/Resources/contao/library/Contao/Search.php 67fa2c7 Remove unnecessary default values 99c999d Update vectorLength of 100 random documents when indexing 51f0dd0 Coding style c29db11 Coding style 3b6c19c Comment the vector length update process fba1e98 Rename tl_search_words to tl_search_term 463ed5b Rename tl_search_words to tl_search_term debb634 Ensure that the relevance is always above zero bf86114 CS fixes a9698c2 Added missing default value for vectorLength cd2ae6e Also delete search entries from the tl_search_term table 235bdaf Add tl_search_term to maintenance description 3431d42 Fix syntax error 6fe7dc9 Use contao.search.indexer service to purge deleted pages 8e513c4 Fix missing group by clause d3ec34d Cast integer terms to string 14098c9 Fix unsigned value is out of range error 52b69b8 Try to prevent deadlocks 86e7d73 Fix concurrent indexing of the same page a0e28fa Fix duplicate error for tl_search_index termId-pid 7e96c37 Add index for documentFrequency to prevent deadlocks dbc587b Remove obsolete index dc23f81 Try to fix another deadlock fb90e2f Revert "Try to fix another deadlock" This reverts commit dc23f81. 84c7e43 Fix bug with division by zero 6053561 Lock tables to prevent deadlocks 7a3009e CS fixes 8450dcf Merge branch 'master' into feature/efficient-search-storage
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
This is a rough concept:
Every Controller or Page type decides on its own whether to index something or not and does that by calling either
$this->search->index()
for immediate indexing or$this->search->indexLazy()
for adding the content to the service and do a bulk insert whentriggerLazyIndex()
is called.The
AddToSearchIndexListener
executes$this->search->triggerLazyIndex()
.Advantages to the current implementation:
Search
).SearchInterface
.The text was updated successfully, but these errors were encountered: