Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the Search class and provide a search service #6359

Open
Toflar opened this issue May 4, 2015 · 4 comments
Open

Rework the Search class and provide a search service #6359

Toflar opened this issue May 4, 2015 · 4 comments
Assignees
Labels

Comments

@Toflar
Copy link
Member

Toflar commented May 4, 2015

This is a rough concept:

  • SearchInterface
    • index(DocumentInterface $doc)
    • indexLazy(DocumentInterface $doc)
    • triggerLazyIndex()
    • search(QueryInterface $query)
    • etc.
  • DocumentInterface
    • getUrl()
    • getContent()
    • getProtected()
    • etc.

Every Controller or Page type decides on its own whether to index something or not and does that by calling either $this->search->index() for immediate indexing or $this->search->indexLazy() for adding the content to the service and do a bulk insert when triggerLazyIndex() is called.
The AddToSearchIndexListener executes $this->search->triggerLazyIndex().

Advantages to the current implementation:

  • We get rid of one more "legacy" class (Search).
  • The Search service can easily replaced by Elasticsearch and Co. All they need to do is implement our SearchInterface.
  • It's possible to add content by different content providers for the same url.
  • In theory, rebuilding the index can be sub requests instead of real requests.
@Toflar
Copy link
Member Author

Toflar commented May 21, 2015

Also, drop the current getSearchablePages() behaviour in favour of a simple web crawler (see contao/core#6942). Note that it must update on a regular basis using a cronjob behaviour as well as manually in the back end somehow.

@Metis77
Copy link

Metis77 commented Jun 7, 2017

+1

1 similar comment
@HolyMacarony
Copy link

+1

@leofeyer
Copy link
Member

This is a long running task, which will eventually result in the https://github.com/contao/search library.

leofeyer referenced this issue Jul 3, 2020
Description
-----------

This PR is based on #1678

It improves the performance of the search further by storing the words in their own table with a unique index.

It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*)

In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds.

#### ToDo (for this pull request)
- [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~
- [x] Rebase once #1678 got merged upstream
- [x] Update the index process to save the words in the new table
- [x] Check how and when to update the `vectorLength` of the documents

#### ToDo (for a contao/search library)
- [ ] Functional Tests (if possible)
- [ ] Move logic to a search service or library
- [ ] ~~Use doctrine entities instead of DCA~~

#### Further ideas
- [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used.

Related: https://github.com/contao/core-bundle/issues/242

### How the search works now (updated 2020-06-25):

1. The `tl_search` table holds all documents (pages) as one row for every document.
2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency)
3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency)

When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values:

```
Queryᵢ = log(1+(N/nt))
Documentᵢ = 1 + log ƒt,d

                     ___                                   
                     ╲                                     
                     ╱    Queryᵢ × Documentᵢ           
                     ‾‾‾                                   
cos(ϕ) = ──────────────────────────────────────────────────
               ________________         ___________________
              ╱  ___                   ╱  ___              
             ╱   ╲          2         ╱   ╲             2
            ╱    ╱    Queryᵢ    ×    ╱    ╱    Documentᵢ 
          ╲╱     ‾‾‾               ╲╱     ‾‾‾              
```

This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query).

With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document.

The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it.

Commits
-------

f0955a7 Improve search query performance
0fcb18d Only count wildcard matches if necessary
20e8880 Move search words to their own table
7f0d95e Improve search performance
da22036 Use cosine similarity to rank search results
e84a6af Rename cosineSimilarity back to relevance
b9db167 Use MySQL variables to prevent multiple count computations
4a48b46 Coding style
803855a Fix division by zero
bb4cb66 Adapt indexing to the new data structure
d14eb23 Drop search tables instead of migrating the data
e7bb1f0 Remove obsolete language column
08350d0 Add unique index for word and pid
4f48c96 Fix division by zero
a8b7052 Merge branch master into feature/efficient-search-storage

Conflicts:
	core-bundle/src/Resources/contao/library/Contao/Search.php
67fa2c7 Remove unnecessary default values
99c999d Update vectorLength of 100 random documents when indexing
51f0dd0 Coding style
c29db11 Coding style
3b6c19c Comment the vector length update process
fba1e98 Rename tl_search_words to tl_search_term
463ed5b Rename tl_search_words to tl_search_term
debb634 Ensure that the relevance is always above zero
bf86114 CS fixes
a9698c2 Added missing default value for vectorLength
cd2ae6e Also delete search entries from the tl_search_term table
235bdaf Add tl_search_term to maintenance description
3431d42 Fix syntax error
6fe7dc9 Use contao.search.indexer service to purge deleted pages
8e513c4 Fix missing group by clause
d3ec34d Cast integer terms to string
14098c9 Fix unsigned value is out of range error
52b69b8 Try to prevent deadlocks
86e7d73 Fix concurrent indexing of the same page
a0e28fa Fix duplicate error for tl_search_index termId-pid
7e96c37 Add index for documentFrequency to prevent deadlocks
dbc587b Remove obsolete index
dc23f81 Try to fix another deadlock
fb90e2f Revert "Try to fix another deadlock"

This reverts commit dc23f81.
84c7e43 Fix bug with division by zero
6053561 Lock tables to prevent deadlocks
7a3009e CS fixes
8450dcf Merge branch 'master' into feature/efficient-search-storage
leofeyer referenced this issue in contao/core-bundle Jul 3, 2020
Description
-----------

This PR is based on #1678

It improves the performance of the search further by storing the words in their own table with a unique index.

It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*)

In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds.

#### ToDo (for this pull request)
- [ ] ~~Check if some of the optimizations are bug fixes that need to be added to #1678~~
- [x] Rebase once #1678 got merged upstream
- [x] Update the index process to save the words in the new table
- [x] Check how and when to update the `vectorLength` of the documents

#### ToDo (for a contao/search library)
- [ ] Functional Tests (if possible)
- [ ] Move logic to a search service or library
- [ ] ~~Use doctrine entities instead of DCA~~

#### Further ideas
- [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used.

Related: https://github.com/contao/core-bundle/issues/242

### How the search works now (updated 2020-06-25):

1. The `tl_search` table holds all documents (pages) as one row for every document.
2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency)
3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency)

When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values:

```
Queryᵢ = log(1+(N/nt))
Documentᵢ = 1 + log ƒt,d

                     ___                                   
                     ╲                                     
                     ╱    Queryᵢ × Documentᵢ           
                     ‾‾‾                                   
cos(ϕ) = ──────────────────────────────────────────────────
               ________________         ___________________
              ╱  ___                   ╱  ___              
             ╱   ╲          2         ╱   ╲             2
            ╱    ╱    Queryᵢ    ×    ╱    ╱    Documentᵢ 
          ╲╱     ‾‾‾               ╲╱     ‾‾‾              
```

This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query).

With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document.

The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it.

Commits
-------

f0955a7a Improve search query performance
0fcb18d1 Only count wildcard matches if necessary
20e88805 Move search words to their own table
7f0d95e0 Improve search performance
da220368 Use cosine similarity to rank search results
e84a6afc Rename cosineSimilarity back to relevance
b9db1675 Use MySQL variables to prevent multiple count computations
4a48b46f Coding style
803855ac Fix division by zero
bb4cb665 Adapt indexing to the new data structure
d14eb239 Drop search tables instead of migrating the data
e7bb1f0b Remove obsolete language column
08350d09 Add unique index for word and pid
4f48c967 Fix division by zero
a8b7052d Merge branch master into feature/efficient-search-storage

Conflicts:
	core-bundle/src/Resources/contao/library/Contao/Search.php
67fa2c77 Remove unnecessary default values
99c999d1 Update vectorLength of 100 random documents when indexing
51f0dd08 Coding style
c29db11f Coding style
3b6c19ce Comment the vector length update process
fba1e98d Rename tl_search_words to tl_search_term
463ed5b2 Rename tl_search_words to tl_search_term
debb6343 Ensure that the relevance is always above zero
bf861148 CS fixes
a9698c2c Added missing default value for vectorLength
cd2ae6eb Also delete search entries from the tl_search_term table
235bdaf0 Add tl_search_term to maintenance description
3431d421 Fix syntax error
6fe7dc90 Use contao.search.indexer service to purge deleted pages
8e513c43 Fix missing group by clause
d3ec34d2 Cast integer terms to string
14098c95 Fix unsigned value is out of range error
52b69b86 Try to prevent deadlocks
86e7d733 Fix concurrent indexing of the same page
a0e28fad Fix duplicate error for tl_search_index termId-pid
7e96c376 Add index for documentFrequency to prevent deadlocks
dbc587bd Remove obsolete index
dc23f81c Try to fix another deadlock
fb90e2fc Revert "Try to fix another deadlock"

This reverts commit dc23f81cd7844e847ec9001a8e5f298e2974403c.
84c7e433 Fix bug with division by zero
60535614 Lock tables to prevent deadlocks
7a3009e8 CS fixes
8450dcfd Merge branch 'master' into feature/efficient-search-storage
AlexejKossmann referenced this issue in AlexejKossmann/contao Apr 6, 2021
Description
-----------

This PR is based on contao#1678

It improves the performance of the search further by storing the words in their own table with a unique index.

It also changes how the check works that ensures that all keywords are matched. Should be faster now and also more accurate. Fixes bugs like searching for [`Contao Conta*`](https://contao.org/de/suche.html?keywords=Contao+Conta*)

In my tests with a `tl_search_index` table with about 1.5 million rows it took down a search like `*foo*` from 10 seconds to 0.2 second, and `*foo* *bar*` from 12 to 0.3 seconds.

#### ToDo (for this pull request)
- [ ] ~~Check if some of the optimizations are bug fixes that need to be added to contao#1678~~
- [x] Rebase once contao#1678 got merged upstream
- [x] Update the index process to save the words in the new table
- [x] Check how and when to update the `vectorLength` of the documents

#### ToDo (for a contao/search library)
- [ ] Functional Tests (if possible)
- [ ] Move logic to a search service or library
- [ ] ~~Use doctrine entities instead of DCA~~

#### Further ideas
- [ ] Only store the IDs of the results in the cache JSON and load the text from the database when it is used.

Related: https://github.com/contao/core-bundle/issues/242

### How the search works now (updated 2020-06-25):

1. The `tl_search` table holds all documents (pages) as one row for every document.
2. In `tl_search_words` all the words of the whole corpus are stored (one row per unique word) together with the number of documents the word appears in (document frequency)
3. `tl_search_index` is the connection between words and documents (one row for every unique word/document combination) and stores how often the word appears in the document (term frequency)

When we do an actual search we calculate the similarity between the query and all matching documents using the [cosine similarity algorithm](https://en.wikipedia.org/wiki/Cosine_similarity) with [tf-idf weighted](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) vector values:

```
Queryᵢ = log(1+(N/nt))
Documentᵢ = 1 + log ƒt,d

                     ___
                     ╲
                     ╱    Queryᵢ × Documentᵢ
                     ‾‾‾
cos(ϕ) = ──────────────────────────────────────────────────
               ________________         ___________________
              ╱  ___                   ╱  ___
             ╱   ╲          2         ╱   ╲             2
            ╱    ╱    Queryᵢ    ×    ╱    ╱    Documentᵢ
          ╲╱     ‾‾‾               ╲╱     ‾‾‾
```

This formula results in a similarity score between `0` (doesn’t match at all) and `1` (exact same words as the query).

With *idf* we make sure that rare words in the whole corpus get high weights while very common words get low weights. The `tf` score is used to give words more weight that appear very often in the same document.

The cosine similarity is then used to help normalizing the length of the documents. This means that you cannot “trick” the search index by creating a document that just has every word multiple times in it.

Commits
-------

f0955a7 Improve search query performance
0fcb18d Only count wildcard matches if necessary
20e8880 Move search words to their own table
7f0d95e Improve search performance
da22036 Use cosine similarity to rank search results
e84a6af Rename cosineSimilarity back to relevance
b9db167 Use MySQL variables to prevent multiple count computations
4a48b46 Coding style
803855a Fix division by zero
bb4cb66 Adapt indexing to the new data structure
d14eb23 Drop search tables instead of migrating the data
e7bb1f0 Remove obsolete language column
08350d0 Add unique index for word and pid
4f48c96 Fix division by zero
a8b7052 Merge branch master into feature/efficient-search-storage

Conflicts:
	core-bundle/src/Resources/contao/library/Contao/Search.php
67fa2c7 Remove unnecessary default values
99c999d Update vectorLength of 100 random documents when indexing
51f0dd0 Coding style
c29db11 Coding style
3b6c19c Comment the vector length update process
fba1e98 Rename tl_search_words to tl_search_term
463ed5b Rename tl_search_words to tl_search_term
debb634 Ensure that the relevance is always above zero
bf86114 CS fixes
a9698c2 Added missing default value for vectorLength
cd2ae6e Also delete search entries from the tl_search_term table
235bdaf Add tl_search_term to maintenance description
3431d42 Fix syntax error
6fe7dc9 Use contao.search.indexer service to purge deleted pages
8e513c4 Fix missing group by clause
d3ec34d Cast integer terms to string
14098c9 Fix unsigned value is out of range error
52b69b8 Try to prevent deadlocks
86e7d73 Fix concurrent indexing of the same page
a0e28fa Fix duplicate error for tl_search_index termId-pid
7e96c37 Add index for documentFrequency to prevent deadlocks
dbc587b Remove obsolete index
dc23f81 Try to fix another deadlock
fb90e2f Revert "Try to fix another deadlock"

This reverts commit dc23f81.
84c7e43 Fix bug with division by zero
6053561 Lock tables to prevent deadlocks
7a3009e CS fixes
8450dcf Merge branch 'master' into feature/efficient-search-storage
@leofeyer leofeyer transferred this issue from contao/core-bundle Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants