New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build(docs-infra): improve search quality #25750
build(docs-infra): improve search quality #25750
Conversation
You can preview f2fa79e at https://pr25750-f2fa79e.ngbuilds.io/. |
This seems to be better. @jenniferfell can you test your queries with this version please? |
token = token | ||
.trim() | ||
.replace(/^[_\-"'`({[<$*]+/, '') | ||
.replace(/[_\-"'`)}\]>.]+$/, ''); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will not strip away matching brackets (e.g. foo()
, bar[]
). Is that intentional?
Is there any downside in including both opening and closing symbols in both .replace()
RegExps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, good spot. This means that we might end up with things like foo(
in the index...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the query lexers (on the browser) we now split the terms by whitespace only. So the following query, https://pr25750-f2fa79e.ngbuilds.io/?search=downgradeModule(), results in no terms matching. Given that we will not keep such punctuation in the index, we should modify the lexer appropriately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
We considered just zipping up and downloading the entire text content of the docs, then building the index directly from the original content. This is problematic in a number of ways:
|
// E.g. if the search is "ngCont guide" then we search for "ngCont guide titleWords:ngCont*" | ||
var titleQuery = 'titleWords:*' + query.split(' ', 1)[0] + '*'; | ||
results = index.search(query + ' ' + titleQuery); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically we now attempt a straight search with the given query; and only fall back on a more relaxed title based search if the original query returns nothing. This results in fewer false positives.
@@ -27,12 +27,13 @@ self.onmessage = handleMessage; | |||
// Create the lunr index - the docs should be an array of objects, each object containing | |||
// the path and search terms for a page | |||
function createIndex(addFn) { | |||
lunr.QueryLexer.termSeparator = lunr.tokenizer.separator = /\s+/; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The built-in lexer was splitting on some punctuation, which are actually significant when searching code. This was causing false negatives. Now we just split on whitespace.
@@ -37,7 +37,7 @@ module.exports = new Package('angular.io', [gitPackage, apiPackage, contentPacka | |||
checkAnchorLinksProcessor.$runBefore = ['convertToJsonProcessor']; | |||
checkAnchorLinksProcessor.$runAfter = ['fixInternalDocumentLinks']; | |||
// We only want to check docs that are going to be output as JSON docs. | |||
checkAnchorLinksProcessor.checkDoc = (doc) => doc.path && doc.outputPath && extname(doc.outputPath) === '.json'; | |||
checkAnchorLinksProcessor.checkDoc = (doc) => doc.path && doc.outputPath && extname(doc.outputPath) === '.json' && doc.docType !== 'json-doc'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't ignore json-doc files, the anchor checker get confused by the JSON escaped URLs in the rendered content of the search term file.
@jenniferfell has suggested that supporting quoted "phrase searching", to get fewer false positives, but lunr does not support this out of the box, and effort required to implement it on top of lunr, the benefit seemed unworthy. |
Still to be done (in another PR):
|
You can preview 66784da at https://pr25750-66784da.ngbuilds.io/. |
You can preview d64323c at https://pr25750-d64323c.ngbuilds.io/. |
82d5d0c
to
5aaa179
Compare
You can preview 5aaa179 at https://pr25750-5aaa179.ngbuilds.io/. |
5aaa179
to
4bfae61
Compare
You can preview 4bfae61 at https://pr25750-4bfae61.ngbuilds.io/. |
Site search for http:
Site search for node.js
API index page:
Badges:
API package and type pages for deprecated APIs:
No show stoppers. MUCH IMPROVED!! |
Great. I will fix the broken Travis test and swap the colours. Then mark it for merge. |
4bfae61
to
c316ecd
Compare
You can preview c316ecd at https://pr25750-c316ecd.ngbuilds.io/. |
c316ecd
to
e638808
Compare
You can preview e638808 at https://pr25750-e638808.ngbuilds.io/. |
There's something weird going on with the search results, it seems. I was exploring the various modules to see their badge color and started searching for Regardless of this minor thing, this PR brings a huge improvement to the search results overall! |
@JoostK thanks for testing this. I can explain why this is happening but I am not sure we can do much better. Here is what happens...
We don't have any control over the stemmer, AFAIK, so there is not so much we can do. |
e638808
to
8ca28bb
Compare
You can preview 8ca28bb at https://pr25750-8ca28bb.ngbuilds.io/. |
This issue has been automatically locked due to inactivity. Read more about our automatic conversation locking policy. This action has been performed automatically by a bot. |
Fixes #24380
Fixes #25721