New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
builtins: Implement unaccent function #54628
Conversation
Thank you for contributing to CockroachDB. Please ensure you have followed the guidelines for creating a PR. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. I have added a few people who may be able to assist in reviewing:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
hELLO` | ||
|
||
query T | ||
SELECT unaccent('something') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is actually U+00AD
in the middle, it's visible in git diff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine these into one test case that's easier to mess with if you do:
SELECT unaccent(str) FROM ( VALUES
('no_special_CHARACTERS1!'),
('Żółć')
) tbl(str)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
I forgot to add a label "first PR", sorry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hello @mknycha, thanks for doing this!
I was away last week, so apologies for the delay.
I have a couple of small comments.
@@ -0,0 +1,1519 @@ | |||
// Copyright 2015 The Cockroach Authors. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: 2020.
I think this may be useful as a separate package, i.e. in pkg/util/unaccent/unaccent.go
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
hELLO` | ||
|
||
query T | ||
SELECT unaccent('something') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can combine these into one test case that's easier to mess with if you do:
SELECT unaccent(str) FROM ( VALUES
('no_special_CHARACTERS1!'),
('Żółć')
) tbl(str)
|
||
package builtins | ||
|
||
var unaccentDictionary = map[string]string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm curious - should this be map[rune]string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(if not, I'm not sure how strings.Split(s, "")
above works with this dictionary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, I have just realized that strings.Split
may not work correctly. I guess that map[rune]string
is indeed a way to go. Some "characters" in the dictionary consist of two code points, so I need to modify this dictionary a bit.
pkg/sql/sem/builtins/builtins.go
Outdated
"unaccent": makeBuiltin(tree.FunctionProperties{Category: categoryString}, | ||
stringOverload1( | ||
func(evalCtx *tree.EvalContext, s string) (tree.Datum, error) { | ||
separatedStrings := strings.Split(s, "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is more efficient if we use a strings.Builder
to write out the whole string.
I feel as if we should be iterating rune by rune, something like
var b strings.Builder
for _, ch := range s {
if repl, ok := unaccent.Characters[ch]; ok {
b.WriteString(repl)
} else {
b.WriteRune(ch)
}
}
return tree.NewDString(b.String()), nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done
546cf70
to
34f6864
Compare
Thank you for updating your pull request. My owl senses detect your PR is good for review. Please keep an eye out for any test failures in CI. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
d327e71
to
e4eea43
Compare
looks like some linting issues -- i'll merge it after those are fixed! |
Unaccent function should work the same as in PostgreSQL. Release note (sql change): Implement the string builtin unaccent.
e4eea43
to
d833390
Compare
bors r+ thank you for the contribution! |
Build succeeded: |
Unaccent function should work the same as in PostgreSQL.
Release note (sql change): Implement the string builtin
unaccent
.Described in #41299