stri_*_fixed: add an option to perform a simple case-insensitive match #110

Closed
gagolews opened this Issue Oct 31, 2014 · 2 comments

Comments

Projects
None yet
1 participant
@gagolews
Owner

gagolews commented Oct 31, 2014

This will not be valid for natural language processing, but... UChar32 u_toupper ( UChar32 c ) may be called on each character in str and pattern to perform a case-insensitive match for the fixed engine.

I reckon that it may be useful for some users...

However: u_toupper: This function only returns the simple, single-code point case mapping. Full case mappings should be used whenever possible because they produce better results by working on whole strings. They take into account the string context and the language and can map to a result string with a different length as appropriate. Full case mappings are applied by the string case mapping functions, see ustring.h and the UnicodeString class. See also the User Guide chapter on C/POSIX migration: http://icu-project.org/userguide/posix.html#case_mappings

@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Dec 7, 2014

Owner

According to UnicodeData.txt, there are characters x for which numbytes(x) != numbytes(u_toupper(x))

Here they are:

       V1                                        V2  V13  V15 X1    X13    X15 S1 S13 S15
306  0131              LATIN SMALL LETTER DOTLESS I 0049 0049  ı      I      I  2   1   1
384  017F                 LATIN SMALL LETTER LONG S 0053 0053  ſ      S      S  2   1   1
576  023F      LATIN SMALL LETTER S WITH SWASH TAIL 2C7E 2C7E    ȿ \u2c7e \u2c7e  2   3   3
577  0240      LATIN SMALL LETTER Z WITH SWASH TAIL 2C7F 2C7F    ɀ \u2c7f \u2c7f  2   3   3
593  0250               LATIN SMALL LETTER TURNED A 2C6F 2C6F  ɐ        Ɐ        Ɐ  2   3   3
594  0251                  LATIN SMALL LETTER ALPHA 2C6D 2C6D  ɑ        Ɑ        Ɑ  2   3   3
595  0252           LATIN SMALL LETTER TURNED ALPHA 2C70 2C70  ɒ \u2c70 \u2c70  2   3   3
614  0265               LATIN SMALL LETTER TURNED H A78D A78D  ɥ \ua78d \ua78d  2   3   3
620  026B    LATIN SMALL LETTER L WITH MIDDLE TILDE 2C62 2C62  ɫ        Ɫ        Ɫ  2   3   3
626  0271            LATIN SMALL LETTER M WITH HOOK 2C6E 2C6E  ɱ        Ɱ        Ɱ  2   3   3
638  027D            LATIN SMALL LETTER R WITH TAIL 2C64 2C64  ɽ        Ɽ        Ɽ  2   3   3
6890 1FBE                      GREEK PROSGEGRAMMENI 0399 0399  ι      Ι      Ι  3   2   2
9835 2C65          LATIN SMALL LETTER A WITH STROKE 023A 023A    ⱥ        Ⱥ        Ⱥ  3   2   2
9836 2C66 LATIN SMALL LETTER T WITH DIAGONAL STROKE 023E 023E    ⱦ        Ⱦ        Ⱦ  3   2   2
Owner

gagolews commented Dec 7, 2014

According to UnicodeData.txt, there are characters x for which numbytes(x) != numbytes(u_toupper(x))

Here they are:

       V1                                        V2  V13  V15 X1    X13    X15 S1 S13 S15
306  0131              LATIN SMALL LETTER DOTLESS I 0049 0049  ı      I      I  2   1   1
384  017F                 LATIN SMALL LETTER LONG S 0053 0053  ſ      S      S  2   1   1
576  023F      LATIN SMALL LETTER S WITH SWASH TAIL 2C7E 2C7E    ȿ \u2c7e \u2c7e  2   3   3
577  0240      LATIN SMALL LETTER Z WITH SWASH TAIL 2C7F 2C7F    ɀ \u2c7f \u2c7f  2   3   3
593  0250               LATIN SMALL LETTER TURNED A 2C6F 2C6F  ɐ        Ɐ        Ɐ  2   3   3
594  0251                  LATIN SMALL LETTER ALPHA 2C6D 2C6D  ɑ        Ɑ        Ɑ  2   3   3
595  0252           LATIN SMALL LETTER TURNED ALPHA 2C70 2C70  ɒ \u2c70 \u2c70  2   3   3
614  0265               LATIN SMALL LETTER TURNED H A78D A78D  ɥ \ua78d \ua78d  2   3   3
620  026B    LATIN SMALL LETTER L WITH MIDDLE TILDE 2C62 2C62  ɫ        Ɫ        Ɫ  2   3   3
626  0271            LATIN SMALL LETTER M WITH HOOK 2C6E 2C6E  ɱ        Ɱ        Ɱ  2   3   3
638  027D            LATIN SMALL LETTER R WITH TAIL 2C64 2C64  ɽ        Ɽ        Ɽ  2   3   3
6890 1FBE                      GREEK PROSGEGRAMMENI 0399 0399  ι      Ι      Ι  3   2   2
9835 2C65          LATIN SMALL LETTER A WITH STROKE 023A 023A    ⱥ        Ⱥ        Ⱥ  3   2   2
9836 2C66 LATIN SMALL LETTER T WITH DIAGONAL STROKE 023E 023E    ⱦ        Ⱦ        Ⱦ  3   2   2
@gagolews

This comment has been minimized.

Show comment
Hide comment
@gagolews

gagolews Dec 7, 2014

Owner

Not done yet: absan errors:

test-locate-fixed.R : ...........................................................................................................................................=================================================================
==27707== ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60040007b48c at pc 0x7f877a2964df bp 0x7fff8ba33a40 sp 0x7fff8ba33a30
READ of size 4 at 0x60040007b48c thread T0
Owner

gagolews commented Dec 7, 2014

Not done yet: absan errors:

test-locate-fixed.R : ...........................................................................................................................................=================================================================
==27707== ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60040007b48c at pc 0x7f877a2964df bp 0x7fff8ba33a40 sp 0x7fff8ba33a30
READ of size 4 at 0x60040007b48c thread T0

@gagolews gagolews closed this in 377ed8a Dec 7, 2014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment