# Python string operations: casing and matching

|Operation   |Python   |Pandas  |PyICU  |
|----------- |-------- |------- |------ |
|Lowercasing |[str.lower()](https://docs.python.org/3/library/stdtypes.html#str.lower)  |[pandas.Series.str.lower()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.lower.html?highlight=lower#pandas-series-str-lower) |icu.UnicodeString.toLower() |
|Uppercasing |[str.upper()](https://docs.python.org/3/library/stdtypes.html#str.upper)  |[pandas.Series.str.upper()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.upper.html#pandas-series-str-upper) |icu.UnicodeString.toUpper() |
|Titlecasing |[str.title()](https://docs.python.org/3/library/stdtypes.html#str.title)  |[pandas.Series.str.title](pandas.Series.str.title) |icu.UnicodeString.toTitle() |
|Casefolding |[str.casefold()](https://docs.python.org/3/library/stdtypes.html#str.casefold) |[pandas.Series.str.casefold()]() |icu.UnicodeString.CaseFold() |

The operations [str.capitalize()](https://docs.python.org/3/library/stdtypes.html#str.capitalize)/[pandas.Series.str.capitalize()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.capitalize.html#pandas-series-str-capitalize) and [str.swapcase()](https://docs.python.org/3/library/stdtypes.html#str.swapcase)/[pandas.Series.str.swapcase()](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.swapcase.html#pandas-series-str-swapcase), although string operations, aren't necessarily casing operations.

N.B. we will not explore the differences between an [object and `StringDtype`](https://pandas.pydata.org/docs/user_guide/text.html#behavior-differences) in Pandas.

In [4]:
from el_internationalisation import cp, cpnames, udata

## Python casing operations

Unicode contains a set of special casing mappings. These are divided intto unconditional and conditional mappings. All casing operations should support unconditional special mappings by default.

Python's casing operations are language insensitive, that is langauge is not taken into account when casing operations occur. The current locale has no impact on casing operations, therefore language sensitive mappings are unsupported.

Unconditional mappings:

  * Eszett (ß) casing 
  * Preserving canonical equivalence of I WITH DOT ABOVE (&#x0130;)
  * Ligatures (Latin and Armenian script)
  * When a lowercase charcater has no corresponding uppercase precomposed character
  * Greek letters with letters with hupogegramménē (ὑπογεγραμμένη) or prosgráphō (προσγράφω) have special uppercase equivalents.
  * Some Greek letters with letters with hupogegramménē (ὑπογεγραμμένη) have no titlecase

Conditional mappings:
  1. Language-Insensitive Mappings
    * Final form of Greek sigma
  2. Language-Sensitive Mappings
    * Lithuanian retains the dot in a lowercase i/j when followed by accents
    * For Turkish and Azeri, I and i-dotless; I-dot and i are case pairs

See [Special Casings](https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt), which forms part of the Unicode Character database (UCD).

### Unconditional mappings

Python lowercasing and uppercasing support the unconditional mappings of Unicode's special mappings.

|Character  |Lowercase  |Titlecase  |Uppercase  |Notes  |
|---------- |---------- |---------- |---------- |------ |

#### Latin script

In [20]:
# ß
ESZETT = "ß"
print(f'{ESZETT} ({cp(ESZETT)}) ⇒ {ESZETT.upper()} ({cp(ESZETT.upper())})')
print("Titlecase: should not appear word initial.")

# I WITH DOT ABOVE
IDOT = "\u0130"
print(f'{IDOT.lower()} ({cp(IDOT.lower())}) ⇐ {IDOT} ({cp(IDOT)})')
print(f'Titlecase: {"i̇".title()} ({cp("i̇".title())})')

ß (00DF) ⇒ SS (0053 0053)
Titlecase: should not appear word initial.
i̇ (0069 0307) ⇐ İ (0130)
Titlecase: İ (0049 0307)


Note that Python titlecasing does not resolve back to the precomosed U+0130, but this is part of a wider issue with Python titlecasing, unlike uppercasing and lowercasing, titlecasing does not adhere to the Unicode specification

If we take the name of the Turkish city İstanbul:

In [26]:
print(f'İstanbul: {cp("İstanbul")}')
istanbul = "İstanbul".lower()
print(f'{istanbul}: {cp(istanbul)}')
istanbul_title = istanbul.title()
print(f'Titlecase: {istanbul_title} ({cp(istanbul_title)})')

İstanbul: 0130 0073 0074 0061 006E 0062 0075 006C
i̇stanbul: 0069 0307 0073 0074 0061 006E 0062 0075 006C
Titlecase: İStanbul (0049 0307 0053 0074 0061 006E 0062 0075 006C)


The first three characters in the titlecased string are U+0049 U+0307 U+0053. Python titlecases the first alphabetic character after a non-alphabetic character. Combining diacritics are not considered alphabetic characaters:

In [27]:
istanbul.isalpha()

False

So __i__ is uppercased to __I__, U+0307 is treated as a non-alphabetic character and the titlecasing operation titlecases the __s__, giving us İStanbul as the titlecased version of the string.

It is important to note that the Unicode definition also excludes marks, like combining diacrtics, but Unicode titlecasing does not apply an alphabetic mask to titlecasing.