### Case-Insensitive String Comparisons

In Python, string comparisons using `==` are case-sensitive.

For example, the following string are not equal:

In [1]:
'python' == 'PYTHON'

False

The typical technique for doing a case-insensitive comparison is to use the `lower()` (or `upper()`) method in the `str` class, to compare lower-case (or upper-case) versions of both strings:

In [2]:
'python'.lower() == 'PYTHON'.lower()

True

> But there's a slight problem with that!

The technique above uses something called **case mapping**, or **case conversion**. 

It is basically a process that converts strings to a particular form, such as lower case, upper case, or title case, and should primarily be used for **display purposes**, **not** for comparison purposes!

Here's an example where case mapping fails:

In [3]:
s1 = 'STRASSE'
s2 = 'Straße'

Technically, from a case insensitive comparison perspective these two strings are equal!

But look at what happens when we do a lower-case comparison:

In [4]:
s1.lower(), s2.lower(), s1.lower() == s2.lower()

('strasse', 'straße', False)

If we had done an `uppercase` comparison that would actually have worked:

In [5]:
s1.upper(), s2.upper(), s1.upper() == s2.upper()

('STRASSE', 'STRASSE', True)

The better alternative for case-insensitive comparisons is to use **case folding**.

Case folding essentially provides us a more consistent method that we can use to compare two strings:

In [6]:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()

('strasse', 'strasse', True)

So `casefold` can address **some** of the issues surrounding case-insensitive comparisons.

But not all!

Consider the following two strings:

In [7]:
s1 = 'ê'
s2 = 'ê'

These may **look** like the same character, but:

In [8]:
s1 == s2

False

Even though these two characters look the same (and we probably would want them to compare equal), `casefold` will not help us here:

In [9]:
s1.casefold(), s2.casefold(), s1.casefold() == s2.casefold()

('ê', 'ê', False)

This is happening because these two strings use **different unicode encodings** to define each character.

The first one uses a **single** character, whereas the second one is actually **two** characters!

In [10]:
s1, len(s1), s2, len(s2)

('ê', 1, 'ê', 2)

We can see what those two characters are:

In [11]:
import unicodedata

In [12]:
unicodedata.name(s1)

'LATIN SMALL LETTER E WITH CIRCUMFLEX'

In [13]:
[unicodedata.name(c) for c in s2]

['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

As you can see, the string `s1` is a single unicode character (the code is `U+00EA`), whereas `s2` consists of the base `e` character (`U+00065`) and a circumflex **modifier** (`U+302`).

We can actually create these two from the unicode codes like this:

In [14]:
'\u00ea'

'ê'

In [15]:
'\u0065\u0302'

'ê'

You can refer to this link to see info about ê, where you'll notice an entry called `Decomposition`:
https://www.compart.com/en/unicode/U+00EA

So, they look the same, and in most cases we would want to consider them equal, since as we see from the definition, it's really just two ways od describing the same character

So, we need to perform an extra step to avoid this pitfall, called **unicode normalization**.

In this case we can use something called **NFD** (Normal Form D) normalization 

(see D145 in the official Unicode documentation here: http://www.unicode.org/versions/Unicode9.0.0/ch03.pdf)

We can achieve this using Python's `unicodedata.normalize` function:

In [16]:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2)

True

So now, we can deal with let's say the following two strings to do a case-insensitive comparison by combining the NFD normalization **and** the case folding.

In [17]:
s1 = '\u0065\u0302'
s2 = '\u00ca'
s1, s2

('ê', 'Ê')

Just case folding will not work:

In [18]:
s1.casefold() == s2.casefold()

False

Just normalization will not work either, since the characters are obviously not the same case:

In [19]:
unicodedata.normalize('NFD', s1) == unicodedata.normalize('NFD', s2)

False

But by combining the two, we get the desired result:

In [20]:
unicodedata.normalize('NFD', s1).casefold() == unicodedata.normalize('NFD', s2).casefold()

True

I usually end up with a small helper function:

In [21]:
def strcomp(a, b, case_insensitive=False):
    a = unicodedata.normalize('NFD', a)
    b = unicodedata.normalize('NFD', b)
    if case_insensitive:
        return a.casefold() == b.casefold()
    else:
        return a == b

And this will work with all our examples so far:

In [22]:
s1 = '\u0065\u0302tre'
s2 = '\u00caTRE'
print(f"{s1=}, {s2=}")
print('case sensitive:', strcomp(s1, s2))
print('case insensitive:', strcomp(s1, s2, True))

s1='être', s2='ÊTRE'
case sensitive: False
case insensitive: True


In [23]:
s1 = 'STRASSE'
s2 = 'Straße'
print(f"{s1=}, {s2=}")
print('case sensitive:', strcomp(s1, s2))
print('case insensitive:', strcomp(s1, s2, True))

s1='STRASSE', s2='Straße'
case sensitive: False
case insensitive: True


Case-insensitive comparisons can be quite simple using case folding, as long as the character set you are dealing with is something like the ASCII characters, but once you start considering internationalization issues, things get more complicated very fast! 

This is just the tip of the iceberg, depending on the particular language you are dealing with, things can get even more complicated.