A diacritic is a mark, point, or sign that is added to a letter or character to: 

* Distinguish it from another of similar form
* Give it a particular phonetic value
* Indicate stress

Diacritics are often loosely called "accents". They are written above, below, or on top of certain letters of the alphabet to indicate something about their pronunciation. 


Examples of diacritics include: Cedilla, Tilde, Circumflex, Macron.

In [2]:
import unicodedata

str1 = "Héllò, Wórld!"

# Canonical decomposition
normalized_str1 = unicodedata.normalize("NFC", str1)

print(normalized_str1)

Héllò, Wórld!


In [3]:
for c in unicodedata.normalize('NFD', str1):
    print(c)

H
e
́
l
l
o
̀
,
 
W
o
́
r
l
d
!


As can be seen above, if we apply this normalization scheme on a text which has diatrics, then it strips off the diatric from the letter.

Below we can filter out the text using `unicodedata.category()` function

In [4]:
for c in unicodedata.normalize('NFD', str1):
    print(f"Literal: {c}, Category: {unicodedata.category(c)}")

Literal: H, Category: Lu
Literal: e, Category: Ll
Literal: ́, Category: Mn
Literal: l, Category: Ll
Literal: l, Category: Ll
Literal: o, Category: Ll
Literal: ̀, Category: Mn
Literal: ,, Category: Po
Literal:  , Category: Zs
Literal: W, Category: Lu
Literal: o, Category: Ll
Literal: ́, Category: Mn
Literal: r, Category: Ll
Literal: l, Category: Ll
Literal: d, Category: Ll
Literal: !, Category: Po


In [17]:
import string

all_letters = string.ascii_letters + " .,;'"
char_string = "Ślusàrski"

In [18]:

for c in unicodedata.normalize('NFD', char_string):
    if unicodedata.category(c) != 'Mn' and c in all_letters:
        print(c)

S
l
u
s
a
r
s
k
i
