Normalize Unicode throughout CLTK #94

kylepjohnson · 2015-09-02T06:59:24Z

I've been reading about normalize() and hope it will prevent normalization problems in the future. This builtin method solves the problem of accented characters made with combining diacritics not equaling precomposed characters. Examples of this appear in the testing library, where I have struggled to make two strings of accented Greek equal one another.

Example of normalize() from Fluent Python by Luciano Ramalho (117-118):

>>> from unicodedata import normalize
>>> s1 = 'café' # composed "e" with acute accent
>>> s2 = 'cafe\u0301' # decomposed "e" and acute accent 
>>> len(s1), len(s2)
(4, 5)
>>> len(normalize('NFC', s1)), len(normalize('NFC', s2)) 
(4, 4)
>>> len(normalize('NFD', s1)), len(normalize('NFD', s2)) 
(5, 5)
>>> normalize('NFC', s1) == normalize('NFC', s2)
True
>>> normalize('NFD', s1) == normalize('NFD', s2) 
True

Solutions

In core, use normalize with the argument 'NFC', as Fluent Python recommends. Not all Greek combining forms may reduce into precomposed … will need to be tested out.
In tests, especially for assertEqual(), check that more complicated strings equal one another. Use normalize('NFC', <text>) on the comparison strings, too, if necessary.
Use this to strip out accented characters coming from the PHI, which I don't do very gracefully here: https://github.com/kylepjohnson/cltk/blob/master/cltk/corpus/utils/formatter.py#L94

Docs: https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize

The text was updated successfully, but these errors were encountered:

fractaledmind · 2015-11-26T20:05:42Z

So there is a wrapper function in the Alfred-Workflow repo that I have used in all of my project to handle Unicode strings consistently:

def decode(self, text, encoding='utf-8', normalization='NFC'):
        """Return ``text`` as normalised unicode.
        :param text: string
        :type text: encoded or Unicode string. If ``text`` is already a
            Unicode string, it will only be normalised.
        :param encoding: The text encoding to use to decode ``text`` to
            Unicode.
        :type encoding: ``unicode`` or ``None``
        :param normalization: The nomalisation form to apply to ``text``.
        :type normalization: ``unicode`` or ``None``
        :returns: decoded and normalised ``unicode``
        :class:`Workflow` uses "NFC" normalisation by default. This is the
        standard for Python and will work well with data from the web (via
        :mod:`~workflow.web` or :mod:`json`).
        OS X, on the other hand, uses "NFD" normalisation (nearly), so data
        coming from the system (e.g. via :mod:`subprocess` or
        :func:`os.listdir`/:mod:`os.path`) may not match. You should either
        normalise this data, too, or change the default normalisation used by
        :class:`Workflow`.
        """
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
        return unicodedata.normalize(normalization, text)

kylepjohnson · 2015-11-26T22:15:38Z

Thanks, @smargh , for sharing. You inspire me to try something like this soon.

jtauber · 2016-03-03T21:13:55Z

I actually end up using NFKC just so alt forms of theta and phi, etc, get normalized.

I have a (very) work-in-progress article about all this stuff: http://jktauber.com/articles/python-unicode-ancient-greek/

coderbhupendra · 2016-03-03T22:09:43Z

@jtauber thanks for this wonderful article , i was googling myself to understand normalization to solve this issue. It cleared my few doubts.
@kylepjohnson i think i will be able to do it after asking few doubts .

kylepjohnson · 2016-03-03T22:12:53Z

Yes, @coderbhupendra , once you finish #95 , let's talk about this one. It's a somewhat similar task, so it'd be great for you to do it.

I will also appreciate your help in checking this is some Indic languages (Sanskrit, namely).

jtauber · 2016-03-04T05:25:42Z

@coderbhupendra happy to help with any Unicode / Python / Greek questions too (and hoping to learn some Sanskrit along the way)

coderbhupendra · 2016-03-04T20:20:45Z

@kylepjohnson now i will like to start this .
we can strip acute using James code first we can remove acute and then remove punctuation afterward.Is this you meant by doing it gracefully.

text='café'
ACUTE="\u0301"

text_acute_free=unicodedata.normalize("NFC", "".join(
    ch
    for ch in unicodedata.normalize("NFD", text)
    if ch not in [ACUTE])

# after this we will do remaining things 
 for char in text_acute_free:
        if char in punctuation:
            pass
        else:
            new_text += char

coderbhupendra · 2016-03-04T20:24:27Z

@kylepjohnson
in problem 1 : core, use normalize with the argument 'NFC'
and problem 2 "In tests, especially for assertEqual()" can you point out these piece of code .

coderbhupendra · 2016-03-06T18:20:20Z

@kylepjohnson can you give some detail , whether that approach for striping acute is right and about other two problems.

kylepjohnson · 2016-03-07T07:25:36Z

@coderbhupendra The main point of this task isn't about stripping accents, but turning combining diacritics into precomposed characters. Please try the code I use in the above example.

The function we need will look something like:

def cltk_normalize(text):
    return normalize('NFKC', text)

Would you do us the favor of testing this out with some Devangari and putting it in a Gist? I'll do the same, soon, with Greek. Then we can talk about changes necessary, then you can do a pull request. Sound good?

coderbhupendra · 2016-03-07T11:41:15Z

@kylepjohnson now i understood the benefit of normalization.Hope its correct and below i have pointed out the changes```

https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1#file-gistfile1-txt-L76
https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1#file-gistfile1-txt-L90
https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1#file-gistfile1-txt-L98

coderbhupendra · 2016-03-07T12:00:54Z

@ i think normalization is not working in sanskrit.What you wanted to do is to remove all punctuation and -\n|«|»|<|>|...|‘|’|_|{.+?}|(.+?)|[a-zA-Z0-9] from text but while striping you didn't want to strip/separate the symbols(i.e. diacritics) from them alphabets.

but in sanskrit normalisalion is not helping.
s is an sanskrit stentence

s='मनोहारि देहं महच्चित्तगेहम्'
on printing [ch for ch in normalize('NFKC',s)]
['म',
 'न',
 'ो',
 'ह',
 'ा',
 'र',
 'ि',
 ' ',
 'द',
 'े',
 'ह',
 'ं',
 ' ',
 'म',
 'ह',
 'च',
 '्',
 'च',
 'ि',
 'त',
 '्',
 'त',
 'ग',
 'े',
 'ह',
 'म',
 '्']
on printing this [ch for ch in s]
['म',
 'न',
 'ो',
 'ह',
 'ा',
 'र',
 'ि',
 ' ',
 'द',
 'े',
 'ह',
 'ं',
 ' ',
 'म',
 'ह',
 'च',
 '्',
 'च',
 'ि',
 'त',
 '्',
 'त',
 'ग',
 'े',
 'ह',
 'म',
 '्']
Even after normalisation striping is the same.

len(s)==len( normalize('NFKC',s))
True

jtauber · 2016-03-07T12:58:34Z

Unicode Normalization doesn't strip anything. Normalization just makes sure that if there is more than one way to express something in Unicode, a consistent choice is made.

coderbhupendra · 2016-03-07T13:16:40Z

@jtauber by stripping i meant this [ch for ch in normalize('NFKC',s)] , you can see results for
[ch for ch in normalize('NFKC',s)] and [ch for ch in s] are same in case of sanskrit .

In case of Greek you can see the difference

In [41]: s1 = 'café'

In [42]: [ch for ch in normalize('NFKC',s1)]
Out[42]: ['c', 'a', 'f', 'é']

In [43]: [ch for ch in s1]
Out[43]: ['c', 'a', 'f', 'e', '́']

kylepjohnson · 2016-03-07T14:41:06Z

@coderbhupendra It's helpful to know that normalize() doesn't handle Sanskrit well.

Unless there are objections, I'm going to assign this to myself and pass at least some Greek through it in the core.

At some point, sometime might want to look at what this library, Indic NLP, does (note that it is 2.7).

jtauber · 2016-03-07T16:38:34Z

@coderbhupendra I'm still not sure it's doing anything wrong for Sanskrit. The fact Greek has precomposed characters for most diacritic combinations (but not all, see http://jktauber.com/2016/02/09/updated-solution-polytonic-greek-unicodes-problems/ ) is actually for political rather than technical reasons and if the Unicode Technical Committee were adding Greek now they assure me Greek would work like the Sanskrit is working in your example.

kylepjohnson · 2016-03-07T16:46:17Z

Thanks, James, for your insights. You are surely correct, I see, when you
put it this way.

If the combined characters are merely political, and not how other
languages are handled, what is you opinion of how polytonal Greek ought to
be handled? Is there a benefit to the combined?

On Monday, March 7, 2016, James Tauber notifications@github.com wrote:

@coderbhupendra https://github.com/coderbhupendra I'm still not sure
it's doing anything wrong for Sanskrit. The fact Greek has precomposed
characters for most diacritic combinations (but not all, see
http://jktauber.com/2016/02/09/updated-solution-polytonic-greek-unicodes-problems/
) is actually for political rather than technical reasons and if the
Unicode Technical Committee were adding Greek now they assure me Greek
would work like the Sanskrit is working in your example.

—
Reply to this email directly or view it on GitHub
#94 (comment).

Kyle P. Johnson, Ph.D.

Natural language processing, data science, architecture
https://kyle-p-johnson.comkyle@kyle-p-johnson.com |
https://kyle-p-johnson.com | https://github.com/kylepjohnson

Classical Language Toolkit, Founder
http://cltk.org | https://github.com/cltk/cltk

jtauber · 2016-03-07T16:56:48Z

It's still useful to normalize to something just because otherwise you can't test equality properly.

To be honest, I've only recently been told by members of the Unicode Technical Committee why new precomposed characters will not be introduced to handle length marking + other diacritics, I'm still trying to work things out.

Keeping things NFC or NFKC certainly makes them work better in Python3 (until you hit length marking + other diacritics).

Note, however, that pyuca (and in fact UCA in general) converts to NFD before looking up collation elements. My diacritic stripping / adding code converts to NFD first too.

Still trying to work out the best trade off. Honestly, my world was somewhat turned upside down by the UTC members' feedback on my complaints about the lack of precomposed length + other diacritics.

This turns out to be something Perl 6 is good at, I also recently discovered.

coderbhupendra · 2016-03-07T17:28:35Z

@kylepjohnson if i understood the problem of "combining diacritics not equaling precomposed characters" , then can you tell me whether the changes which i did in my gist are correct or not.
https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1

and @jtauber according to my undersatnding diacritics in simple words are nothing but some special symbols which are added to Geek 24 alphabets .Like wise in sanskrit also we have "matras" which are added over sanskrit alphabets.Above i did the same things for Greek and Sanskrit sentence. and in case of sanskrit "matras" got separated after normalization , there is no effect after normalization.You may try it yourself.So i think it is not working for sanskrit.

kylepjohnson · 2016-03-08T05:01:44Z

@coderbhupendra Yes, for our purposes here, accents == polytonic diacritics == matras. For your commit, this is close, but I don't want you to implement it yet on any texts; I'll do this one-by-one. Thus, make it like this:

def cltk_normalize(text, compatibility=True):
    if compatibility:
        return normalize('NFKC', text)
    else:
        return normalize('NFC', text)

Also remember to import at top of file and add two tests, one for compatibility True/False, in cltk/tests/test_corpus.py. Use the 'café' example above.

kylepjohnson · 2016-03-08T05:03:40Z

@jtauber I see what you're saying and this is making my head spin, too.

I agree that we're best off normalizing to something … will want to hear more what you come to think of this all with Python 3.

kylepjohnson · 2016-03-08T05:05:24Z

@coderbhupendra If you accept the team Core Member team request, we can assign this ticket to you.

coderbhupendra · 2016-03-08T08:11:40Z

@kylepjohnson You send it , i will accept it .

Regarding Issue:Normalize Unicode throughout CLTK #94

kylepjohnson · 2016-03-08T19:32:06Z

Thank you @coderbhupendra, I have merged PR #182.

I'll close this ticket for now, will open when I implement it in places, such as for Greek TLG output and tests.

coderbhupendra · 2016-03-08T19:34:48Z

ok @kylepjohnson and thanks for your help too.

Improve sphinx docs

kylepjohnson added the enhancement label Sep 2, 2015

kylepjohnson mentioned this issue Feb 15, 2016

Add Levenstein distance enhancements #130

Closed

kylepjohnson self-assigned this Mar 7, 2016

kylepjohnson removed their assignment Mar 8, 2016

kylepjohnson added a commit that referenced this issue Mar 8, 2016

Merge pull request #182 from coderbhupendra/master

9477edc

Regarding Issue:Normalize Unicode throughout CLTK #94

coderbhupendra mentioned this issue Mar 8, 2016

Regarding Issue:Normalize Unicode throughout CLTK #94 #182

Merged

kylepjohnson closed this as completed Mar 8, 2016

kylepjohnson added a commit to kylepjohnson/cltk that referenced this issue Aug 21, 2020

Merge pull request cltk#94 from kylepjohnson/misc-changes

6457c11

Improve sphinx docs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize Unicode throughout CLTK #94

Normalize Unicode throughout CLTK #94

kylepjohnson commented Sep 2, 2015

fractaledmind commented Nov 26, 2015

kylepjohnson commented Nov 26, 2015

jtauber commented Mar 3, 2016

coderbhupendra commented Mar 3, 2016

kylepjohnson commented Mar 3, 2016

jtauber commented Mar 4, 2016

coderbhupendra commented Mar 4, 2016

coderbhupendra commented Mar 4, 2016

coderbhupendra commented Mar 6, 2016

kylepjohnson commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

jtauber commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

kylepjohnson commented Mar 7, 2016

jtauber commented Mar 7, 2016

kylepjohnson commented Mar 7, 2016

jtauber commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

kylepjohnson commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

coderbhupendra commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

coderbhupendra commented Mar 8, 2016

Normalize Unicode throughout CLTK #94

Normalize Unicode throughout CLTK #94

Comments

kylepjohnson commented Sep 2, 2015

Solutions

fractaledmind commented Nov 26, 2015

kylepjohnson commented Nov 26, 2015

jtauber commented Mar 3, 2016

coderbhupendra commented Mar 3, 2016

kylepjohnson commented Mar 3, 2016

jtauber commented Mar 4, 2016

coderbhupendra commented Mar 4, 2016

coderbhupendra commented Mar 4, 2016

coderbhupendra commented Mar 6, 2016

kylepjohnson commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

jtauber commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

kylepjohnson commented Mar 7, 2016

jtauber commented Mar 7, 2016

kylepjohnson commented Mar 7, 2016

jtauber commented Mar 7, 2016

coderbhupendra commented Mar 7, 2016

kylepjohnson commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

coderbhupendra commented Mar 8, 2016

kylepjohnson commented Mar 8, 2016

coderbhupendra commented Mar 8, 2016