Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalize Unicode throughout CLTK #94

Closed
kylepjohnson opened this issue Sep 2, 2015 · 25 comments
Closed

Normalize Unicode throughout CLTK #94

kylepjohnson opened this issue Sep 2, 2015 · 25 comments

Comments

@kylepjohnson
Copy link
Member

I've been reading about normalize() and hope it will prevent normalization problems in the future. This builtin method solves the problem of accented characters made with combining diacritics not equaling precomposed characters. Examples of this appear in the testing library, where I have struggled to make two strings of accented Greek equal one another.

Example of normalize() from Fluent Python by Luciano Ramalho (117-118):

>>> from unicodedata import normalize
>>> s1 = 'café' # composed "e" with acute accent
>>> s2 = 'cafe\u0301' # decomposed "e" and acute accent 
>>> len(s1), len(s2)
(4, 5)
>>> len(normalize('NFC', s1)), len(normalize('NFC', s2)) 
(4, 4)
>>> len(normalize('NFD', s1)), len(normalize('NFD', s2)) 
(5, 5)
>>> normalize('NFC', s1) == normalize('NFC', s2)
True
>>> normalize('NFD', s1) == normalize('NFD', s2) 
True

Solutions

  1. In core, use normalize with the argument 'NFC', as Fluent Python recommends. Not all Greek combining forms may reduce into precomposed … will need to be tested out.

  2. In tests, especially for assertEqual(), check that more complicated strings equal one another. Use normalize('NFC', <text>) on the comparison strings, too, if necessary.

  3. Use this to strip out accented characters coming from the PHI, which I don't do very gracefully here: https://github.com/kylepjohnson/cltk/blob/master/cltk/corpus/utils/formatter.py#L94

Docs: https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize

@fractaledmind
Copy link
Contributor

So there is a wrapper function in the Alfred-Workflow repo that I have used in all of my project to handle Unicode strings consistently:

def decode(self, text, encoding='utf-8', normalization='NFC'):
        """Return ``text`` as normalised unicode.
        :param text: string
        :type text: encoded or Unicode string. If ``text`` is already a
            Unicode string, it will only be normalised.
        :param encoding: The text encoding to use to decode ``text`` to
            Unicode.
        :type encoding: ``unicode`` or ``None``
        :param normalization: The nomalisation form to apply to ``text``.
        :type normalization: ``unicode`` or ``None``
        :returns: decoded and normalised ``unicode``
        :class:`Workflow` uses "NFC" normalisation by default. This is the
        standard for Python and will work well with data from the web (via
        :mod:`~workflow.web` or :mod:`json`).
        OS X, on the other hand, uses "NFD" normalisation (nearly), so data
        coming from the system (e.g. via :mod:`subprocess` or
        :func:`os.listdir`/:mod:`os.path`) may not match. You should either
        normalise this data, too, or change the default normalisation used by
        :class:`Workflow`.
        """
        if not isinstance(text, unicode):
            text = unicode(text, encoding)
        return unicodedata.normalize(normalization, text)

@kylepjohnson
Copy link
Member Author

Thanks, @smargh , for sharing. You inspire me to try something like this soon.

@jtauber
Copy link

jtauber commented Mar 3, 2016

I actually end up using NFKC just so alt forms of theta and phi, etc, get normalized.

I have a (very) work-in-progress article about all this stuff: http://jktauber.com/articles/python-unicode-ancient-greek/

@coderbhupendra
Copy link
Contributor

@jtauber thanks for this wonderful article , i was googling myself to understand normalization to solve this issue. It cleared my few doubts.
@kylepjohnson i think i will be able to do it after asking few doubts .

@kylepjohnson
Copy link
Member Author

Yes, @coderbhupendra , once you finish #95 , let's talk about this one. It's a somewhat similar task, so it'd be great for you to do it.

I will also appreciate your help in checking this is some Indic languages (Sanskrit, namely).

@jtauber
Copy link

jtauber commented Mar 4, 2016

@coderbhupendra happy to help with any Unicode / Python / Greek questions too (and hoping to learn some Sanskrit along the way)

@coderbhupendra
Copy link
Contributor

@kylepjohnson now i will like to start this .
we can strip acute using James code first we can remove acute and then remove punctuation afterward.Is this you meant by doing it gracefully.

text='café'
ACUTE="\u0301"

text_acute_free=unicodedata.normalize("NFC", "".join(
    ch
    for ch in unicodedata.normalize("NFD", text)
    if ch not in [ACUTE])

# after this we will do remaining things 
 for char in text_acute_free:
        if char in punctuation:
            pass
        else:
            new_text += char

@coderbhupendra
Copy link
Contributor

@kylepjohnson
in problem 1 : core, use normalize with the argument 'NFC'
and problem 2 "In tests, especially for assertEqual()" can you point out these piece of code .

@coderbhupendra
Copy link
Contributor

@kylepjohnson can you give some detail , whether that approach for striping acute is right and about other two problems.

@kylepjohnson
Copy link
Member Author

@coderbhupendra The main point of this task isn't about stripping accents, but turning combining diacritics into precomposed characters. Please try the code I use in the above example.

The function we need will look something like:

def cltk_normalize(text):
    return normalize('NFKC', text)

Would you do us the favor of testing this out with some Devangari and putting it in a Gist? I'll do the same, soon, with Greek. Then we can talk about changes necessary, then you can do a pull request. Sound good?

@coderbhupendra
Copy link
Contributor

@ i think normalization is not working in sanskrit.What you wanted to do is to remove all punctuation and -\n|«|»|<|>|...|‘|’|_|{.+?}|(.+?)|[a-zA-Z0-9] from text but while striping you didn't want to strip/separate the symbols(i.e. diacritics) from them alphabets.

but in sanskrit normalisalion is not helping.
s is an sanskrit stentence

s='मनोहारि देहं महच्चित्तगेहम्'
on printing [ch for ch in normalize('NFKC',s)]
['म',
 'न',
 'ो',
 'ह',
 'ा',
 'र',
 'ि',
 ' ',
 'द',
 'े',
 'ह',
 'ं',
 ' ',
 'म',
 'ह',
 'च',
 '्',
 'च',
 'ि',
 'त',
 '्',
 'त',
 'ग',
 'े',
 'ह',
 'म',
 '्']
on printing this [ch for ch in s]
['म',
 'न',
 'ो',
 'ह',
 'ा',
 'र',
 'ि',
 ' ',
 'द',
 'े',
 'ह',
 'ं',
 ' ',
 'म',
 'ह',
 'च',
 '्',
 'च',
 'ि',
 'त',
 '्',
 'त',
 'ग',
 'े',
 'ह',
 'म',
 '्']
Even after normalisation striping is the same.

len(s)==len( normalize('NFKC',s))
True

@jtauber
Copy link

jtauber commented Mar 7, 2016

Unicode Normalization doesn't strip anything. Normalization just makes sure that if there is more than one way to express something in Unicode, a consistent choice is made.

@coderbhupendra
Copy link
Contributor

@jtauber by stripping i meant this [ch for ch in normalize('NFKC',s)] , you can see results for
[ch for ch in normalize('NFKC',s)] and [ch for ch in s] are same in case of sanskrit .

In case of Greek you can see the difference

In [41]: s1 = 'café'

In [42]: [ch for ch in normalize('NFKC',s1)]
Out[42]: ['c', 'a', 'f', 'é']

In [43]: [ch for ch in s1]
Out[43]: ['c', 'a', 'f', 'e', '́']

@kylepjohnson kylepjohnson self-assigned this Mar 7, 2016
@kylepjohnson
Copy link
Member Author

@coderbhupendra It's helpful to know that normalize() doesn't handle Sanskrit well.

Unless there are objections, I'm going to assign this to myself and pass at least some Greek through it in the core.

At some point, sometime might want to look at what this library, Indic NLP, does (note that it is 2.7).

@jtauber
Copy link

jtauber commented Mar 7, 2016

@coderbhupendra I'm still not sure it's doing anything wrong for Sanskrit. The fact Greek has precomposed characters for most diacritic combinations (but not all, see http://jktauber.com/2016/02/09/updated-solution-polytonic-greek-unicodes-problems/ ) is actually for political rather than technical reasons and if the Unicode Technical Committee were adding Greek now they assure me Greek would work like the Sanskrit is working in your example.

@kylepjohnson
Copy link
Member Author

Thanks, James, for your insights. You are surely correct, I see, when you
put it this way.

If the combined characters are merely political, and not how other
languages are handled, what is you opinion of how polytonal Greek ought to
be handled? Is there a benefit to the combined?

On Monday, March 7, 2016, James Tauber notifications@github.com wrote:

@coderbhupendra https://github.com/coderbhupendra I'm still not sure
it's doing anything wrong for Sanskrit. The fact Greek has precomposed
characters for most diacritic combinations (but not all, see
http://jktauber.com/2016/02/09/updated-solution-polytonic-greek-unicodes-problems/
) is actually for political rather than technical reasons and if the
Unicode Technical Committee were adding Greek now they assure me Greek
would work like the Sanskrit is working in your example.


Reply to this email directly or view it on GitHub
#94 (comment).

Kyle P. Johnson, Ph.D.

Natural language processing, data science, architecture
https://kyle-p-johnson.comkyle@kyle-p-johnson.com |
https://kyle-p-johnson.com | https://github.com/kylepjohnson

Classical Language Toolkit, Founder
http://cltk.org | https://github.com/cltk/cltk

@jtauber
Copy link

jtauber commented Mar 7, 2016

It's still useful to normalize to something just because otherwise you can't test equality properly.

To be honest, I've only recently been told by members of the Unicode Technical Committee why new precomposed characters will not be introduced to handle length marking + other diacritics, I'm still trying to work things out.

Keeping things NFC or NFKC certainly makes them work better in Python3 (until you hit length marking + other diacritics).

Note, however, that pyuca (and in fact UCA in general) converts to NFD before looking up collation elements. My diacritic stripping / adding code converts to NFD first too.

Still trying to work out the best trade off. Honestly, my world was somewhat turned upside down by the UTC members' feedback on my complaints about the lack of precomposed length + other diacritics.

This turns out to be something Perl 6 is good at, I also recently discovered.

@coderbhupendra
Copy link
Contributor

@kylepjohnson if i understood the problem of "combining diacritics not equaling precomposed characters" , then can you tell me whether the changes which i did in my gist are correct or not.
https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1

and @jtauber according to my undersatnding diacritics in simple words are nothing but some special symbols which are added to Geek 24 alphabets .Like wise in sanskrit also we have "matras" which are added over sanskrit alphabets.Above i did the same things for Greek and Sanskrit sentence. and in case of sanskrit "matras" got separated after normalization , there is no effect after normalization.You may try it yourself.So i think it is not working for sanskrit.

@kylepjohnson
Copy link
Member Author

@coderbhupendra Yes, for our purposes here, accents == polytonic diacritics == matras. For your commit, this is close, but I don't want you to implement it yet on any texts; I'll do this one-by-one. Thus, make it like this:

def cltk_normalize(text, compatibility=True):
    if compatibility:
        return normalize('NFKC', text)
    else:
        return normalize('NFC', text)

Also remember to import at top of file and add two tests, one for compatibility True/False, in cltk/tests/test_corpus.py. Use the 'café' example above.

@kylepjohnson
Copy link
Member Author

@jtauber I see what you're saying and this is making my head spin, too.

I agree that we're best off normalizing to something … will want to hear more what you come to think of this all with Python 3.

@kylepjohnson kylepjohnson removed their assignment Mar 8, 2016
@kylepjohnson
Copy link
Member Author

@coderbhupendra If you accept the team Core Member team request, we can assign this ticket to you.

@coderbhupendra
Copy link
Contributor

@kylepjohnson You send it , i will accept it .

kylepjohnson added a commit that referenced this issue Mar 8, 2016
Regarding Issue:Normalize Unicode throughout CLTK #94
@kylepjohnson
Copy link
Member Author

Thank you @coderbhupendra, I have merged PR #182.

I'll close this ticket for now, will open when I implement it in places, such as for Greek TLG output and tests.

@coderbhupendra
Copy link
Contributor

ok @kylepjohnson and thanks for your help too.

kylepjohnson added a commit to kylepjohnson/cltk that referenced this issue Aug 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants