New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Normalize Unicode throughout CLTK #94
Comments
So there is a wrapper function in the Alfred-Workflow repo that I have used in all of my project to handle Unicode strings consistently:
|
Thanks, @smargh , for sharing. You inspire me to try something like this soon. |
I actually end up using NFKC just so alt forms of theta and phi, etc, get normalized. I have a (very) work-in-progress article about all this stuff: http://jktauber.com/articles/python-unicode-ancient-greek/ |
@jtauber thanks for this wonderful article , i was googling myself to understand normalization to solve this issue. It cleared my few doubts. |
Yes, @coderbhupendra , once you finish #95 , let's talk about this one. It's a somewhat similar task, so it'd be great for you to do it. I will also appreciate your help in checking this is some Indic languages (Sanskrit, namely). |
@coderbhupendra happy to help with any Unicode / Python / Greek questions too (and hoping to learn some Sanskrit along the way) |
@kylepjohnson now i will like to start this .
|
@kylepjohnson |
@kylepjohnson can you give some detail , whether that approach for striping acute is right and about other two problems. |
@coderbhupendra The main point of this task isn't about stripping accents, but turning combining diacritics into precomposed characters. Please try the code I use in the above example. The function we need will look something like: def cltk_normalize(text):
return normalize('NFKC', text) Would you do us the favor of testing this out with some Devangari and putting it in a Gist? I'll do the same, soon, with Greek. Then we can talk about changes necessary, then you can do a pull request. Sound good? |
@kylepjohnson now i understood the benefit of normalization.Hope its correct and below i have pointed out the changes``` https://gist.github.com/coderbhupendra/a5016a8e52b480c14fa1#file-gistfile1-txt-L76
|
@ i think normalization is not working in sanskrit.What you wanted to do is to remove all punctuation and -\n|«|»|<|>|...|‘|’|_|{.+?}|(.+?)|[a-zA-Z0-9] from text but while striping you didn't want to strip/separate the symbols(i.e. diacritics) from them alphabets. but in sanskrit normalisalion is not helping.
|
Unicode Normalization doesn't strip anything. Normalization just makes sure that if there is more than one way to express something in Unicode, a consistent choice is made. |
@jtauber by stripping i meant this [ch for ch in normalize('NFKC',s)] , you can see results for In case of Greek you can see the difference
|
@coderbhupendra It's helpful to know that Unless there are objections, I'm going to assign this to myself and pass at least some Greek through it in the core. At some point, sometime might want to look at what this library, Indic NLP, does (note that it is 2.7). |
@coderbhupendra I'm still not sure it's doing anything wrong for Sanskrit. The fact Greek has precomposed characters for most diacritic combinations (but not all, see http://jktauber.com/2016/02/09/updated-solution-polytonic-greek-unicodes-problems/ ) is actually for political rather than technical reasons and if the Unicode Technical Committee were adding Greek now they assure me Greek would work like the Sanskrit is working in your example. |
Thanks, James, for your insights. You are surely correct, I see, when you If the combined characters are merely political, and not how other On Monday, March 7, 2016, James Tauber notifications@github.com wrote:
Kyle P. Johnson, Ph.D. Natural language processing, data science, architecture Classical Language Toolkit, Founder |
It's still useful to normalize to something just because otherwise you can't test equality properly. To be honest, I've only recently been told by members of the Unicode Technical Committee why new precomposed characters will not be introduced to handle length marking + other diacritics, I'm still trying to work things out. Keeping things NFC or NFKC certainly makes them work better in Python3 (until you hit length marking + other diacritics). Note, however, that pyuca (and in fact UCA in general) converts to NFD before looking up collation elements. My diacritic stripping / adding code converts to NFD first too. Still trying to work out the best trade off. Honestly, my world was somewhat turned upside down by the UTC members' feedback on my complaints about the lack of precomposed length + other diacritics. This turns out to be something Perl 6 is good at, I also recently discovered. |
@kylepjohnson if i understood the problem of "combining diacritics not equaling precomposed characters" , then can you tell me whether the changes which i did in my gist are correct or not. and @jtauber according to my undersatnding diacritics in simple words are nothing but some special symbols which are added to Geek 24 alphabets .Like wise in sanskrit also we have "matras" which are added over sanskrit alphabets.Above i did the same things for Greek and Sanskrit sentence. and in case of sanskrit "matras" got separated after normalization , there is no effect after normalization.You may try it yourself.So i think it is not working for sanskrit. |
@coderbhupendra Yes, for our purposes here, accents == polytonic diacritics == matras. For your commit, this is close, but I don't want you to implement it yet on any texts; I'll do this one-by-one. Thus, make it like this: def cltk_normalize(text, compatibility=True):
if compatibility:
return normalize('NFKC', text)
else:
return normalize('NFC', text) Also remember to import at top of file and add two tests, one for compatibility True/False, in |
@jtauber I see what you're saying and this is making my head spin, too. I agree that we're best off normalizing to something … will want to hear more what you come to think of this all with Python 3. |
@coderbhupendra If you accept the team Core Member team request, we can assign this ticket to you. |
@kylepjohnson You send it , i will accept it . |
Regarding Issue:Normalize Unicode throughout CLTK #94
Thank you @coderbhupendra, I have merged PR #182. I'll close this ticket for now, will open when I implement it in places, such as for Greek TLG output and tests. |
ok @kylepjohnson and thanks for your help too. |
Improve sphinx docs
I've been reading about
normalize()
and hope it will prevent normalization problems in the future. This builtin method solves the problem of accented characters made with combining diacritics not equaling precomposed characters. Examples of this appear in the testing library, where I have struggled to make two strings of accented Greek equal one another.Example of
normalize()
from Fluent Python by Luciano Ramalho (117-118):Solutions
In core, use
normalize
with the argument 'NFC', as Fluent Python recommends. Not all Greek combining forms may reduce into precomposed … will need to be tested out.In tests, especially for assertEqual(), check that more complicated strings equal one another. Use
normalize('NFC', <text>)
on the comparison strings, too, if necessary.Use this to strip out accented characters coming from the PHI, which I don't do very gracefully here: https://github.com/kylepjohnson/cltk/blob/master/cltk/corpus/utils/formatter.py#L94
Docs: https://docs.python.org/3.4/library/unicodedata.html#unicodedata.normalize
The text was updated successfully, but these errors were encountered: