 Russian Language NLP

<br><h4>I've done research about Natural Language Processing for russian language, and I found a library called pymorph2. </h4><br>

In [1]:
import nltk
import spacy
from nltk.tokenize import word_tokenize, WordPunctTokenizer 
from nltk.tag import pos_tag

In [2]:
rus_text = "Его зовут Д.А. Редько! А как зовут вас? "

In [3]:
nltk.sent_tokenize(rus_text)

['Его зовут Д.А.', 'Редько!', 'А как зовут вас?']

<br><h4>There is this different if we set language to russian, because in russian language 'Д.А.' is just a short way of writing full names, which is not a new sentence. <br></h4>

In [4]:
nltk.sent_tokenize(rus_text, language='russian')

['Его зовут Д.А. Редько!', 'А как зовут вас?']

In [5]:
import pymorphy2

<br><h4>Morphological analysis is the determination of the characteristics of a word based on how the word is spelled. Morphological analysis does not use information about neighboring words.<br></h4>

In [6]:
morph = pymorphy2.MorphAnalyzer()

<br><h4>Here: I use a verb 'run'(past, present, future tense) as a word to demontrate how the parsing works. I use dictionaty of the form {'english': 'russian'}<br></h4>

In [7]:
run = {'will_run': 'побегу', 'run':'бегаю', 'running': 'бегу'}

In [8]:
run_english = list(run.keys())
run_russian = list(run.values())

In [9]:
print(run_english)

['will_run', 'run', 'running']


In [10]:
print(run_russian)

['побегу', 'бегаю', 'бегу']


<br><h4>Let's get some data about the words, for example, whether it is a noun or a verb. single or plural, normal form  etc.<br></h4>

In [11]:
for eng_run_word, rus_run_word in run.items():
    print('______________________________________________________________')
    print('')
    print('forms of ENG: {}, RUS: {}'.format(eng_run_word, rus_run_word))
    print('')
    
    for form in morph.parse(rus_run_word):
        print("ENG:")
        print("{} : {}; score: {}".format(form.word, form.tag, form.score))
        print("RUS:")
        print("{} : {}; score: {}".format(form.word, form.tag.cyr_repr, form.score))
        print("normal form: "+form.normal_form)
        print('')


______________________________________________________________

forms of ENG: will_run, RUS: побегу

ENG:
побегу : VERB,perf,intr sing,1per,futr,indc; score: 0.666666
RUS:
побегу : ГЛ,сов,неперех ед,1л,буд,изъяв; score: 0.666666
normal form: побежать

ENG:
побегу : NOUN,inan,masc sing,datv; score: 0.333333
RUS:
побегу : СУЩ,неод,мр ед,дт; score: 0.333333
normal form: побег

______________________________________________________________

forms of ENG: run, RUS: бегаю

ENG:
бегаю : VERB,impf,intr sing,1per,pres,indc; score: 1.0
RUS:
бегаю : ГЛ,несов,неперех ед,1л,наст,изъяв; score: 1.0
normal form: бегать

______________________________________________________________

forms of ENG: running, RUS: бегу

ENG:
бегу : NOUN,inan,masc,Sgtm sing,loc2; score: 0.428571
RUS:
бегу : СУЩ,неод,мр,sg ед,пр2; score: 0.428571
normal form: бег

ENG:
бегу : NOUN,inan,masc,Sgtm sing,datv; score: 0.285714
RUS:
бегу : СУЩ,неод,мр,sg ед,дт; score: 0.285714
normal form: бег

ENG:
бегу : VERB,perf,intr sing,1pe

<br><h4>Lexeme is a unit of lexical meaning that underlies a set of words that are related through inflection. It is a basic abstract unit of meaning, a unit of morphological analysis in linguistics that roughly corresponds to a set of forms taken by a single root word (Wikipedia). Let's apply this for our analysis.<br><h4>

## Score
<br><h4>
At present, P (tag | word) estimates based on OpenCorpora(open source project) are for about 20 thousand words (based on about 250 thousand observations). For those words for which there is no such assessment, the probability P (tag | word) is either considered uniform (for dictionary words) or estimated based on empirical rules (for non-dictionary words). Parses are sorted in descending order by score, therefore everywhere in the examples the first parsing option is taken from the possible ones:  
For example : <br><h4>

In [12]:
run_rus_word = morph.parse(run['run'])[0]

## Lexeme
<br><h4>As we can see russian language can be complicated bacause we have 70 forms of a word 'run':<br><h4>

In [13]:
rus_run_lexeme = run_rus_word.lexeme
len(rus_run_lexeme)

70

<br><h4>Let's print all of them:<br></h4>

In [14]:
for rus_run_word in rus_run_lexeme:
    print(rus_run_word.word)

бегать
бегаю
бегаем
бегаешь
бегаете
бегает
бегают
бегал
бегала
бегало
бегали
бегай
бегайте
бегающий
бегающего
бегающему
бегающего
бегающий
бегающим
бегающем
бегающая
бегающей
бегающей
бегающую
бегающей
бегающею
бегающей
бегающее
бегающего
бегающему
бегающее
бегающим
бегающем
бегающие
бегающих
бегающим
бегающих
бегающие
бегающими
бегающих
бегавший
бегавшего
бегавшему
бегавшего
бегавший
бегавшим
бегавшем
бегавшая
бегавшей
бегавшей
бегавшую
бегавшей
бегавшею
бегавшей
бегавшее
бегавшего
бегавшему
бегавшее
бегавшим
бегавшем
бегавшие
бегавших
бегавшим
бегавших
бегавшие
бегавшими
бегавших
бегая
бегав
бегавши


<br><h4>Russian Language has a different way how to say '4 years' and '5 years'.## Words after numbers
Russian Language has a different way to say '4 years' and '5 years'.<br></h4>

In [15]:
year = morph.parse('год')[0]

In [16]:
year.make_agree_with_number(4).word

'года'

In [17]:
year.make_agree_with_number(6).word

'годов'

<br><h4>This is actually wrong, it should be another word('лет') with 6 (bag of library). Let's try another word( a plate). <br></h4>

In [18]:
plate = morph.parse('тарелка')[0]

In [19]:
plate.make_agree_with_number(1).word

'тарелка'

In [20]:
plate.make_agree_with_number(2).word

'тарелки'

In [21]:
plate.make_agree_with_number(6).word

'тарелок'

<br><h4>This works exactly as expected.<br></h4>

## Change form of a word

In [29]:
plate

Parse(word='тарелка', tag=OpencorporaTag('NOUN,inan,femn sing,nomn'), normal_form='тарелка', score=1.0, methods_stack=((<DictionaryAnalyzer>, 'тарелка', 8, 0),))

In [30]:
plate.inflect({'gent'})

Parse(word='тарелки', tag=OpencorporaTag('NOUN,inan,femn sing,gent'), normal_form='тарелка', score=1.0, methods_stack=((<DictionaryAnalyzer>, 'тарелки', 8, 1),))

In [31]:
plate.inflect({'accs'})

Parse(word='тарелку', tag=OpencorporaTag('NOUN,inan,femn sing,accs'), normal_form='тарелка', score=1.0, methods_stack=((<DictionaryAnalyzer>, 'тарелку', 8, 3),))

In [32]:
plate.inflect({'loct'})

Parse(word='тарелке', tag=OpencorporaTag('NOUN,inan,femn sing,loct'), normal_form='тарелка', score=1.0, methods_stack=((<DictionaryAnalyzer>, 'тарелке', 8, 6),))

## Gender of a word
<br><h4>In russian language, a word has its gender. For example, a male cat is a different word from a female cat. Let's see the difference. <br></h4>

In [26]:
male_cat = morph.parse('кот')[0]
female_cat = morph.parse('кошка')[0]

In [27]:
male_cat.tag.gender

'masc'

In [28]:
female_cat.tag.gender

'femn'