# Morphological Analysis

In [11]:
import sys
import os.path as p

In [12]:
%load_ext autoreload
%autoreload 2

In [13]:
exp_dir = "/media/data/code/polyglot/"

if exp_dir not in sys.path:
  sys.path.insert(0, exp_dir)

In [14]:
import polyglot

## Languages Coverage

The models were trained on a combination of:
- Original CONLL datasets after the tags were converted using the [universal POS tables](http://universaldependencies.github.io/docs/tagset-conversion/index.html).
- Universal Dependencies 1.0 corpora whenever they are available.

In [5]:
from __future__ import print_function

In [6]:
from polyglot.downloader import downloader
print(", ".join(sorted(downloader.supported_languages("morph2"))))

Afrikaans, Albanian, Alemannic, Amharic, Arabic, Aragonese, Armenian, Assamese, Asturian, Azerbaijani, Bashkir, Basque, Bavarian, Belarusian, Bengali, Bishnupriya Manipuri, Bosnian, Bosnian-Croatian-Serbian, Breton, Bulgarian, Burmese, Catalan; Valencian, Cebuano, Chechen, Chinese, Chuvash, Croatian, Czech, Danish, Divehi; Dhivehi; Maldivian;, Dutch, Egyptian Arabic, English, Esperanto, Estonian, Faroese, Fiji Hindi, Finnish, French, Galician, Gan Chinese, Georgian, German, Greek, Modern, Gujarati, Haitian; Haitian Creole, Hebrew (modern), Hindi, Hungarian, Icelandic, Ido, Ilokano, Indonesian, Interlingua, Irish, Italian, Japanese, Javanese, Kannada, Kapampangan, Kazakh, Khmer, Kirghiz, Kyrgyz, Korean, Kurdish, Latin, Latvian, Limburgish, Limburgan, Limburger, Lithuanian, Lombard language, Luxembourgish, Letzeburgesch, Macedonian, Malagasy, Malay, Malayalam, Maltese, Manx, Marathi (Marāṭhī), Mongolian, Nepali, Northern Sami, Norwegian, Norwegian Nynorsk, Occitan, Oriya, Ossetian, Osset

#### Download Necessary Models

In [7]:
%%bash
polyglot download morph2.en morph2.ar

[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package morph2.ar to
[polyglot_data]     /home/rmyeid/polyglot_data...
[polyglot_data]   Package morph2.ar is already up-to-date!


## Library Interface

We tag each word in the text with one part of speech.

In [15]:
from polyglot.text import Text, Word

In [16]:
blob = "Wewillmeettoday."
text = Text(blob)
text.language = "en"

We can query all the tagged words

In [17]:
text.morphemes

WordList([u'We', u'will', u'meet', u'to', u'day', u'.'])

After calling the pos_tags property once, the words objects will carry the POS tags.

In [20]:
words = ["preprocessing", "processor", "invaluable", "thankful", "crossed"]
for w in words:
  w2 = Word(w, language="en")
  print("{:<20}{}".format(w2, w2.morphemes))

preprocessing       ['pre', 'process', 'ing']
processor           ['process', 'or']
invaluable          ['in', 'valuable']
thankful            ['thank', 'ful']
crossed             ['cross', 'ed']


## Command Line Interface

#### Tokenization

Notice, if we do not pass `--lang` the language code, the detector will bem used to detect the language of the document.

In [21]:
%%bash
tok_file=/tmp/cricket.tok.txt
polyglot --lang en tokenize --input testdata/cricket.txt > $tok_file
head -n 2 $tok_file

Australia posted a World Cup record total of 417 - 6 as they beat Afghanistan by 275 runs .
David Warner hit 178 off 133 balls , Steve Smith scored 95 while Glenn Maxwell struck 88 in 39 deliveries in the Pool A encounter in Perth .


#### Morphemes

In [8]:
%%bash
tok_file=/tmp/cricket.tok.txt
polyglot --lang en morph --input $tok_file | tail -n 50

-               -    
4               4    
against         a_gain_st
West            West 
Indies          In_dies
and             and  
Ireland         Ireland
respectively    re_spective_ly
.               .    

The             The  
winning         winning
margin          margin
beats           beat_s
the             the  
257             2_57 
-               -    
run             run  
amount          amount
by              by   
which           which
India           In_dia
beat            beat 
Bermuda         Ber_mud_a
in              in   
Port            Port 
of              of   
Spain           Spa_in
in              in   
2007            2007 
,               ,    
which           which
was             wa_s 
equalled        equal_led
five            five 
days            day_s
ago             ago  
by              by   
South           South
Africa          Africa
in              in   
their           t_heir
victory         victor_y
over            over 
West            

#### Nesting steps
We can nest the tokenization and POS tagging in a simple bash pipeline

In [9]:
!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en morph | tail -n 30

which           which
India           In_dia
beat            beat 
Bermuda         Ber_mud_a
in              in   
Port            Port 
of              of   
Spain           Spa_in
in              in   
2007            2007 
,               ,    
which           which
was             wa_s 
equalled        equal_led
five            five 
days            day_s
ago             ago  
by              by   
South           South
Africa          Africa
in              in   
their           t_heir
victory         victor_y
over            over 
West            West 
Indies          In_dies
in              in   
Sydney          Syd_ney
.               .    



### Citation

This work is a direct implementation of the research being described in the [Polyglot: Distributed Word Representations for Multilingual NLP](http://www.aclweb.org/anthology/W13-3520) paper.
The author of this library strongly encourage you to cite the following paper if you are using this software.

## References

- [Universal Part of Speech Tagging](http://universaldependencies.github.io/docs/u/pos/index.html)
- [Universal Dependencies 1.0](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1464).