# References

- [1] https://polyglot.readthedocs.io/en/latest/
- [2] https://www.geeksforgeeks.org/natural-language-processing-using-polyglot-introduction/
- [3] https://blog.jcharistech.com/2018/12/10/introduction-to-natural-language-processing-with-polyglot/

# Intro

- Polyglot is a natural language pipeline that supports massive multilingual applications [1]
- If we know how to use TextBlob, Polyglot has a similar learning curve [3]
- Developed by Rami Al-Rfou
- Core Features:
 - Tokenization
 - Language detection
 - Named Entity Recognition
 - Part of Speech (POS) Tagging
 - Sentiment Analysis
 - Word Embeddings
 - Morphological Analysis
 - Transliteration

In [1]:
## installation
# !pip install polyglot

# installing dependency packages [2]
!pip install pyicu morfessor pycld2



# Quick Tutorial [1]

In [2]:
import polyglot
from polyglot.text import Text, Word

## Language Detection

In [3]:
text = Text("Bonjour, Mesdames.")
print("Language Detected: Code={}, Name={}\n".format(text.language.code, text.language.name))

Language Detected: Code=fr, Name=French



In [4]:
zen = Text("Beautiful is better than ugly. "
           "Explicit is better than implicit. "
           "Simple is better than complex.")
print(zen.words)

['Beautiful', 'is', 'better', 'than', 'ugly', '.', 'Explicit', 'is', 'better', 'than', 'implicit', '.', 'Simple', 'is', 'better', 'than', 'complex', '.']


In [5]:
print(zen.sentences)

[Sentence("Beautiful is better than ugly."), Sentence("Explicit is better than implicit."), Sentence("Simple is better than complex.")]


## POS Tagging

In [6]:
## dependencies
!polyglot download embeddings2.pt
!polyglot download pos2.pt
!polyglot download embeddings2.de
!polyglot download ner2.de
!polyglot download sentiment2.en
!polyglot download sgns2.en
!polyglot download morph2.en
!polyglot download transliteration2.ru

[polyglot_data] Downloading package embeddings2.pt to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package embeddings2.pt is already up-to-date!
[polyglot_data] Downloading package pos2.pt to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package pos2.pt is already up-to-date!
[polyglot_data] Downloading package embeddings2.de to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package embeddings2.de is already up-to-date!
[polyglot_data] Downloading package ner2.de to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package ner2.de is already up-to-date!
[polyglot_data] Downloading package sentiment2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package sentiment2.en is already up-to-date!
[polyglot_data] Downloading package sgns2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package sgns2.en is already up-to-date!
[polyglot_data] Downloading packag

In [7]:
text = Text(u"O primeiro uso de desobediência civil em massa ocorreu em setembro de 1906.")

print("{:<16}{}".format("Word", "POS Tag")+"\n"+"-"*30)
for word, tag in text.pos_tags:
    print(u"{:<16}{:>2}".format(word, tag))

Word            POS Tag
------------------------------
O               DET
primeiro        ADJ
uso             NOUN
de              ADP
desobediência   NOUN
civil           ADJ
em              ADP
massa           NOUN
ocorreu         ADJ
em              ADP
setembro        NOUN
de              ADP
1906            NUM
.               PUNCT


## Named Entity Recognition

In [8]:
text = Text(u"In Großbritannien war Gandhi mit dem westlichen Lebensstil vertraut geworden")
print(text.entities)

[I-LOC(['Großbritannien']), I-PER(['Gandhi'])]


## Polarity

In [9]:
print("{:<16}{}".format("Word", "Polarity")+"\n"+"-"*30)
for w in zen.words[:6]:
    print("{:<16}{:>2}".format(w, w.polarity))

Word            Polarity
------------------------------
Beautiful        0
is               0
better           1
than             0
ugly            -1
.                0


## Embeddings

In [10]:
word = Word("Obama", language="en")
print("Neighbors (Synonms) of {}".format(word)+"\n"+"-"*30)
for w in word.neighbors:
    print("{:<16}".format(w))
print("\n\nThe first 10 dimensions out the {} dimensions\n".format(word.vector.shape[0]))
print(word.vector[:10])

Neighbors (Synonms) of Obama
------------------------------
Bush            
Reagan          
Clinton         
Ahmadinejad     
Nixon           
Karzai          
McCain          
Biden           
Huckabee        
Lula            


The first 10 dimensions out the 256 dimensions

[-2.5738235   1.5217597   0.51070285  1.0867867  -0.7438695  -1.1861616
  2.9278462  -0.25694436 -1.4095867  -2.396754  ]


## Morphology

In [11]:
word = Text("Preprocessing is an essential step.").words[0]
print(word.morphemes)

['Pre', 'process', 'ing']


## Transliteration

In [12]:
from polyglot.transliteration import Transliterator
transliterator = Transliterator(source_lang="en", target_lang="ru")
print(transliterator.transliterate(u"preprocessing"))

препрокессинг


# Introduction to Natural Language Processing with Polyglot [3]

In [13]:
# Dependencies
!polyglot download embeddings2.en
!polyglot download ner2.en
!polyglot download sentiment2.en
!polyglot download pos2.en
!polyglot download morph2.en
!polyglot download transliteration2.ar
!polyglot download transliteration2.fr

[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package embeddings2.en is already up-to-date!
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package ner2.en is already up-to-date!
[polyglot_data] Downloading package sentiment2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package sentiment2.en is already up-to-date!
[polyglot_data] Downloading package pos2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package pos2.en is already up-to-date!
[polyglot_data] Downloading package morph2.en to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package morph2.en is already up-to-date!
[polyglot_data] Downloading package transliteration2.ar to
[polyglot_data]     /Users/enlik/polyglot_data...
[polyglot_data]   Package transliteration2.ar is already up-to-date!
[polyglot_data] Downlo

## Tokenization

In [14]:
# Load packages
import polyglot
from polyglot.text import Text,Word

# Word Tokens
docx = Text(u"He likes reading and painting")
docx.words

WordList(['He', 'likes', 'reading', 'and', 'painting'])

In [15]:
docx2 = Text(u"He exclaimed, 'what're you doing? Reading?'.")
docx2.words

WordList(['He', 'exclaimed', ',', "'", "what're", 'you', 'doing', '?', 'Reading', '?', "'", '.'])

In [16]:
# Sentence tokens
docx3 = Text(u"He likes reading and painting.He exclaimed, 'what're you doing? Reading?'.")
docx3.sentences

[Sentence("He likes reading and painting.He exclaimed, 'what're you doing?"),
 Sentence("Reading?'.")]

## POS Tagging

In [17]:
docx

Text("He likes reading and painting")

In [18]:
docx.pos_tags

[('He', 'PRON'),
 ('likes', 'VERB'),
 ('reading', 'VERB'),
 ('and', 'CONJ'),
 ('painting', 'NOUN')]

## Language Detection

In [19]:
docx

Text("He likes reading and painting")

In [20]:
docx.language.name

'English'

In [21]:
docx.language.code

'en'

In [22]:
from polyglot.detect  import Detector

en_text = "He is a student "
fr_text = "Il est un étudiant"
ru_text = "Он студент"

detect_en = Detector(en_text)
detect_fr = Detector(fr_text)
detect_ru = Detector(ru_text)

Detector is not able to detect the language reliably.
Detector is not able to detect the language reliably.


In [23]:
print(detect_en.language)

name: English     code: en       confidence:  94.0 read bytes:   704


In [24]:
print(detect_fr.language)

name: French      code: fr       confidence:  95.0 read bytes:   870


In [25]:
print(detect_ru.language)

name: Serbian     code: sr       confidence:  95.0 read bytes:   614


In [26]:
docx4 = Text(u"He hates reading and playing")

In [27]:
docx

Text("He likes reading and painting")

In [28]:
docx.polarity

1.0

In [29]:
docx4.polarity

-1.0

## Named Entities

In [30]:
docx5 = Text(u"John Jones was a FBI detector")
docx5.entities

[I-PER(['John', 'Jones']), I-ORG(['FBI'])]

In [31]:
docx6 = Text(u"preprocessing")
docx6.morphemes

Detector is not able to detect the language reliably.


WordList(['pre', 'process', 'ing'])

## Transliteration

In [41]:
# Load 
from polyglot.transliteration import Transliterator
translit = Transliterator(source_lang='en',target_lang='ar')
translit.transliterate(u"hello")

'هيلو'