# Exploring Foreign Languages

So far, we have been learning about general ways to explore texts through manipulating strings and regular expressions. Today, we will be focusing on what we can do when texts are in languages other than English. This will just be an introduction to some of the many different modules that can be used for these tasks. The goal is to learn some tools, including Polyglot and translation, that can be jumping off points to see what you may or may not need going forward.

### Lesson Outline:
- Q&A about what we've gone over so far
- Examples (with Sara's data)
- Practice!

## Installations
Uncomment and run the cell below!

In [None]:
#!pip install translation
#!pip install py-translate
#!pip install morfessor
#!pip install polyglot
#!pip install pycld2
#!brew install intltool icu4c gettext
#!brew link icu4c gettext --force
#!CFLAGS=-I/usr/local/opt/icu4c/include LDFLAGS=-L/usr/local/opt/icu4c/lib pip3 install pyicu

## Importing Text

In [None]:
import codecs
with codecs.open('Skyggebilleder af en Reise til Harzen.txt', 'r', encoding='utf-8', errors='ignore') as f:
    read_text = f.read()
read_text

In [None]:
# pulling out a subsection of text for our examples
text_snippet = read_text[20000:23000]

## Translating Text

There are many different ways that you could go about translating text within Python, but one of the easiest is the package `translation`. `translation` makes use of existing online translators. The module used to include a method for Google Translate, but the site no longer allows easy access. Bing is probably the most useful method for it.

**Pros:**
* Easy to set up
* Runs quickly

**Cons:**
* Not always accurate
* Internet connection needed
* Language limitations

The documentation (or lack there of): https://pypi.python.org/pypi/translation

In [None]:
import translation

In [None]:
translation.bing(text_snippet, dst = 'en')

Other alternatives for translating your text include:
* `py-translate`
    * Makes use of Google Translate
    * Often return errors / gets blocked
    * Can be used from the command line
    * Documentation: https://pypi.python.org/pypi/py-translate


* API calls to Google Translate
    * Takes a little more set-up
    * Can be customized a little bit more
    * Can translate a LOT of text

In [None]:
# using py-translate
from translate import translator

# calling tranlator function, telling it that the 
translator('da', 'en',text_snippet[:200])

## Polyglot

Polyglot is "a natural language pipeline that supports massive multilingual applications," in other words, it does a lot of stuff. It is a sort of one-stop-shop for many different functions that you may want to apply to you text, and supports many different languages. We are going to run through some of its functionalities.

Docs: http://polyglot.readthedocs.io/en/latest/

#### Language Detection

In [None]:
from polyglot.detect import Detector

# create a detector object that contains read_text
# and assigning it to DETECTED
detected = Detector(read_text)

# the .language method will return the language the most of
# the text is made up of and the system is confident about
print(detected.language)

In [None]:
# sometimes there will be multiple languages within
# the text, and you will want to see all of them
for language in detected.languages:
  print(language)

In [None]:
# if you try to pass in a string that is too short
# for the system to get a good read on, it will throw
# an error, alerting you to this fact
Detector("4")

In [None]:
# we can override that with the optional argument 'quiet=True'
print(Detector("4", quiet=True))

In [None]:
# here are all of the languages supported for language detection
from polyglot.utils import pretty_list
print(pretty_list(Detector.supported_languages()))

#### Tokenization

Similar to what we saw with NLTK, Polyglot can break our text up into words and sentences. Polyglot has the advantage of spanning multiple languages, and thus is more likely to identify proper breakpoint in languages other than English.

In [None]:
from polyglot.text import Text

# creating a Text object that analyzes our text_snippet
text = Text(text_snippet)

In [None]:
# Text also has a language instance variable
print(text.language)

# here, we are looking at text_snippet tokenized into words
text.words

In [None]:
# now we are looking at text_snippet broken down into sentences
text.sentences

#### Side Notes: Important Package Information

Not all of the packages are downloaded for all functionalities for all languages in Polyglot. Instead of forcing you to download a lot of files in the beginning, the creators decided that it would be better for language extensions to be downloaded on an 'as-necessary' basis. You will occassionaly be told that you're lacking a package, and you will need to download it. You can either do that with the built-in downloader, or from the command line.

In [None]:
# staying within python
from polyglot.downloader import downloader
downloader.download("embeddings2.en")

In [None]:
# alternate command line method
!polyglot download embeddings2.da pos2.da

Also, if you're working with a language and want to know what Polyglot lets you do with a language, it provides a `supported_tasks` method.

In [None]:
# tasks available for english
downloader.supported_tasks(lang="en")

In [None]:
# tasks available for danish
downloader.supported_tasks(lang="da")

#### Part of Speech Tagging

Polyglot supports POS tagging for several languages.

In [None]:
# languages that polyglot supports for part of speech tagging
print(downloader.supported_languages_table("pos2"))

In [None]:
text.pos_tags

#### Named Entity Recognition

Polyglot can tag names and groups them into three main categories:
* Locations (Tag: I-LOC): cities, countries, regions, continents, neighborhoods, administrative divisions ...
* Organizations (Tag: I-ORG): sports teams, newspapers, banks, universities, schools, non-profits, companies, ...
* Persons (Tag: I-PER): politicians, scientists, artists, atheletes ...

In [None]:
# languages that polyglot supports for part of speech tagging
print(downloader.supported_languages_table("ner2", 3))

In [None]:
#!polyglot download ner2.da
text.entities

#### Other Features of Polyglot
* Nearest Neighbors -- http://polyglot.readthedocs.io/en/latest/Embeddings.html
* Morpheme Generation -- http://polyglot.readthedocs.io/en/latest/MorphologicalAnalysis.html
* Sentiment Analysis -- http://polyglot.readthedocs.io/en/latest/Sentiment.html
* Transliteration -- http://polyglot.readthedocs.io/en/latest/Transliteration.html

## Code Summary:

#### Translation:
* `translation.bing(your_string, dst = 'en')`

#### Polyglot:
* `<Detector>.language`
* `<Detector>.languages`
* `<Text>.language`
* `<Text>.words`
* `<Text>.sentences`
* `<Text>.pos_tags`
* `<Text>.entities`

### Extra

In [None]:
# importing some more packages
from datascience import *
%matplotlib inline
import seaborn as sns

In [None]:
# analyzing our text with a Polyglot Text object
whole_text = Text(read_text)

In [None]:
# the language of our text
print(whole_text.language)

In [None]:
# getting the part of speech tags for our corpus
print(whole_text.pos_tags)
words_and_poss = list(whole_text.pos_tags)

In [None]:
# putting those word / part of speech pairs into a table
wrd = Table(['Word', 'Part of Speech']).with_rows(words_and_poss)
# grouping those by part of speech to get the most commonly occuring parts of speech
df = wrd.group('Part of Speech').sort('count', descending=True).to_df()
df

In [None]:
# plotting the counts for each part of speech using seaborn
sns.barplot(x='Part of Speech', y='count', data=df)

In [None]:
# getting the most popular word for each part of speech type
wrd_counts = wrd.group('Word').join('Word', wrd).sort('count', descending=True)
wrd_counts.group(2, lambda x: x.item(0)).show(16)

In [None]:
# thats not very informative, so lets pull out the stop words
# using a list from http://snowball.tartarus.org/algorithms/danish/stop.txt
danish_stop_words = """og,
i,
jeg,
det,
at,
en,
den,
til,
er,
som,
på,
de,
med,
han,
af,
for,
ikke,
der,
var,
mig,
sig,
men,
et,
har,
om,
vi,
min,
havde,
ham,
hun,
nu,
over,
da,
fra,
du,
ud,
sin,
dem,
os,
op,
man,
hans,
hvor,
eller,
hvad,
skal,
selv,
her,
alle,
vil,
blev,
kunne,
ind,
når,
være,
dog,
noget,
ville,
jo,
deres,
efter,
ned,
skulle,
denne,
end,
dette,
mit,
også,
under,
have,
dig,
anden,
hende,
mine,
alt,
meget,
sit,
sine,
vor,
mod,
disse,
hvis,
din,
nogle,
hos,
blive,
mange,
ad,
bliver,
hendes,
været,
thi,
jer,
sådan"""
splt = danish_stop_words.split(',\n')
print(splt)

In [None]:
# determining which rows we need to change
not_in_stop_words = [x not in danish_stop_words for x in wrd_counts['Word']]
# most common words for each part of speech no longer including the stop words
wrd_counts.where(not_in_stop_words).group(2, lambda x: x.item(0)).show(16)

In [None]:
# retrieving all of the named entities that Polyglot detected
ner = str(whole_text.entities).split('I-')[1:]
ner[:5]

In [None]:
# splitting up the type and the name
split_type = [x.split('([') for x in ner]
split_type[:5]

In [None]:
# making a table out of that
entities = Table(['Type', 'Name']).with_rows(split_type)
entities

In [None]:
# how many of each type of entity there are
entities.group('Type')

In [None]:
# finding the most commonly occuring entities
entities.group('Name').sort('count', descending=True)

In [None]:
# possibly the most common names of people
entities.where('Type', 'PER').group('Name').sort('count', True)