# Text Processing with TextBlob

In [None]:
%matplotlib inline

#### TextBlob is already installed but we need data files 

In [None]:
!python -m textblob.download_corpora

### Now import some modules

In [None]:
import sqlite3 as sqlite
import pandas as pd
import os
from textblob import TextBlob, Word
from ipywidgets import widgets, interact, interactive, fixed

from IPython.display import clear_output, display, HTML
import time


### Read in Data

#### Data is stored in a [SQLite](https://docs.python.org/2/library/sqlite3.html) database. Use Pandas to read it in.

In [None]:
DATADIR = os.path.join("..","Resources")
con = sqlite.connect(os.path.join(DATADIR,"reports.sqlite"))
df = pd.read_sql("SELECT * from reports", con)
df

### We can rename a column to make it more friendly

In [None]:
df = df.rename(columns = {'0':'report'})

In [None]:
df

In [None]:
with open(os.path.join(DATADIR,"atlantic_article.txt")) as f0:
    article = f0.read()

In [None]:
article

In [None]:
display(HTML( article.replace('\n',"</br>")))

### Create a TextBlob Object

When we create a ``TextBlob`` object it does a lot of behind the scence processing for us, such as

* Breaking the text into sentences
* Breaking the text into words
* Breaking the text into tokens

In [None]:
blob = TextBlob(article)

## TextBlob can tokenize (break into pieces) the Text

In [None]:
for s in blob.sentences[:5]:
    print (s)
    print("-"*42)

### Print out the words

In [None]:
for w in blob.words[:40]:
    print (w)

### Tokens include punctuation

In [None]:
for t in blob.tokens[:40]:
    print (t)

## [Sentiment Analysis](http://en.wikipedia.org/wiki/Sentiment_analysis)

>Generally speaking, sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader). (Wikipedia)

### Sentiment Analysis

>A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. (Wikipedia)

>Another research direction is subjectivity/objectivity identification. This task is commonly[8] defined as classifying a given text (usually a sentence) into one of two classes: objective or subjective. (Wikipedia)

### TextBlob does Some Sentiment Analysis
#### Sentiment can be computed on a document and sentence level

In [None]:
blob.sentiment

In [None]:
for s in blob.sentences:
    print (s)
    print (s.sentiment)
    print ('-'*42)

### Pre-processing Words

* We often want to transform all the variations of words into a single form

#### TextBlob Can Singularize and Pluralize Words, But Not Well

In [None]:
s = blob.sentences[5]

In [None]:
for w in s.words:
    print (w,w.singularize(),w.pluralize())
    print("-"*42)

#### TextBlob Can Lemmatize Words

>Lemmatisation (or lemmatization) in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item.[1] (Wikipedia)

In [None]:
s = blob.sentences[15]
for w in s.words:
    if w != w.lemmatize():
        print ("Modified:\t",w,"->",w.lemmatize())
    else:
        print("Unchanged:\t",w)

### We can guide the lemmatization by telling word type

In [None]:
ww = Word("imagining")
print(ww.lemmatize('v'))


### TextBlob will do part-of-speech tagging

A list of part-of-speech tag abbreviations can be seen [here](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html).

In [None]:
type(blob.tags)

In [None]:
t = dict(blob.tags)
for item in list(t.items())[0:20]:
    print(item)

### TextBlob Can (try) to Provide Definitions

If we do not provide a part-of-speech, TextBlob will try to return definitions for all parts-of-speech. Definitions are returned as a [synset](https://en.wikipedia.org/wiki/Synonym_ring)

>In metadata a synonym ring or synset, is a group of data elements that are considered semantically equivalent for the purposes of information retrieval. These data elements are frequently found in different metadata registries. Although a group of terms can be considered equivalent, metadata registries store the synonyms at a central location called the preferred data element.

In [None]:
for w in s.words:
    print (w)
    for d in w.definitions:
        print("*",d)
    print ('-'*42)

In [None]:
w1 = Word('apical')

In [None]:
print(w1.definitions)
w1.lemmatize('a')

In [None]:
blob.words.count("eggs",case_sensitive=True)

In [None]:
blob.words.count("eggs",case_sensitive=False)

### TextBlob can try to translate
#### NOTE: Google has eliminated their free translation service

* Translation is done via Google Translate

[Language Codes Can be Found Here](https://cloud.google.com/translate/v2/using_rest#language-params)

In [None]:
codes = [l.split("\t") for l in """Afrikaans 	af
Albanian 	sq
Arabic 	ar
Armenian 	hy
Azerbaijani 	az
Basque 	eu
Belarusian 	be
Bengali 	bn
Bosnian 	bs
Bulgarian 	bg
Catalan 	ca
Cebuano 	ceb
Chichewa 	ny
Chinese Simplified 	zh-CN
Chinese Traditional 	zh-TW
Croatian 	hr
Czech 	cs
Danish 	da
Dutch 	nl
English 	en
Esperanto 	eo
Estonian 	et
Filipino 	tl
Finnish 	fi
French 	fr
Galician 	gl
Georgian 	ka
German 	de
Greek 	el
Gujarati 	gu
Haitian Creole 	ht
Hausa 	ha
Hebrew 	iw
Hindi 	hi
Hmong 	hmn
Hungarian 	hu
Icelandic 	is
Igbo 	ig
Indonesian 	id
Irish 	ga
Italian 	it
Japanese 	ja
Javanese 	jw
Kannada 	kn
Kazakh 	kk
Khmer 	km
Korean 	ko
Lao 	lo
Latin 	la
Latvian 	lv
Lithuanian 	lt
Macedonian 	mk
Malagasy 	mg
Malay 	ms
Malayalam 	ml
Maltese 	mt
Maori 	mi
Marathi 	mr
Mongolian 	mn
Myanmar (Burmese) 	my
Nepali 	ne
Norwegian 	no
Persian 	fa
Polish 	pl
Portuguese 	pt
Punjabi 	ma
Romanian 	ro
Russian 	ru
Serbian 	sr
Sesotho 	st
Sinhala 	si
Slovak 	sk
Slovenian 	sl
Somali 	so
Spanish 	es
Sudanese 	su
Swahili 	sw
Swedish 	sv
Tajik 	tg
Tamil 	ta
Telugu 	te
Thai 	th
Turkish 	tr
Ukrainian 	uk
Urdu 	ur
Uzbek 	uz
Vietnamese 	vi
Welsh 	cy
Yiddish 	yi
Yoruba 	yo
Zulu 	zu""".split("\n")]
codes

In [None]:
s = blob.sentences[0]
type(s)
type(s.raw)

In [None]:
@interact(s=[s.raw for s in blob.sentences],code={c[0]:c[1] for c in codes})
def translate_sentence(s,code):
    #clear_output()
    blob = TextBlob(s)
    try:
        display(HTML("<h3>Original</h3><p>%s</p><h2>To: %s</h2>"%(s,code[0])))
        display(HTML(blob.sentences[0].translate(to=code).raw))
    except Exception as error:
        display(HTML("<h3>Could not translate: %s</h3>"%error))

    

seq=" ".join(list("""AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGCGTAACGCGTGGGAA"""\
    +"""TCTGCCCTTGGGTACGGAATAACAGTTAGAAATGACTGCTAATACC"""))
sblob = TextBlob(seq)

## TextBlob can create n-grams (think k-mers)

In [None]:
for s in blob.sentences:
    print( len(s.ngrams(2)),len(s.ngrams(3)),len(s.ngrams(4)))

In [None]:
seq=" ".join(list("""AACGAACGCTGGCGGCATGCCTAACACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGCGTAACGCGTGGGAA"""\
    +"""TCTGCCCTTGGGTACGGAATAACAGTTAGAAATGACTGCTAATACC"""))
sblob = TextBlob(seq)

In [None]:
sblob.words

In [None]:
ngrams=sblob.ngrams(7)
ngrams.sort()

In [None]:
ngs = [''.join(n) for n in ngrams]
print (len(ngs))
for n in ngs:
    print (n)