# Getting started

## Read the docs
* [TextBlobDE](https://pypi.python.org/pypi/textblob-de)
* [spaCy](https://spacy.io/api/doc)

## Installation
On windows, run the conda shell and issue the following commands to install the required libraries and extensions.

### Install spacy
1. Basic installation: `conda install -c conda-forge spacy`
2. Download the German model: `python -m spacy download de`

Make sure to run the shell as adminastrator if you are on a windows machine since otherwise linking will fail.

### Install textblob
1. Basic installation: `conda install -c conda-forge textblob`
2. Download the German extension: `pip install -U textblob-de`
3. Fetch models/data/corpora: `python -m textblob.download_corpora`



# Imports

In [1]:
# for handling German text containing umlauts
from __future__ import unicode_literals
from textblob_de import TextBlobDE as TextBlob
import spacy
# Load the German annotation models
spacy_nlp = spacy.load('de')

## Tokenization
Tokenization is the first step in any natural language processing task and divides a sequence like a sentence into lexical units or words. Here are some points that render even this basic task challenging:
* should we split only on white spaces? No. This would leave us with punctuation glued to the actual words.
* should we additionally split on punctuation? Yes, but it depends. 
 * But how many tokens are in "St. Nick? or "Dr. Schmitz"? Two or three? And what about "e.g.", "i.e." or "z.B."?
 * Should we keep punctuation information? Yes. For treating the examples above but also for any other kind of acronym.
 
Let's see what TextBlob and spaCy do. 

In [2]:
text = "Ich suche einen guten Arzt, z.B. so jemanden wie Dr. Karl-Heinz Schmitz."

In [3]:
textblob_doc = TextBlob(text)    
print(' '.join('\'{w}\''.format(w=t) for t in textblob_doc.tokens))    

'Ich' 'suche' 'einen' 'guten' 'Arzt' ',' 'z.B.' 'so' 'jemanden' 'wie' 'Dr.' 'Karl-Heinz' 'Schmitz' '.'


In [4]:
spacy_doc = spacy_nlp(text)    
print(' '.join('\'{w}\''.format(w=t) for t in spacy_doc)) 

'Ich' 'suche' 'einen' 'guten' 'Arzt' ',' 'z.B.' 'so' 'jemanden' 'wie' 'Dr.' 'Karl-Heinz' 'Schmitz' '.'


In [5]:
text = "Wie funktioniert das mit dem \"Hand-Out\"?"
spacy_doc = spacy_nlp(text)    
print(' '.join('\'{w}\''.format(w=t) for t in spacy_doc)) 

textblob_doc = TextBlob(text)    
print(' '.join('\'{w}\''.format(w=t) for t in textblob_doc.tokens))    

'Wie' 'funktioniert' 'das' 'mit' 'dem' '"' 'Hand-Out' '"' '?'
'Wie' 'funktioniert' 'das' 'mit' 'dem' '``' 'Hand-Out' '''' '?'


Great! Both tools provide decent tokenization capabilities. Let's check sentence splitting next.
Sentence splitting (or paragraph splitting) divides a longer text document into larger units. Ideally, it should be able to distinguish headlines from the following sentences, not split on acronym punctuation and so so.

In [34]:
text = "Wie funktioniert das mit dem \"Hand-Out\"? Können wir das mit de A.B.C. Methode lösen?"
spacy_doc = spacy_nlp(text)    
print(' '.join('\'{w}\''.format(w=t) for t in spacy_doc.sents)) 

textblob_doc = TextBlob(text)    
print(' '.join('\'{w}\''.format(w=t) for t in textblob_doc.sentences))    

'Wie funktioniert das mit dem "Hand-Out"?' 'Können wir das mit de A.B.C. Methode lösen?'
'Wie funktioniert das mit dem "Hand-Out"?' 'Können wir das mit de A.B.C.' 'Methode lösen?'


spaCy wins as it correctly detects the ancronym and does not create a new sentence. 

## Part-of-speech Tagging
Part-of-speech Tagging assigns each of the words in a sequence (i.e. in most cases a sentence) a category like verb, noun, pronoun, adjective, and so on. Initially, this requires 

Textblob uses the [Penn Treebank tagset](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html), spacy uses the [TIGER Treebank tagset](http://www.ims.uni-stuttgart.de/forschung/ressourcen/korpora/TIGERCorpus/annotation/index.html) and also cleverly maps tags to a [simplified, universal Tagest](http://www.petrovi.de/data/lrec.pdf). 

Why clever? First, such a reduced tag set is easier to grasp for somebody without a background in linguistitics. But it also simplifies the manual creation of rules based on POS-Tag patterns, as for instance often used in sentiment analysis.  


Our example sentence will be "Ich habe Kopfweh" which translates to "I have an headache" in English.
Let's first check how stable the tools are when the text ignores capitalization rules.

In [9]:
textblob_doc = TextBlob('ich habe kopfweh')
for word, tag in textblob_doc.tags:
    print("word: %s \t tag: %s" % (word,tag))

word: ich 	 tag: PRP
word: habe 	 tag: VB
word: kopfweh 	 tag: NN


All good. Let's see how spaCy performs.

In [10]:
spacy_doc = spacy_nlp(u'ich habe kopfweh')
for token in spacy_doc:
    print('word: %s \t coarse-tag: %s \t fine-tag: %s' % (token.text, token.pos_, token.tag_))

word: ich 	 coarse-tag: PRON 	 fine-tag: PPER
word: habe 	 coarse-tag: AUX 	 fine-tag: VAFIN
word: kopfweh 	 coarse-tag: ADJ 	 fine-tag: ADJD


While _ich_ (english: I) and _habe_ (eng.: have) are tagged  correctly,  _kopfweh_ (eng.: head ache) was tagged as an adjective (`ADJD`) instead of a pronoun. 

Now let's use the proper case for every word and see what happens. 

In [11]:
textblob_doc = TextBlob('Ich habe Kopfweh')
for word, tag in textblob_doc.tags:
    print("word: %s \t tag: %s" % (word,tag))

word: Ich 	 tag: PRP
word: habe 	 tag: VB
word: Kopfweh 	 tag: NN


All good! Not suprising, since the first variant was also correct.

In [12]:
spacy_doc = spacy_nlp(u'Ich habe Kopfweh')
for token in spacy_doc:
    print("word: %s \t coarse-tag: %s \t fine-tag: %s" % (token.text, token.pos_, token.tag_))

word: Ich 	 coarse-tag: PRON 	 fine-tag: PPER
word: habe 	 coarse-tag: AUX 	 fine-tag: VAFIN
word: Kopfweh 	 coarse-tag: NOUN 	 fine-tag: NN


All good now!

## Noun-phrase chunking

In [13]:
text = 'Könnte man etwas gegen meine Kopfschmerzen tun?'

In [14]:
spacy_doc = spacy_nlp(text)
for token in spacy_doc:
    print("word: %s \t coarse-tag: %s \t fine-tag: %s" % (token.text, token.pos_, token.tag_))
        
# show noun chunks
for chunk in spacy_doc.noun_chunks:
    print("chunk: %s \t" % chunk.text)    

word: Könnte 	 coarse-tag: VERB 	 fine-tag: VMFIN
word: man 	 coarse-tag: PRON 	 fine-tag: PIS
word: etwas 	 coarse-tag: ADV 	 fine-tag: ADV
word: gegen 	 coarse-tag: ADP 	 fine-tag: APPR
word: meine 	 coarse-tag: DET 	 fine-tag: PPOSAT
word: Kopfschmerzen 	 coarse-tag: NOUN 	 fine-tag: NN
word: tun 	 coarse-tag: VERB 	 fine-tag: VVINF
word: ? 	 coarse-tag: PUNCT 	 fine-tag: $.
chunk: man 	
chunk: meine Kopfschmerzen 	


In [15]:
from textblob_de import PatternParser

In [16]:
textblob_doc = TextBlob(text)
for word, tag in textblob_doc.tags:
    print("word: %s \t tag: %s" % (word,tag))

word: Könnte 	 tag: VB
word: man 	 tag: DT
word: etwas 	 tag: DT
word: gegen 	 tag: IN
word: meine 	 tag: VB
word: Kopfschmerzen 	 tag: NN
word: tun 	 tag: VB


In [17]:
for chunk in textblob_doc.noun_phrases:
    print(chunk)

Per default, only noun_phrases that consist of two or more meaningful parts are displayed. 
Actually, the word kopfschmerzen is tagged correctly as NP which means Noun Phrase. But probably since meine is tagged incorrectly as a verb and not a determiner? we don't find a nounphrase. 
Investigating the lemma textblob assigns to this word, we will see below that it mistakes it for _mean_ instead of _mine_

## Lemmatization

In [18]:
from textblob_de import PatternParser

In [19]:
blob = TextBlob(text, parser=PatternParser(pprint=True, lemmata=True))
blob.parse()
    
for t in blob.words.lemmatize():
    print(t)

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA           
                                                                     
        Könnte   VB     VP      -      -      -      können          
           man   DT     -       -      -      -      man             
         etwas   DT     -       -      -      -      etwas           
         gegen   IN     PP      -      -      -      gegen           
         meine   VB     VP      -      -      -      meinen          
 Kopfschmerzen   NN     NP      -      -      -      kopfschmerzen   
           tun   VB     VP      -      -      -      tun             
             ?   .      -       -      -      -      ?               
können
man
etwas
gegen
meinen
Kopfschmerzen
tun


In [20]:
# show lemmas produced by spaCy
print(' '.join('{word}/{lemma}'.format(word=t.orth_, lemma=t.lemma_) for t in spacy_doc))

Könnte/Könnte man/man etwas/etwas gegen/gegen meine/meinen Kopfschmerzen/Kopfschmerzen tun/tun ?/?


## Sentiment Analysis

In [21]:
blob = TextBlob("TextBlob ist richtig super")
blob.sentiment

Sentiment(polarity=1.0, subjectivity=0.0)

In [22]:
blob = TextBlob("TextBlob ist nicht super")
blob.sentiment

Sentiment(polarity=0.0, subjectivity=0.0)

In [23]:
blob = TextBlob("TextBlob ist super schlecht")
blob.sentiment

Sentiment(polarity=-1.0, subjectivity=0.0)

## Pluralization

In [24]:
from textblob import Word
w = Word("university")
print(w.pluralize())


textblob_doc = TextBlob('ich habe kopfweh')
for word, tag in textblob_doc.tags:
    print("word: %s \t tag: %s" % (word, Word(word).pluralize()))

universities
word: ich 	 tag: iches
word: habe 	 tag: habes
word: kopfweh 	 tag: kopfwehs


## Word Vectors

In [27]:
# word vectors are attached to tokens and can be accessed via t.vector
word1 = spacy_doc[2:3]
word2 = spacy_doc[4:5]
word1.similarity(word2)

0.21771714593758457

## Dependency Parsing

In [42]:
# show dependency arcs
print('\n'.join('{child:<8} <{label:-^7} {head}'.format(child=t.orth_, label=t.dep_, head=t.head.orth_) for t in spacy_doc))

Wie      <--mo--- funktioniert
funktioniert <-ROOT-- funktioniert
das      <--sb--- funktioniert
mit      <--mo--- funktioniert
dem      <--nk--- Hand-Out
"        <--pnc-- Hand-Out
Hand-Out <--nk--- mit
"        <-punct- Hand-Out
?        <-punct- funktioniert
Können   <-ROOT-- Können
wir      <--sb--- Können
das      <--oa--- lösen
mit      <--mo--- lösen
de       <--pnc-- A.B.C.
A.B.C.   <--nk--- mit
Methode  <--nk--- mit
lösen    <--oc--- Können
?        <-punct- Können


## Named Entities

In [44]:
# show named entities
for ent in spacy_doc.ents:
    print(ent.text)
      
    
print([w[0] for w in textblob_doc.tags if w[1] == u'NNP'])    

Hand-Out
[]


# Links
* **English Only?**: [TextAnalysis Api](http://textanalysisonline.com/): TextAnalysis Api provides customized Text Analysis or Text Mining Services like tokenization, POS-Tagging, stemming, lemmatization, chunking, parsing, sentence segmentation, gammar checking, sentiment analysis, text summarization, text classification and other text analysis tasks. It stands on the giant shoulders of NLP Tools, such as NLTK, TextBlob, Pattern, MBSP and etc.