# Lab 2: NLP Pipelines

In this Lab, we are going to look into NLP pipelines and their role in text processing. 

### Pipeline
A pipeline is a set of processors combined together to form a chain. A user puts their input from one end of the pipeline and gets the desired output from the other end. If you are familiar with Linux command line then you might have used the pipe command, where output of the first command becomes the input to the next command. For example: 

$ less text.txt | grep winter

Here we first look into the text file, so the output is the whole file content and then we pipe this output into the grep command, which uses this text file to search for the word winter. We can again pipe this output into another command and so on. This is what happens with NLP pipeline as well. We first start with one processing task (generally tokenization or segmentation) and then use these results to do another task like part-of-speech tagging. 

There are many NLP pipelines available. 
Small subset of usable pipelines: 
* [SpaCy](https://spacy.io/) - single implementation for each NLP component
* [Stanza](https://stanfordnlp.github.io/stanza/) -  highly accurate neural network components, can train your own models easily, supports 66 languages
* [NLTK](https://www.nltk.org/) - Multiple implementations for each NLP component, can build your own pipeline
* [UDPipe ](https://ufal.mff.cuni.cz/udpipe/1)- Trainable pipeline, language-agnostic
* [Forte](https://github.com/asyml/forte) - toolkit for building NLP pipelines, decomposes problem into data, models and tasks. More usable for building integrated systems (search documents, analyze, extract documents etc)
* [TextBlob](https://textblob.readthedocs.io/en/dev/) - extension of NLTK (simplified manner), good for small projects where state-of-the-art results are not needed 
* [CogCompNLP](https://github.com/CogComp/cogcomp-nlp) (Java tool) - developed by the University of Illinois, can process text (locally and remotely), a lot of components. 

In this Lab, we will use [Stanza](https://stanfordnlp.github.io/stanza/) pipeline. 

In [None]:
!pip install stanza

In [None]:
import stanza
stanza.download('en') # download the appropriate models 
import pandas as pd

Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/master/resources_1.2.0.json: 128kB [00:00, 30.2MB/s]                    
2021-02-16 08:49:15 INFO: Downloading default packages for language: en (English)...
Downloading http://nlp.stanford.edu/software/stanza/1.2.0/en/default.zip: 100%|██████████| 411M/411M [03:34<00:00, 1.91MB/s]
2021-02-16 08:52:58 INFO: Finished downloading models and saved to /root/stanza_resources.



<img src="https://stanfordnlp.github.io/stanza/assets/images/pipeline.png" >




Stanza pipeline contains many processors that depend on each other: 
* tokenize processor, 
* multi-word tokens (MWT) processor, 
* POS processor, 
* lemma processor, 
* depparse processor, 
* NER (named entity recognizer) processor,
* sentiment processor. 

Each of these processors have specific requirements. For example depparse processor needs tokenize, MWT, POS and lemma annotations. 

We can define the pipeline as follows: 

In [None]:
nlp = stanza.Pipeline(lang='en', processors='tokenize,pos,lemma,depparse, ner, sentiment') #mwt is not available from official model list


In [None]:
raw_text = ("The brown fox is quick and he is jumping over the lazy dog" )
print(raw_text)

Lets analyse this text with Stanza pipeline: 

In [None]:
doc = ...
print(doc)

After analysis, the pipeline gives you the Document object.
Document has the following properties: text, sentences, entities, num_tokens, num_words. A Sentence objects inside Document represents a sentence, this object contains a list of Tokens. 



In [None]:
print(f"Text: {...}")
print(f"Dependencies: {...}")
print(f"Tokens: {...}")
print(f"Words: {...}")
print(f"XPOS: {...}")
print(f"Entities: {...}")
print(f"Sentiment: {...}") # Available for English, Chinese, German

 Finally, a Word object has all the analysis results, that can be accesses with the following attributes: id, text, lemma, xpos, upos, feats (morphological features), head , deprel (dependency relation between this word and its head), and misc .

### Part of Speech 
Parts of speech (POS) are specific lexical categories to which words are assigned based
on their syntactic context and role. 
The main POS are nouns, verbs, adjectives, and
adverbs. The process of classifying and labeling POS tags for words is defined as parts of
speech tagging (POS tagging).

Stanza outputs universal POS (UPOS) and language sepcific POS (XPOS). 



In [None]:
pos_tagged = [...]
pd.DataFrame(pos_tagged, columns=['Word', 'UPOS', 'XPOS']).T

### Morphological Tagging
By definition, a morpheme is the smallest unit of
language that has distinctive meaning. This includes things like
words, prefixes, suffixes, and so on, which have their own distinct
meaning. Morphology is the study of the structure and meaning of
these distinctive units or morphemes in a language. There are specific
rules and syntaxes that govern the way morphemes can combine. For example, the word unbreakable is composed of three morphemes: 
* *un* - a bound morpheme signifying *not*
* *break* - the root morpheme 
* *able* - a free morpheme signifying *can be done*

In the UD corpora, these attributes are annotated as feature-value pairs for each
token.

In [None]:
morph_tagged = [...]
pd.DataFrame(morph_tagged, columns=['Word', 'Feats'])

### Dependency Parsing

Syntax usually envolves the study of sentences, phrases, words, and
their structures. This includes researching how words are combined
grammatically to form phrases and sentences. Syntactic order of
words used in a phrase or a sentence matter since the order can
change the meaning entirely.

In dependency-based parsing, we try to use dependency-based grammars to analyze
and infer both structure and semantic dependencies and relationships between tokens
in a sentence.

The basic principle behind a dependency grammar is that in any sentence in the
language, all words except one have some relationship or dependency on other words
in the sentence. The word that has no dependency is called the root of the sentence. The
verb is taken as the root of the sentence in most cases. All the other words are directly or
indirectly linked to the root verb using links , which are the dependencies

Stanza outputs the head and deprel, head being the head of the word, which is either value of ID or zero (meaning that the word is the root), and deprel being the relation to the head. 



In [None]:
synt_tagged = [...]
pd.DataFrame(synt_tagged, columns=['Word', 'Head', 'Deprel']).T

We can also visualize the dependency syntax tree using SpaCy: 

In [None]:
import spacy
from spacy import displacy
!python -m spacy download en_core_web_sm


In [None]:
nlp_spacy =  spacy.load("en_core_web_sm")
displacy.render(nlp_spacy(raw_text), jupyter=True, options={'distance':100, 'arrow_stroke':1.5, 'arrow_width':8})

### Named Entity Recognition

A classical problem in information extraction is to recognize and extract mentions of
named entities in text. In news documents, the core entity types are people, locations, and
organizations; more recently, the task has been extended to include amounts of money,
percentages, dates, and times.

Usually BIO notation is used for named entity recognition. Each token at the
beginning of a name span is labeled with a B- prefix; each token within a name span is labeled with an I- prefix. These prefixes are followed by a tag for the entity type, e.g. B-LOC
for the beginning of a location. Tokens
that are not parts of name spans are labeled as O.

Stanza uses BIOES representation, where E denotes ending and S denotes single element. 

In [None]:
ner_text = "The U.S. Army captured Atlanta on May 14, 1864."
ner_doc = ...
ner_tagged = [...]
pd.DataFrame(ner_tagged, columns=['Word', 'NER']).T

Now we can put the whole output in [ConNLL-U format](https://universaldependencies.org/format.html): 

In [None]:
columns = ['ID', 'FORM', 'LEMMA', 'UPOS', 'XPOS', 'FEATS', 'HEAD', 'DEPREL', 'DEPS', 'MISC']
tagged = [...]
pd.DataFrame(tagged, columns=columns)

### Performance 

You can check the performance for other treebanks [here](https://stanfordnlp.github.io/stanza/performance.html). 

In [None]:
columns = ['Treebank','Tokens', 'Sentences', 'UPOS', 'XPOS', 'Feats', 'UAS', 'LAS', 'LEMMAS']
results = [('UD_English-EWT', 99.01,	81.13, 95.4,	95.12, 96.11, 86.22,	83.59,	97.21),
           ('UD_Estonian-EDT', 99.96, 93.32, 97.19,98.04, 95.77,86.68, 83.82, 96.05 ), 
           ('UD_Russian-SynTagRus',99.57,	98.86, 98.2,	99.57,	95.91, 	92.38,	90.6 ,97.51)]
pd.DataFrame(results, columns=columns)

Unnamed: 0,Treebank,Tokens,Sentences,UPOS,XPOS,Feats,UAS,LAS,LEMMAS
0,UD_English-EWT,99.01,81.13,95.4,95.12,96.11,86.22,83.59,97.21
1,UD_Estonian-EDT,99.96,93.32,97.19,98.04,95.77,86.68,83.82,96.05
2,UD_Russian-SynTagRus,99.57,98.86,98.2,99.57,95.91,92.38,90.6,97.51


What can we now do with all of this information? 

1. Input to another task in NLP like summarization, information extraction, machine translation etc. 
2. Extract some phrases, sentences (for example, extract NOUN-VERB pairs for further analysis like clustering) 
3. any more ideas? 

Why it might be a bad idea to use pretrained pipelines? 





### Analysing movie "The Room"
Let's analyse one of the greatest movies of all time "The Room" by Tommy Wiseau.

We know that the main character Johnny shot himself dead in the end. The question is can we see that his mood changes from his text.

In [None]:
johnny_lines = []
with  open('the_room.txt', 'r', encoding='utf-8') as f: 
  for line in f: 
  ...

In [None]:
johnny_lines[0], johnny_lines[-1]

In [None]:
from tqdm.notebook import tqdm

analysed_lines = []
for johnny_line in tqdm(johnny_lines): 
  doc = ...
  ...

In [None]:
import matplotlib.pyplot as plt
...

Did it work and why? 

In [None]:
... # alternate approach