<a href="https://colab.research.google.com/github/dgromann/viva/blob/master/NLP_syntactic_processing_miniexample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##VIVA Summer School - Basic syntactic NLP processing

This is a very short intro into building an NLP pipeline with [SpaCy](https://nlpforhackers.io/complete-guide-to-spacy/) - a popular NLP library and toolkit that represents a recent alternative to the conventional Natural Language Toolkit (NLTK) in Python. 

First we need to download the English version of spacy.

In [0]:
!python -m spacy download en

# Tokenization
Next, we will import the SpaCy library to python and give it a first statement to tokenize. 

In [0]:
import spacy
nlp = spacy.load('en')
doc = nlp("Welcome to the NLP pipeline in SpaCy!")
for token in doc:
    print('"' + token.text + '"', token.idx)
 

# Sentence detection

We can also detect individual sentences in a longer text. The good thing about NLTK is that it comes with a number of preprocessed and cleaned corpora, such as the Brown corpus. 

In [0]:
import nltk
#only for the first time
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gutenberg

gutenberg.fileids()

In [0]:
excerpt = gutenberg.raw('austen-sense.txt')[0:1000]
print(excerpt)

In [0]:
processed_text = nlp(excerpt)
for sent in processed_text.sents: 
  print(sent)

# POS Tagging

Part-of-speech tagging is the process of identifying word classes for each individual word. 

For instance "NN" refers to a noun singular, "CC" denotes a coordination, etc. The whole collection of highly conventional tags can be found in the [Penn Treebank listing](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

In [0]:
print([(token.text, token.tag_) for token in processed_text])

Which process is the following? Which elements of the text are being extracted?

GPE, for instance, denotes geopolitical entity, FAC refers to "Buildings, airports, highways, bridges, etc."

In [0]:
for ent in processed_text.ents:
    print(ent.text, ent.label_)


And its visual representation:

In [0]:
from spacy import displacy
displacy.render(processed_text, style='ent', jupyter=True)

# Dependency Parsing 

What can we learn from dependency parsing?

In [0]:
displacy.render(processed_text[13:24], style='dep', jupyter=True, options={'distance': 90})