<a href="https://colab.research.google.com/github/dgromann/viva/blob/master/NLP_syntactic_processing_miniexample.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##VIVA Summer School - Basic syntactic NLP processing

This is a very short intro into building an NLP pipeline with [SpaCy](https://nlpforhackers.io/complete-guide-to-spacy/) - a popular NLP library and toolkit that represents a recent alternative to the conventional Natural Language Toolkit (NLTK) in Python. 

First we need to download the English version of spacy.

In [1]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


# Tokenization
Next, we will import the SpaCy library to python and give it a first statement to tokenize. 

In [2]:
import spacy
nlp = spacy.load('en')
doc = nlp("Welcome to the NLP pipeline in SpaCy!")
for token in doc:
    print('"' + token.text + '"', token.idx)
 

"Welcome" 0
"to" 8
"the" 11
"NLP" 15
"pipeline" 19
"in" 28
"SpaCy" 31
"!" 36


# Sentence detection

We can also detect individual sentences in a longer text. The good thing about NLTK is that it comes with a number of preprocessed and cleaned corpora, such as the Brown corpus. 

In [3]:
import nltk
#only for the first time
nltk.download('gutenberg')
from nltk.corpus import gutenberg as gutenberg

gutenberg.fileids()

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [4]:
excerpt = gutenberg.raw('austen-sense.txt')[0:1000]
print(excerpt)

[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1


The family of Dashwood had long been settled in Sussex.
Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.
The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion and housekeeper in his sister.
But her death, which happened ten years before his own,
produced a great alteration in his home; for to supply
her loss, he invited and received into his house the family
of his nephew Mr. Henry Dashwood, the legal inheritor
of the Norland estate, and the person to whom he intended
to bequeath it.  In the society of his nephew and niece,
and their children, the old Gentleman's days were
comfortably spent.  His attachment to them all increased.
The constant attention 

In [5]:
processed_text = nlp(excerpt)
for sent in processed_text.sents: 
  print(sent)

[Sense and Sensibility by Jane Austen 1811]

CHAPTER 1



The family of Dashwood had long been settled in Sussex.

Their estate was large, and their residence was at Norland Park,
in the centre of their property, where, for many generations,
they had lived in so respectable a manner as to engage
the general good opinion of their surrounding acquaintance.

The late owner of this estate was a single man, who lived
to a very advanced age, and who for many years of his life,
had a constant companion and housekeeper in his sister.

But her death, which happened ten years before his own,
produced a great alteration in his home; for to supply
her loss, he invited and received into his house the family
of his nephew Mr. Henry Dashwood, the legal inheritor
of the Norland estate, and the person to whom he intended
to bequeath it.  
In the society of his nephew and niece,
and their children, the old Gentleman's days were
comfortably spent.  
His attachment to them all increased.

The constant att

# POS Tagging

Part-of-speech tagging is the process of identifying word classes for each individual word. 

For instance "NN" refers to a noun singular, "CC" denotes a coordination, etc. The whole collection of highly conventional tags can be found in the [Penn Treebank listing](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html). 

In [6]:
print([(token.text, token.tag_) for token in processed_text])

[('[', '-LRB-'), ('Sense', 'NN'), ('and', 'CC'), ('Sensibility', 'NNP'), ('by', 'IN'), ('Jane', 'NNP'), ('Austen', 'NNP'), ('1811', 'CD'), (']', '-RRB-'), ('\n\n', '_SP'), ('CHAPTER', 'NNP'), ('1', 'CD'), ('\n\n\n', '_SP'), ('The', 'DT'), ('family', 'NN'), ('of', 'IN'), ('Dashwood', 'NNP'), ('had', 'VBD'), ('long', 'RB'), ('been', 'VBN'), ('settled', 'VBN'), ('in', 'IN'), ('Sussex', 'NNP'), ('.', '.'), ('\n', '_SP'), ('Their', 'PRP$'), ('estate', 'NN'), ('was', 'VBD'), ('large', 'JJ'), (',', ','), ('and', 'CC'), ('their', 'PRP$'), ('residence', 'NN'), ('was', 'VBD'), ('at', 'IN'), ('Norland', 'NNP'), ('Park', 'NNP'), (',', ','), ('\n', '_SP'), ('in', 'IN'), ('the', 'DT'), ('centre', 'NN'), ('of', 'IN'), ('their', 'PRP$'), ('property', 'NN'), (',', ','), ('where', 'WRB'), (',', ','), ('for', 'IN'), ('many', 'JJ'), ('generations', 'NNS'), (',', ','), ('\n', '_SP'), ('they', 'PRP'), ('had', 'VBD'), ('lived', 'VBN'), ('in', 'RP'), ('so', 'RB'), ('respectable', 'JJ'), ('a', 'DT'), ('manner'

Which process is the following? Which elements of the text are being extracted?

GPE, for instance, denotes geopolitical entity, FAC refers to "Buildings, airports, highways, bridges, etc."

In [8]:
for ent in processed_text.ents:
    print(ent.text, ent.label_)


Jane Austen PERSON
1811 ORDINAL
CHAPTER 1


 ORG
Dashwood ORG
Sussex GPE
Norland Park FAC
many years DATE
ten years DATE
Henry Dashwood PERSON
Norland GPE
Gentleman PERSON
days DATE


And its visual representation:

In [9]:
from spacy import displacy
displacy.render(processed_text, style='ent', jupyter=True)

# Dependency Parsing 

What can we learn from dependency parsing?

In [10]:
displacy.render(processed_text[13:24], style='dep', jupyter=True, options={'distance': 90})