<a href="https://colab.research.google.com/github/cltl/python-for-text-analysis/blob/colab/Chapters-colab/Chapter_19_More_about_Natural_Language_Processing_Tools_(spaCy).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
%%capture
!wget https://github.com/cltl/python-for-text-analysis/raw/master/zips/Data.zip
!wget https://github.com/cltl/python-for-text-analysis/raw/master/zips/images.zip
!wget https://github.com/cltl/python-for-text-analysis/raw/master/zips/Extra_Material.zip

!unzip Data.zip -d ../
!unzip images.zip -d ./
!unzip Extra_Material.zip -d ../

!rm Data.zip
!rm Extra_Material.zip
!rm images.zip

# Chapter 19 - More about Natural Language Processing Tools (spaCy)

Text data is unstructured. But if you want to extract information from text, then you often need to process that data into a more structured representation. The common idea for all Natural Language Processing (NLP) tools is that they try to structure or transform text in some meaningful way. You have already learned about four basic NLP steps: sentence splitting, tokenization, POS-tagging and lemmatization. For all of these, we have used the NLTK library, which is widely used in the field of NLP. However, there are some competitors out there that are worthwhile to have a look at. One of them is spaCy, which is fast and accurate and supports multiple languages. 

**At the end of this chapter, you will be able to:**
- work with spaCy
- find some additional NLP tools

## 1. The NLP pipeline

There are many tools and libraries designed to solve NLP problems. In Chapter 15, we have already seen the NLTK library for tokenization, sentence splitting, part-of-speech tagging and lemmatization. However, there are many more NLP tasks and off-the-shelf tools to perform them. These tasks often depend on each other and are therefore put into a sequence; such a sequence of NLP tasks is called an NLP pipeline. Some of the most common NLP tasks are:

* **Tokenization:** splitting texts into individual words
* **Sentence splitting:** splitting texts into sentences
* **Part-of-speech (POS) tagging:** identifying the parts of speech of words in context (verbs, nouns, adjectives, etc.)
* **Morphological analysis:** separating words into morphemes and identifying their classes (e.g. tense/aspect of verbs)
* **Stemming:** identifying the stems of words in context by removing inflectional/derivational affixes, such as 'troubl' for 'trouble/troubling/troubled'
* **Lemmatization:** identifying the lemmas (dictionary forms) of words in context, such as 'go' for 'go/goes/going/went'
* **Word Sense Disambiguation (WSD):** assigning the correct meaning to words in context
* **Stop words recognition:** identifying commonly used words (such as 'the', 'a(n)', 'in', etc.) in text, possibly to ignore them in other tasks
* **Named Entity Recognition (NER):** identifying people, locations, organizations, etc. in text
* **Constituency/dependency parsing:** analyzing the grammatical structure of a sentence
* **Semantic Role Labeling (SRL):** analyzing the semantic structure of a sentence (*who* does *what* to *whom*, *where* and *when*)
* **Sentiment Analysis:** determining whether a text is mostly positive or negative
* **Word Vectors (or Word Embeddings) and Semantic Similarity:** representating the meaning of words as rows of real valued numbers where each point captures a dimension of the word's meaning and where semantically similar words have similar vectors (very popular these days)

You don't always need all these modules. But it's important to know that they are
there, so that you can use them when the need arises.

### 1.1 How can you use these modules?

Let's be clear about this: **you don't always need to use Python for this**. There are
some very strong NLP programs out there that don't rely on Python. You can typically
call these programs from the command line. Some examples are:

* [Treetagger](http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/) is a POS-tagger
  and lemmatizer in one. It provides support for many different languages. If you want to
  call Treetagger from Python, use [treetaggerwrapper](http://treetaggerwrapper.readthedocs.io/).
  [Treetagger-python](https://github.com/miotto/treetagger-python) also works, but is much slower.

* [Stanford's CoreNLP](http://stanfordnlp.github.io/CoreNLP/) is a very powerful system
  that is able to process English, German, Spanish, French, Chinese and Arabic. (Each to
  a different extent, though. The pipeline for English is most complete.) There are also
  Python wrappers available, such as [py-corenlp](https://github.com/smilli/py-corenlp).

* [The Maltparser](http://www.maltparser.org/) has models for English, Swedish, French, and Spanish.


Having said that, there are many **NLP-tools that have been developed for Python**:

* [Natural Language ToolKit (NLTK)](http://www.nltk.org/): Incredibly versatile library with a bit of everything.
  The only downside is that it's not the fastest library out there, and it lags behind the
  state-of-the-art.
    * Access to several corpora.
    * Create a POS-tagger. (Some of these are actually state-of-the-art if you have enough training data.)
    * Perform corpus analyses.
    * Interface with [WordNet](https://wordnet.princeton.edu/).
* [Pattern](http://www.clips.ua.ac.be/pattern): A module that describes itself as a 'web mining module'. Implements a
    tokenizer, tagger, parser, and sentiment analyzer for multiple different languages.
    Also provides an API for Google, Twitter, Wikipedia and Bing.
* [Textblob](http://textblob.readthedocs.io/en/dev/): Another general NLP library that builds on the NLTK and Pattern.
* [Gensim](http://radimrehurek.com/gensim/): For building vector spaces and topic models.
* [Corpkit](http://corpkit.readthedocs.io/en/latest/) is a module for corpus building and corpus management. Includes an interface to the Stanford CoreNLP parser.
* [SpaCy](https://spacy.io/): Tokenizer, POS-tagger, parser and named entity recogniser for English, German, Spanish, Portugese, French, Italian and Dutch (more languages in progress). It can also predict similarity using word embeddings.

## 2. spaCy

[spaCy](https://spacy.io/) provides a rather complete NLP pipeline: it takes a raw document and performs tokenization, POS-tagging, stop word recognition, morphological analysis, lemmatization, sentence splitting, dependency parsing and Named Entity Recognition (NER). It also supports similarity prediction, but that is outside of the scope of this notebook. The advantage of SpaCy is that it is really fast, and it has a good accuracy. In addition, it currently supports multiple languages, among which: English, German, Spanish, Portugese, French, Italian and Dutch. 

In this notebook, we will show you the basic usage. If you want to learn more, please visit spaCy's website; it has extensive documentation and provides excellent user guides. 

### 2.1 Installing and loading spaCy

To install spaCy, check out the instructions [here](https://spacy.io/usage). On this page, it is explained exactly how to install spaCy for your operating system, package manager and desired language model(s). Simply run the suggested commands in your terminal or cmd. Alternatively, you can probably also just run the following cells in this notebook:

In [None]:
!pip install -U spacy



In [None]:
%%bash
python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.1.0
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.1.0/en_core_web_sm-3.1.0-py3-none-any.whl (13.6 MB)
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


Now, let's first load spaCy. We import the spaCy module and load the English tokenizer, tagger, parser, NER and word vectors.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm') # other languages: de, es, pt, fr, it, nl

`nlp` is now a Python object representing the English NLP pipeline that we can use to process a text. 

#### EXTRA: Larger models

For English, there are three [models](https://spacy.io/usage/models) ranging from 'small' to 'large':

- en_core_web_sm
- en_core_web_md
- en_core_web_lg

By default, the smallest one is loaded. Larger models should have a better accuracy, but take longer to load. If you like, you can use them instead. You will first need to download them.

In [None]:
#%%bash
#python -m spacy download en_core_web_md

In [None]:
#%%bash
#python -m spacy download en_core_web_lg

In [None]:
# uncomment one of the lines below if you want to load the medium or large model instead of the small one
# nlp = spacy.load('en_core_web_md')  
# nlp = spacy.load('en_core_web_lg') 

### 2.2 Using spaCy

Parsing a text with spaCy after loading a language model is as easy as follows:

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")

`doc` is now a Python object of the class `Doc`. It is a container for accessing linguistic annotations and a sequence of `Token` objects.

#### Doc, Token and Span objects

At this point, there are three important types of objects to remember:

* A `Doc` is a sequence of `Token` objects.
* A `Token` object represents an individual token — i.e. a word, punctuation symbol, whitespace, etc. It has attributes representing linguistic annotations. 
* A `Span` object is a slice from a `Doc` object and a sequence of `Token` objects.

Since `Doc` is a sequence of `Token` objects, we can iterate over all of the tokens in the text as shown below, or select a single token from the sequence: 

In [None]:
# Iterate over the tokens
for token in doc:
    print(token)
print()

# Select one single token by index
first_token = doc[0]
print("First token:", first_token)

I
have
an
awesome
cat
.
It
's
sitting
on
the
mat
that
I
bought
yesterday
.

First token: I


Please note that even though these look like strings, they are not:

In [None]:
for token in doc:
    print(token, "\t", type(token))

I 	 <class 'spacy.tokens.token.Token'>
have 	 <class 'spacy.tokens.token.Token'>
an 	 <class 'spacy.tokens.token.Token'>
awesome 	 <class 'spacy.tokens.token.Token'>
cat 	 <class 'spacy.tokens.token.Token'>
. 	 <class 'spacy.tokens.token.Token'>
It 	 <class 'spacy.tokens.token.Token'>
's 	 <class 'spacy.tokens.token.Token'>
sitting 	 <class 'spacy.tokens.token.Token'>
on 	 <class 'spacy.tokens.token.Token'>
the 	 <class 'spacy.tokens.token.Token'>
mat 	 <class 'spacy.tokens.token.Token'>
that 	 <class 'spacy.tokens.token.Token'>
I 	 <class 'spacy.tokens.token.Token'>
bought 	 <class 'spacy.tokens.token.Token'>
yesterday 	 <class 'spacy.tokens.token.Token'>
. 	 <class 'spacy.tokens.token.Token'>


These `Token` objects have many useful methods and *attributes*, which we can list by using `dir()`. We haven't really talked about attributes during this course, but while methods are operations or activities performed by that object, attributes are 'static' features of the objects. Methods are called using parantheses (as we have seen with `str.upper()`, for instance), while attributes are indicated without parantheses. We will see some examples below.

You can find more detailed information about the token methods and attributes in the [documentation](https://spacy.io/api/token).

In [None]:
dir(first_token)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 'ancestors',
 'check_flag',
 'children',
 'cluster',
 'conjuncts',
 'dep',
 'dep_',
 'doc',
 'ent_id',
 'ent_id_',
 'ent_iob',
 'ent_iob_',
 'ent_kb_id',
 'ent_kb_id_',
 'ent_type',
 'ent_type_',
 'get_extension',
 'has_dep',
 'has_extension',
 'has_head',
 'has_morph',
 'has_vector',
 'head',
 'i',
 'idx',
 'iob_strings',
 'is_alpha',
 'is_ancestor',
 'is_ascii',
 'is_bracket',
 'is_currency',
 'is_digit',
 'is_left_punct',
 'is_lower',
 'is_oov',
 'is_punct',
 'is_quote',
 'is_right_punct',
 'is_sent_end',
 'is_sent_start',
 'is_space',
 'is_stop',
 'is_title',
 'is_upper',
 'lang',
 'lang_',
 'le

Let's inspect some of the attributes of the tokens. Can you figure out what they mean? Feel free to try out a few more.

In [None]:
# Print attributes of tokens
for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_, token.shape_)

I I PRON PRP nsubj X
have have VERB VBP ROOT xxxx
an an DET DT det xx
awesome awesome ADJ JJ amod xxxx
cat cat NOUN NN dobj xxx
. . PUNCT . punct .
It it PRON PRP nsubj Xx
's be AUX VBZ aux 'x
sitting sit VERB VBG ROOT xxxx
on on ADP IN prep xx
the the DET DT det xxx
mat mat NOUN NN pobj xxx
that that DET WDT dobj xxxx
I I PRON PRP nsubj X
bought buy VERB VBD relcl xxxx
yesterday yesterday NOUN NN npadvmod xxxx
. . PUNCT . punct .


Notice that some of the attributes end with an underscore. For example, tokens have both `lemma` and `lemma_` attributes. The `lemma` attribute represents the id of the lemma (integer), while the `lemma_` attribute represents the unicode string representation of the lemma. In practice, you will mostly use the `lemma_` attribute.

In [None]:
for token in doc:
    print(token.lemma, token.lemma_)

4690420944186131903 I
14692702688101715474 have
15099054000809333061 an
3240785716591152042 awesome
5439657043933447811 cat
12646065887601541794 .
10239237003504588839 it
10382539506755952630 be
14192039007865877226 sit
5640369432778651323 on
7425985699627899538 the
11408774834842292007 mat
4380130941430378203 that
4690420944186131903 I
9457496526477982497 buy
1756787072497230782 yesterday
12646065887601541794 .


You can also use spacy.explain to find out more about certain labels:

In [None]:
# try out some more, such as NN, ADP, PRP, VBD, VBP, VBZ, WDT, aux, nsubj, pobj, dobj, npadvmod
spacy.explain("VBZ")

'verb, 3rd person singular present'

You can create a `Span` object from the slice doc[start : end]. For instance, doc[2:5] produces a span consisting of tokens 2, 3 and 4. Stepped slices (e.g. doc[start : end : step]) are not supported, as `Span` objects must be contiguous (cannot have gaps). You can use negative indices and open-ended ranges, which have their normal Python semantics.

In [None]:
# Create a Span
a_slice = doc[2:5]
print(a_slice, type(a_slice))

# Iterate over Span
for token in a_slice:
    print(token.lemma_, token.pos_)

an awesome cat <class 'spacy.tokens.span.Span'>
an DET
awesome ADJ
cat NOUN


#### Text, sentences and noun_chunks

If you call the `dir()` function on a `Doc` object, you will see that it has a range of methods and attributes. You can read more about them in the [documentation](https://spacy.io/api/doc). Below, we highlight three of them: `text`, `sents` and `noun_chunks`.

In [None]:
dir(doc)

['_',
 '__bytes__',
 '__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__iter__',
 '__le__',
 '__len__',
 '__lt__',
 '__ne__',
 '__new__',
 '__pyx_vtable__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__unicode__',
 '_bulk_merge',
 '_get_array_attrs',
 '_py_tokens',
 '_realloc',
 '_vector',
 '_vector_norm',
 'cats',
 'char_span',
 'copy',
 'count_by',
 'doc',
 'ents',
 'extend_tensor',
 'from_array',
 'from_bytes',
 'from_dict',
 'from_disk',
 'from_docs',
 'get_extension',
 'get_lca_matrix',
 'has_annotation',
 'has_extension',
 'has_unknown_spaces',
 'has_vector',
 'is_nered',
 'is_parsed',
 'is_sentenced',
 'is_tagged',
 'lang',
 'lang_',
 'mem',
 'noun_chunks',
 'noun_chunks_iterator',
 'remove_extension',
 'retokenize',
 'sentiment',
 'sents',
 'set_ents',
 'se

First of all, `text` simply gives you the whole document as a string:

In [None]:
print(doc.text)
print(type(doc.text))

I have an awesome cat. It's sitting on the mat that I bought yesterday.
<class 'str'>


`sents` can be used to get all the sentences. Notice that it will create a so-called 'generator'. For now, you don't have to understand exactly what a generator is (if you like, you can read more about them online). Just remember that we can use generators to iterate over an object in a fast and efficient way.

In [None]:
# Get all the sentences as a generator 
print(doc.sents, type(doc.sents))

# We can use the generator to loop over the sentences; each sentence is a span of tokens
for sentence in doc.sents:
    print(sentence, type(sentence))

<generator object at 0x7f303bb74aa0> <class 'generator'>
I have an awesome cat. <class 'spacy.tokens.span.Span'>
It's sitting on the mat that I bought yesterday. <class 'spacy.tokens.span.Span'>


If you find this difficult to comprehend, you can also simply convert it to a list and then loop over the list. Remember that this is less efficient, though.

In [None]:
# You can also store the sentences in a list and then loop over the list 
sentences = list(doc.sents)
for sentence in sentences:
    print(sentence, type(sentence))

I have an awesome cat. <class 'spacy.tokens.span.Span'>
It's sitting on the mat that I bought yesterday. <class 'spacy.tokens.span.Span'>


The benefit of converting it to a list is that we can use indices to select certain sentences. For example, in the following we only print some information about the tokens in the second sentence.

In [None]:
# Print some information about the tokens in the second sentence.
sentences = list(doc.sents)
for token in sentences[1]:
    data = '\t'.join([token.orth_,
                      token.lemma_,
                      token.pos_,
                      token.tag_,
                      str(token.i),    # Turn index into string
                      str(token.idx)]) # Turn index into string
    print(data)

It	it	PRON	PRP	6	23
's	be	AUX	VBZ	7	25
sitting	sit	VERB	VBG	8	28
on	on	ADP	IN	9	36
the	the	DET	DT	10	39
mat	mat	NOUN	NN	11	43
that	that	DET	WDT	12	47
I	I	PRON	PRP	13	52
bought	buy	VERB	VBD	14	54
yesterday	yesterday	NOUN	NN	15	61
.	.	PUNCT	.	16	70


Similarly, `noun_chunks` can be used to create a generator for all noun chunks in the text. 

In [None]:
# Get all the noun chunks as a generator 
print(doc.noun_chunks, type(doc.noun_chunks))

# You can loop over a generator; each noun chunk is a span of tokens
for chunk in doc.noun_chunks:
    print(chunk, type(chunk))
    print()

<generator object at 0x7f310c13ae10> <class 'generator'>
I <class 'spacy.tokens.span.Span'>

an awesome cat <class 'spacy.tokens.span.Span'>

It <class 'spacy.tokens.span.Span'>

the mat <class 'spacy.tokens.span.Span'>

I <class 'spacy.tokens.span.Span'>



#### Named Entities

Finally, we can also very easily access the Named Entities in a text using `ents`. As you can see below, it will create a tuple of the entities recognized in the text. Each entity is again a span of tokens, and you can access the type of the entity with the `label_` attribute of `Span`.

In [None]:
# Here's a slightly longer text, from the Wikipedia page about Harry Potter.
harry_potter = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

doc = nlp(harry_potter)
print(doc.ents)
print(type(doc.ents))

(Harry Potter, British, J. K. Rowling, Harry Potter, Hermione Granger, Ron Weasley, Hogwarts School of Witchcraft and, Wizardry, Harry, Voldemort, the Ministry of Magic, Muggles)
<class 'tuple'>


In [None]:
# Each entity is a span of tokens and is labeled with the type of entity
for entity in doc.ents:
    print(entity, "\t", entity.label_, "\t", type(entity))

Harry Potter 	 PERSON 	 <class 'spacy.tokens.span.Span'>
British 	 NORP 	 <class 'spacy.tokens.span.Span'>
J. K. Rowling 	 PERSON 	 <class 'spacy.tokens.span.Span'>
Harry Potter 	 PERSON 	 <class 'spacy.tokens.span.Span'>
Hermione Granger 	 PERSON 	 <class 'spacy.tokens.span.Span'>
Ron Weasley 	 PERSON 	 <class 'spacy.tokens.span.Span'>
Hogwarts School of Witchcraft and 	 ORG 	 <class 'spacy.tokens.span.Span'>
Wizardry 	 ORG 	 <class 'spacy.tokens.span.Span'>
Harry 	 PERSON 	 <class 'spacy.tokens.span.Span'>
Voldemort 	 PERSON 	 <class 'spacy.tokens.span.Span'>
the Ministry of Magic 	 ORG 	 <class 'spacy.tokens.span.Span'>
Muggles 	 PERSON 	 <class 'spacy.tokens.span.Span'>


Pretty cool, but what does NORP mean? Again, you can use spacy.explain() to find out:

## 3. EXTRA: Stanford CoreNLP

Another very popular NLP pipeline is [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/index.html). You can use the tool from the command line, but there are also some useful Python wrappers that make use of the Stanford CoreNLP API, such as pycorenlp. As you might want to use this in the future, we will provide you with a quick start guide. To use the code below, you will have to do the following:

1. Download Stanford CoreNLP [here](https://stanfordnlp.github.io/CoreNLP/download.html).
2. Install pycorenlp (run `pip install pycorenlp` in your terminal, or simply run the cell below).
3. Open a terminal and run the following commands (replace with the correct directory names):  
   `cd LOCATION_OF_CORENLP/stanford-corenlp-full-2018-02-27`  
   `java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer`  
   This step you will always have to do if you want to use the Stanford CoreNLP API.


In [None]:
%%bash
pip install pycorenlp

Collecting pycorenlp
  Downloading pycorenlp-0.3.0.tar.gz (1.3 kB)
Building wheels for collected packages: pycorenlp
  Building wheel for pycorenlp (setup.py): started
  Building wheel for pycorenlp (setup.py): finished with status 'done'
  Created wheel for pycorenlp: filename=pycorenlp-0.3.0-py3-none-any.whl size=2143 sha256=73aaffc3983ca2999eff87f27a40b645838900690424884546aed39218cba047
  Stored in directory: /root/.cache/pip/wheels/83/d8/ad/6b2276343ac605ee47e6beddb28331e96377909e5c816539c3
Successfully built pycorenlp
Installing collected packages: pycorenlp
Successfully installed pycorenlp-0.3.0


In [None]:
# https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb
# Install stanza; note that the prefix "!" is not needed if you are running in a terminal
!pip install stanza

# Import stanza
import stanza

# Download the Stanford CoreNLP package with Stanza's installation command
# This'll take several minutes, depending on the network speed
corenlp_dir = './corenlp'
stanza.install_corenlp(dir=corenlp_dir)

# Set the CORENLP_HOME environment variable to point to the installation location
import os
os.environ["CORENLP_HOME"] = corenlp_dir

# Examine the CoreNLP installation folder to make sure the installation is successful
!ls $CORENLP_HOME

Collecting stanza
  Downloading stanza-1.2.3-py3-none-any.whl (342 kB)
[?25l[K     |█                               | 10 kB 24.5 MB/s eta 0:00:01[K     |██                              | 20 kB 28.1 MB/s eta 0:00:01[K     |██▉                             | 30 kB 30.0 MB/s eta 0:00:01[K     |███▉                            | 40 kB 32.0 MB/s eta 0:00:01[K     |████▉                           | 51 kB 34.3 MB/s eta 0:00:01[K     |█████▊                          | 61 kB 35.6 MB/s eta 0:00:01[K     |██████▊                         | 71 kB 29.8 MB/s eta 0:00:01[K     |███████▋                        | 81 kB 29.6 MB/s eta 0:00:01[K     |████████▋                       | 92 kB 31.1 MB/s eta 0:00:01[K     |█████████▋                      | 102 kB 32.2 MB/s eta 0:00:01[K     |██████████▌                     | 112 kB 32.2 MB/s eta 0:00:01[K     |███████████▌                    | 122 kB 32.2 MB/s eta 0:00:01[K     |████████████▌                   | 133 kB 32.2 MB/s eta 0:0

2021-09-29 10:36:48 INFO: Installing CoreNLP package into ./corenlp...


Downloading http://nlp.stanford.edu/software/stanford-corenlp-latest.zip:   0%|          | 0.00/504M [00:00<?,…



build.xml				  jollyday.jar
corenlp.sh				  LIBRARY-LICENSES
CoreNLP-to-HTML.xsl			  LICENSE.txt
ejml-core-0.39.jar			  Makefile
ejml-core-0.39-sources.jar		  patterns
ejml-ddense-0.39.jar			  pom-java-11.xml
ejml-ddense-0.39-sources.jar		  pom.xml
ejml-simple-0.39.jar			  protobuf-java-3.11.4.jar
ejml-simple-0.39-sources.jar		  README.txt
input.txt				  RESOURCE-LICENSES
input.txt.out				  SemgrexDemo.java
input.txt.xml				  ShiftReduceDemo.java
istack-commons-runtime-3.0.7.jar	  slf4j-api.jar
istack-commons-runtime-3.0.7-sources.jar  slf4j-simple.jar
javax.activation-api-1.2.0.jar		  stanford-corenlp-4.2.2.jar
javax.activation-api-1.2.0-sources.jar	  stanford-corenlp-4.2.2-javadoc.jar
javax.json-api-1.0-sources.jar		  stanford-corenlp-4.2.2-models.jar
javax.json.jar				  stanford-corenlp-4.2.2-sources.jar
jaxb-api-2.4.0-b180830.0359.jar		  StanfordCoreNlpDemo.java
jaxb-api-2.4.0-b180830.0359-sources.jar   StanfordDependenciesManual.pdf
jaxb-impl-2.4.0-b180830.0438.jar	  sutime
jaxb-i

In [None]:
# Import client module
from stanza.server import CoreNLPClient

In [None]:
# Construct a CoreNLPClient with some basic annotators, a memory allocation of 4GB, and port number 9001
client = CoreNLPClient(
    annotators=['tokenize','ssplit', 'pos', 'lemma', 'ner'], 
    memory='4G', 
    endpoint='http://localhost:9001',
    be_quiet=True)
print(client)

# Start the background server and wait for some time
# Note that in practice this is totally optional, as by default the server will be started when the first annotation is performed
client.start()
import time; time.sleep(10)

2021-09-29 10:40:31 INFO: Writing properties to tmp file: corenlp_server-d22036b299da4f0f.props
2021-09-29 10:40:31 INFO: Starting server with command: java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-d22036b299da4f0f.props -annotators tokenize,ssplit,pos,lemma,ner -preload -outputFormat serialized


<stanza.server.client.CoreNLPClient object at 0x7f303a400d90>


In [None]:
# Print background processes and look for java
# You should be able to see a StanfordCoreNLPServer java process running in the background
!ps -o pid,cmd | grep java

    281 java -Xmx4G -cp ./corenlp/* edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9001 -timeout 60000 -threads 5 -maxCharLength 100000 -quiet True -serverProperties corenlp_server-d22036b299da4f0f.props -annotators tokenize,ssplit,pos,lemma,ner -preload -outputFormat serialized
    304 /bin/bash -c ps -o pid,cmd | grep java
    306 grep java


In [None]:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')

Next, you will want to define which [annotators](https://stanfordnlp.github.io/CoreNLP/annotators.html) to use and which [output format](https://stanfordnlp.github.io/CoreNLP/cmdline.html#output-options) should be produced (text, json, xml, conll, conllu, serialized). Annotating the document then is very easy. Note that Stanford CoreNLP uses some large models that can take [a long time](https://stackoverflow.com/questions/11219392/stanford-corenlp-very-slow) to load. You can read more about it [here](https://stanfordnlp.github.io/CoreNLP/memory-time.html).

In [None]:
# Annotate some text
# text = "Albert Einstein was a German-born theoretical physicist. He developed the theory of relativity."

text = "Harry Potter is a series of fantasy novels written by British author J. K. Rowling.\
The novels chronicle the life of a young wizard, Harry Potter, and his friends Hermione Granger and Ron Weasley,\
all of whom are students at Hogwarts School of Witchcraft and Wizardry.\
The main story arc concerns Harry's struggle against Lord Voldemort, a dark wizard who intends to become immortal,\
overthrow the wizard governing body known as the Ministry of Magic, and subjugate all wizards and Muggles."

properties= {'annotators': 'tokenize, ssplit, pos, lemma, parse',
             'outputFormat': 'json'}

doc = client.annotate(text, properties=properties)
print(type(doc))

<class 'dict'>


In the next cells, we will simply show some examples of how to access the linguistic annotations if you use the properties as shown above. If you'd like to continue working with Stanford CoreNLP in the future, you will likely have to experiment a bit more.

In [None]:
doc.keys()

dict_keys(['sentences'])

In [None]:
sentences = doc["sentences"]
first_sentence = sentences[0]
first_sentence.keys()

dict_keys(['index', 'parse', 'basicDependencies', 'enhancedDependencies', 'enhancedPlusPlusDependencies', 'tokens'])

In [None]:
first_sentence["parse"]

"(ROOT\n  (S\n    (NP (NNP Harry) (NNP Potter))\n    (VP (VBZ is)\n      (NP\n        (NP (DT a) (NN series))\n        (PP (IN of)\n          (NP\n            (NP (NN fantasy) (NNS novels))\n            (VP (VBN written)\n              (PP (IN by)\n                (NP\n                  (NP\n                    (NML (JJ British) (NN author))\n                    (NNP J.) (NNP K.) (NNP Rowling.The))\n                  (NP (NNS novels))\n                  (NP (NN chronicle))\n                  (NP\n                    (NP (DT the) (NN life))\n                    (PP (IN of)\n                      (NP\n                        (NP (DT a) (JJ young) (NN wizard))\n                        (, ,)\n                        (NP (NNP Harry) (NNP Potter))\n                        (, ,))))\n                  (CC and)\n                  (NP (PRP$ his) (NNS friends))\n                  (NP\n                    (NP (NNP Hermione) (NNP Granger))\n                    (CC and)\n                    (NP (NNP

In [None]:
first_sentence["basicDependencies"]

[{'dep': 'ROOT',
  'dependent': 5,
  'dependentGloss': 'series',
  'governor': 0,
  'governorGloss': 'ROOT'},
 {'dep': 'compound',
  'dependent': 1,
  'dependentGloss': 'Harry',
  'governor': 2,
  'governorGloss': 'Potter'},
 {'dep': 'nsubj',
  'dependent': 2,
  'dependentGloss': 'Potter',
  'governor': 5,
  'governorGloss': 'series'},
 {'dep': 'cop',
  'dependent': 3,
  'dependentGloss': 'is',
  'governor': 5,
  'governorGloss': 'series'},
 {'dep': 'det',
  'dependent': 4,
  'dependentGloss': 'a',
  'governor': 5,
  'governorGloss': 'series'},
 {'dep': 'case',
  'dependent': 6,
  'dependentGloss': 'of',
  'governor': 8,
  'governorGloss': 'novels'},
 {'dep': 'compound',
  'dependent': 7,
  'dependentGloss': 'fantasy',
  'governor': 8,
  'governorGloss': 'novels'},
 {'dep': 'nmod',
  'dependent': 8,
  'dependentGloss': 'novels',
  'governor': 5,
  'governorGloss': 'series'},
 {'dep': 'acl',
  'dependent': 9,
  'dependentGloss': 'written',
  'governor': 8,
  'governorGloss': 'novels'},


In [None]:
first_sentence["tokens"]

[{'after': ' ',
  'before': '',
  'characterOffsetBegin': 0,
  'characterOffsetEnd': 5,
  'index': 1,
  'lemma': 'Harry',
  'originalText': 'Harry',
  'pos': 'NNP',
  'word': 'Harry'},
 {'after': ' ',
  'before': ' ',
  'characterOffsetBegin': 6,
  'characterOffsetEnd': 12,
  'index': 2,
  'lemma': 'Potter',
  'originalText': 'Potter',
  'pos': 'NNP',
  'word': 'Potter'},
 {'after': ' ',
  'before': ' ',
  'characterOffsetBegin': 13,
  'characterOffsetEnd': 15,
  'index': 3,
  'lemma': 'be',
  'originalText': 'is',
  'pos': 'VBZ',
  'word': 'is'},
 {'after': ' ',
  'before': ' ',
  'characterOffsetBegin': 16,
  'characterOffsetEnd': 17,
  'index': 4,
  'lemma': 'a',
  'originalText': 'a',
  'pos': 'DT',
  'word': 'a'},
 {'after': ' ',
  'before': ' ',
  'characterOffsetBegin': 18,
  'characterOffsetEnd': 24,
  'index': 5,
  'lemma': 'series',
  'originalText': 'series',
  'pos': 'NN',
  'word': 'series'},
 {'after': ' ',
  'before': ' ',
  'characterOffsetBegin': 25,
  'characterOffset

In [None]:
for sent in doc["sentences"]:
    for token in sent["tokens"]:
        word = token["word"]
        lemma = token["lemma"]
        pos = token["pos"]
        print(word, lemma, pos)

Harry Harry NNP
Potter Potter NNP
is be VBZ
a a DT
series series NN
of of IN
fantasy fantasy NN
novels novel NNS
written write VBN
by by IN
British british JJ
author author NN
J. J. NNP
K. K. NNP
Rowling.The Rowling.The NNP
novels novel NNS
chronicle chronicle NN
the the DT
life life NN
of of IN
a a DT
young young JJ
wizard wizard NN
, , ,
Harry Harry NNP
Potter Potter NNP
, , ,
and and CC
his he PRP$
friends friend NNS
Hermione Hermione NNP
Granger Granger NNP
and and CC
Ron Ron NNP
Weasley Weasley NNP
, , ,
all all DT
of of IN
whom whom WP
are be VBP
students student NNS
at at IN
Hogwarts Hogwarts NNP
School School NNP
of of IN
Witchcraft Witchcraft NNP
and and CC
Wizardry.The Wizardry.The NNP
main main JJ
story story NN
arc arc NN
concerns concern NNS
Harry Harry NNP
's 's POS
struggle struggle NN
against against IN
Lord Lord NNP
Voldemort Voldemort NNP
, , ,
a a DT
dark dark JJ
wizard wizard NN
who who WP
intends intend VBZ
to to TO
become become VB
immortal immortal JJ
, , ,
overt

In [None]:
# find out what the entity label 'NORP' means
spacy.explain("NORP")

'Nationalities or religious or political groups'

## 4. NLTK vs. spaCy vs. CoreNLP

There might be different reasons why you want to use NLTK, spaCy or Stanford CoreNLP. There are differences in efficiency, quality, user friendliness, functionalities, output formats, etc. At this moment, we advise you to go with spaCy because of its ease in use and high quality performance.

Here's an example of both NLTK and spaCy in action. 

* The example text is a case in point. What goes wrong here?
* Try experimenting with the text to see what the differences are.

In [None]:
import nltk
import spacy

nlp = spacy.load('en_core_web_sm')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

In [None]:
text = "I like cheese very much"

print("NLTK results:")
nltk_tagged = nltk.pos_tag(text.split())
print(nltk_tagged)

print()

print("spaCy results:")
doc = nlp(text)
spacy_tagged = []
for token in doc:
    tag_data = (token.orth_, token.tag_,)
    spacy_tagged.append(tag_data)
print(spacy_tagged)

NLTK results:
[('I', 'PRP'), ('like', 'VBP'), ('cheese', 'JJ'), ('very', 'RB'), ('much', 'JJ')]

spaCy results:
[('I', 'PRP'), ('like', 'VBP'), ('cheese', 'NN'), ('very', 'RB'), ('much', 'RB')]


Do you want to learn more about the differences between NLTK, spaCy and CoreNLP? Here are some links:
- [Facts & Figures (spaCy)](https://spacy.io/usage/facts-figures)
- [About speed (CoreNLP vs. spaCy)](https://nlp.stanford.edu/software/tokenizer.html#Speed)
- [NLTK vs. spaCy: Natural Language Processing in Python](https://blog.thedataincubator.com/2016/04/nltk-vs-spacy-natural-language-processing-in-python/) 
- [What are the advantages of Spacy vs NLTK?](https://www.quora.com/What-are-the-advantages-of-Spacy-vs-NLTK) 
- [5 Heroic Python NLP Libraries](https://elitedatascience.com/python-nlp-libraries)


## 5. Some other useful modules for cleaning and preprocessing

Data is often messy, noisy or includes irrelevant information. Therefore, chances are big that you will need to do some cleaning before you can start with your analysis. This is especially true for social media texts, such as tweets, chats, and emails. Typically, these texts are informal and notoriously noisy. Normalising them to be able to process them with NLP tools is a NLP challenge in itself and fully discussing it goes beyond the scope of this course. However, you may find the following modules useful in your project:

- [tweet-preprocessor](https://pypi.python.org/pypi/tweet-preprocessor/0.4.0): This library makes it easy to clean, parse or tokenize the tweets. It supports cleaning, tokenizing and parsing of URLs, hashtags, reserved words, mentions, emojis and smileys.
- [emot](https://pypi.python.org/pypi/emot/1.0): Emot is a python library to extract the emojis and emoticons from a text (string). All the emojis and emoticons are taken from a reliable source, i.e. Wikipedia.org.
- [autocorrect](https://pypi.python.org/pypi/autocorrect/0.1.0): Spelling corrector (Python 3).
- [html](https://docs.python.org/3/library/html.html#module-html): Can be used to remove HTML tags.
- [chardet](https://pypi.python.org/pypi/chardet): Universal encoding detector for Python 2 and 3.
- [ftfy](https://pypi.python.org/pypi/ftfy): Fixes broken unicode strings.

If you are interested in reading more about these topic, these papers discuss preprocessing and normalization:

* [Assessing the Consequences of Text Preprocessing Decisions](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2849145) (Denny & Spirling 2016). This paper is a bit long, but it provides a nice discussion of common preprocessing steps and their potential effects.
* [What to do about bad language on the internet](http://www.cc.gatech.edu/~jeisenst/papers/naacl2013-badlanguage.pdf) (Eisenstein 2013). This is a quick read that we recommend everyone to at least look through.

And [here](https://www.kaggle.com/rtatman/character-encodings-tips-tricks/) is a nice blog about character encoding.

## Exercises

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')

### Exercise 1:
1. What is the difference between token.pos\_ and token.tag\_? Read [the docs](https://spacy.io/api/annotation#pos-tagging) to find out.

2. What do the different labels mean? Use `space.explain` to inspect some of them. You can also refer to [this page](http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) for a complete overview. 

In [None]:
doc = nlp("I have an awesome cat. It's sitting on the mat that I bought yesterday.")
for token in doc:
    print(token.pos_, token.tag_)

PRON PRP
VERB VBP
DET DT
ADJ JJ
NOUN NN
PUNCT .
PRON PRP
AUX VBZ
VERB VBG
ADP IN
DET DT
NOUN NN
DET WDT
PRON PRP
VERB VBD
NOUN NN
PUNCT .


In [None]:
spacy.explain("PRON")

'pronoun'

### Exercise 2:

Let's practice a bit with processing files. Open the file `charlie.txt` for reading and use `read()` to read its content as a string. Then use spaCy to annotate this string and print the information below. Remember: you can use `dir()` to remind yourself of the attributes.

For each **token** in the text:
1. Text 
2. Lemma
3. POS tag
4. Whether it's a stopword or not
5. Whether it's a punctuation mark or not

For each **sentence** in the text:
1. The complete text
2. The number of tokens
3. The complete text in lowercase letters
4. The text, lemma and POS of the first word

For each **noun chunk** in the text:
1. The complete text
2. The number of tokens
3. The complete text in lowercase letters
4. The text, lemma and POS of the first word

For each **named entity** in the text:
1. The complete text
2. The number of tokens
3. The complete text in lowercase letters
4. The text, lemma and POS of the first word

In [None]:
filename = "../Data/Charlie/charlie.txt"

# read the file and process with spaCy

In [None]:
# print all information about the tokens

In [None]:
# print all information about the sentences

In [None]:
# print all information about the noun chunks

In [None]:
# print all information about the entities

### Exercise 3:

Remember how we can use the `os` and `glob` modules to process multiple files? For example, we can read all `.txt` files in the `Dreams` folder like this:

In [None]:
import glob
filenames = glob.glob("../Data/Dreams/*.txt")
print(filenames)

[]


Now create a function called `get_vocabulary` that takes one positional parameter `filenames`. It should read in all `filenames` and return a set called `unique_words`, that contains all unique words in the files.

In [None]:
def get_vocabulary(filenames):
    # your code here

# test your function here
unique_words = get_vocabulary(filenames)
print(unique_words, len(unique_words))
assert len(unique_words) == 415 # if your code is correct, this should not raise an error

IndentationError: ignored

### Exercise 4:
Create a function called `get_sentences_with_keyword` that takes one positional parameter `filenames` and one keyword parameter `filenames` with default value `None`. It should read in all `filenames` and return a list called `sentences` that contains all sentences (the complete texts) with the keyword. 

Hints:
- It's best to check for the *lemmas* of each token
- Lowercase both your keyword and the lemma

In [None]:
import glob
filenames = glob.glob("../Data/Dreams/*.txt")
print(filenames)

[]


In [None]:
def get_sentences_with_keyword(filenames, keyword=None):
    #your code here

# test your function here
sentences = get_sentences_with_keyword(filenames, keyword="toy")
print(sentences)
assert len(sentences) == 4 # if your code is correct, this should not raise an error

IndentationError: ignored