Basic NLP with re, NLTK and spaCy
===

This is an easy notebook for [Hack the Burgh V](https://2019.hacktheburgh.com/) as a demonstration of some basic NLP, including sentence splitting, tokenization, and named entity recognition.

Any questions or feedback, feel free to ping me on twitter @mstdan or find me at https://danduma.github.io

Installation instructions 
---

These instructions are for linux and macOS computers. For anything else, please google :)

>sudo pip install nltk spacy

Download the spaCy model:
> python -m spacy download en_core_web_lg

You may or may not need NLTK resources. If this was the case, run:

> import nltk
> nltk.download()

A nice interface will appear to guide you through the download.


In [2]:
# Let's import the basics
import re
import nltk
%matplotlib inline

In [3]:
# And now spaCy. This will take a while.
import spacy

nlp = spacy.load("en_core_web_lg") # The large model is slower to load, but it is much more effective than the default

And now ler's grab a few sentences that make sense together. Why not grab them from George Clooney's wikipedia page?

In [4]:
text = """George Timothy Clooney (born May 6, 1961) is an American actor, filmmaker and businessman. He is the 
recipient of three Golden Globe Awards and two Academy Awards, one for acting in Syriana (2006) and the other 
for co-producing Argo (2012). In 2018, he was the recipient of the AFI Lifetime Achievement Award, at the age of 57.""".replace("\n", "")

Maybe we don't care about the details of this date of birth and the dates of his movies. Let's get rid of all the dates with RegEx. This is very old-style text processing, but it's surprisingly efficient for text normalisation, and still widely used to this day as part of any industrial NLP pipeline.

Python's re library deals with regex. We'll use re.sub() which is like CRISPR for text.

Learn more about regex [here](https://s3.amazonaws.com/assets.datacamp.com/production/course_5064/slides/chapter1.pdf).
If you need to write RegEx, use https://regex101.com/ to test them.

In [5]:
regex=r"\((\w+\s+)?(Jan(uary)?|Feb(ruary)?|Mar(ch)?|Apr(il)?|May|Jun(e)?|Jul(y)?|Aug(ust)?|Sep(tember)?|Oct(ober)?|Nov(ember)?|Dec(ember)?)?(\s+\d{1,2},\s+)?\d{4}\)"

text = re.sub(regex, "", text) # let's get rid of those pesky dates
text = re.sub("\s+"," ", text) # and let's make sure we don't leave extra spaces behind



In [6]:
# Let's first split the sentences.
sentences=nltk.sent_tokenize(text)
sentences

['George Timothy Clooney is an American actor, filmmaker and businessman.',
 'He is the recipient of three Golden Globe Awards and two Academy Awards, one for acting in Syriana and the other for co-producing Argo .',
 'In 2018, he was the recipient of the AFI Lifetime Achievement Award, at the age of 57.']

In [14]:
# We could use NLTK's tokenizer, which is fast and simple.
tokens = nltk.word_tokenize(sentences[0])
tokens

['George',
 'Timothy',
 'Clooney',
 'is',
 'an',
 'American',
 'actor',
 ',',
 'filmmaker',
 'and',
 'businessman',
 '.']

In [16]:
# We could now get the POS tags
pos_tags = nltk.pos_tag(tokens)
pos_tags

[('George', 'NNP'),
 ('Timothy', 'NNP'),
 ('Clooney', 'NNP'),
 ('is', 'VBZ'),
 ('an', 'DT'),
 ('American', 'JJ'),
 ('actor', 'NN'),
 (',', ','),
 ('filmmaker', 'NN'),
 ('and', 'CC'),
 ('businessman', 'NN'),
 ('.', '.')]

But let's do this the modern way and use spaCy instead:

In [17]:
# Let's look at the first sentence
s = nlp(sentences[0])
s

George Timothy Clooney is an American actor, filmmaker and businessman.

In [18]:
# Maybe we want to iterate over tokens. We do it like so:
for token in s:
    print(token)

George
Timothy
Clooney
is
an
American
actor
,
filmmaker
and
businessman
.


In [19]:
# Let's see what named entities we have in here.
s.ents

(George Timothy Clooney, American)

Ooops. "American" is not a named entity. spaCy, like everything else, has a measure of accuracy and sometimes fails.

We could enforce a filtering for named entities based on there being at least one token of type Noun in it. 

In [10]:
actual_entities=[]

for ent in s.ents:
    for token in ent:
        if token.pos_ in ["PROPN", "NOUN"]:
            actual_entities.append(ent)
            break

print(actual_entities)

[George Timothy Clooney]


In [11]:
# Time to see what part-of-speech tags we got
pos_tags = [(t.text,t.pos_) for t in s]
pos_tags

[('George', 'PROPN'),
 ('Timothy', 'PROPN'),
 ('Clooney', 'PROPN'),
 ('is', 'VERB'),
 ('an', 'DET'),
 ('American', 'ADJ'),
 ('actor', 'NOUN'),
 (',', 'PUNCT'),
 ('filmmaker', 'NOUN'),
 ('and', 'CCONJ'),
 ('businessman', 'NOUN'),
 ('.', 'PUNCT')]

In [13]:
# And what about noun phrases? That is, groups of words with a noun as a head?
list(s.noun_chunks)

[George Timothy Clooney, an American actor, filmmaker, businessman]

In [83]:
# As we know, spaCy comes with pre-trained word vectors
print([t.vector for t in s])

[array([-2.8908e-01,  2.8379e-01,  3.1507e-01, -3.3824e-01,  5.4726e-01,
        4.5746e-02, -6.7819e-02, -3.5722e-01,  2.2688e-01,  9.4739e-01,
       -3.9282e-01, -4.8832e-01,  2.9361e-01, -1.0110e-01, -1.3484e-01,
       -2.9496e-01, -1.6695e-01,  6.1507e-01, -2.5881e-01,  4.3098e-01,
        3.0641e-01,  9.8365e-02,  2.7190e-01, -1.3004e-01,  5.1339e-01,
       -2.0091e-01, -6.5646e-01,  4.4247e-01,  2.2802e-01,  5.2060e-01,
       -1.6910e-01,  7.0353e-01, -5.4488e-02,  8.4449e-02,  4.0558e-04,
        3.7785e-01, -6.2604e-02,  2.8059e-01, -4.4648e-01,  1.5252e-01,
       -6.8788e-02,  2.7290e-01, -3.4336e-01,  1.0463e-01, -6.2990e-02,
       -9.7598e-03,  2.5322e-01, -2.7117e-01,  5.9682e-01,  2.6988e-01,
        2.3757e-02,  2.0391e-01,  2.1597e-01, -2.5550e-01, -1.5896e-01,
       -2.4954e-01, -4.6604e-03,  6.0487e-02,  4.3354e-02, -1.3683e-01,
       -2.2944e-01,  2.8641e-01,  2.2141e-01, -1.4532e-01, -3.9184e-02,
       -1.5529e-01, -3.4816e-01,  4.7308e-01, -5.4173e-02, -7.2

These pre-trained vectors can come in handy if we are building, say, [a sentiment analysis](https://github.com/explosion/spaCy/blob/master/examples/deep_learning_keras.py) application :)

In [91]:
# If we want a tree representation of the sentence, we can start a visualisation server
from spacy import displacy
displacy.serve(s, style='dep', options={"compact": True})

  "__main__", mod_spec)
  "__main__", mod_spec)



[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer



127.0.0.1 - - [15/Mar/2019 23:29:01] "GET / HTTP/1.1" 200 7590
127.0.0.1 - - [15/Mar/2019 23:29:01] "GET /favicon.ico HTTP/1.1" 200 7590



    Shutting down server on port 5000.

