# An Introduction to Natural Language in Python using spaCy
Based on a tutorial at https://colab.research.google.com/github/DerwenAI/spaCy_tuTorial/blob/master/spaCy_tuTorial.ipynb#scrollTo=nEzq2vNoUJz5

In this assignment we will create data from natural language text.  We'll use spaCy to identify people, organizations and places mentioned in the State Department cables that we used in an earlier assignment. Here are the steps:

1) Use a Colab notebook to download the cables from October 1973 and run Spacy to find the named mentions in the actual body of each message.  To do this you will need to remove message headers and other formatted text that is also present in the msgtext field.

2) Then run coreferee to resolve nominal and pronominal references and create mention chains.

3) Then count the number of references to each unique entity two ways: (1) direct mentions of the named entity, and (2) any reference to a coreference chain that contains that entity.

Create an excel spreadsheet in which you provide two sorted lists showing the most commonly referenced entities, in decreasing order of mentions or references, as counted in step 3.

Upload your spreadsheet and a link to the Colab notebook that you used to assemble the raw data that you used to compute those statistics to ELMS.

In [None]:
!pip install coreferee
!python -m coreferee install en

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting coreferee
  Downloading coreferee-1.3.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 13.5 MB/s 
Installing collected packages: coreferee
Successfully installed coreferee-1.3.0
2022-10-11 22:28:31.210316: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/explosion/coreferee/raw/master/models/coreferee_model_en.zip
  Downloading https://github.com/explosion/coreferee/raw/master/models/coreferee_model_en.zip (65.4 MB)
[K     |████████████████████████████████| 65.4 MB 1.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdon

In [None]:
import spacy

nlp = spacy.load("en_core_web_sm")

import coreferee
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x7fc9841ffdd0>

In [None]:
!wget https://users.umiacs.umd.edu/~oard/cables.zip
print('starting unzip')
!unzip -u -q cables.zip
print('unzip complete, files stored in cables/')

--2022-10-11 22:30:06--  https://users.umiacs.umd.edu/~oard/cables.zip
Resolving users.umiacs.umd.edu (users.umiacs.umd.edu)... 128.8.120.33
Connecting to users.umiacs.umd.edu (users.umiacs.umd.edu)|128.8.120.33|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 509664391 (486M) [application/zip]
Saving to: ‘cables.zip’


2022-10-11 22:30:25 (27.9 MB/s) - ‘cables.zip’ saved [509664391/509664391]

starting unzip
unzip complete, files stored in cables/


This next chunk takes ~32 minutes to run. 

In [None]:
import xml.etree.ElementTree as ET
import os
import coreferee, spacy


data = []
i=0
files = 0
count = 0
total = 0
tree=ET.parse('cables/CFPF.TEL.OCT73.PU')
root=tree.getroot()
chaindict = {}
directcount = {}
for doc in list(root.iter('sasdoc')):
  messages = list(doc.iter('msgtext'))
  for message in messages:
    if (not "SUBJ" in message.text) or (not "NNN" in message.text): continue
    usable = message.text.replace("UNCLASSIFIED","").replace("LIMITED OFFICIAL USE","").replace("CLASSIFIED","")
    start = usable.index("SUBJ") + 4;
    for i in range(start,len(usable)-1):
      if (usable[i] == "\n"):
        start = i
        break
    end = usable.index("NNN");
    body = usable[start:end]
    doc = nlp(body)
    chains = doc._.coref_chains
    for chain in chains:
      found = [];
      for link in chain:
        for index in link:
          word = doc[index]
          if not word.is_stop:
            subject = str(word)
            if subject in chaindict:
              directcount[subject] += 1
              if not subject in found:
                chaindict[subject] += len(chain)
                found.append(subject)
            else:
              directcount[subject] = 1
              chaindict[subject] = len(chain)
print(chaindict)
print(directcount)
with open('output.csv','w') as f:
  f.write('word,directReferences,directAndIndirectReferences')
  for key in chaindict:
    f.write(key + ",")
    f.write(directcount[key] + ",")
    f.write(chaindict[key] + "\n")
    



In [None]:
with open('output.csv','w') as f:
  f.write('word,directReferences,directAndIndirectReferences')
  for key in chaindict:
    f.write(key + ",")
    f.write(directcount[key] + ",")
    f.write(chaindict[key] + "\n")

That `nlp` variable is now your gateway to all things _spaCy_ and loaded with the `en_core_web_sm` small model for English.
Next, let's run a small "document" through the natural language parser:

In [None]:
text = "The rain in Spain falls mainly on the plain."
doc = nlp(text)

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.is_stop)

The the DET True
rain rain NOUN False
in in ADP True
Spain Spain PROPN False
falls fall VERB False
mainly mainly ADV False
on on ADP True
the the DET True
plain plain NOUN False
. . PUNCT False


First we created a [doc](https://spacy.io/api/doc) from the text, which is a container for a document and all of its annotations. Then we iterated through the document to see what _spaCy_ had parsed.

Good, but it's a lot of info and a bit difficult to read. Let's reformat the _spaCy_ parse of that sentence as a [pandas](https://pandas.pydata.org/) dataframe:

In [None]:
import pandas as pd

cols = ("text", "lemma", "POS", "explain", "stopword")
rows = []

for t in doc:
    row = [t.text, t.lemma_, t.pos_, spacy.explain(t.pos_), t.is_stop]
    rows.append(row)

df = pd.DataFrame(rows, columns=cols)
    
df

Unnamed: 0,text,lemma,POS,explain,stopword
0,The,the,DET,determiner,True
1,rain,rain,NOUN,noun,False
2,in,in,ADP,adposition,True
3,Spain,Spain,PROPN,proper noun,False
4,falls,fall,VERB,verb,False
5,mainly,mainly,ADV,adverb,False
6,on,on,ADP,adposition,True
7,the,the,DET,determiner,True
8,plain,plain,NOUN,noun,False
9,.,.,PUNCT,punctuation,False


Much more readable!
In this simple case, the entire document is merely one short sentence.
For each word in that sentence _spaCy_ has created a [token](https://spacy.io/api/token), and we accessed fields in each token to show:

 - raw text
 - [lemma](https://en.wikipedia.org/wiki/Lemma_(morphology)) – a root form of the word
 - [part of speech](https://en.wikipedia.org/wiki/Part_of_speech)
 - a flag for whether the word is a _stopword_ – i.e., a common word that may be filtered out

Multiple sentences require some kind of Sentence Boundary Detection.

In [None]:
text = "We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit. I fell in. Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket. The gorillas just went wild."

doc = nlp(text)

for sent in doc.sents:
    print(">", sent)

> We were all out at the zoo one day, I was doing some acting, walking on the railing of the gorilla exhibit.
> I fell in.
> Everyone screamed and Tommy jumped in after me, forgetting that he had blueberries in his front pocket.
> The gorillas just went wild.


When _spaCy_ creates a document, it uses a principle of _non-destructive tokenization_ meaning that the tokens, sentences, etc., are simply indexes into a long array. In other words, they don't carve the text stream into little pieces. So each sentence is a [span](https://spacy.io/api/span) with a _start_ and an _end_ index into the document array:

In [None]:
for sent in doc.sents:
    print(">", sent.start, sent.end)

> 0 25
> 25 29
> 29 48
> 48 54


We can index into the document array to pull out the tokens for one sentence:

In [None]:
doc[48:54]

The gorillas just went wild.

Or simply index into a specific token, such as the verb `went` in the last sentence:

In [None]:
token = doc[51]
print(token.text, token.lemma_, token.pos_)

went go VERB


At this point we can parse a document, segment that document into sentences, then look at annotations about the tokens in each sentence. That's a good start.

## Acquiring Text

Now that we can parse texts, where do we get texts?
One quick source is to leverage the interwebs.
Of course when we download web pages we'll get HTML, and then need to extract text from them.
[Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) is a popular package for that.

First, a little housekeeping:

In [None]:
import sys
import warnings

warnings.filterwarnings("ignore")

### Character Encoding

The following shows examples of how to use [codecs](https://docs.python.org/3/library/codecs.html) and [normalize unicode](https://docs.python.org/3/library/unicodedata.html#unicodedata.normalize). NB: the example text comes from the article "[Metal umlat](https://en.wikipedia.org/wiki/Metal_umlaut)".

In [None]:
x = "Rinôçérôse screams ﬂow not unlike an encyclopædia, \
'TECHNICIÄNS ÖF SPÅCE SHIP EÅRTH THIS IS YÖÜR CÄPTÅIN SPEÄKING YÖÜR ØÅPTÅIN IS DEA̋D' to Spın̈al Tap."

type(x)

str

In [None]:
import unicodedata

unicodedata.normalize('NFKD', x).encode('ascii','ignore')

b"Rinocerose screams flow not unlike an encyclopdia, 'TECHNICIANS OF SPACE SHIP EARTH THIS IS YOUR CAPTAIN SPEAKING YOUR APTAIN IS DEAD' to Spnal Tap."

Even before this normalization and encoding, you may need to convert some characters explicitly **before** parsing. For example:

In [None]:
x = "The sky “above” the port … was the color of ‘cable television’ – tuned to the Weather Channel®"

ascii(x)

"'The sky \\u201cabove\\u201d the port \\u2026 was the color of \\u2018cable television\\u2019 \\u2013 tuned to the Weather Channel\\xae'"

In [None]:
x = x.replace('“', '"').replace('”', '"')
x = x.replace("‘", "'").replace("’", "'")
x = x.replace('…', '...').replace('–', '-')

x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8')
print(x)

The sky "above" the port ... was the color of 'cable television' - tuned to the Weather Channel


### Parsing HTML

In the following function `get_text()` we'll parse the HTML to find all of the `<p/>` tags, then extract the text for those:

In [None]:
from bs4 import BeautifulSoup
import requests
import traceback

def get_text (url):
    buf = []
    
    try:
        soup = BeautifulSoup(requests.get(url).text, "html.parser")
        
        for p in soup.find_all("p"):
            buf.append(p.get_text())

        return "\n".join(buf)
    except:
        print(traceback.format_exc())
        sys.exit(-1)

Now let's grab some text from online sources.
We can compare open source licenses hosted on the [Open Source Initiative](https://opensource.org/licenses/) site:

In [None]:
lic = {}
lic["mit"] = nlp(get_text("https://opensource.org/licenses/MIT"))
lic["asl"] = nlp(get_text("https://opensource.org/licenses/Apache-2.0"))
lic["bsd"] = nlp(get_text("https://opensource.org/licenses/BSD-3-Clause"))

for sent in lic["bsd"].sents:
    print(">", sent)

> SPDX short identifier: BSD-3-Clause
 

Note: This license has also been called the "New BSD License" or "Modified BSD License".
> See also the 2-clause BSD License.

> Copyright <YEAR> <COPYRIGHT HOLDER>
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

> 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

> 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

> 3.
> Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

> THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIE

One common use case for natural language work is to compare texts. For example, with those open source licenses we can download their text, parse, then compare [similarity](https://spacy.io/api/doc#similarity) metrics among them:

In [None]:
pairs = [
    ["mit", "asl"],
    ["asl", "bsd"],
    ["bsd", "mit"]
]

for a, b in pairs:
    print(a, b, lic[a].similarity(lic[b]))

mit asl 0.9085436503462764
asl bsd 0.8981709726047463
bsd mit 0.9801954262145988


That is interesting, since the [BSD](https://opensource.org/licenses/BSD-3-Clause) and [MIT](https://opensource.org/licenses/MIT) licenses appear to be the most similar documents.
In fact they are closely related.

Admittedly, there was some extra text included in each document due to the OSI disclaimer in the footer – but this provides a reasonable approximation for comparing the licenses.

## Natural Language Understanding

Now let's dive into some of the _spaCy_ features for NLU.
Given that we have a parse of a document, from a purely grammatical standpoint we can pull the [noun chunks](https://spacy.io/usage/linguistic-features#noun-chunks), i.e., each of the noun phrases:

In [None]:
text = "Steve Jobs and Steve Wozniak incorporated Apple Computer on January 3, 1977, in Cupertino, California."
doc = nlp(text)

for chunk in doc.noun_chunks:
    print(chunk.text)

Steve Jobs
Steve Wozniak
Apple Computer
January
Cupertino
California


Not bad. The noun phrases in a sentence generally provide more information content – as a simple filter used to reduce a long document into a more "distilled" representation.

We can take this approach further and identify [named entities](https://spacy.io/usage/linguistic-features#named-entities) within the text, i.e., the proper nouns:

In [None]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Steve Jobs PERSON
Steve Wozniak PERSON
Apple Computer ORG
January 3, 1977 DATE
Cupertino GPE
California GPE


The _displaCy_ library provides an excellent way to visualize named entities:

In [None]:
displacy.render(doc, style="ent", jupyter=True)

Now let's add co-reference resolution.  This is essentially clustering, in which we seek to associate instances of pronouns (e.g., he or she) and nominal references (e.g., country) with the named entities to which they refer.

In [None]:
!pip install coreferee
!python -m coreferee install en

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting coreferee
  Downloading coreferee-1.3.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 2.1 MB/s 
Installing collected packages: coreferee
Successfully installed coreferee-1.3.0
2022-10-06 00:28:13.076255: E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/explosion/coreferee/raw/master/models/coreferee_model_en.zip
  Downloading https://github.com/explosion/coreferee/raw/master/models/coreferee_model_en.zip (65.4 MB)
[K     |████████████████████████████████| 65.4 MB 1.2 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone

In [None]:
import coreferee, spacy
nlp.add_pipe('coreferee')
doc = nlp('Although he was very busy with his work, Peter had had enough of it. He and his wife decided they needed a holiday. They travelled to Spain because they loved the country very much.')
doc._.coref_chains.print()

0: he(1), his(6), Peter(9), He(16), his(18)
1: work(7), it(14)
2: [He(16); wife(19)], they(21), They(26), they(31)
3: Spain(29), country(34)
