## SpaCy 101: doing more and adding pipeline modules

[Spacy](https://spacy.io/) is an open-source, multi-lingual NLP library. Its components are not SOTA but they are robust, easy to use and fast.

This notebook shows addional things you can do, including adding pipline modules for sentiment analysis and coreference

You may need to do the following:
 * pip install spacy
 * python -m spacy download en_core_web_md

In [1]:
import spacy
from spacy import displacy

### Load Spacy's medium English language model

In [2]:
nlp = spacy.load("en_core_web_md")

spacy_entity_linker


### Text similarity via word embeddings is available in the medium and larger language models
* values range from 0.0 to 1.0 based on how close the meanings of two pieces of text are

In [3]:
doc1 = nlp("A mouse ate my cheese.")
doc2 = nlp("Some cheese was eaten by a rodent!")
doc3 = nlp("All of the cheddar was eaten by rats!")
doc4 = nlp("Computers can analyze language today.")
print(f"Similarity of docs 1 and 2: {doc1.similarity(doc2):.2f}")
print(f"Similarity of docs 1 and 3: {doc1.similarity(doc3):.2f}")
print(f"Similarity of docs 1 and 4: {doc1.similarity(doc4):.2f}")

Similarity of docs 1 and 2: 0.89
Similarity of docs 1 and 3: 0.82
Similarity of docs 1 and 4: 0.56


### It's a simple similarity model that ignores word order, but still useful

In [4]:
doc1 = nlp("Alice killed Bob")
doc2 = nlp("Bob killed Alice")
print(f"Similarity of doc1 and doc2: {doc1.similarity(doc2):.2f}")


Similarity of doc1 and doc2: 1.00


In [5]:
### SpaCY has multiple language models and even the medium one knows when people are similar

In [6]:
doc1 = nlp("Jennifer Aniston")
doc2 = nlp("Brad Pitt")
doc3 = nlp("Marie Currie")

print(f"Similarity of Jennifer Aniston and Brad Pitt: {doc1.similarity(doc2):.2f}")
print(f"Similarity of Marie Currie and Brad Pitt: {doc3.similarity(doc2):.2f}")

Similarity of Jennifer Aniston and Brad Pitt: 0.67
Similarity of Marie Currie and Brad Pitt: 0.48


### Extending the pipeline to compute sentiment expressed of text
 * [**spaCyTextBlob**](https://spacy.io/universe/project/spacy-textblob) is a simple sentiment analytic that can be added to the pipeline
 * You need to install it first with ** pip install spacytextblob **
 * it adds features representing a text span's polarity, subjectivity and assesements
 * Spacy requires that new attributes on text spans be prefaced by **._.** , see below

In [7]:
from spacytextblob.spacytextblob import SpacyTextBlob
nlp.add_pipe('spacytextblob')

<spacytextblob.spacytextblob.SpacyTextBlob at 0x7f9b73b652b0>

In [8]:
doc = nlp("I hated calculus 2 and had a difficult time. I didn't like the course and got a bad grade.")
print(f"Polarity: {doc._.polarity:.2f}")         # -1 to +1
print(f"Subjectivity: {doc._.subjectivity:.2f}") # 0 to 1.0 with 0 for rational and 1 for emotional
print(f"Assesments: {doc._.assessments}")        # words that indicate the sentiment

Polarity: -0.70
Subjectivity: 0.79
Assesments: [(['hated'], -0.9, 0.7, None), (['difficult'], -0.5, 1.0, None), (['bad'], -0.6999999999999998, 0.6666666666666666, None)]


### This is an example of a more objective (i.e., non-emotional) expression of  positive sentiment

In [9]:
doc = nlp("Tesla's Model 3 is worth the price.")
print(f"Polarity: {doc._.polarity:.2f}")         # -1 to +1
print(f"Subjectivity: {doc._.subjectivity:.2f}") # foo
print(f"Assesments: {doc._.assessments}")        # bar

Polarity: 0.30
Subjectivity: 0.10
Assesments: [(['worth'], 0.3, 0.1, None)]


### Text expressing neither positive nor negative sentiment and is neither objective or subjective

In [10]:
doc = nlp("The store will open at 9:00 am and close at 5:00 pm.")
print(f"Polarity: {doc._.polarity:.2f}")         # -1 to +1
print(f"Subjectivity: {doc._.subjectivity:.2f}") # foo
print(f"Assesments: {doc._.assessments}")        # bar

Polarity: 0.00
Subjectivity: 0.50
Assesments: [(['open'], 0.0, 0.5, None)]


### Adding the [Coreferee](https://pypi.org/project/coreferee/) coreference model

 * Coreference occurs when two or more words in a text refer to the same entity, e.g. **John** went home because **he** was tired. 
 * it's critical for many tasks
 * using the largest language models yield the best results

You will need to pip install these packages to use coreferee on ENglish text
 * python3 -m pip install coreferee
 * python3 -m coreferee install en

In [11]:
# import and add to the pipeline
import coreferee
nlp.add_pipe('coreferee')

<coreferee.manager.CorefereeBroker at 0x7f9b76c46fd0>

In [12]:
doc = nlp("""Although he was very busy with his work, Peter Piper had had enough of it. \
He and his wife decided they needed a holiday. They travelled to Spain because \
she loved the country very much.""")

A **coref chain** is a list of names, nouns, and or pronouns in the text that refer to the same entity.  Coreferee finds four chains in this text. Note that
 * The number after each word in in a chain refers to the token in the text; a reference to named entity will be to its last token (e.g., Piper), and some items in the chain are a list (e.g., chaine #2)

In [13]:
doc._.coref_chains.print()

0: he(1), his(6), Piper(10), He(17), his(19)
1: work(7), it(15)
2: [He(17); wife(20)], they(22), They(27)
3: wife(20), she(32)
4: Spain(30), country(35)


Here's a cyber-security relevant example

In [24]:
doc = nlp("""APT41 is a state-sponsored espionage group. It is based in China and attacks \
higher education, travel services and media firms from that country.""")
print(f"Named entities: {doc.ents}")
print(f"Noun chunks: {[np for np in doc.noun_chunks]}")
print("\nCoreference chains:")
doc._.coref_chains.print()

Named entities: (China,)
Noun chunks: [APT41, a state-sponsored espionage group, It, China, higher education, travel services, media firms, that country]

Coreference chains:
0: APT41(0), It(9)
1: China(13), country(26)


*** 
*The End*