# Natural Language Processing

## Part 1: Introduction to SpaCy

SpaCy is one of the top framework (alongside NLTK) for natural language processing.  On daily NLP tasks, you cannnot live without it.   Here we should explore different concepts such as (1) the philosophy of SpaCy, the (2) the concept of pipelines, (3) creating pipelines, and (4) visiting some interesting case studies.

## 1. How to Install spaCy

In order to install spaCy, I recommend visiting their website, here: https://spacy.io/usage

If you are using pip, simply

`pip install spacy` or `pip install -U 'spacy[cuda-autodetect]'`

`python -m spacy download en_core_web_sm` 

`python -m spacy download en_core_web_md` 

For the `en_core_web_sm`, replace `sm` with `md` (medium), `lg` (large), and `trf` (transformers).  The more complex model you used, the better features you get.

Now that we've installed spaCy let's import it to make sure we installed it correctly.

In [1]:
import spacy
spacy.__version__

'3.4.3'

Great! Now, let's make sure we downloaded the model successfully with the command below.

In [2]:
nlp = spacy.load("en_core_web_sm")

Excellent! spaCy is now installed correctly and we have successfully downloaded the small English model.

## 2. Data structures

Let's learn the data structures of spaCy.  The overarching container is the `Doc`, which contains a list of `Token` where a span of tokens creates a `Span`.  As you can see, things are very intuitive!.    There are much more stuffs but we are too lazy to understand them....we will slowly come across them.

<img src = "figures/container.svg" width="300">

### The Doc object

The Doc is one of the central data structures in spaCy. It's created automatically when you process a text with the nlp object. But you can also instantiate the class manually.

After creating the nlp object, we can import the Doc class from spacy dot tokens.

Here we're creating a Doc from three words. The spaces are a list of boolean values indicating whether the word is followed by a space. Every token includes that information – even the last one!

The Doc class takes three arguments: the shared vocab, the words and the spaces.

In [3]:
# Create an nlp object
from spacy.lang.en import English
nlp = English()

# Import the Doc class
from spacy.tokens import Doc

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

<img src = "figures/span2.png">

In [4]:
# Import the Doc and Span classes
from spacy.tokens import Doc, Span

# The words and spaces to create the doc from
words = ['Hello', 'world', '!']
spaces = [True, False, False]

# Create a doc manually
doc = Doc(nlp.vocab, words=words, spaces=spaces)

# Create a span manually
span = Span(doc, 0, 2)

# Create a span with a label
span_with_label = Span(doc, 0, 2, label="GREETING")

# Add span to the doc.ents
doc.ents = [span_with_label]

## 3. Creating a Doc Container

Great! With the model loaded, let's go ahead and import our text from Wikipedia.

In [5]:
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()

Now, let's see what this text looks like. It can be a bit difficult to read in a JupyterBook, but notice the horizontal slider below. You don't neeed to read this in its entirety.

In [6]:
print (text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.


With the data loaded in, it's time to make our first Doc container.

In [7]:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

Great! Let's see what this looks like.

In [8]:
print (doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.


**so what's the difference between raw text and doc?**

In [9]:
#the length is different, why?
print (len(doc))
print (len(text))

156
787


Hmm... What's going on here? Same text, but different length. Why does this occur? To answer that, let's explore it more deeply and try and print off each item in each object.

In [10]:
for token in text[:10]:
    print (token)

T
h
e
 
U
n
i
t
e
d


As we would expect. We have printed off each character, including white spaces. Let's try and do the same with the Doc container.

In [11]:
for token in doc[:10]:
    print (token)

The
United
States
of
America
(
U.S.A.
or
USA
)


And now we see the magical difference. `Tokens` are created through the `en_core_web_sm`.   Another difference is the concept of `sents` which refer to sentences.

In [12]:
sentence1 = list(doc.sents)[0]
print (sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


We can explore deeper into **token attributes** which hides many crazy good stuff.

In [13]:
#let's pick a random token
token2 = sentence1[2]
print (token2)

States


### Text

In [14]:
#the raw text
token2.text

'States'

### Head

In [15]:
#head of the dependency tree
token2.head

is

### Left Edge

In [16]:
#this tell us where the multi-word token begins
token2.left_edge

The

### Right Edge

In [17]:
#this will tell us where the multi-word token ends
token2.right_edge

,

### Entity Type

In [18]:
#type of entity
token2.ent_type

384

In [19]:
#the actual name of the entity
token2.ent_type_

'GPE'

In [20]:
spacy.explain('GPE') #geo political entities

'Countries, cities, states'

### Ent IOB

In [21]:
#“B” means the token begins an entity, “I” means it is inside an entity, 
# “O” means it is outside an entity, and "" means no entity tag is set.
token2.ent_iob_

'I'

### Lemma

In [22]:
#the root form of the word
print(sentence1[12])
sentence1[12].lemma_ #without _, it will give you an integer representation instead

known


'know'

### Part of Speech

In [23]:
token2.pos_

'PROPN'

### Syntactic Dependency

<img src = "figures/dep_example.png">

In [24]:
token2.dep_

'nsubj'

In [25]:
spacy.explain('nsubj')

'nominal subject'

### Language

In [26]:
token2.lang_

'en'

### Boolean attributes

In [27]:
token2.is_alpha

True

In [28]:
token2.is_punct

False

In [29]:
token2.like_email

False

## 4. Part of Speech Tagging (POS)

In the field of computational linguistics, understanding parts-of-speech is essential. SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

In [30]:
doc = nlp("Peter loves eating and Paris very much.")
sentence1 = list(doc.sents)[0]
sentence1

Peter loves eating and Paris very much.

In [31]:
for token in sentence1:
    print (token.text, token.pos_, token.dep_)

Peter PROPN nsubj
loves VERB ROOT
eating VERB xcomp
and CCONJ cc
Paris PROPN conj
very ADV advmod
much ADV advmod
. PUNCT punct


Here, we can see two vital pieces of information: the string and the corresponding part-of-speech (pos). For a complete list of the pos labels, see the spaCy documentation (https://spacy.io/api/annotation#pos-tagging). Most of these, however, should be apparent, i.e. PROPN is proper noun, AUX is an auxiliary verb, ADJ, is adjective, etc. We can visualize this sentence with a diagram through spaCy's displaCy Notebook feature.

In [32]:
from spacy import displacy
displacy.render(sentence1, style="dep")

In [33]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Berlin is a nice city")

# Iterate over the tokens
for token in doc:
    # Check if the current token is a proper noun
    if token.pos_ == "PROPN":
        # Check if the next token is a verb
        if doc[token.i + 1].pos_ == "VERB":
            print("Found proper noun before a verb:", token.text)

## 5. Named Entity Recognition

Another essential task of NLP, is named entity recognition, or NER. SpaCy provides `doc.ents`.

In [34]:
for ent in doc.ents:
    print (ent.text, ent.label_)

Berlin GPE


We can use `displaCy` to display the text as NER annotations

In [35]:
displacy.render(doc, style="ent")

## 6. Word Vectors

Let's explore word vectors a bit.  

To do so, we shall use `en_core_web_md` which contains word vectors.

In [36]:
nlp = spacy.load("en_core_web_md")
with open ("data/wiki_us.txt", "r") as f:
    text = f.read()
doc = nlp(text)
sentence1 = list(doc.sents)[0]

In [37]:
sentence1

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.

In [38]:
sentence1[0]

The

In [39]:
print(sentence1[0].vector.shape)
sentence1[0].vector[:5] #300 dimensions representing "The"

(300,)


array([-7.2681 , -0.85717,  5.8105 ,  1.9771 ,  8.8147 ], dtype=float32)

### Similarity

<img src = "figures/data-struct.png">

In [40]:
#before similarity, let's learn about nlp.vocab.strings
doc = nlp("I love coffee")
print('hash value:', nlp.vocab.strings['coffee'])
print('string value:', nlp.vocab.strings[3197928453018144401])

hash value: 3197928453018144401
string value: coffee


In [41]:
#first numercalize
integer = nlp.vocab.strings['dog']
integer

7562983679033046312

In [42]:
#let's obtain the vector
vector  = nlp.vocab.vectors[integer]
vector[:5]  #too ugly to print all...

array([  1.233 ,   4.2963,  -7.9738, -10.121 ,   1.8207], dtype=float32)

In [43]:
import numpy as np

#obtain words that are most similar to this vector
#need to put [] because most_similar accept numpy array of vectors
close_words = nlp.vocab.vectors.most_similar(np.asarray([vector]), n=10)
close_words  #vocab id,  (i don't know what is this...) , probability

(array([[ 7918624946109788756,  4969328240109515165,  4560869431627726864,
         17429802345416193488,  6017664905485703127, 14534804554944721111,
           173986088034745168, 15668852121853073894, 11567120971096873637,
         15872191516786115817]], dtype=uint64),
 array([[ 1147,  2545,  3201,  9003,  3828, 18829,  5845, 11580,  7045,
         18612]], dtype=int32),
 array([[1.    , 0.8334, 0.8221, 0.8108, 0.7856, 0.7195, 0.685 , 0.6328,
         0.6148, 0.5966]], dtype=float32))

In [44]:
#let's look at only the vocab id (the first of the tuple)
close_words[0].shape
similar_vocabs = close_words[0].reshape(-1) #reshape of easy for loop

In [45]:
words = [nlp.vocab.strings[word] for word in similar_vocabs]
words

['dogsbody',
 'wolfdogs',
 'Baeg',
 'duppy',
 'pet(s',
 'postcanine',
 'Kebira',
 'uppies',
 'Toropets',
 'moggie']

### Doc Similarity

In spaCy we can do this same thing at the document level.

In [46]:
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")

# Similarity of two documents
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


### Word Similarity

We can also calculate the similarity between two given words.

In [47]:
# Similarity of tokens and spans
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489675521851


### Span Similarity

In [48]:
import spacy

doc = nlp("This was a great restaurant. Afterwards, we went to a really nice bar.")

# Create spans for "great restaurant" and "really nice bar"
span1 = doc[3:5]
span2 = doc[12:15]

# Get the similarity of the spans
similarity = span1.similarity(span2)
print(similarity)

0.6348510384559631
