In [1]:
import spacy

2021-11-30 09:16:53.461272: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /app/lib
2021-11-30 09:16:53.461394: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In the script above we use the load function from the spacy library to load the core English language model. The model is stored in the sp variable.

In [2]:
sp = spacy.load('en_core_web_sm')

Let's now create a small document using this model. A document can be a sentence or a group of sentences and can have unlimited length. The following script creates a simple spaCy document.

In [3]:
sentence = sp(u'Manchester United is looking to sign a forward for $90 million')

SpaCy automatically breaks your document into tokens when a document is created using the model.

A token simply refers to an individual part of a sentence having some semantic value. Let's see what tokens we have in our document:

In [4]:
for word in sentence:
    print(word.text)

Manchester
United
is
looking
to
sign
a
forward
for
$
90
million


You can see we have the following tokens in our document. We can also see the parts of speech of each of these tokens using the .pos_ attribute shown below:

In [5]:
for word in sentence:
    print(word.text,  word.pos_)

Manchester PROPN
United PROPN
is AUX
looking VERB
to PART
sign VERB
a DET
forward NOUN
for ADP
$ SYM
90 NUM
million NUM


You can see that each word or token in our sentence has been assigned a part of speech. For instance "Manchester" has been tagged as a proper noun, "Looking" has been tagged as a verb, and so on.

Finally, in addition to the parts of speech, we can also see the dependencies.

Let's create another document:

In [6]:
sentence2 = sp(u"Manchester United isn't looking to sign any forward.")

For dependency parsing, the attribute dep_ is used as shown below:

In [7]:
for word in sentence2:
    print(word.text,  word.pos_, word.dep_)

Manchester PROPN compound
United PROPN nsubj
is AUX aux
n't PART neg
looking VERB ROOT
to PART aux
sign VERB xcomp
any DET det
forward ADV advmod
. PUNCT punct


From the output, you can see that spaCy is intelligent enough to find the dependency between the tokens, for instance in the sentence we had a word is'nt. The depenency parser has broken it down to two words and specifies that the n't is actually negation of the previous word.

In addition to printing the words, you can also print sentences from a document.

In [8]:
document = sp(u'Hello from Stackabuse. The site with the best Python Tutorials. What are you looking for?')

Now, we can iterate through each sentence using the following script:

In [9]:
for sentence in document.sents:
    print(sentence)

Hello from Stackabuse.
The site with the best Python Tutorials.
What are you looking for?


You can also check if a sentence starts with a particular token or not. You can get individual tokens using an index and the square brackets, like an array:

In [10]:
document[4]

The

Now to see if any sentence in the document starts with The, we can use the is_sent_start attribute as shown below:



In [11]:
document[4].is_sent_start

True

# Tokenization

In [12]:
sentence3 = sp(u'"They\'re leaving U.K. for U.S.A."')
print(sentence3)

"They're leaving U.K. for U.S.A."


You can see the sentence contains quotes at the beginnnig and at the end. It also contains punctuation marks in abbreviations "U.K" and "U.S.A."

Let's see how spaCy tokenizes this sentence.

In [13]:
for word in sentence3:
    print(word.text)

"
They
're
leaving
U.K.
for
U.S.A.
"


In the output, you can see that spaCy has tokenized the starting and ending double quotes. However, it is intelligent enough, not to tokenize the punctuation dot used between the abbreviations such as U.K. and U.S.A.

In [14]:
sentence4 = sp(u"Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com")
print(sentence4)

Hello, I am non-vegetarian, email me the menu at abc-xyz@gmai.com


Here in the above sentence, we have a dash in the word "non-vegetarian" and in the email address. Let's see how spaCy will tokenize this:

In [15]:
for word in sentence4:
    print(word.text)

Hello
,
I
am
non
-
vegetarian
,
email
me
the
menu
at
abc-xyz@gmai.com


It is evident from the output that spaCy was actually able to detect the email and it did not tokenize it despite having a "-". On the other hand, the word "non-vegetarian" was tokenized.

Let's now see how we can count the words in the document:

In [16]:
len(sentence4)

14

# Detecting Entities


In addition to tokenizing the documents to words, you can also find if the word is an entity such as a company, place, building, currency, institution, etc.

Let's see a simple example of named entity recognition:

In [17]:
sentence5 = sp(u'Manchester United is looking to sign Harry Kane for $90 million') 

simple tokenizer

In [18]:
for word in sentence5:
    print(word.text)

Manchester
United
is
looking
to
sign
Harry
Kane
for
$
90
million


We know that "Manchester United" is a single word, therefore it should not be tokenized into two words. Similarly, "Harry Kane" is the name of a person, and "$90 million" is a currency value. These should not be tokenized either.

This is where named entity recognition comes to play. To get the named entities from a document, you have to use the ents attribute. Let's retrieve the named entities from the above sentence. Execute the following script:

In [20]:
for entity in sentence5.ents:
    print(entity.text + ' - ' + entity.label_ + ' - ' + str(spacy.explain(entity.label_)))

Manchester United - GPE - Countries, cities, states
Harry Kane - PERSON - People, including fictional
$90 million - MONEY - Monetary values, including unit


# Detecting Nouns

In addition to detecting named entities, nouns can also be detected. To do so, the noun_chunks attribute is used. Consider the following sentence:



In [21]:
sentence5 = sp(u'Latest Rumours: Manchester United is looking to sign Harry Kane for $90 million') 

Let's find nouns

In [22]:
for noun in sentence5.noun_chunks:
    print(noun.text)

Latest Rumours
Manchester United
Harry Kane


# Stemming

Stemming refers to reducing a word to its root form. While performing natural language processing tasks, you will encounter various scenarios where you find different words with the same root. For instance, compute, computer, computing, computed, etc. You may want to reduce the words to their root form for the sake of uniformity. This is where stemming comes in to play.

It might be surprising to you but spaCy doesn't contain any function for stemming as it relies on lemmatization only. Therefore, in this section, we will use NLTK for stemming.

There are two types of stemmers in NLTK: Porter Stemmer and Snowball stemmers. Both of them have been implemented using different algorithms.

# Porter Stemmer

In [23]:
import nltk

from nltk.stem.porter import *

In [24]:
stemmer = PorterStemmer()

In [25]:
tokens = ['compute', 'computer', 'computed', 'computing']

In [26]:
for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that all the 4 words have been reduced to "comput" which actually isn't a word at all.

# Snowball Stemmer


Snowball stemmer is a slightly improved version of the Porter stemmer and is usually preferred over the latter. Let's see snowball stemmer in action:

In [27]:
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer(language='english')

tokens = ['compute', 'computer', 'computed', 'computing']

for token in tokens:
    print(token + ' --> ' + stemmer.stem(token))

compute --> comput
computer --> comput
computed --> comput
computing --> comput


You can see that the results are the same. We still got "comput" as the stem. Again, this word "comput" actually isn't a dictionary word.

This is where lemmatization comes handy. Lemmatization reduces the word to its stem as it appears in the dictionary. The stems returned through lemmatization are actual dictionary words and are semantically complete unlike the words returned by stemmer.

# Lemmatization

Though we could not perform stemming with spaCy, we can perform lemmatization using spaCy.

To do so, we need to use the lemma_ attribute on the spaCy document. Suppose we have the following sentence:

In [28]:
sentence6 = sp(u'compute computer computed computing')

In [29]:
for word in sentence6:
    print(word.text,  word.lemma_)

compute compute
computer computer
computed compute
computing computing


You can see that unlike stemming where the root we got was "comput", the roots that we got here are actual words in the dictionary.

Lemmatization converts words in the second or third forms to their first form variants. Look at the following example:

In [30]:
sentence7 = sp(u'A letter has been written, asking him to be released')

for word in sentence7:
    print(word.text + '  ===>', word.lemma_)

A  ===> a
letter  ===> letter
has  ===> have
been  ===> be
written  ===> write
,  ===> ,
asking  ===> ask
him  ===> he
to  ===> to
be  ===> be
released  ===> release


In [32]:
Example_Sentence = "Patients who in late middle age have smoked 20 cigarettes a day since their teens constitute an at-risk group. One thing they’re clearly at risk for is the acute sense of guilt that a clinician can incite, which immediately makes a consultation tense."
#You might need to download the model when you first load the model, follow the instruction given in the error message.

In [33]:
nlp = spacy.load('en_core_web_sm')

In [31]:
def spacy_process(text):
    doc = nlp(text)
    
#Tokenization and lemmatization are done with the spacy nlp pipeline commands
    lemma_list = []
    for token in doc:
        lemma_list.append(token.lemma_)
    print("Tokenize+Lemmatize:")
    print(lemma_list)
    
    #Filter the stopword
    filtered_sentence =[] 
    for word in lemma_list:
        lexeme = nlp.vocab[word]
        if lexeme.is_stop == False:
            filtered_sentence.append(word) 
    
    #Remove punctuation
    punctuations="?:!.,;"
    for word in filtered_sentence:
        if word in punctuations:
            filtered_sentence.remove(word)
    print(" ")
    print("Remove stopword & punctuation: ")
    print(filtered_sentence)


In [36]:
spacy_process(Example_Sentence)

Tokenize+Lemmatize:
['patient', 'who', 'in', 'late', 'middle', 'age', 'have', 'smoke', '20', 'cigarette', 'a', 'day', 'since', 'their', 'teen', 'constitute', 'an', 'at', '-', 'risk', 'group', '.', 'one', 'thing', 'they', '’re', 'clearly', 'at', 'risk', 'for', 'be', 'the', 'acute', 'sense', 'of', 'guilt', 'that', 'a', 'clinician', 'can', 'incite', ',', 'which', 'immediately', 'make', 'a', 'consultation', 'tense', '.']
 
Remove stopword & punctuation: 
['patient', 'late', 'middle', 'age', 'smoke', '20', 'cigarette', 'day', 'teen', 'constitute', '-', 'risk', 'group', 'thing', 'clearly', 'risk', 'acute', 'sense', 'guilt', 'clinician', 'incite', 'immediately', 'consultation', 'tense']


In [34]:
import nltk
from nltk.tokenize import word_tokenize  
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.snowball import SnowballStemmer
from nltk.stem.porter import *

p_stemmer = PorterStemmer()
#s_stemmer = SnowballStemmer(language='english')

def nltk_process(text):
    #Tokenization
    nltk_tokenList = word_tokenize(text)
    
    #Stemming
    nltk_stemedList = []
    for word in nltk_tokenList:
        nltk_stemedList.append(p_stemmer.stem(word))
        #nltk_stemedList.append(s_stemmer.stem(word))
    
    #Lemmatization
    wordnet_lemmatizer = WordNetLemmatizer()
    nltk_lemmaList = []
    for word in nltk_stemedList:
        nltk_lemmaList.append(wordnet_lemmatizer.lemmatize(word))
    
    print("Stemming + Lemmatization")
    print(nltk_lemmaList)

    #Filter stopword
    filtered_sentence = []  
    nltk_stop_words = set(stopwords.words("english"))
    for w in nltk_lemmaList:  
        if w not in nltk_stop_words:  
            filtered_sentence.append(w)  

    #Removing Punctuation
    punctuations="?:!.,;"
    for word in filtered_sentence:
        if word in punctuations:
            filtered_sentence.remove(word)
    print(" ")
    print("Remove stopword & Punctuation")
    print(filtered_sentence)

In [37]:
nltk_process(Example_Sentence)

Stemming + Lemmatization
['patient', 'who', 'in', 'late', 'middl', 'age', 'have', 'smoke', '20', 'cigarett', 'a', 'day', 'sinc', 'their', 'teen', 'constitut', 'an', 'at-risk', 'group', '.', 'one', 'thing', 'they', '’', 're', 'clearli', 'at', 'risk', 'for', 'is', 'the', 'acut', 'sen', 'of', 'guilt', 'that', 'a', 'clinician', 'can', 'incit', ',', 'which', 'immedi', 'make', 'a', 'consult', 'ten', '.']
 
Remove stopword & Punctuation
['patient', 'late', 'middl', 'age', 'smoke', '20', 'cigarett', 'day', 'sinc', 'teen', 'constitut', 'at-risk', 'group', 'one', 'thing', '’', 'clearli', 'risk', 'acut', 'sen', 'guilt', 'clinician', 'incit', 'immedi', 'make', 'consult', 'ten']
