# Testing NLP Libraries
* We'll see brief examples on how to use the NLTK and Spacy libraries
* We'll perform exercises on PoS, NER, Tokenization, Lemmanization and Stemming

# Tokenization Examples

In [1]:
# Importing libraries and downloading contents
import nltk
from nltk.corpus import names
nltk.download('popular')



[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to
[nltk_data]    |     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to
[nltk_data]    |     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to
[nltk_data]    |     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to
[nltk_data]    |     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to
[nltk_data]    |     C:\Users\52556\AppData\Roaming\nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]   

True

In [2]:
# Checking the first 10 names
print(names.words()[:10])
print(f"Our names corpus has: {len(names.words())} names")


['Abagael', 'Abagail', 'Abbe', 'Abbey', 'Abbi', 'Abbie', 'Abby', 'Abigael', 'Abigail', 'Abigale']
Our names corpus has: 7944 names


In [3]:
# Tokenization example with NLTK
from nltk.tokenize import word_tokenize
sent = """I am reading a book.
    It is Python Machine Learning By Example,
    3rd edition."""

print(word_tokenize(sent))


['I', 'am', 'reading', 'a', 'book', '.', 'It', 'is', 'Python', 'Machine', 'Learning', 'By', 'Example', ',', '3rd', 'edition', '.']


In [4]:
# A more complex example of tokenization
sent2 = "I've been to U.K. and U.S.A"
print(word_tokenize(sent2))

['I', "'ve", 'been', 'to', 'U.K.', 'and', 'U.S.A']


In [5]:
# Using Spacy to tokenize the same example
import spacy

nlp = spacy.load('en_core_web_sm')
tokens2 = nlp(sent2)
print([token.text for token in tokens2])


['I', "'ve", 'been', 'to', 'U.K.', 'and', 'U.S.A']


In [6]:
# Segmenting the text based on sentences
from nltk.tokenize import sent_tokenize
print(sent_tokenize(sent))


['I am reading a book.', 'It is Python Machine Learning By Example,\n    3rd edition.']


## PoS Tagging
* Using the built-in tagging function pos_tag
* This returns tuples with the word and the part of speach. To see the meaning of the PoS, we can use help()

In [7]:
tokens = word_tokenize(sent)
print(nltk.pos_tag(tokens))

[('I', 'PRP'), ('am', 'VBP'), ('reading', 'VBG'), ('a', 'DT'), ('book', 'NN'), ('.', '.'), ('It', 'PRP'), ('is', 'VBZ'), ('Python', 'NNP'), ('Machine', 'NNP'), ('Learning', 'NNP'), ('By', 'IN'), ('Example', 'NNP'), (',', ','), ('3rd', 'CD'), ('edition', 'NN'), ('.', '.')]


In [9]:
# Checking the meaning of tags
nltk.download('tagsets')
nltk.help.upenn_tagset('PRP')

[nltk_data] Downloading package tagsets to
[nltk_data]     C:\Users\52556\AppData\Roaming\nltk_data...


PRP: pronoun, personal
    hers herself him himself hisself it itself me myself one oneself ours
    ourselves ownself self she thee theirs them themselves they thou thy us


[nltk_data]   Unzipping help\tagsets.zip.


In [12]:
# Using spacy to get PoS
print([(token.text, token.pos_) for token in tokens2])

[('I', 'PRON'), ("'ve", 'AUX'), ('been', 'AUX'), ('to', 'ADP'), ('U.K.', 'PROPN'), ('and', 'CCONJ'), ('U.S.A', 'PROPN')]


## NER
* NER allows us to identify words or phrases that are of definitive categories such as names of persons, companies, locations and dates
* For a full list of named entity tags: https://spacy.io/api/annotation#section-named-entities

In [16]:
# Tokenizing a sentence and finding the NER
tokens3 = nlp("The book written by Hayden Liu in 2023 was sold in $30 in America.")


In [17]:
# The tokens3 object has an attribute called ents, which are named entities.
# Extracting the tagging for each as follows: 
print([(token_ent.text, token_ent.label_) for token_ent in tokens3.ents])

[('Hayden Liu', 'PERSON'), ('2023', 'DATE'), ('30', 'MONEY'), ('America', 'GPE')]


## Stemming and Lemmatization
* Stemming reverts an inflected or derived word to it's root form. 

* Lemmatization is a cautious form of stemming that considers the PoS of a word when conducting stemming it also traces back the lemma of the word.

In [25]:
# Using porter (one of the three built-in stemming algorithms in NLTK)
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()

# Stemming the words 'machines' and 'learning'
machine = porter_stemmer.stem('machines')
learning = porter_stemmer.stem('learning')

print(machine)

print(learning)

machin
learn


**Note** Stemming sometimes involves the chopping of letters if necessary, as you can see in machin in the preceding command output.

In [27]:
# Importing a lemmatization algorithm based on the built-in WordNet corpus
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

# Lemmatizing 'machines' and 'learning'
machine_lemma = lemmatizer.lemmatize('machines')
learning_lemma = lemmatizer.lemmatize('learning')

print(machine_lemma)
print(learning_lemma)

machine
learning


**Note** The previous algorithm only lemmatizes nouns by default.