## **Context**

**Working with text data presents a number of challenges, such as the use of special characters, accented characters, or word inflections.** However, in order to extract meaningful information from text and prepare data for modeling, libraries designed for working with text data have made the process much simpler and are nowadays a crucial part of basic Natural Language Processing.


**NLP researchers and developers have been able to simplify the process of handling and processing text data as a result of the development of these various libraries.** This has helped NLP grow rapidly and fing applications in a wide variety of domains, befitting the ubiquitous nature of text data and how mining it for information would be valuable to any line of business in the world.


Two libraries in particular that provide great ease of use in working with text data are:
- **NLTK**
- **spaCy**

## **Objective**
The NLTK and spaCy libraries have plenty of modules and functions that are used in text preprocessing, text mining and model building. 

**We are going to learn about both libraries and compare their performances on various tasks.**

Let's first look at the **NLTK** Library. Before proceeding to use that library, we will have to install it into our working environment.

## **NLTK**
<center><img src="https://files.anaconda.com/production/resources/open-source/nltk-logo.svg" width="450"></img></center>

Source: <a href="https://www.nltk.org/">NLTK</a>

**First, we need to install the NTLK library** using the pip command, along with the bash exclamation.

In [22]:
!pip install nltk



Let's check out the NLTK library version that is installed in our working environment:

In [23]:
import nltk
print(nltk.__version__)

# Helps to create the data frames
import pandas as pd

3.7


We have sucessfully installed the NLTK library with a version of 3.7. NLTK has large modules that we are required to download to further work on things. We will now have a look and download all the necessary modules that are required to implement NLP operations.

In [3]:
# Downloading the 'punkt' module that will be helpful for tokenization
nltk.download('punkt')

# Downloading the 'stopwords' module that will be helpful for Stopwrods removal
nltk.download('stopwords')

# Downloading the 'wordnet' module that will be helpful for stemming and lemmatization
nltk.download('wordnet')

# Downloading the 'omw1.4', dependency for Tokenization
nltk.download('omw-1.4')

# Downloading the 'averaged_perceptron_tagger' for POS_Tagging
nltk.download('averaged_perceptron_tagger')

# Downloading  the required modules that are used in NER tagging
nltk.download('maxent_ne_chunker')
nltk.download('words')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\aravind\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunke

True

**Let's import all the necessary functions that are required to perform tasks on text data using the downloaded modules.**

In [4]:
# Helpful to remove the stopwords
from nltk.corpus import stopwords

# Helpful in Lemmatization
from nltk.stem import WordNetLemmatizer

# Helpful in Tokenization
from nltk.tokenize import word_tokenize, sent_tokenize

# Used in Stemming
from nltk.stem.porter import PorterStemmer


# Used in NER Tagging
from nltk.corpus import treebank_chunk
from nltk.chunk import ne_chunk

## **spaCy**

<img src="https://spacy.io/static/social_default-1d3b50b1eba4c2b06244425ff0c49570.jpg"></img>

Source: <a href = "https://spacy.io/">spaCy</a>

**spaCy:** spaCy is an open-source software library for advanced Natural Language Processing implemented in Python and Cython. It is intended for production use, and aids in the development of programs that process and "understand" massive amounts of text. It can be used to create systems for information extraction and natural language interpretation, as well as to preprocess text for Deep Learning.

It can do tasks like text classification and tokenization like the NLTK library can, but it also provides pre-trained language models and pipelines that users can customize. These features are not limited to the English language.

Each library utilizes either time or space to improve performance. While NLTK returns results much slower than spaCy (spaCy is a memory hog!), spaCy’s performance is attributed to the fact that it was written in Cython (a high performance hybrid language of C and Python) from the ground up.

Similar to NTLK, we will need to install the spaCy library using the **pip** command along with the bash exclamation.

! pip uninstall spacy

In [5]:
!pip install spacy==3.4.1



**Let's check the spaCy library version that is installed in our working environment**

In [6]:
import spacy
print(spacy.__version__)

3.4.1


In the spaCy library, we have an English language module that helps to perform language model operations on English text. Let's download the English module that will help us with the operations we need to perform.

In [7]:
# To download the spacy langauge module
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.4.1
  Using cached https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl (12.8 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_sm')


**Now load the 'en_core_web_sm'. It is a small English pipeline trained on written web text (blogs, news, comments), that includes vocabulary, syntax and entities.**

In [8]:
# Loading the small english corpus
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x1e1c3a0acd0>

**Similarly, we can download and load up other language models. Let's see how we can do this for the French language.**

In [9]:
!python -m spacy download fr_core_news_sm

Collecting fr-core-news-sm==3.4.0
  Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.4.0/fr_core_news_sm-3.4.0-py3-none-any.whl (16.3 MB)
Installing collected packages: fr-core-news-sm
  Attempting uninstall: fr-core-news-sm
    Found existing installation: fr-core-news-sm 3.5.0
    Uninstalling fr-core-news-sm-3.5.0:
      Successfully uninstalled fr-core-news-sm-3.5.0
Successfully installed fr-core-news-sm-3.4.0
[+] Download and installation successful
You can now load the package via spacy.load('fr_core_news_sm')


In [10]:
# Loading the french language corpus
spacy.load('fr_core_news_sm')

<spacy.lang.fr.French at 0x1e1d3f7ea30>

**Let's now import the necessary functions from the spaCy library that are required to perform tasks on text data using the downloaded modules.**

In [11]:
# In spaCy, we import a single class and then instantiate a singleton to do the processing:
from spacy.lang.en import English
nlp = English()

We have sucessfully installed and loaded the NLTK and spaCy libraries as well as their dependencies into our working environment. Before jumping into NLP operations, we still have to install a few other libraries that will be helpful during the text preprocessing and text mining stage.

**Unidecode**

The unidecode module accepts unicode string values and returns a unicode string in Python 3. By using the unidecode library, we can transliterate any unicode string into the closest possible representation in ASCII text.

In [12]:
!pip install unidecode



**Spell Checking**

This will help correct the spellings of our text using a Python library. We can install this library into our working environment using the pip command. 

Please run the below command for installation.

In [13]:
!pip install autocorrect



Now let's apply some text preprocessing and language model techniques on our text data using both the NLTK and spaCy libraries.

In [14]:
text = '''When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to, said Thrun, in an interview with Recode earlier this week'''

In [15]:
# Let's print the data
print(text)

When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously. I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to, said Thrun, in an interview with Recode earlier this week


In [55]:
# Load English tokenizer, tagger, parser and Named Entity Recognition
en = spacy.load("en_core_web_sm")

# Holds the stopwords
sp_stopwords = en.Defaults.stop_words

In [56]:
# Pass the text data to nlp() parser and store the text into doc variable
doc = nlp(text)

spaCy provides a variety of linguistic annotations to give you insights into a text's grammatical structure. This includes word types, like the Parts of Speech, and how the words are related to each other.

## **Stop Word Removal**

Now first we shall remove the Stop words from our text data. Stop words are common words that are often omitted when text is analyzed. Both the NLTK & spaCy toolkits provide lists of stop words. 

In [59]:
# NLTK
sorted(nltk.corpus.stopwords.words('english'))[:10]

['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an']

**Let's now look at the stopwords available in the spaCy library.**

In [60]:
sorted(sp_stopwords)[10:20]

['after',
 'afterwards',
 'again',
 'against',
 'all',
 'almost',
 'alone',
 'along',
 'already',
 'also']

We observe that there are lot of stopwords which do not really require us to understand the context of the text. Now let's remove these stopwords from the text data that we have.

### **NLTK**

text = '''NLTK and spaCy are two of the most popular Natural Language Processing tools available in Python'''

In [61]:
data = text.split()

# Removing the stopwords
words = [word for word in data if not word in stopwords.words('english')]
words = ' '.join(words)

**Let's find out the number of words present in the text data.**

In [62]:
print(" No of Words : ", len(text.split()))

 No of Words :  56


**Let's find out the number of words present in the text data after removing the stopwords.**

In [63]:
print(" No of Words after removing stopwords :", len(words.split()))

 No of Words after removing stopwords : 38


So we had a total of 56 words, among them NLTK removed 18 stopwords from the text data, leaving us with 38 words remaining.

### **spaCy**

In [93]:
data = text.split()

# Removing the stopwords
words = [word for word in data if not word in sp_stopwords]

In [94]:
print(" No of Words : ",len(words))

 No of Words :  37


- spaCy ends up removing just one more stop word than NLTK, leaving us with 37 words in the end. Both libraries hence seem to remove stop words in more or less a similar manner.

## **Tokenization**

Now let's look at Tokenization, the process that helps us tokenize the data into chunks. Tokenization simply breaks down a stream of raw text into small chunks of words or sentences called tokens.

### **NLTK**

**Word Tokenization**

This is the most popular method - it divides a piece of text into individual words based on a specific delimiter. Let's implement this using NLTK.

In [66]:
# Word Tokenization using NLTK Library
result = word_tokenize(text)
result

['When',
 'Sebastian',
 'Thrun',
 'started',
 'working',
 'on',
 'self-driving',
 'cars',
 'at',
 'Google',
 'in',
 '2007',
 ',',
 'few',
 'people',
 'outside',
 'of',
 'the',
 'company',
 'took',
 'him',
 'seriously',
 '.',
 'I',
 'can',
 'tell',
 'you',
 'very',
 'senior',
 'CEOs',
 'of',
 'major',
 'American',
 'car',
 'companies',
 'would',
 'shake',
 'my',
 'hand',
 'and',
 'turn',
 'away',
 'because',
 'I',
 'wasn',
 '’',
 't',
 'worth',
 'talking',
 'to',
 ',',
 'said',
 'Thrun',
 ',',
 'in',
 'an',
 'interview',
 'with',
 'Recode',
 'earlier',
 'this',
 'week']

In [67]:
len(result)

62

- As we see, we got 62 word tokens for this text data. 

**Now let's look at Sentence Tokenization and how it breaks the text down by sentence.**

We shall implement this with NLTK as well.

In [68]:
# NLTK

result = sent_tokenize(text)
print(result)

print(" \nNumber of Sentence Tokens", len(result))

['When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.', 'I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to, said Thrun, in an interview with Recode earlier this week']
 
Number of Sentence Tokens 2


- We observe that our text data is comprosed of merely 2 sentence tokens. 
- This may hence be a more apt method for larger documents with multiple sentences.

### **spaCy**

Let's now apply word tokenization and sentence tokenization using the spaCy library.

In [69]:
tokens = []
for token in doc:
    tokens.append(token.text)

tokens

['When',
 'Sebastian',
 'Thrun',
 'started',
 'working',
 'on',
 'self',
 '-',
 'driving',
 'cars',
 'at',
 'Google',
 'in',
 '2007',
 ',',
 'few',
 'people',
 'outside',
 'of',
 'the',
 'company',
 'took',
 'him',
 'seriously',
 '.',
 'I',
 'can',
 'tell',
 'you',
 'very',
 'senior',
 'CEOs',
 'of',
 'major',
 'American',
 'car',
 'companies',
 'would',
 'shake',
 'my',
 'hand',
 'and',
 'turn',
 'away',
 'because',
 'I',
 'was',
 'n’t',
 'worth',
 'talking',
 'to',
 ',',
 'said',
 'Thrun',
 ',',
 'in',
 'an',
 'interview',
 'with',
 'Recode',
 'earlier',
 'this',
 'week']

In [70]:
len(tokens)

63

- When compared to the NLTK library, spaCy created 63 word tokens for the given text data. 
- The difference appears to be with the hyphen character '-', NLTK did not consider that a separate token, but spaCy does.

In [33]:
# Sentence Tokenization using spaCy

# Adding the pipeline 'sentencizer' component
sbd = nlp.add_pipe('sentencizer')
    
#  "nlp" Object is used to create documents with linguistic annotations.
content = nlp(text)

# Create list of sentence tokens
sents_list = []

for sent in content.sents:
        sents_list.append(sent.text)
sents_list

['When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously.',
 'I can tell you very senior CEOs of major American car companies would shake my hand and turn away because I wasn’t worth talking to, said Thrun, in an interview with Recode earlier this week']

- spaCy also creates the same two sentence tokens from the text data, the way NLTK does.

## **Stemming**

Stemming is the process of reducing a word to its stem or root format.

Let us implement Stemming using the NLTK library.

In [71]:
# Creating object for the PorterStemmer() class
ps = PorterStemmer()

porter_stems = []

# Word Tokenization using NLTK Library
token_data = word_tokenize(text)

for word in token_data:
    # Taking root word from actual word
    result = ps.stem(word)
    # Appending the root word into porter_stems list
    porter_stems.append(result)

In [72]:
porter_stems

['when',
 'sebastian',
 'thrun',
 'start',
 'work',
 'on',
 'self-driv',
 'car',
 'at',
 'googl',
 'in',
 '2007',
 ',',
 'few',
 'peopl',
 'outsid',
 'of',
 'the',
 'compani',
 'took',
 'him',
 'serious',
 '.',
 'i',
 'can',
 'tell',
 'you',
 'veri',
 'senior',
 'ceo',
 'of',
 'major',
 'american',
 'car',
 'compani',
 'would',
 'shake',
 'my',
 'hand',
 'and',
 'turn',
 'away',
 'becaus',
 'i',
 'wasn',
 '’',
 't',
 'worth',
 'talk',
 'to',
 ',',
 'said',
 'thrun',
 ',',
 'in',
 'an',
 'interview',
 'with',
 'recod',
 'earlier',
 'thi',
 'week']

- As we can see, we have converted each token into its respective root word / stem. Let's create a dataframe and compare how the tokens get converted into their root words.
- Similar to how this was done with the Porter Stemmer, you can implement this for the other available stemming techniques, such as the Snowball Stemmer and the Lancaster Stemmer.

**Note:**
Stemming techniques are not available for the spaCy library.

In [73]:
# Creating the data frame with actual token and its respective stemmed word

df = pd.DataFrame({'tokens':token_data,'stem_words':porter_stems})

In [74]:
df.head()

Unnamed: 0,tokens,stem_words
0,When,when
1,Sebastian,sebastian
2,Thrun,thrun
3,started,start
4,working,work


## **Lemmatization**

Lemmatization is similar to but subtly different from Stemming - it is the transformation that uses a dictionary to map a word’s variant back to its root format.

### **NLTK**

In [75]:
# Implemenation using nltk

lemmatizer = WordNetLemmatizer()

lemma = []

# Word tokenization using NLTK Library
token_data = word_tokenize(text)

for word in token_data:
    # Taking lemma word from actual word
    result = lemmatizer.lemmatize(word)
    # Appending the lemma into result list   
    lemma.append(result)

We have converted each token into its respective root lemma. Let's create a DataFrame and compare how the tokens get converted into their root words.

In [76]:
# Creating the data frame with actual token and its respective lemma word for visibility
new_df = pd.DataFrame({'tokens':token_data,'lemma_words':lemma})

In [77]:
new_df.tail()

Unnamed: 0,tokens,lemma_words
57,with,with
58,Recode,Recode
59,earlier,earlier
60,this,this
61,week,week


- We observe that in these examples, there appears to be no change between the tokens and their lemma words, as the tokens already already appear to be in their root word format.

### **spaCy**

In [78]:
# Changing the nlp pipeline to text, to get the desired output
doc = en(text)

In [79]:
tokens = []
spacy_lemma = []

for word in doc:
    # Appending tokens into tokens list
    tokens.append(word.text)
    # Storing the lemma into spacy_lemma list
    spacy_lemma.append(word.lemma_)

In [80]:
# Creating the data frame with actual token and its respective lemma word for visibility
sp_df = pd.DataFrame({'tokens':tokens,'lemma_words':spacy_lemma})

In [81]:
sp_df.head()

Unnamed: 0,tokens,lemma_words
0,When,when
1,Sebastian,Sebastian
2,Thrun,Thrun
3,started,start
4,working,work


## **Part-of-speech Tagging (POS Tagging)**

Part-of-speech (POS) tagging is a popular Natural Language Processing method which refers to categorizing words in a text (corpus) in correspondence with a particular part-of-speech, depending on the definition of the word and its context.

For example:
The word **when** can be tagged as a **subordinating conjunction**.


### **NLTK**

In [83]:
output = nltk.pos_tag(token_data)
output

[('When', 'WRB'),
 ('Sebastian', 'JJ'),
 ('Thrun', 'NNP'),
 ('started', 'VBD'),
 ('working', 'VBG'),
 ('on', 'IN'),
 ('self-driving', 'JJ'),
 ('cars', 'NNS'),
 ('at', 'IN'),
 ('Google', 'NNP'),
 ('in', 'IN'),
 ('2007', 'CD'),
 (',', ','),
 ('few', 'JJ'),
 ('people', 'NNS'),
 ('outside', 'IN'),
 ('of', 'IN'),
 ('the', 'DT'),
 ('company', 'NN'),
 ('took', 'VBD'),
 ('him', 'PRP'),
 ('seriously', 'RB'),
 ('.', '.'),
 ('I', 'PRP'),
 ('can', 'MD'),
 ('tell', 'VB'),
 ('you', 'PRP'),
 ('very', 'RB'),
 ('senior', 'JJ'),
 ('CEOs', 'NNP'),
 ('of', 'IN'),
 ('major', 'JJ'),
 ('American', 'JJ'),
 ('car', 'NN'),
 ('companies', 'NNS'),
 ('would', 'MD'),
 ('shake', 'VB'),
 ('my', 'PRP$'),
 ('hand', 'NN'),
 ('and', 'CC'),
 ('turn', 'VB'),
 ('away', 'RB'),
 ('because', 'IN'),
 ('I', 'PRP'),
 ('wasn', 'VBP'),
 ('’', 'JJ'),
 ('t', 'NN'),
 ('worth', 'NN'),
 ('talking', 'VBG'),
 ('to', 'TO'),
 (',', ','),
 ('said', 'VBD'),
 ('Thrun', 'NNP'),
 (',', ','),
 ('in', 'IN'),
 ('an', 'DT'),
 ('interview', 'NN'),
 (

### **spaCy**

In [84]:
# Tagging the tokens
for token in doc:
    print(token, token.pos_)

When SCONJ
Sebastian PROPN
Thrun PROPN
started VERB
working VERB
on ADP
self NOUN
- PUNCT
driving VERB
cars NOUN
at ADP
Google PROPN
in ADP
2007 NUM
, PUNCT
few ADJ
people NOUN
outside ADV
of ADP
the DET
company NOUN
took VERB
him PRON
seriously ADV
. PUNCT
I PRON
can AUX
tell VERB
you PRON
very ADV
senior ADJ
CEOs NOUN
of ADP
major ADJ
American ADJ
car NOUN
companies NOUN
would AUX
shake VERB
my PRON
hand NOUN
and CCONJ
turn VERB
away ADV
because SCONJ
I PRON
was AUX
n’t PART
worth ADJ
talking VERB
to ADP
, PUNCT
said VERB
Thrun PROPN
, PUNCT
in ADP
an DET
interview NOUN
with ADP
Recode PROPN
earlier ADV
this DET
week NOUN


- As we see, each token gets tagged into its respective part of speech. Let's understand a few of these tags:
  - VERB -  verbs (all tenses and modes)
  - NOUN — nouns (common and proper)
 - PRON — pronouns
 - ADJ — adjectives
  -  ADV — adverbs
  - ADP — adpositions (prepositions and postpositions)
  - CONJ — conjunctions

The other tags can also be understood by referring to [this article](https://cs.nyu.edu/~grishman/jet/guide/PennPOS.html).

In [85]:
# You want list of Verb tokens
print("Verbs:", [token.text for token in doc if token.pos_ == "VERB"])

Verbs: ['started', 'working', 'driving', 'took', 'tell', 'shake', 'turn', 'talking', 'said']


In [86]:
# You want list of Noun tokens
print("Nouns:", [token.text for token in doc if token.pos_ == "NOUN"])

Nouns: ['self', 'cars', 'people', 'company', 'CEOs', 'car', 'companies', 'hand', 'interview', 'week']


- We had successfully tagged every word token present in the data. Now let's implement the collocations.

## **Collocations: Bigrams and Trigrams**

What are Collocations?

Collocations are groups of words occurring together many times in a document. They are calculated by the number of those groups occurring together with respect to the overall word count of the document. We also call them N-grams: continuous sequences of N words or symbols or tokens in a document.

- **Unigram:** An N-gram consisting of a single item from a sequence.
- **Bigrams:** An N-gram consisting of a combination of two words from a sequence.
- **Trigrams:** An N-gram consisting of a combination of three words from a sequence.

**Unigrams** are the same as the word tokens generated from the corpus. 

We've already tokenized the data to obtain these Unigrams, so let's now obtain Bigrams and Trigrams from the same corpus.

### **Bigrams**

In [87]:
Tokens = nltk.word_tokenize(text)
output = list(nltk.bigrams(Tokens))
print(output)

[('When', 'Sebastian'), ('Sebastian', 'Thrun'), ('Thrun', 'started'), ('started', 'working'), ('working', 'on'), ('on', 'self-driving'), ('self-driving', 'cars'), ('cars', 'at'), ('at', 'Google'), ('Google', 'in'), ('in', '2007'), ('2007', ','), (',', 'few'), ('few', 'people'), ('people', 'outside'), ('outside', 'of'), ('of', 'the'), ('the', 'company'), ('company', 'took'), ('took', 'him'), ('him', 'seriously'), ('seriously', '.'), ('.', 'I'), ('I', 'can'), ('can', 'tell'), ('tell', 'you'), ('you', 'very'), ('very', 'senior'), ('senior', 'CEOs'), ('CEOs', 'of'), ('of', 'major'), ('major', 'American'), ('American', 'car'), ('car', 'companies'), ('companies', 'would'), ('would', 'shake'), ('shake', 'my'), ('my', 'hand'), ('hand', 'and'), ('and', 'turn'), ('turn', 'away'), ('away', 'because'), ('because', 'I'), ('I', 'wasn'), ('wasn', '’'), ('’', 't'), ('t', 'worth'), ('worth', 'talking'), ('talking', 'to'), ('to', ','), (',', 'said'), ('said', 'Thrun'), ('Thrun', ','), (',', 'in'), ('in'

We observe that we have created the Bigrams i.e combinations of two words, from the available text.

### **Trigrams**

In [88]:
Tokens = nltk.word_tokenize(text)
output = list(nltk.trigrams(Tokens))
print(output)

[('When', 'Sebastian', 'Thrun'), ('Sebastian', 'Thrun', 'started'), ('Thrun', 'started', 'working'), ('started', 'working', 'on'), ('working', 'on', 'self-driving'), ('on', 'self-driving', 'cars'), ('self-driving', 'cars', 'at'), ('cars', 'at', 'Google'), ('at', 'Google', 'in'), ('Google', 'in', '2007'), ('in', '2007', ','), ('2007', ',', 'few'), (',', 'few', 'people'), ('few', 'people', 'outside'), ('people', 'outside', 'of'), ('outside', 'of', 'the'), ('of', 'the', 'company'), ('the', 'company', 'took'), ('company', 'took', 'him'), ('took', 'him', 'seriously'), ('him', 'seriously', '.'), ('seriously', '.', 'I'), ('.', 'I', 'can'), ('I', 'can', 'tell'), ('can', 'tell', 'you'), ('tell', 'you', 'very'), ('you', 'very', 'senior'), ('very', 'senior', 'CEOs'), ('senior', 'CEOs', 'of'), ('CEOs', 'of', 'major'), ('of', 'major', 'American'), ('major', 'American', 'car'), ('American', 'car', 'companies'), ('car', 'companies', 'would'), ('companies', 'would', 'shake'), ('would', 'shake', 'my'),

We observe that here we have created Trigrams i.e combinations of three words, from the available text.

**Note:** For the spaCy library, we don't have direct functions to create Bigrams, Trigrams and N-grams. That can be implemented through simple Python functions where we use N as a control parameter to slice that number of words at a time from the whole corpus.

## **Named Entity Recognition (NER)**

Named Entity Recognition (NER) is an NLP-based technique to identify mentions of rigid designators from the text belonging to particular semantic types such as a person, location, organization etc.


<table width="700">
<th> TYPE</th>
<th> Description </th>
<tr>
<td>LOC</td>
<td> Non-GPE locations, mountain ranges, bodies of water</td></tr>
<tr>
<td>MONEY</td>
<td> Monetary values, including unit.</td></tr>
<tr>
<td>ORG</td>
<td> Companies, agencies, institutions etc.</td></tr>
<tr>
<td>PERSON</td>
<td> People, including fictional</td></tr>
<tr>
<td>NORP</td>
<td> Nationalities or religious or political groups</td></tr>
<tr>
<td>FAC</td>
<td> Buildings, airports, highways, bridge, etc.</td></tr>
<tr>
<td>GPE</td>
<td> Countries, cities, states.</td></tr>
<tr><td>LANGUAGE</td>
<td> Any named language</td></tr>
<tr><td>PRODUCT</td>
<td> Objects, vehicles, foods,etc. (Not services)</td></tr>
<tr><td>LAW</td>
<td>Named documents made into laws</td></tr>
<tr><td>EVENT</td>
<td> Named hurricanes, battles, wars, sports events,etc.</td></tr>
<tr><td>DATE</td>
<td> Absolute or relative dates or periods</td></tr>
<tr><td>TIME</td>
<td> Times smaller than a day</td></tr>
</table>

We observe from the above table that there are several types of Named Entities with their own descriptions.

Now, let's implement the same using both the NLTK and spaCy libraries.

### **NLTK**

In [89]:
# Using ne_chunk() method available in NLTK, we can recognize named entities using a classifier, the classifier adds category labels such as PERSON, ORGANIZATION, and GPE.
# First it will tokenize the whole corpus, and tag into its respective pos tags. In final step, each entity get identified into their respective entity tags.

ne_tree = ne_chunk(nltk.pos_tag(word_tokenize(text)))

# Displaying the results
print(ne_tree)

(S
  When/WRB
  (PERSON Sebastian/JJ Thrun/NNP)
  started/VBD
  working/VBG
  on/IN
  self-driving/JJ
  cars/NNS
  at/IN
  (ORGANIZATION Google/NNP)
  in/IN
  2007/CD
  ,/,
  few/JJ
  people/NNS
  outside/IN
  of/IN
  the/DT
  company/NN
  took/VBD
  him/PRP
  seriously/RB
  ./.
  I/PRP
  can/MD
  tell/VB
  you/PRP
  very/RB
  senior/JJ
  (ORGANIZATION CEOs/NNP)
  of/IN
  major/JJ
  (GPE American/JJ)
  car/NN
  companies/NNS
  would/MD
  shake/VB
  my/PRP$
  hand/NN
  and/CC
  turn/VB
  away/RB
  because/IN
  I/PRP
  wasn/VBP
  ’/JJ
  t/NN
  worth/NN
  talking/VBG
  to/TO
  ,/,
  said/VBD
  (PERSON Thrun/NNP)
  ,/,
  in/IN
  an/DT
  interview/NN
  with/IN
  (PERSON Recode/NNP)
  earlier/RBR
  this/DT
  week/NN)


- We observe that each word present in the data is tagged to its respective named entity.

### **spaCy**

In [90]:
print("Named entities in the given text are\n")

# doc is corpus, doc.ents gives available entites in that corpus

for ent in doc.ents: 

    # Printing the entity text and its label
    print(ent.text,ent.label_)

Named entities in the given text are

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun GPE
Recode ORG
earlier this week DATE


- We can infer that **Sebastian Thrun** is a name of the **PERSON**, **2007** is a **DATE**, **American** is a **Nationality**, etc.

## **Performance Comparison**

Finally, let's take a quantitative final look at the performance / speed of both libraries. 

For instance, let us look at the Word Tokenization operation for simiplicity.

### **Word Tokenization on NLTK vs spaCy**

In [91]:
# NLTK
Nw = %timeit -o nltk.tokenize.word_tokenize(text)

346 µs ± 24.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [92]:
# spaCy
Sw = %timeit -o nlp(doc)

41.5 µs ± 2.33 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


It's clear to see that **the spaCy library takes much less computational time than NLTK to perform the same task on the same corpus.**

This is quite true across the board, and spaCy is hence preferred in performance & deployment contexts for creating finalized NLP business solutions. NLTK, on the other hand, is preferred at the research & prototyping stage when coming up with Proof-of-Concepts, due to the lack of necessity of performance / speed at that stage and also its comprehensive features for all major NLP operations.