**Created by Berkay Alan**

**Natural Language Proccessing Series 3 - Object Standardization and Linguistic Features**

**25 of February, 2022**

For more Tutorial: https://www.kaggle.com/berkayalan

# Content

- Object Standardization

- Linguistic Features : N-Gram

- Linguistic Features : Part of speech tagging (POS)

- Linguistic Features : Chunking(Shallow Parsing)

- Linguistic Features : Noun Chunks

- Linguistic Features : Named Entity Recognition(NER)

- Linguistic Features : Visualization in Spacy

***

For the string essentials, please check the first part [here](https://github.com/berkayalan/Data-Science-Tutorials/blob/master/Natural%20Language%20Processing/Natural%20Language%20Proccessing%20Series%201-%20String%20Essentials.ipynb).

For the text preprocessing, please check the second part [here](https://github.com/berkayalan/Data-Science-Tutorials/blob/master/Natural%20Language%20Processing/Natural%20Language%20Proccessing%20Series%202-%20Text%20Preprocessing.ipynb).

# Resources

- [**What Is Natural Language Processing?**](https://machinelearningmastery.com/natural-language-processing/)

- [**Text Processing in Python**](https://towardsdatascience.com/text-processing-in-python-29e86ea4114c)

- [**Ultimate Guide to Understand and Implement Natural Language Processing (with codes in Python)**](https://www.analyticsvidhya.com/blog/2017/01/ultimate-guide-to-understand-implement-natural-language-processing-codes-in-python/)

- [**Natural Language Processing (NLP) with BERT**](https://www.udemy.com/course/natural-language-processing-with-bert/learn/lecture/18889316?start=0#overview)

- [**How to Clean Text for Machine Learning with Python**](https://machinelearningmastery.com/clean-text-machine-learning-python/)

- [**Dropping common terms: stop words by StanfordNLP**](https://nlp.stanford.edu/IR-book/html/htmledition/dropping-common-terms-stop-words-1.html)

- [**Natural Language Processing With Python's NLTK Package**](https://realpython.com/nltk-nlp-python/)

- [**What Are n-grams and How to Implement Them in Python?**](https://www.analyticsvidhya.com/blog/2021/09/what-are-n-grams-and-how-to-implement-them-in-python/#:~:text=N%2Dgrams%20are%20continuous%20sequences,(Natural%20Language%20Processing)%20tasks.)

- [**N-Gram Language Modelling with NLTK**](https://www.geeksforgeeks.org/n-gram-language-modelling-with-nltk/)

- [**Neural Network Embeddings Explained**](https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526)

- [**Linguistic Features of Spacy**](https://spacy.io/usage/linguistic-features)

# Importing Libraries

In [3]:
import nltk
#nltk.download('stopwords')
#nltk.download("punkt")
#nltk.download("wordnet")
#nltk.download("averaged_perceptron_tagger")
#nltk.download('maxent_ne_chunker')
#nltk.download('words')
from nltk.corpus import stopwords
from nltk import word_tokenize, pos_tag, ne_chunk
from textblob import TextBlob,Word
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn import model_selection, preprocessing, linear_model, naive_bayes, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from textblob import Word,TextBlob
import re
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import spacy
from spacy.tokens import Span
from spacy.matcher import PhraseMatcher
from spacy import displacy
import en_core_web_sm

from warnings import filterwarnings
filterwarnings('ignore')

We also need to load *en_core_web_sm* for **spacy** library.

In [25]:
nlp = en_core_web_sm.load()

In order to see all rows and columns, we will increase max display numbers of dataframe.

In [26]:
pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth', -1)

# Object Standardization

 Text data often contains words or phrases which are not present in any standard lexical dictionaries. These pieces are not recognized by search engines and models.

Some of the examples are – acronyms, hashtags with attached words, and colloquial slangs. With the help of regular expressions and manually prepared data dictionaries, this type of noise can be fixed, the code below uses a dictionary lookup method to replace social media slangs from a text.

In [27]:
abbr_dict = {'rt':'Retweet', 'dm':'direct message', "awsm" : "awesome", "luv" :"love", "gg":"good game",
               'fb': 'Facebook' , 'ig': 'Instagram','li': 'LinkedIn','sc': 'SnapChat', 'tw': 'Twitter',
               'yt': 'YouTube', 'ds':'data science'}

In [28]:
def standardize_text(input_text):
    words = input_text.split() 
    new_words = [] 
    for word in words:
        if word.lower() in abbr_dict:
            word = abbr_dict[word.lower()]
        new_words.append(word) 
        new_text = " ".join(new_words) 
    return new_text

In [29]:
standardize_text("""Did you see the new post of Andrew NG on li ?
                 He was sad about Elon Musk's rt on TW saying ds is a hype.""")

"Did you see the new post of Andrew NG on LinkedIn ? He was sad about Elon Musk's Retweet on Twitter saying data science is a hype."

# Linguistic Features

 Processing raw text intelligently is difficult: most words are rare, and it’s common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. While it’s possible to solve some problems starting from only the raw characters, it’s usually better to use linguistic knowledge to add useful information.

## N-Gram

A combination of N items together are called N-Grams. The items can be letters, words, or base pairs according to the application. The N-grams typically are collected from a text or speech corpus (A long text dataset).

**N-gram Language Model**:

An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. A good N-gram model can predict the next word in the sentence i.e the value of p(w|h). 

**What are typical applications of N-gram models?**

N grams (N > 1) are generally more informative as compared to words (Unigrams) as features. Also, bigrams (N = 2) are considered as the most important features of all the others.

Example of N-gram such as unigram (“This”, “article”, “is”, “on”, “NLP”)  or bi-gram (‘This article’, ‘article is’, ‘is on’,’on NLP’).

We will use **ngrams()** function of TextBlob library here. It takes *N* as an input.

In [30]:
text = """Companies are very complex so sometimes it's hard to enforce the law when 
they are spread out around the world. 
Can we punish a parent company for something it's responsible for in another country?"""

In [31]:
TextBlob(text).ngrams(2)

[WordList(['Companies', 'are']),
 WordList(['are', 'very']),
 WordList(['very', 'complex']),
 WordList(['complex', 'so']),
 WordList(['so', 'sometimes']),
 WordList(['sometimes', 'it']),
 WordList(['it', "'s"]),
 WordList(["'s", 'hard']),
 WordList(['hard', 'to']),
 WordList(['to', 'enforce']),
 WordList(['enforce', 'the']),
 WordList(['the', 'law']),
 WordList(['law', 'when']),
 WordList(['when', 'they']),
 WordList(['they', 'are']),
 WordList(['are', 'spread']),
 WordList(['spread', 'out']),
 WordList(['out', 'around']),
 WordList(['around', 'the']),
 WordList(['the', 'world']),
 WordList(['world', 'Can']),
 WordList(['Can', 'we']),
 WordList(['we', 'punish']),
 WordList(['punish', 'a']),
 WordList(['a', 'parent']),
 WordList(['parent', 'company']),
 WordList(['company', 'for']),
 WordList(['for', 'something']),
 WordList(['something', 'it']),
 WordList(['it', "'s"]),
 WordList(["'s", 'responsible']),
 WordList(['responsible', 'for']),
 WordList(['for', 'in']),
 WordList(['in', 'anot

In [32]:
TextBlob(text).ngrams(3)

[WordList(['Companies', 'are', 'very']),
 WordList(['are', 'very', 'complex']),
 WordList(['very', 'complex', 'so']),
 WordList(['complex', 'so', 'sometimes']),
 WordList(['so', 'sometimes', 'it']),
 WordList(['sometimes', 'it', "'s"]),
 WordList(['it', "'s", 'hard']),
 WordList(["'s", 'hard', 'to']),
 WordList(['hard', 'to', 'enforce']),
 WordList(['to', 'enforce', 'the']),
 WordList(['enforce', 'the', 'law']),
 WordList(['the', 'law', 'when']),
 WordList(['law', 'when', 'they']),
 WordList(['when', 'they', 'are']),
 WordList(['they', 'are', 'spread']),
 WordList(['are', 'spread', 'out']),
 WordList(['spread', 'out', 'around']),
 WordList(['out', 'around', 'the']),
 WordList(['around', 'the', 'world']),
 WordList(['the', 'world', 'Can']),
 WordList(['world', 'Can', 'we']),
 WordList(['Can', 'we', 'punish']),
 WordList(['we', 'punish', 'a']),
 WordList(['punish', 'a', 'parent']),
 WordList(['a', 'parent', 'company']),
 WordList(['parent', 'company', 'for']),
 WordList(['company', 'for'

We can also do that with sklearn.

In [33]:
corpus = pd.Series([
    "The Lion is the king of the jungle",
    "Lions have lifespans of a decade",
    "The lion is an endangered species"
])

In [34]:
# Generate n-grams upto n=1
vectorizer_ng1 = CountVectorizer(ngram_range=(1,1))
ng1 = vectorizer_ng1.fit_transform(corpus)

# Generate n-grams upto n=2
vectorizer_ng2 = CountVectorizer(ngram_range=(1,2))
ng2 = vectorizer_ng2.fit_transform(corpus)

# Generate n-grams upto n=3
vectorizer_ng3 = CountVectorizer(ngram_range=(1, 3))
ng3 = vectorizer_ng3.fit_transform(corpus)


In [35]:
print(ng1)

  (0, 12)	3
  (0, 8)	1
  (0, 4)	1
  (0, 6)	1
  (0, 10)	1
  (0, 5)	1
  (1, 10)	1
  (1, 9)	1
  (1, 3)	1
  (1, 7)	1
  (1, 1)	1
  (2, 12)	1
  (2, 8)	1
  (2, 4)	1
  (2, 0)	1
  (2, 2)	1
  (2, 11)	1


In [36]:
# Print the number of features for each model
print("ng1, ng2 and ng3 have %i, %i and %i features respectively" % (ng1.shape[1], ng2.shape[1], ng3.shape[1]))

ng1, ng2 and ng3 have 13, 27 and 39 features respectively


## Part of speech tagging (POS)

Part of speech is a grammatical term that deals with the roles words play when we use them together in sentences. Tagging parts of speech, or POS tagging, is the task of labeling the words in our text according to their part of speech.

 We can get parts of speech from both ```Textblob.tag``` and ```spacy.pos_```. All tags can be seen as below.

| Abbrv| Details |
| --- | --- |
| CC | coordinating conjunction |
| CD | cardinal digit |
| DT | determiner |
| EX | existential there (like: “there is” … think of it like “there exists”) |
| FW | foreign word |
| IN | preposition/subordinating conjunction |
| JJ | adjective ‘big’ |
| JJR | adjective, comparative ‘bigger’  |
| JJS | adjective, superlative ‘biggest’  |
| LS | list marker 1) |
| MD | modal could, will |
| NN | noun, singular ‘desk’ |
| NNS | noun plural ‘desks’  |
| NNP | proper noun, singular ‘Harrison’  |
| NNPS | proper noun, plural ‘Americans’ |
| PDT | predeterminer ‘all the kids’  |
| POS | possessive ending parent‘s  |
| PRP | personal pronoun I, he, she  |
| PRP(sign) | possessive pronoun my, his, hers |
| RB | adverb very, silently |
| RBR | adverb, comparative better  |
| RBS | adverb, superlative best  |
| RP |  particle give up |
| TO |  to go ‘to‘ the store. |
| UH |  interjection errrrrrrrm |
| VB |  verb, base form take |
| VBD | verb, past tense took  |
| VBG | verb, gerund/present participle taking  |
| VBN | verb, past participle taken  |
| VBP | verb, sing. present, non-3d take  |
| VBZ | verb, 3rd person sing. present takes  |
| WDT | wh-determiner which  |
| WP |  wh-pronoun who, what |
| WP$ | possessive wh-pronoun whose  |
| WRB | wh-abverb where, when  |
| ADJ | adjective | 
| ADP | adposition | 
| ADV | adverb | 
| AUX | auxiliary | 
| CONJ | conjunction | 
| CCONJ | coordinating | 
| DET | determiner | 
| INTJ | interjection | 
| NOUN | noun | 
| NUM | numeral | 
| PART | particle | 
| PRON | pronoun | 
| PROPN | proper | 
| PUNCT | punctuation | 
| SCONJ | subordinating | 
| SYM | symbol | 
| VERB | verb | 

For a current list of tags for all languages visit https://spacy.io/api/annotation#pos-tagging

Now we will make an example and assign every words is corresponding part of speech.

*Berkay is an amazing football player*.

**POS Tagging**

Berkay -> prober noun

is -> verb

an -> determiner

amazing -> adjective

football -> noun

player -> noun

Now let's get started with *Textblob*

In [37]:
text = """The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
"""

In [38]:
TextBlob(text).tags

[('The', 'DT'),
 ('titular', 'JJ'),
 ('threat', 'NN'),
 ('of', 'IN'),
 ('The', 'DT'),
 ('Blob', 'NNP'),
 ('has', 'VBZ'),
 ('always', 'RB'),
 ('struck', 'VBN'),
 ('me', 'PRP'),
 ('as', 'IN'),
 ('the', 'DT'),
 ('ultimate', 'JJ'),
 ('movie', 'NN'),
 ('monster', 'NN'),
 ('an', 'DT'),
 ('insatiably', 'RB'),
 ('hungry', 'JJ'),
 ('amoeba-like', 'JJ'),
 ('mass', 'NN'),
 ('able', 'JJ'),
 ('to', 'TO'),
 ('penetrate', 'VB'),
 ('virtually', 'RB'),
 ('any', 'DT'),
 ('safeguard', 'NN'),
 ('capable', 'JJ'),
 ('of', 'IN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('doomed', 'JJ'),
 ('doctor', 'NN'),
 ('chillingly', 'RB'),
 ('describes', 'VBZ'),
 ('it', 'PRP'),
 ('assimilating', 'VBG'),
 ('flesh', 'NN'),
 ('on', 'IN'),
 ('contact', 'NN'),
 ('Snide', 'JJ'),
 ('comparisons', 'NNS'),
 ('to', 'TO'),
 ('gelatin', 'VB'),
 ('be', 'VB'),
 ('damned', 'VBN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('a', 'DT'),
 ('concept', 'NN'),
 ('with', 'IN'),
 ('the', 'DT'),
 ('most', 'RBS'),
 ('devastating', 'JJ'),
 ('of', 'IN'),
 ('potenti

In [39]:
text = """Companies are very complex so sometimes it's hard to enforce the law when 
they are spread out around the world. 
Can we punish a parent company for something it's responsible for in another country?"""

In [40]:
TextBlob(text).pos_tags # pos_tags is also possible

[('Companies', 'NNS'),
 ('are', 'VBP'),
 ('very', 'RB'),
 ('complex', 'JJ'),
 ('so', 'RB'),
 ('sometimes', 'VBZ'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('hard', 'JJ'),
 ('to', 'TO'),
 ('enforce', 'VB'),
 ('the', 'DT'),
 ('law', 'NN'),
 ('when', 'WRB'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('spread', 'VBN'),
 ('out', 'RP'),
 ('around', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('Can', 'MD'),
 ('we', 'PRP'),
 ('punish', 'VB'),
 ('a', 'DT'),
 ('parent', 'NN'),
 ('company', 'NN'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('responsible', 'JJ'),
 ('for', 'IN'),
 ('in', 'IN'),
 ('another', 'DT'),
 ('country', 'NN')]

Let's make it with **spacy** library.

We can also obtain a particular token by its index position.

* To view the coarse POS tag, we can use `token.pos_`
* To view the fine-grained tag, we can use `token.tag_`
* To view the description of either type of tag, we can use `spacy.explain(tag)`

Note that `token.pos` and `token.tag` return integer hash values; by adding the underscores we get the text equivalent that lives in **doc.vocab**.

In [41]:
text = """Companies are very complex so sometimes it's hard to enforce the law when 
they are spread out around the world. 
Can we punish a parent company for something it's responsible for in another country?"""

In [42]:
doc = nlp(text)

In [43]:
doc.text

"Companies are very complex so sometimes it's hard to enforce the law when \nthey are spread out around the world. \nCan we punish a parent company for something it's responsible for in another country?"

In [44]:
doc[3].text

'complex'

In [45]:
doc[3].pos_

'ADJ'

In [46]:
doc[3].tag_

'JJ'

In [47]:
spacy.explain(doc[3].tag_)

'adjective (English), other noun-modifier (Chinese)'

**Note:** In the English language, the same string of characters can have different meanings, even within the same sentence. For this reason, morphology is important. spaCy uses machine learning algorithms to best predict the use of a token in a sentence.

We can apply this technique to the entire Doc object:

In [48]:
for token in doc:
    print(f'{token.text:{10}} {token.pos_:{8}} {token.tag_:{6}} {spacy.explain(token.tag_)}')

Companies  NOUN     NNS    noun, plural
are        AUX      VBP    verb, non-3rd person singular present
very       ADV      RB     adverb
complex    ADJ      JJ     adjective (English), other noun-modifier (Chinese)
so         ADV      RB     adverb
sometimes  ADV      RB     adverb
it         PRON     PRP    pronoun, personal
's         AUX      VBZ    verb, 3rd person singular present
hard       ADJ      JJ     adjective (English), other noun-modifier (Chinese)
to         PART     TO     infinitival "to"
enforce    VERB     VB     verb, base form
the        DET      DT     determiner
law        NOUN     NN     noun, singular or mass
when       SCONJ    WRB    wh-adverb

          SPACE    _SP    whitespace
they       PRON     PRP    pronoun, personal
are        AUX      VBP    verb, non-3rd person singular present
spread     VERB     VBN    verb, past participle
out        ADP      RP     adverb, particle
around     ADP      IN     conjunction, subordinating or preposition
the      

In [49]:
nlp = en_core_web_sm.load()

Let's define a function that returns number of proper nouns in a text.

In [50]:
def proper_nouns(text, model=nlp):
    # Create doc object
    doc = model(text)
    # Generate list of POS tags
    pos = [token.pos_ for token in doc]
    
    # Return number of proper nouns
    return pos.count("PROPN") # NOUN

In [51]:
print(proper_nouns("Abdul, Bill and Cathy went to the market to buy apples.", nlp))

3


The `Doc.count_by()` method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

In [52]:
text

"Companies are very complex so sometimes it's hard to enforce the law when \nthey are spread out around the world. \nCan we punish a parent company for something it's responsible for in another country?"

In [53]:
doc = nlp(text)

Let's count the frequencies of different coarse-grained POS tags. It returns a dictionary with vocab indexes and counts.

In [54]:
doc.count_by(spacy.attrs.POS)

{92: 6,
 87: 5,
 86: 3,
 84: 3,
 95: 5,
 94: 1,
 100: 3,
 90: 4,
 98: 1,
 103: 2,
 85: 5,
 97: 2}

In [55]:
doc.vocab[92].text

'NOUN'

In [56]:
for key,value in sorted(doc.count_by(spacy.attrs.POS).items()):
    print(f"{key}. {doc.vocab[key].text:{7}} {value}")

84. ADJ     3
85. ADP     5
86. ADV     3
87. AUX     5
90. DET     4
92. NOUN    6
94. PART    1
95. PRON    5
97. PUNCT   2
98. SCONJ   1
100. VERB    3
103. SPACE   2


## Chunking (shallow parsing)

Chunking makes use of POS tags to group words and apply chunk tags to those groups. Chunks don’t overlap, so one instance of a word can be in only one chunk at a time.

In [57]:
text = """Companies are very complex so sometimes it's hard to enforce the law when 
they are spread out around the world. 
Can we punish a parent company for something it's responsible for in another country?"""


In [58]:
speech_tags = TextBlob(text).tags

In [59]:
speech_tags

[('Companies', 'NNS'),
 ('are', 'VBP'),
 ('very', 'RB'),
 ('complex', 'JJ'),
 ('so', 'RB'),
 ('sometimes', 'VBZ'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('hard', 'JJ'),
 ('to', 'TO'),
 ('enforce', 'VB'),
 ('the', 'DT'),
 ('law', 'NN'),
 ('when', 'WRB'),
 ('they', 'PRP'),
 ('are', 'VBP'),
 ('spread', 'VBN'),
 ('out', 'RP'),
 ('around', 'IN'),
 ('the', 'DT'),
 ('world', 'NN'),
 ('Can', 'MD'),
 ('we', 'PRP'),
 ('punish', 'VB'),
 ('a', 'DT'),
 ('parent', 'NN'),
 ('company', 'NN'),
 ('for', 'IN'),
 ('something', 'NN'),
 ('it', 'PRP'),
 ("'s", 'VBZ'),
 ('responsible', 'JJ'),
 ('for', 'IN'),
 ('in', 'IN'),
 ('another', 'DT'),
 ('country', 'NN')]

In [60]:
regex = "NP: {<DT>?<JJ>*<NN>}"
rp = nltk.RegexpParser(regex)

In [61]:
results = rp.parse(speech_tags)

In [62]:
print(results)

(S
  Companies/NNS
  are/VBP
  very/RB
  complex/JJ
  so/RB
  sometimes/VBZ
  it/PRP
  's/VBZ
  hard/JJ
  to/TO
  enforce/VB
  (NP the/DT law/NN)
  when/WRB
  they/PRP
  are/VBP
  spread/VBN
  out/RP
  around/IN
  (NP the/DT world/NN)
  Can/MD
  we/PRP
  punish/VB
  (NP a/DT parent/NN)
  (NP company/NN)
  for/IN
  (NP something/NN)
  it/PRP
  's/VBZ
  responsible/JJ
  for/IN
  in/IN
  (NP another/DT country/NN))


Now we will draw the tree.

In [63]:
results.draw()

## Noun Chunks

Similar to Doc.ents, **Doc.noun_chunks** are another object property. Noun chunks are "base noun phrases" – flat phrases that have a noun as their head. We can think of noun chunks as a noun plus the words describing the noun – for example, “the lavish green grass” or “the world’s largest tech fund”.

In [64]:
doc = nlp("Autonomous cars shift insurance liability toward manufacturers")

In [65]:
for chunk in doc.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [66]:
doc = nlp(u"Big houses carry higher insurance rates.")

for chunk in doc.noun_chunks:
    print(chunk.text)

Big houses
higher insurance rates


## Named Entity Recognition

Named entities are noun phrases that refer to specific locations, people, organizations, and so on. With named entity recognition, we can find the named entities in our texts and also determine what kind of named entity they are.

We can use **nltk.ne_chunk()** to recognize named entities.

In [67]:
sentence = """This page gives an overview of all articles in the 1911 
Brittanica which are alphabetized under Nau to Ner."""

In [68]:
print(ne_chunk(pos_tag(word_tokenize(sentence))))

(S
  This/DT
  page/NN
  gives/VBZ
  an/DT
  overview/NN
  of/IN
  all/DT
  articles/NNS
  in/IN
  the/DT
  1911/CD
  (GPE Brittanica/NNP)
  which/WDT
  are/VBP
  alphabetized/VBN
  under/IN
  (PERSON Nau/NNP)
  to/TO
  (PERSON Ner/NNP)
  ./.)


In [69]:
def extract_ne(quote):
    words = word_tokenize(quote, language="english")
    tags = nltk.pos_tag(words)
    tree = nltk.ne_chunk(tags, binary=True)
    return set(
        " ".join(i[0] for i in t)
        for t in tree
        if hasattr(t, "label") and t.label() == "NE")

In [70]:
extract_ne(sentence)

{'Nau', 'Ner'}

Now lets do that with **spacy** library.

In order to find named entitites, we need to get label_ from the text object's entities. If we also would like to get details of entities, we can use *spacy.explain* function.


Tags are accessible through the `.label_` property of an entity.

<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

In [71]:
# Create a Doc instance 
text = 'Huawei to build a Istanbul factory for $4 million.'
doc = nlp(text)

In [72]:
# Print all named entities and their labels
for ent in doc.ents:
    print(ent.text, ent.label_, str(spacy.explain(ent.label_)))

Huawei ORG Companies, agencies, institutions, etc.
Istanbul GPE Countries, cities, states
$4 million MONEY Monetary values, including unit


In [73]:
len(doc.ents)

3

Let's write a function to display these info:

In [74]:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('There is no named entities.')

In [75]:
show_ents(doc)

Huawei - ORG - Companies, agencies, institutions, etc.
Istanbul - GPE - Countries, cities, states
$4 million - MONEY - Monetary values, including unit


Now let's create a function to get persons in a text.

In [76]:
news = """Germany's conservative Christian Democrats have chosen Friedrich Merz,
a critic of former Chancellor Angela Merkel, as their new leader in the hope of 
reviving the party devastated by last year's general election defeat."""

In [77]:
def find_persons(text):
  # Create Doc object
  doc = nlp(text)
  
  # Identify the persons
  persons = [ent.text for ent in doc.ents if ent.label_ == 'PERSON']
  
  # Return persons
  return persons

print(find_persons(news))

['Friedrich Merz', 'Angela Merkel']


For all details of name entities in Spacy, please check [here](https://spacy.io/usage/linguistic-features#named-entities).

***

**Adding Named Entities to All Matching Spans**

What if we want to tag *all* occurrences of "Huawei"? Now we show how to use the PhraseMatcher to identify a series of spans in the Doc.

In [78]:
text = 'Huawei to build a Istanbul factory for $4 million. But their new smartphone are not loved by the audience. Smart-phones are very important for these companies.'
doc = nlp(text)

In [79]:
show_ents(doc)

Huawei - ORG - Companies, agencies, institutions, etc.
Istanbul - GPE - Countries, cities, states
$4 million - MONEY - Monetary values, including unit


In [80]:
matcher = PhraseMatcher(nlp.vocab)

In [81]:
matcher

<spacy.matcher.phrasematcher.PhraseMatcher at 0x7ff149b67190>

In [82]:
phrase_list = ['smartphone', 'Smart-phones']
phrase_patterns = [nlp(text) for text in phrase_list]

In [83]:
# Apply the patterns to our matcher object:
matcher.add('newproduct', None, *phrase_patterns)

# Apply the matcher to our Doc object:
matches = matcher(doc)

In [84]:
matches

[(2689272359382549672, 14, 15), (2689272359382549672, 22, 25)]

In [85]:
PROD = doc.vocab.strings[u'PRODUCT']

In [86]:
new_ents = [Span(doc, match[1],match[2],label=PROD) for match in matches]

In [87]:
doc.ents = list(doc.ents) + new_ents

In [88]:
show_ents(doc)

Huawei - ORG - Companies, agencies, institutions, etc.
Istanbul - GPE - Countries, cities, states
$4 million - MONEY - Monetary values, including unit
smartphone - PRODUCT - Objects, vehicles, foods, etc. (not services)
Smart-phones - PRODUCT - Objects, vehicles, foods, etc. (not services)


## Visualization in Spacy

 spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When we export our notebook, the visualizations will be included as HTML.
 
 You can check [here](https://spacy.io/usage/visualizers) for more info.

The dependency visualizer, *de*p, shows part-of-speech tags and syntactic dependencies.

In [89]:
doc = nlp(u'Huawei to build a Istanbul factory for $4 million.')

displacy.render(doc, style='dep', jupyter=True, options={'distance': 90,"color":"red"})

The argument options lets us specify a dictionary of settings to customize the layout.

The entity visualizer, *ent*, highlights named entities and their labels in a text.

In [90]:
doc = nlp(u'Over the last quarter Amazon sold nearly 30 thousand books for a profit of $5 million.')
displacy.render(doc, style='ent', jupyter=True)

If we want, we can filter entities with *options* parameter.

In [91]:
doc = nlp(u"""With a market capitalization of 2.25 trillion U.S. dollars as of April 2021, 
Apple was the world’s largest company in 2021. Rounding out the top five were some of the 
world’s most recognizable brands: Microsoft, Saudi Arabian Oil Company (Saudi Aramco), Amazon, 
and Google’s parent company Alphabet. Saudi Aramco led the ranking of the world's most profitable 
companies in 2019, with a net income of 88.21 billion U.S. dollars.
""")

In [92]:
for sent in doc.sents:
    displacy.render(sent, style='ent', jupyter=True,options={"ents":["ORG"]})

Rendering several large documents on one page can easily become confusing. To add a headline to each visualization, we can add a title to its *user_data*. User data is never touched or modified by spaCy.

In [93]:
doc = nlp(u'Huawei to build a Istanbul factory for $4 million.')

In [94]:
doc.user_data["title"] = "News"
displacy.render(doc, style="ent")

***

**Upcoming Tutorials**

- Text Feature Engineering 

- Bag of Words

- Text Visualisation : Bar Plot

- Text Visualisation : Frequency Visualisation

- Text Visualisation : WordCloud

- Transformers, Encoders and Decoders

- Different Models : Bert, HuggingFace, StanfordNLP, NLTK, LSTM etc.

- Sentiment Analysis with Logistic Regression

- Sentiment Analysis with Naive Bayes

- Vector Space Models

- Neural Machine Translation

- Text Summarization

- Classification with Bert

- Working with PDF Files

***

**Future Plans**

- Regex

- Turkish NLP

- Spacy