# Text Feature Extraction

<div class="alert alert-block alert-info">
<b>What is Feature Extraction in NLP?</b>


In NLP, feature extraction is the process of transforming raw text data into a set of numerical features that can be used for machine learning or other NLP tasks. 

The goal of feature extraction is to capture relevant information from the text data in a way that can be easily processed by a machine learning algorithm. 
</div>

<div class="alert alert-block alert-info">
<b>What are some common ways of extraction?</b>


__`1. Bag-of-Words (BoW):`__ BoW is a simple technique that involves representing a document as a vector of word counts. It involves counting the frequency of each word in a document and then representing the document as a vector of word counts. BoW is simple to implement and can be used for a variety of NLP tasks, such as text classification and clustering.

__`2. Term Frequency-Inverse Document Frequency (TF-IDF):`__ TF-IDF is a technique that represents text as a weight that measures the importance of a word in a document. It takes into account not only the frequency of a word in a document but also its frequency across the corpus. Words that are common across the corpus but rare in a document are given higher weights, while words that are rare across the corpus but common in a document are given lower weights.

__`3. Word Embeddings:`__ Word embeddings are dense vector representations of words that capture the meaning and context of words in a corpus. They are generated by training a neural network on a large corpus of text, such as Wikipedia or news articles. Word embeddings can be used for a variety of NLP tasks, such as sentiment analysis, named entity recognition, and machine translation.

__`4. Named Entity Recognition (NER):`__ NER is a technique that involves identifying and extracting named entities such as person names, organization names, and locations from text. NER can be used to generate features that represent the presence or absence of named entities in a document.

__`5. Part-of-Speech (POS) tagging:`__ POS tagging involves labeling the part of speech (e.g. noun, verb, adjective) for each word in a sentence. POS tags can be used to generate features that capture syntactic information about a sentence.
</div>

## NLP Terminologies Primer

<div class="alert alert-block alert-success">
 
Here is a list of common terminologies used in Natural Language Processing (NLP):

`1. Corpus:` A collection of text documents(datapoints) or spoken words used for analysis and study in NLP. It can be considered as a pragraph having many sentences in it.
    

`2. Documents:` It is same as the sentences in a paragraph. A group of documents constitutes a pragraph/corpus.
    
   

 `3. Vocabulary:` It represents the no. of unique words present in the corpus. 
    

`4. Tokenization:` The process of splitting text into individual units, usually words or subwords, for further analysis.
    

`5. Stemming:` The process of reducing a word to its base or root form.
    

`6. Lemmatization:` A more advanced version of stemming, which reduces a word to its base or dictionary form.
    

`7. Stop words:` Common words that are usually removed from text during preprocessing, as they carry little meaning.
    

`8. N-grams:` A sequence of N words that are adjacent to each other in a text.
    

`9. Part-of-speech (POS) tagging:` The process of labeling each word in a text with its corresponding part of speech.
    

`10. Named Entity Recognition (NER):` The process of identifying and labeling named entities such as people, organizations, and     locations in a text.
    

`11. Dependency parsing:` The process of analyzing the grammatical structure of a sentence by identifying the relationships between words.

`12. Sentiment analysis:` The process of determining the emotional tone or attitude of a text, typically positive, negative, or neutral.
   
`13. Topic modeling:` The process of identifying the underlying topics or themes in a collection of text documents.
    

`14. Bag of Words (BoW):` A simple method for representing text as a collection of word frequencies.
    

`15. Term frequency-inverse document frequency (TF-IDF):` A statistical measure used to evaluate the importance of a word in a document, based on how frequently it appears in the document and how rare it is in the corpus.

`16. Word embedding:` A technique for representing words as dense vectors of real numbers, which capture their semantic meaning and relationships.

`17. Neural machine translation:` A method of machine translation that uses neural networks to learn the mapping between languages.

`18. Language models:` Models that estimate the probability of a sequence of words in a language, typically used for tasks such as speech recognition and text generation.

</div>

<br>

<br>

## Bag of Words (BoW)

<div class="alert alert-block alert-info">

__`Pros:`__

`1. Simplicity:` BoW is a simple and intuitive method for representing text as numerical vectors, making it easy to understand and implement.

`2. Flexibility:` BoW can be used with a variety of machine learning algorithms, making it a versatile tool for text analysis.

`3. Efficiency:` BoW can be computationally efficient, especially when used with sparse representations.

`4. Robustness:` BoW is robust to changes in word order and grammar, making it suitable for a wide range of text data.
    
__`Cons:`__

`1.  Loss of information:` BoW treats each word in the text as independent and ignores the order of the words, resulting in the loss of important contextual information.

`2. Vocabulary size:` BoW can result in a high-dimensional vector representation, especially when using a large vocabulary, which can make it computationally expensive and difficult to interpret.

`3. Out-of-vocabulary words:` BoW can struggle with words that are not present in the training vocabulary, which can result in inaccurate representations.

`4. Semantic ambiguity:` BoW does not capture the nuances of word meanings, which can result in multiple words with similar meanings being represented as separate features.
</d>

In [1]:
import spacy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
nlp = spacy.load('en_core_web_sm')

In [2]:
para = "This is the first sentence. This is the second sentence. And this is the third sentence."

In [3]:
doc = nlp(para)
doc

This is the first sentence. This is the second sentence. And this is the third sentence.

In [4]:
for i in doc.sents:
    print(f"{type(i)} --> {i}")
    print(f"{type(i.text)} --> {i.text}")
  

<class 'spacy.tokens.span.Span'> --> This is the first sentence.
<class 'str'> --> This is the first sentence.
<class 'spacy.tokens.span.Span'> --> This is the second sentence.
<class 'str'> --> This is the second sentence.
<class 'spacy.tokens.span.Span'> --> And this is the third sentence.
<class 'str'> --> And this is the third sentence.


In [5]:
corpus = [i.text for i in doc.sents]
corpus

['This is the first sentence.',
 'This is the second sentence.',
 'And this is the third sentence.']

In [6]:
vectorizer = CountVectorizer(lowercase=True, ngram_range=(1,1), max_features=None)
X = vectorizer.fit_transform(corpus)
X.shape

(3, 8)

__`Note:`__ An ngram_range of (1, 1) means only unigrams, (1, 2) means unigrams and bigrams, and (2, 2) means only bigrams. 

In [7]:
vectorizer.get_feature_names_out()

array(['and', 'first', 'is', 'second', 'sentence', 'the', 'third', 'this'],
      dtype=object)

In [8]:
vectorizer.get_stop_words()

In [9]:
vectorizer.vocabulary_

{'this': 7,
 'is': 2,
 'the': 5,
 'first': 1,
 'sentence': 4,
 'second': 3,
 'and': 0,
 'third': 6}

In [10]:
corpus

['This is the first sentence.',
 'This is the second sentence.',
 'And this is the third sentence.']

In [11]:
X.toarray()

array([[0, 1, 1, 0, 1, 1, 0, 1],
       [0, 0, 1, 1, 1, 1, 0, 1],
       [1, 0, 1, 0, 1, 1, 1, 1]], dtype=int64)

### Visualize the Bag Of Words 
 
 The vocabulary dictionary being generated contains the words as keys and the values are the position in the feature matrix.

In [12]:
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,first,is,second,sentence,the,third,this
0,0,1,1,0,1,1,0,1
1,0,0,1,1,1,1,0,1
2,1,0,1,0,1,1,1,1


### Tweak with ngrams

In [13]:
corpus

['This is the first sentence.',
 'This is the second sentence.',
 'And this is the third sentence.']

In [14]:
vectorizer = CountVectorizer(lowercase=True, ngram_range=(2,2), max_features=None)
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and this,first sentence,is the,second sentence,the first,the second,the third,third sentence,this is
0,0,1,1,0,1,0,0,0,1
1,0,0,1,1,0,1,0,0,1
2,1,0,1,0,0,0,1,1,1


In [15]:
vectorizer = CountVectorizer(lowercase=True, ngram_range=(1,2), max_features=None)
X = vectorizer.fit_transform(corpus)
df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out())
df

Unnamed: 0,and,and this,first,first sentence,is,is the,second,second sentence,sentence,the,the first,the second,the third,third,third sentence,this,this is
0,0,0,1,1,1,1,0,0,1,1,1,0,0,0,0,1,1
1,0,0,0,0,1,1,1,1,1,1,0,1,0,0,0,1,1
2,1,1,0,0,1,1,0,0,1,1,0,0,1,1,1,1,1


<br>

## TF-IDF

<div class="alert alert-block alert-info">

TF-IDF is a more sophisticated and effective technique for text analysis than Bag-of-Words, as it gives more importance to the words that are important in the document and less importance to the common words.

It helps to overcome some of the limitations of Bag-of-Words model by giving more importance to the words that are important and less importance to the common words. 


__`Term Frequency (TF)`__
    
TF(t, d) = (Number of times term t appears in document d) / (Total number of terms in document d)

__`Inverse Document Frequency (IDF)`__
    
IDF(t, D) = log_e(Total number of documents in the corpus D / Number of documents with term t in it)

__`TF-IDF`__
    
TF-IDF(t, d, D) = TF(t, d) * IDF(t, D)

where:

t: term(word)

d: document(sentence)

D: corpus (collection of documents/sentence)

</div>

<div class="alert alert-block alert-info">

__`Pros:`__

`1. Gives importance to important words:` TF-IDF down-weights the common words and up-weights the important words, making it more effective in capturing the essence of the document.

`2. Handles long documents better:` TF-IDF normalizes the term frequency by the inverse document frequency, giving more importance to the words that are important in the document, and less importance to the words that are common, which makes it better suited for handling long documents.

`3. Improves search relevance:` Because TF-IDF gives more weight to the important words and less weight to the common words, it can help improve the relevance of search results.

`4. Widely used:` TF-IDF is a widely used technique in text mining and information retrieval.
    
    

__`Cons:`__

`1. Assumes independence between terms:` TF-IDF assumes that the terms in a document are independent of each other, which is not always true. This can lead to inaccurate results in some cases.

`2. Ignores word order and context:` TF-IDF does not take into account the order of words or the context in which they appear, which can limit its usefulness in some cases.

`3. Requires a large corpus:` To compute the inverse document frequency, TF-IDF requires a large corpus of documents, which can be a limitation in some cases.

`4. Can be sensitive to document length:` The inverse document frequency is sensitive to document length, which can lead to inaccuracies in some cases.
</div>

In [16]:
corpus

['This is the first sentence.',
 'This is the second sentence.',
 'And this is the third sentence.']

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
tfidf_vec = TfidfVectorizer()

In [19]:
x = tfidf_vec.fit_transform(corpus)
x.shape

(3, 8)

In [20]:
x.toarray()

array([[0.        , 0.64612892, 0.38161415, 0.        , 0.38161415,
        0.38161415, 0.        , 0.38161415],
       [0.        , 0.        , 0.38161415, 0.64612892, 0.38161415,
        0.38161415, 0.        , 0.38161415],
       [0.54270061, 0.        , 0.32052772, 0.        , 0.32052772,
        0.32052772, 0.54270061, 0.32052772]])

In [21]:
tfidf_vec.get_feature_names_out()

array(['and', 'first', 'is', 'second', 'sentence', 'the', 'third', 'this'],
      dtype=object)

In [22]:
df_tfidf = pd.DataFrame(x.toarray(), columns=tfidf_vec.get_feature_names_out())

In [23]:
df_tfidf

Unnamed: 0,and,first,is,second,sentence,the,third,this
0,0.0,0.646129,0.381614,0.0,0.381614,0.381614,0.0,0.381614
1,0.0,0.0,0.381614,0.646129,0.381614,0.381614,0.0,0.381614
2,0.542701,0.0,0.320528,0.0,0.320528,0.320528,0.542701,0.320528


__`NOTE:`__ It's quite clear how the weights are distributed and normalized in propper manner giving a semantic/importance to the words with respect to sentence/document.

<br>

## Part of Speech Tagging  (POS) 

<div class="alert alert-block alert-info">
    
__`Part-of-speech (POS)`__ tagging is a process in natural language processing (NLP) that involves assigning a syntactic category to each word in a sentence based on its definition, context, and relationships with other words in the sentence.
    
<br>
Processing raw text intelligently is difficult: most words are rare, and it's common for words that look completely different to mean almost the same thing. The same words in a different order can mean something completely different. Even splitting text into useful word-like units can be difficult in many languages. 
    
While it's possible to solve some problems starting from only the raw characters, it's usually better to use linguistic knowledge to add useful information. That's exactly what spaCy is designed to do: you put in raw text, and get back a **Doc** object, that comes with a variety of annotations.
    
</div>

### Coarse-Grained POS Tags

<div class="alert alert-block alert-success">
Every token is assigned a POS Tag from the following list:
    
<br>
<br>


<table><tr><th>POS</th><th>DESCRIPTION</th><th>EXAMPLES</th></tr>
    
<tr><td>ADJ</td><td>adjective</td><td>*big, old, green, incomprehensible, first*</td></tr>
<tr><td>ADP</td><td>adposition</td><td>*in, to, during*</td></tr>
<tr><td>ADV</td><td>adverb</td><td>*very, tomorrow, down, where, there*</td></tr>
<tr><td>AUX</td><td>auxiliary</td><td>*is, has (done), will (do), should (do)*</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>*and, or, but*</td></tr>
<tr><td>CCONJ</td><td>coordinating conjunction</td><td>*and, or, but*</td></tr>
<tr><td>DET</td><td>determiner</td><td>*a, an, the*</td></tr>
<tr><td>INTJ</td><td>interjection</td><td>*psst, ouch, bravo, hello*</td></tr>
<tr><td>NOUN</td><td>noun</td><td>*girl, cat, tree, air, beauty*</td></tr>
<tr><td>NUM</td><td>numeral</td><td>*1, 2017, one, seventy-seven, IV, MMXIV*</td></tr>
<tr><td>PART</td><td>particle</td><td>*'s, not,*</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>*I, you, he, she, myself, themselves, somebody*</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>*Mary, John, London, NATO, HBO*</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>*., (, ), ?*</td></tr>
<tr><td>SCONJ</td><td>subordinating conjunction</td><td>*if, while, that*</td></tr>
<tr><td>SYM</td><td>symbol</td><td>*$, %, §, ©, +, −, ×, ÷, =, :), 😝*</td></tr>
<tr><td>VERB</td><td>verb</td><td>*run, runs, running, eat, ate, eating*</td></tr>
<tr><td>X</td><td>other</td><td>*sfpksdpsxmsa*</td></tr>
<tr><td>SPACE</td><td>space</td></tr>

### Fine-Grained POS Tags
<div class="alert alert-block alert-success">
Tokens are subsequently given a fine-grained tag as determined by morphology:
    
<br>
<br>
<table>
<tr><th>POS</th><th>Description</th><th>Fine-grained Tag</th><th>Description</th><th>Morphology</th></tr>
<tr><td>ADJ</td><td>adjective</td><td>AFX</td><td>affix</td><td>Hyph=yes</td></tr>
<tr><td>ADJ</td><td></td><td>JJ</td><td>adjective</td><td>Degree=pos</td></tr>
<tr><td>ADJ</td><td></td><td>JJR</td><td>adjective, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADJ</td><td></td><td>JJS</td><td>adjective, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADJ</td><td></td><td>PDT</td><td>predeterminer</td><td>AdjType=pdt PronType=prn</td></tr>
<tr><td>ADJ</td><td></td><td>PRP\$</td><td>pronoun, possessive</td><td>PronType=prs Poss=yes</td></tr>
<tr><td>ADJ</td><td></td><td>WDT</td><td>wh-determiner</td><td>PronType=int rel</td></tr>
<tr><td>ADJ</td><td></td><td>WP\$</td><td>wh-pronoun, possessive</td><td>Poss=yes PronType=int rel</td></tr>
<tr><td>ADP</td><td>adposition</td><td>IN</td><td>conjunction, subordinating or preposition</td><td></td></tr>
<tr><td>ADV</td><td>adverb</td><td>EX</td><td>existential there</td><td>AdvType=ex</td></tr>
<tr><td>ADV</td><td></td><td>RB</td><td>adverb</td><td>Degree=pos</td></tr>
<tr><td>ADV</td><td></td><td>RBR</td><td>adverb, comparative</td><td>Degree=comp</td></tr>
<tr><td>ADV</td><td></td><td>RBS</td><td>adverb, superlative</td><td>Degree=sup</td></tr>
<tr><td>ADV</td><td></td><td>WRB</td><td>wh-adverb</td><td>PronType=int rel</td></tr>
<tr><td>CONJ</td><td>conjunction</td><td>CC</td><td>conjunction, coordinating</td><td>ConjType=coor</td></tr>
<tr><td>DET</td><td>determiner</td><td>DT</td><td>determiner</td><td></td></tr>
<tr><td>INTJ</td><td>interjection</td><td>UH</td><td>interjection</td><td></td></tr>
<tr><td>NOUN</td><td>noun</td><td>NN</td><td>noun, singular or mass</td><td>Number=sing</td></tr>
<tr><td>NOUN</td><td></td><td>NNS</td><td>noun, plural</td><td>Number=plur</td></tr>
<tr><td>NOUN</td><td></td><td>WP</td><td>wh-pronoun, personal</td><td>PronType=int rel</td></tr>
<tr><td>NUM</td><td>numeral</td><td>CD</td><td>cardinal number</td><td>NumType=card</td></tr>
<tr><td>PART</td><td>particle</td><td>POS</td><td>possessive ending</td><td>Poss=yes</td></tr>
<tr><td>PART</td><td></td><td>RP</td><td>adverb, particle</td><td></td></tr>
<tr><td>PART</td><td></td><td>TO</td><td>infinitival to</td><td>PartType=inf VerbForm=inf</td></tr>
<tr><td>PRON</td><td>pronoun</td><td>PRP</td><td>pronoun, personal</td><td>PronType=prs</td></tr>
<tr><td>PROPN</td><td>proper noun</td><td>NNP</td><td>noun, proper singular</td><td>NounType=prop Number=sign</td></tr>
<tr><td>PROPN</td><td></td><td>NNPS</td><td>noun, proper plural</td><td>NounType=prop Number=plur</td></tr>
<tr><td>PUNCT</td><td>punctuation</td><td>-LRB-</td><td>left round bracket</td><td>PunctType=brck PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>-RRB-</td><td>right round bracket</td><td>PunctType=brck PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>,</td><td>punctuation mark, comma</td><td>PunctType=comm</td></tr>
<tr><td>PUNCT</td><td></td><td>:</td><td>punctuation mark, colon or ellipsis</td><td></td></tr>
<tr><td>PUNCT</td><td></td><td>.</td><td>punctuation mark, sentence closer</td><td>PunctType=peri</td></tr>
<tr><td>PUNCT</td><td></td><td>''</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>""</td><td>closing quotation mark</td><td>PunctType=quot PunctSide=fin</td></tr>
<tr><td>PUNCT</td><td></td><td>``</td><td>opening quotation mark</td><td>PunctType=quot PunctSide=ini</td></tr>
<tr><td>PUNCT</td><td></td><td>HYPH</td><td>punctuation mark, hyphen</td><td>PunctType=dash</td></tr>
<tr><td>PUNCT</td><td></td><td>LS</td><td>list item marker</td><td>NumType=ord</td></tr>
<tr><td>PUNCT</td><td></td><td>NFP</td><td>superfluous punctuation</td><td></td></tr>
<tr><td>SYM</td><td>symbol</td><td>#</td><td>symbol, number sign</td><td>SymType=numbersign</td></tr>
<tr><td>SYM</td><td></td><td>\$</td><td>symbol, currency</td><td>SymType=currency</td></tr>
<tr><td>SYM</td><td></td><td>SYM</td><td>symbol</td><td></td></tr>
<tr><td>VERB</td><td>verb</td><td>BES</td><td>auxiliary "be"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>HVS</td><td>forms of "have"</td><td></td></tr>
<tr><td>VERB</td><td></td><td>MD</td><td>verb, modal auxiliary</td><td>VerbType=mod</td></tr>
<tr><td>VERB</td><td></td><td>VB</td><td>verb, base form</td><td>VerbForm=inf</td></tr>
<tr><td>VERB</td><td></td><td>VBD</td><td>verb, past tense</td><td>VerbForm=fin Tense=past</td></tr>
<tr><td>VERB</td><td></td><td>VBG</td><td>verb, gerund or present participle</td><td>VerbForm=part Tense=pres Aspect=prog</td></tr>
<tr><td>VERB</td><td></td><td>VBN</td><td>verb, past participle</td><td>VerbForm=part Tense=past Aspect=perf</td></tr>
<tr><td>VERB</td><td></td><td>VBP</td><td>verb, non-3rd person singular present</td><td>VerbForm=fin Tense=pres</td></tr>
<tr><td>VERB</td><td></td><td>VBZ</td><td>verb, 3rd person singular present</td><td>VerbForm=fin Tense=pres Number=sing Person=3</td></tr>
<tr><td>X</td><td>other</td><td>ADD</td><td>email</td><td></td></tr>
<tr><td>X</td><td></td><td>FW</td><td>foreign word</td><td>Foreign=yes</td></tr>
<tr><td>X</td><td></td><td>GW</td><td>additional word in multi-word expression</td><td></td></tr>
<tr><td>X</td><td></td><td>XX</td><td>unknown</td><td></td></tr>
<tr><td>SPACE</td><td>space</td><td>_SP</td><td>space</td><td></td></tr>
<tr><td></td><td></td><td>NIL</td><td>missing tag</td><td></td></tr>
</table>
</div>

* __`token.pos_ :`__ To view the coarse POS tag.
* __`token.tag_ :`__ To view the fine-grained tag.
* __`spacy.explain(tag) :`__ To view the description of either type of tag.

__`Note :`__ You can obtain a particular token by its index position.

In [24]:
doc = nlp(u'I read books on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')

read       VERB     VBP    verb, non-3rd person singular present


In [25]:
doc = nlp(u'I read a book on NLP.')
r = doc[1]

print(f'{r.text:{10}} {r.pos_:{8}} {r.tag_:{6}} {spacy.explain(r.tag_)}')

read       VERB     VBD    verb, past tense


#### Counting POS Tags
__`The Doc.count_by()`__ method accepts a specific token attribute as its argument, and returns a frequency count of the given attribute as a dictionary object. Keys in the dictionary are the integer values of the given attribute ID, and values are the frequency. Counts of zero are not included.

In [26]:
doc = nlp(u"The quick brown fox jumped over the lazy dog's back.")
# Count the frequencies of different coarse-grained POS tags:
POS_counts = doc.count_by(spacy.attrs.POS)
POS_counts

{90: 2, 84: 3, 92: 3, 100: 1, 85: 1, 94: 1, 97: 1}

In [27]:
# Get POS tag by its ID
doc.vocab[83].text

'LANG'

In [28]:
for k,v in sorted(POS_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{5}}: {v}')

84. ADJ  : 3
85. ADP  : 1
90. DET  : 2
92. NOUN : 3
94. PART : 1
97. PUNCT: 1
100. VERB : 1


#### Counting Fine-Grained Tags

In [29]:
# Count the different fine-grained tags:
TAG_counts = doc.count_by(spacy.attrs.TAG)

for k,v in sorted(TAG_counts.items()):
    print(f'{k}. {doc.vocab[k].text:{4}}: {v}')

74. POS : 1
1292078113972184607. IN  : 1
10554686591937588953. JJ  : 3
12646065887601541794. .   : 1
15267657372422890137. DT  : 2
15308085513773655218. NN  : 3
17109001835818727656. VBD : 1


<br>

## Named Entity Recognition (NER)

<div class="alert alert-block alert-info">
    
__`NER (Named Entity Recognition)`__ is a natural language processing (NLP) technique that involves identifying and categorizing named entities in text into predefined categories such as person names, organizations, locations, medical codes, etc. The goal of NER is to extract useful information from unstructured text data by identifying important named entities and classifying them into meaningful categories.

<br>
For example, consider the following sentence:

__"Barack Obama was the 44th President of the United States."__

NER would involve identifying "Barack Obama" as a person name and "the United States" as a location.

NER is a useful technique in a variety of NLP applications such as information extraction, text classification, and question answering. Many NLP libraries, including spaCy and NLTK, provide pre-trained models for NER that can be used out of the box, or trained on custom data to recognize specific entities relevant to a particular domain or application.
    
</div>

### Entity annotations

<div class="alert alert-block alert-success">
`Doc.ents` are token spans with their own set of annotations.
    
<br>
    <br>
<table>
<tr><td>`ent.text`</td><td>The original entity text</td></tr>
<tr><td>`ent.label`</td><td>The entity type's hash value</td></tr>
<tr><td>`ent.label_`</td><td>The entity type's string description</td></tr>
<tr><td>`ent.start`</td><td>The token span's *start* index position in the Doc</td></tr>
<tr><td>`ent.end`</td><td>The token span's *stop* index position in the Doc</td></tr>
<tr><td>`ent.start_char`</td><td>The entity text's *start* index position in the Doc</td></tr>
<tr><td>`ent.end_char`</td><td>The entity text's *stop* index position in the Doc</td></tr>
</table>
    </div>

In [30]:
doc = nlp(u'Can I please borrow 500 dollars from you to buy some Microsoft stock?')

for ent in doc.ents:
    print(ent.text, ent.start, ent.end, ent.start_char, ent.end_char, ent.label_)

500 dollars 4 6 20 31 MONEY
Microsoft 11 12 53 62 ORG


<div class="alert alert-block alert-success">

Tags are accessible through the `.label_` property of an entity.
<br>
<br>

<table>
<tr><th>TYPE</th><th>DESCRIPTION</th><th>EXAMPLE</th></tr>
<tr><td>`PERSON`</td><td>People, including fictional.</td><td>*Fred Flintstone*</td></tr>
<tr><td>`NORP`</td><td>Nationalities or religious or political groups.</td><td>*The Republican Party*</td></tr>
<tr><td>`FAC`</td><td>Buildings, airports, highways, bridges, etc.</td><td>*Logan International Airport, The Golden Gate*</td></tr>
<tr><td>`ORG`</td><td>Companies, agencies, institutions, etc.</td><td>*Microsoft, FBI, MIT*</td></tr>
<tr><td>`GPE`</td><td>Countries, cities, states.</td><td>*France, UAR, Chicago, Idaho*</td></tr>
<tr><td>`LOC`</td><td>Non-GPE locations, mountain ranges, bodies of water.</td><td>*Europe, Nile River, Midwest*</td></tr>
<tr><td>`PRODUCT`</td><td>Objects, vehicles, foods, etc. (Not services.)</td><td>*Formula 1*</td></tr>
<tr><td>`EVENT`</td><td>Named hurricanes, battles, wars, sports events, etc.</td><td>*Olympic Games*</td></tr>
<tr><td>`WORK_OF_ART`</td><td>Titles of books, songs, etc.</td><td>*The Mona Lisa*</td></tr>
<tr><td>`LAW`</td><td>Named documents made into laws.</td><td>*Roe v. Wade*</td></tr>
<tr><td>`LANGUAGE`</td><td>Any named language.</td><td>*English*</td></tr>
<tr><td>`DATE`</td><td>Absolute or relative dates or periods.</td><td>*20 July 1969*</td></tr>
<tr><td>`TIME`</td><td>Times smaller than a day.</td><td>*Four hours*</td></tr>
<tr><td>`PERCENT`</td><td>Percentage, including "%".</td><td>*Eighty percent*</td></tr>
<tr><td>`MONEY`</td><td>Monetary values, including unit.</td><td>*Twenty Cents*</td></tr>
<tr><td>`QUANTITY`</td><td>Measurements, as of weight or distance.</td><td>*Several kilometers, 55kg*</td></tr>
<tr><td>`ORDINAL`</td><td>"first", "second", etc.</td><td>*9th, Ninth*</td></tr>
<tr><td>`CARDINAL`</td><td>Numerals that do not fall under another type.</td><td>*2, Two, Fifty-two*</td></tr>
</table>

In [31]:
# Write a function to display basic entity info:
def show_ents(doc):
    if doc.ents:
        for ent in doc.ents:
            print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))
    else:
        print('No named entities found.')

In [32]:
doc = nlp(u'May I go to Washington, DC next May to see the Washington Monument?')

show_ents(doc)

Washington, DC - GPE - Countries, cities, states
next May - DATE - Absolute or relative dates or periods
the Washington Monument - ORG - Companies, agencies, institutions, etc.


### Adding a Named Entity to a Span
Normally we would have spaCy build a library of named entities by training it on several samples of text.<br>In this case, we only want to add one value:

In [33]:
doc = nlp(u'Tesla to build a U.K. factory for $6 million')

show_ents(doc)

U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


In [34]:
doc.ents #Right now, spaCy does not recognize "Tesla" as a company.

(U.K., $6 million)

In [35]:
from spacy.tokens import Span

# Get the hash value of the ORG entity label
ORG = doc.vocab.strings[u'ORG']  

# Create a Span for the new entity
new_ent = Span(doc, 0, 1, label=ORG)

# Add the entity to the existing Doc object
doc.ents = list(doc.ents) + [new_ent]

In [36]:
doc.ents

(Tesla, U.K., $6 million)

In [37]:
show_ents(doc)

Tesla - ORG - Companies, agencies, institutions, etc.
U.K. - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


### Counting Entities
While spaCy may not have a built-in tool for counting entities, we can pass a conditional statement into a list comprehension:

In [38]:
doc = nlp(u'Originally priced at $29.50, the sweater was marked down to five dollars.')
show_ents(doc)

29.50 - MONEY - Monetary values, including unit
five dollars - MONEY - Monetary values, including unit


In [39]:
len([ent for ent in doc.ents if ent.label_=='MONEY'])

2