book: "`Information Extraction: Algorithms and Prospects in a Retrieval Context`"

single input example: https://towardsdatascience.com/named-entity-recognition-and-classification-with-scikit-learn-f05372f07ba2

context information: https://www.depends-on-the-definition.com/introduction-named-entity-recognition-python/

Raw Document data may contain `rich linguistic relations between different entities`. 

* **One approach**: 
  * remove stop words, 
  * stem the data, 
  * use a bag-of-words representation. 

* **Other methods**:
  * `entity extraction` to determine linguistic relationships.



<img src="figures/information-extraction-intro.jpg" width="80%">

# Information Extraction (IE) systems

* **Find	and	understand**	limited	relevant	parts	of	texts	
* **Gather	information**	from	many	pieces	of	text	
* Produce	a	**structured	representation**	of	relevant	information
* Goals:	
  1. Organize	information	so	that	it	is	**useful	to	people**	
  2. Put	information	in	a	**semantically**	precise	**form**	that	allows	further	inferences	to	be	made	by	computer	algorithms	

# Named-Entity Recognition (NER)

<img src="figures/ner-fig.jpg" width="60%">


is an important subtask of **information extraction**. 

This approach **locates and classifies** <font style="color:red">atomic elements</font> in text into predefined expressions of 
* names of persons, 
* organizations, 
* locations, 
* actions, 
* numeric quantities, 
* and so on.

**Atomic elements**: used to understand the structure of sentences and complex events. 


### Example
consider the following sentence:

> Bill Clinton lives in Chappaqua.

Here, 
* “`Bill Clinton`” is the **name** of a person, 
* “`Chappaqua`” is the **name** of a place. 
* The word “`lives`” denotes an **action**. 


**Named Entity Recognition**, also known as **Entity Extraction** classifies named entities that are present in a text into pre-defined categories like 
* “individuals”, “companies”, “places”, “organization”, “cities”, “dates”, “product terminologies” etc. 

A	very	important	sub-task:	<font style="color:red">**find**</font>	and	<font style="color:red">**classify**</font> names	in text,   
for	example:	

<img src="figures/ner-example.png" width="80%">


**Named Entity Extraction** forms a core **subtask** to build knowledge 
* from semi-structured and unstructured text sources. 

Some of the **first researchers** working to extract information from unstructured texts recognized the importance of “<font style="color:red">units of information</font>” like 
* **names** (such as person, organization, and location names) 
* **numeric expressions** (such as time, date, money, and percent expressions). 

They coined the term “Named Entity” in 1996 to represent these. 

`Named Enity Recognition is one of the most common NLP problems.`

### The	uses
* Named	entities	can	be	**indexed**,	linked	off,	etc.	
* **Sentiment**	can	be	attributed	to	`companies`	or	`products`	
* A	lot	of	**IE	relations**	are	associations	between	named	`entities`	
* For	**question	answering**,	answers	are	offen	named	`entities`.	

#### Use Cases 1: Classifying content for news providers

* Named Entity Recognition can automatically **scan entire articles** and reveal which are 
  * the major `people`, `organizations`, and `places` discussed in them. 
* Knowing the relevant tags for each article **help in automatically categorizing the articles** in defined hierarchies.

<img src="figures/ner-doc-classification.png" width="70%">

#### Use case 2: Customer Support

<img src="figures/ner-customer-support.png" width="60%">

* Using Named Entity Recognition we know the entities 
  * Bandra (**location**) and Fitbit (**Product**). 
* This can be then used to **categorize** the complaint and **assign it to the relevant department** within the organization that should be handling this.

#### Use case 3: Powering Content Recommendations

By **extracting entities** from a particular article and 

<img src="figures/ner-recommendations.png" width="60%">

recommending the other articles which have the **most similar entities** mentioned in them. 
<img src="figures/ner-recommendations2.png" width="60%">

### Entity Extraction
The three common methods to approach entity extraction (and recognition)

* **Entity lists** — Used when the list of entities is known and finite (e.g., a list of professional `tennis players` from 2013–2014).
* **Regular expressions** — Use regular expressions when the entity can be defined by a `pattern`. For example, `credit card numbers` are 16 digits beginning with a 4 (Visa), 5 (Mastercard), 6 (Discover), or 15 numbers beginning with a 34 or 37 (American Express). Regular expressions can reliably find these entities.
* **Statistical models** — entities that you cannot exhaustively list or which have too much overlap with non-entities, statistical modeling (a.k.a., **machine learning**) is best as it is context sensitive. Traditional ML and more recent approaches like CNN/RNN are used in this.

### IOB Inside–Outside–Beginning (tagging)

The **IOB** (short for Inside, Outside, Beginning) is a common tagging format for tagging tokens.

* **I-** prefix before a tag indicates that the tag is **inside** a chunk.
* **B-** prefix before a tag indicates that the tag is the **beginning** of a chunk.
* An **O tag** indicates that a token belongs to no chunk (**outside**).

<img src="figures/ner-tags.png" width="80%">

### Evaluation of Named Entity Recognition	

The	extension of 
* Precision $P=TP/(TP+FP)$
* Recall $R=TP/(TP+FN)$
* F-measure 

to sequences	


**The Named Entity Recognition Task**: Predict entities in a text	
 
  
 <img src="figures/ner-eval-example.png" width="70%">

* **Recall** and **precision** are straighforward for tasks like **text categorization**, where there is only one grain size (documents)
* For IE/NER task evaluation there are <font style="color:blue">boundary errors</font> (which are **common**):

    <font style="color:green">&lt;ORG&gt;</font>First <font style="color:red">Bank of Chicago</font><font style="color:green">&lt;/ORG&gt;</font> announced earnings ... 
  * Predicted: `Bank of Chicago`      **ORG**
  * **FN**: `First Bank of Chicago`
  * **FP**: `Bank of Chicago`
  * **2 errors**: This counts as both a **fp** and a **fn**
  * Selecting <font style="color:red">nothing</font> would have been better

## The ML sequence model approach to NER

Training 
1. Collect a set of representative **training documents** 
2. Label each token for its **entity class** or **other** (O) 
3. Design **feature** extractors appropriate to the text and classes 
4. **Train** a **sequence classifier** to predict the labels from the data 

Testing 
1. Receive a set of **testing documents** 
2. Run sequence model inference to **label each token** 
3. Appropriately output the recognized entities

### NER pipeline	

 <img src="figures/ner-pipeline.png" width="70%">

### Encoding classes for sequence labeling

 <img src="figures/ner-encoding-example.png" width="40%">

### Features for sequence labeling

Words 
* Current **word** (essentially like a learned dictionary) 
* Previous/next word (**context**) 
* Word Shapes

Other kinds of inferred linguistic classification 
* **Part‐of‐speech** tags 

Label context 
* Previous (and perhaps next) label

### Features: Word Shapes

Map **words** to **simplified representation** that encodes attributes such as 
* length, capitalization, numerals, Greek letters, internal punctuation, etc.

<img src="figures/word-shape-features.png" width="30%">

## Named Entity Recognition with Scikit-Learn
* How to train **machine learning** models for NER using Scikit-Learn’s libraries
* The **goal** is to `develop practical and domain-independent techniques` in order to detect named entities with high accuracy automatically.

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.linear_model import Perceptron
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report

## The Data

Extracted from GMB(Groningen Meaning Bank) corpus which is tagged, annotated and built specifically to train the classifier to predict **named entities** such as name, location, etc.

The data is feature engineered corpus **annotated with** 
* **IOB** (Inside–Outside–Beginning) tags 
* **POS** (Part-Of-Speech) tags 

that can be found at [Kaggle](https://www.kaggle.com/abhinavwalia95/how-to-loading-and-fitting-dataset-to-scikit/data). 

In [16]:
df = pd.read_csv('data/ner_dataset.csv', encoding = "ISO-8859-1")
# The entire data set can not be fit into the memory of a single computer, 
# so we select the first 100,000 records,
df = df[:10000]
df.head(30)

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,,of,IN,O
2,,demonstrators,NNS,O
3,,have,VBP,O
4,,marched,VBN,O
5,,through,IN,O
6,,London,NNP,B-geo
7,,to,TO,O
8,,protest,VB,O
9,,the,DT,O


### Entities

* geo = Geographical Entity
* org = Organization
* per = Person
* gpe = Geopolitical Entity
* tim = Time indicator
* art = Artifact
* eve = Event
* nat = Natural Phenomenon

In [17]:
df.isnull().sum()

Sentence #    9543
Word             0
POS              0
Tag              0
dtype: int64

* there are many **NaN** values in ‘`Sentence #`” column

## Data Preprocessing

We fill NaN by preceding values

In [18]:
df = df.fillna(method='ffill')
df.head()

Unnamed: 0,Sentence #,Word,POS,Tag
0,Sentence: 1,Thousands,NNS,O
1,Sentence: 1,of,IN,O
2,Sentence: 1,demonstrators,NNS,O
3,Sentence: 1,have,VBP,O
4,Sentence: 1,marched,VBN,O


We have 457 sentences that contain 2746 unique words and tagged by 17 tags.

In [19]:
df['Sentence #'].nunique(), df.Word.nunique(), df.Tag.nunique()

(457, 2746, 17)

### The tags are not evenly distributed

In [20]:
df.groupby('Tag').size().reset_index(name='counts')

Unnamed: 0,Tag,counts
0,B-art,28
1,B-eve,10
2,B-geo,244
3,B-gpe,303
4,B-nat,5
5,B-org,176
6,B-per,160
7,B-tim,149
8,I-art,20
9,I-eve,10


### Transform text to vector 

we use DictVectorizer and then split to train and test sets.

In [38]:
X = df.drop('Tag', axis=1)
X.head()

Unnamed: 0,Sentence #,Word,POS
0,Sentence: 1,Thousands,NNS
1,Sentence: 1,of,IN
2,Sentence: 1,demonstrators,NNS
3,Sentence: 1,have,VBP
4,Sentence: 1,marched,VBN


In [39]:
X.columns

Index(['Sentence #', 'Word', 'POS'], dtype='object')

In [23]:
v = DictVectorizer(sparse=False)
X = v.fit_transform(X.to_dict('records'))
X.shape

(10000, 3242)

In [36]:
y = df.Tag.values

In [25]:
classes = np.unique(y)

In [26]:
classes = classes.tolist()
classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim',
 'O']

In [27]:
X.shape, y.shape

((10000, 3242), (10000,))

In [28]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=0)

In [29]:
X_train.shape, y_train.shape

((6700, 3242), (6700,))

## Perceptron

* We are using algorithms supporting `partial_fit` because our dataset is really big. 
* Instead of running into possible memory problems you perform your **fitting in smaller batches**.

* Because tag “O” (outside) is the **most common tag** and it will make our results look much better than they actual are. 
* We **remove tag “O”** when we evaluate classification metrics.

In [30]:
new_classes = classes.copy()
new_classes.pop()
new_classes

['B-art',
 'B-eve',
 'B-geo',
 'B-gpe',
 'B-nat',
 'B-org',
 'B-per',
 'B-tim',
 'I-art',
 'I-eve',
 'I-geo',
 'I-gpe',
 'I-nat',
 'I-org',
 'I-per',
 'I-tim']

In [32]:
per = Perceptron(verbose=10, n_jobs=-1, max_iter=5)
per.partial_fit(X_train, y_train, classes)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:    0.1s


-- Epoch 1-- Epoch 1
-- Epoch 1
-- Epoch 1

Norm: 6.16, NNZs: 32, Bias: -2.000000, T: 6700, Avg. loss: 0.001642
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 23.24, NNZs: 354, Bias: -4.000000, T: 6700, Avg. loss: 0.039104
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 24.70, NNZs: 379, Bias: -4.000000, T: 6700, Avg. loss: 0.043433
Total training time: 0.07 seconds.
-- Epoch 1
Norm: 8.25, NNZs: 59, Bias: -4.000000, T: 6700, Avg. loss: 0.006269
Total training time: 0.08 seconds.
-- Epoch 1
Norm: 5.29, NNZs: 20, Bias: -2.000000, T: 6700, Avg. loss: 0.001194
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 19.80, NNZs: 284, Bias: -4.000000, T: 6700, Avg. loss: 0.026866
Total training time: 0.05 seconds.
-- Epoch 1
Norm: 17.75, NNZs: 179, Bias: -3.000000, T: 6700, Avg. loss: 0.011940
Total training time: 0.05 seconds.
Norm: 19.13, NNZs: 255, Bias: -4.000000, T: 6700, Avg. loss: 0.025224
Total training time: 0.06 seconds.
-- Epoch 1
-- Epoch 1
Norm: 7.14, NNZs: 48, Bias: -3.000

[Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:    0.2s
[Parallel(n_jobs=-1)]: Done  12 out of  17 | elapsed:    0.2s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  14 out of  17 | elapsed:    0.3s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done  17 out of  17 | elapsed:    0.3s finished


Perceptron(alpha=0.0001, class_weight=None, early_stopping=False, eta0=1.0,
      fit_intercept=True, max_iter=5, n_iter=None, n_iter_no_change=5,
      n_jobs=-1, penalty=None, random_state=0, shuffle=True, tol=None,
      validation_fraction=0.1, verbose=10, warm_start=False)

In [17]:
print(classification_report(y_pred=per.predict(X_test), y_true=y_test, labels=new_classes))

             precision    recall  f1-score   support

      B-art       0.15      0.12      0.14        24
      B-eve       0.46      0.32      0.37        19
      B-geo       0.42      0.91      0.57      1085
      B-gpe       0.89      0.78      0.83       556
      B-nat       0.11      0.25      0.15        12
      B-org       0.55      0.35      0.43       589
      B-per       0.72      0.43      0.53       564
      B-tim       0.65      0.78      0.71       611
      I-art       0.02      0.08      0.03        12
      I-eve       0.00      0.00      0.00        18
      I-geo       0.81      0.32      0.46       230
      I-gpe       0.00      0.00      0.00        14
      I-nat       0.50      0.50      0.50         2
      I-org       0.71      0.41      0.52       445
      I-per       0.76      0.20      0.32       591
      I-tim       0.26      0.05      0.09       194

avg / total       0.62      0.55      0.53      4966



  'precision', 'predicted', average, warn_for)


## Linear classifiers with SGD training

Regularized linear models with stochastic gradient descent (SGD) learning: 
* the gradient of the loss is estimated each sample at a time
* the model is updated along the way with a decreasing strength schedule (aka learning rate). 

In [41]:
sgd = SGDClassifier()
sgd.partial_fit(X_train, y_train, classes)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=None, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [43]:
print(classification_report(y_pred=sgd.predict(X_test), y_true=y_test, labels=new_classes))

              precision    recall  f1-score   support

       B-art       0.00      0.00      0.00         9
       B-eve       0.00      0.00      0.00         3
       B-geo       0.22      0.93      0.36        69
       B-gpe       0.92      0.45      0.61       102
       B-nat       0.00      0.00      0.00         0
       B-org       1.00      0.05      0.09        63
       B-per       0.94      0.41      0.58        41
       B-tim       1.00      0.48      0.65        52
       I-art       0.00      0.00      0.00        10
       I-eve       0.00      0.00      0.00         3
       I-geo       0.00      0.00      0.00        11
       I-gpe       0.18      0.50      0.26         6
       I-nat       0.00      0.00      0.00         1
       I-org       0.82      0.30      0.44        47
       I-per       0.68      0.23      0.34        66
       I-tim       0.00      0.00      0.00         4

   micro avg       0.43      0.38      0.40       487
   macro avg       0.36   

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)
