# Build your own NER Tagger

__Named Entity Recognition (NER)__ , also known as entity chunking/extraction , is a popular technique used in information extraction to identify and segment the named entities and classify or categorize them under various predefined classes.

There are various off the shelf solutions which offer capabilites to perform named entity extraction (some of which we discussed in the previous units). Yet there are times when the requirements are beyond the capabilities of off-the-shelf classifiers.

In this notebook, we will go through an exercise to build our own NER using Conditional Random Fields.
We would be utilizing ```sklearn_crfsuite``` to develop our NER.

## Load Dataset

Named Entity Recognition is a sequence modeling problem at it's core. It is more related to classification class of problems where in we need a labeled dataset to train a classifier. 

There are various labeled datasets for NER class of problems. We would be utilizing a pre-processed version of __GMB (Groningen Meaning Bank) corpus__ for this notebook. The preprocessed version is availble at the following link : [kaggle/ner](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus)

We have provided the dataset in the code repository itself using some intelligent compression and you can access it directly from `pandas` as follows.

In [1]:
import pandas as pd

df = pd.read_csv('ner_dataset.csv.gz', compression='gzip', encoding='ISO-8859-1')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  47959 non-null    object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


In [2]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048565,1048566,1048567,1048568,1048569,1048570,1048571,1048572,1048573,1048574
Sentence #,Sentence: 1,,,,,,,,,,...,,,Sentence: 47959,,,,,,,
Word,Thousands,of,demonstrators,have,marched,through,London,to,protest,the,...,impact,.,Indian,forces,said,they,responded,to,the,attack
POS,NNS,IN,NNS,VBP,VBN,IN,NNP,TO,VB,DT,...,NN,.,JJ,NNS,VBD,PRP,VBD,TO,DT,NN
Tag,O,O,O,O,O,O,B-geo,O,O,O,...,O,O,B-gpe,O,O,O,O,O,O,O


In [3]:
df = df.fillna(method='ffill')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1048575 entries, 0 to 1048574
Data columns (total 4 columns):
 #   Column      Non-Null Count    Dtype 
---  ------      --------------    ----- 
 0   Sentence #  1048575 non-null  object
 1   Word        1048575 non-null  object
 2   POS         1048575 non-null  object
 3   Tag         1048575 non-null  object
dtypes: object(4)
memory usage: 32.0+ MB


In [4]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,1048565,1048566,1048567,1048568,1048569,1048570,1048571,1048572,1048573,1048574
Sentence #,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,Sentence: 1,...,Sentence: 47958,Sentence: 47958,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959,Sentence: 47959
Word,Thousands,of,demonstrators,have,marched,through,London,to,protest,the,...,impact,.,Indian,forces,said,they,responded,to,the,attack
POS,NNS,IN,NNS,VBP,VBN,IN,NNP,TO,VB,DT,...,NN,.,JJ,NNS,VBD,PRP,VBD,TO,DT,NN
Tag,O,O,O,O,O,O,B-geo,O,O,O,...,O,O,B-gpe,O,O,O,O,O,O,O


In [5]:
# To get a deeper understanding of the data we are dealing with and the total number
# of annotated tags, we can use the following code.
df['Sentence #'].nunique(), df.Word.nunique(), df.POS.nunique(), df.Tag.nunique()

(47959, 35178, 42, 17)

We have 47959 sentences that contain 35178 unique words.

#### These sentences have a total of 42 unique POS tags and 17 unique NER tags in total.

## Tag Distribution

The GMB dataset utilizes IOB tagging or _Inside, Outside Beginning_. IOB is a common tagging format for tagging tokens which we have discussed earlier. To refresh your memory:

+ __I- prefix__ before a tag indicates that the tag is inside a chunk.
+ __B- prefix__ before a tag indicates that the tag is the beginning of a chunk.
+ __O-  tag__ indicates that a token belongs to no chunk (outside).

The tags in this dataset are explained as follows:

+ __geo__ = Geographical Entity
+ __org__ = Organization
+ __per__ = Person
+ __gpe__ = Geopolitical Entity
+ __tim__ = Time indicator
+ __art__ = Artifact
+ __eve__ = Event
+ __nat__ = Natural Phenomenon

Anything outside these classes is termed as other, denoted as __O__. 

The following output shows the unbalanced distribution of different tags in the dataset

In [6]:
df.Tag.value_counts()

O        887908
B-geo     37644
B-tim     20333
B-org     20143
I-per     17251
B-per     16990
I-org     16784
B-gpe     15870
I-geo      7414
I-tim      6528
B-art       402
B-eve       308
I-art       297
I-eve       253
B-nat       201
I-gpe       198
I-nat        51
Name: Tag, dtype: int64

## Conditional Random Fields

As mentioned above, NER belongs to sequence modeling class of problems. There are different algorithms to tackle sequence modeling, __CRF__ or _Conditional Random Fields_ are one such example. CRFs are proven to perform extremely well on NER and related domains. In this notebook, we will attempt at developing our own NER based on CRFs.

---

__Question__: What is a CRF and how does it work?

__Wikipedia__ :  CRF is an undirected graphical model whose nodes can be divided into exactly two disjoint sets $X$ and $Y$, the observed and output variables, respectively; the conditional distribution $p(Y|X)$ is then modeled.

For more details, checkout the paper [__Conditional Random Fields: Probabilistic Models
for Segmenting and Labeling Sequence Data__](https://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)

## Prepare Data

CRF trains upon sequence of input data to learn transitions from one state (label) to another. 
To enable such an algorithm, we need to define features which take into account different transitions. 
In the function ```word2features()``` below, we transform each word into a feature dictionary depicting the following attributes or features:

+ lower case of word
+ suffix containing last 3 characters
+ suffix containing last 2 characters
+ flags to determine upper-case, title-case, numeric data and POS tag

We also attach attributes related to previous and next words or tags to determine beginning of sentence (BOS) or end of sentence (EOS)

In [7]:
def word2features(sent, i):
    word = sent[i][0]
    postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if i > 0:
        word1 = sent[i-1][0]
        postag1 = sent[i-1][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        postag1 = sent[i+1][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features


def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for token, postag, label in sent]

In [8]:
# Let’s now define a function to extract our word token, POS tag, and NER tag triplets from sentences. 
# We will be applying this to all our input sentences
agg_func = lambda s: [(w, p, t) for w, p, t in zip(s['Word'].values.tolist(), 
                                                   s['POS'].values.tolist(), 
                                                   s['Tag'].values.tolist())]

In [9]:
grouped_df = df.groupby('Sentence #').apply(agg_func)

In [10]:
print(grouped_df[grouped_df.index == 'Sentence: 1'].values)

[list([('Thousands', 'NNS', 'O'), ('of', 'IN', 'O'), ('demonstrators', 'NNS', 'O'), ('have', 'VBP', 'O'), ('marched', 'VBN', 'O'), ('through', 'IN', 'O'), ('London', 'NNP', 'B-geo'), ('to', 'TO', 'O'), ('protest', 'VB', 'O'), ('the', 'DT', 'O'), ('war', 'NN', 'O'), ('in', 'IN', 'O'), ('Iraq', 'NNP', 'B-geo'), ('and', 'CC', 'O'), ('demand', 'VB', 'O'), ('the', 'DT', 'O'), ('withdrawal', 'NN', 'O'), ('of', 'IN', 'O'), ('British', 'JJ', 'B-gpe'), ('troops', 'NNS', 'O'), ('from', 'IN', 'O'), ('that', 'DT', 'O'), ('country', 'NN', 'O'), ('.', '.', 'O')])]


In [11]:
grouped_df.shape

(47959,)

In [12]:
# We can now view a sample annotated sentence from our dataset with the following code.
sentences = [s for s in grouped_df]
sentences[0]

[('Thousands', 'NNS', 'O'),
 ('of', 'IN', 'O'),
 ('demonstrators', 'NNS', 'O'),
 ('have', 'VBP', 'O'),
 ('marched', 'VBN', 'O'),
 ('through', 'IN', 'O'),
 ('London', 'NNP', 'B-geo'),
 ('to', 'TO', 'O'),
 ('protest', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('war', 'NN', 'O'),
 ('in', 'IN', 'O'),
 ('Iraq', 'NNP', 'B-geo'),
 ('and', 'CC', 'O'),
 ('demand', 'VB', 'O'),
 ('the', 'DT', 'O'),
 ('withdrawal', 'NN', 'O'),
 ('of', 'IN', 'O'),
 ('British', 'JJ', 'B-gpe'),
 ('troops', 'NNS', 'O'),
 ('from', 'IN', 'O'),
 ('that', 'DT', 'O'),
 ('country', 'NN', 'O'),
 ('.', '.', 'O')]

In [13]:
# The preceding output shows a standard tokenized sentence with POS and NER tags. Let’s look at how each annotated tokenized
# sentence can be used for our feature engineering with the function we defined earlier.
sent2features(sentences[0][5:7])

[{'bias': 1.0,
  'word.lower()': 'through',
  'word[-3:]': 'ugh',
  'word[-2:]': 'gh',
  'word.isupper()': False,
  'word.istitle()': False,
  'word.isdigit()': False,
  'postag': 'IN',
  'postag[:2]': 'IN',
  'BOS': True,
  '+1:word.lower()': 'london',
  '+1:word.istitle()': True,
  '+1:word.isupper()': False,
  '+1:postag': 'NNP',
  '+1:postag[:2]': 'NN'},
 {'bias': 1.0,
  'word.lower()': 'london',
  'word[-3:]': 'don',
  'word[-2:]': 'on',
  'word.isupper()': False,
  'word.istitle()': True,
  'word.isdigit()': False,
  'postag': 'NNP',
  'postag[:2]': 'NN',
  '-1:word.lower()': 'through',
  '-1:word.istitle()': False,
  '-1:word.isupper()': False,
  '-1:postag': 'IN',
  '-1:postag[:2]': 'IN',
  'EOS': True}]

In [14]:
sent2labels(sentences[0][5:7])

['O', 'B-geo']

## Prepare Train and Test Datasets

In [16]:
from sklearn.model_selection import train_test_split
import numpy as np

X = np.array([sent2features(s) for s in sentences])
y = np.array([sent2labels(s) for s in sentences])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)
X_train.shape, X_test.shape

# It is now time to start training our model. For this, we use sklearn-crfsuite

((35969,), (11990,))

# Building Models with sklearn-crfsuite

__`sklearn-crfsuite`__ is a thin [CRFsuite (python-crfsuite)](https://github.com/scrapinghub/python-crfsuite) wrapper which provides scikit-learn-compatible sklearn_crfsuite.CRF estimator: you can use e.g. scikit-learn model selection utilities (cross-validation, hyperparameter optimization) with it, or save/load CRF models using joblib.

In [17]:
!pip install sklearn-crfsuite

Collecting sklearn-crfsuite
  Downloading sklearn_crfsuite-0.3.6-py2.py3-none-any.whl (12 kB)
Collecting python-crfsuite>=0.8.3
  Downloading python_crfsuite-0.9.7-cp37-cp37m-win_amd64.whl (154 kB)
Collecting tabulate
  Downloading tabulate-0.8.7-py3-none-any.whl (24 kB)
Installing collected packages: python-crfsuite, tabulate, sklearn-crfsuite
Successfully installed python-crfsuite-0.9.7 sklearn-crfsuite-0.3.6 tabulate-0.8.7


# Train the model!

Train the model using the default configurations mentioned in the [sklearn-crfsuite API docs](https://sklearn-crfsuite.readthedocs.io/en/latest/api.html)


- __algorithm:__ the training algorithm. We use [L-BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS) for gradient descent for optimization and getting model parameters
- __c1:__ Coefficient for Lasso (L1) regularization
- __c2:__ Coefficient for Ridge (L2) regularization
- __all_possible_transitions:__ Specify whether CRFsuite generates transition features that do not even occur in the training data


__Note:__ If the model is taking too long to train, you can load up the pre-trained model using the code after the training cells and use that for predictions.

In [21]:
import sklearn_crfsuite

crf = sklearn_crfsuite.CRF(algorithm='lbfgs',
                           c1=0.1,
                           c2=0.1,
                           max_iterations=100,
                           all_possible_transitions=True,
                           verbose=True)

In [17]:
crf.fit(X_train, y_train)

loading training data to CRFsuite: 100%|███████████████████████████████████████| 35969/35969 [00:15<00:00, 2384.94it/s]



Feature generation
type: CRF1d
feature.minfreq: 0.000000
feature.possible_states: 0
feature.possible_transitions: 1
0....1....2....3....4....5....6....7....8....9....10
Number of features: 133629
Seconds required: 3.486

L-BFGS optimization
c1: 0.100000
c2: 0.100000
num_memories: 6
max_iterations: 100
epsilon: 0.000010
stop: 10
delta: 0.000010
linesearch: MoreThuente
linesearch.max_iterations: 20

Iter 1   time=4.01  loss=1264028.26 active=132637 feature_norm=1.00
Iter 2   time=3.99  loss=994059.01 active=131294 feature_norm=4.42
Iter 3   time=2.00  loss=776413.87 active=125970 feature_norm=3.87
Iter 4   time=11.46 loss=422143.40 active=127018 feature_norm=3.24
Iter 5   time=2.11  loss=355775.44 active=129029 feature_norm=4.04
Iter 6   time=2.23  loss=264125.22 active=124046 feature_norm=6.10
Iter 7   time=2.36  loss=222304.71 active=117183 feature_norm=7.69
Iter 8   time=2.21  loss=197827.17 active=110838 feature_norm=8.75
Iter 9   time=2.09  loss=176877.92 active=105650 feature_norm

CRF(algorithm='lbfgs', all_possible_states=None,
  all_possible_transitions=True, averaging=None, c=None, c1=0.1, c2=0.1,
  calibration_candidates=None, calibration_eta=None,
  calibration_max_trials=None, calibration_rate=None,
  calibration_samples=None, delta=None, epsilon=None, error_sensitive=None,
  gamma=None, keep_tempfiles=None, linesearch=None, max_iterations=100,
  max_linesearch=None, min_freq=None, model_filename=None,
  num_memories=None, pa_type=None, period=None, trainer_cls=None,
  variance=None, verbose=True)

## Use the following to load our pre-trained model if training above takes a lot of time

In [22]:
from sklearn.externals import joblib

joblib.dump(crf, 'ner_model.pkl')

['ner_model.pkl']

In [23]:
crf = joblib.load('ner_model.pkl')

# Model Evaluation

Let's evaluate our model performance for NER Tagging on the test data now!

Try playing around with the following cells and observe the overall model performance.

We use standard classification metrics like precision, recall and f1-score

In [24]:
y_pred = crf.predict(X_test)
print(y_pred[0])

AttributeError: 'NoneType' object has no attribute 'tag'

In [19]:
print(y_test[0])

['O', 'O', 'O', 'O', 'B-per', 'I-per', 'O', 'B-org', 'O', 'O', 'B-gpe', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']


In [20]:
from sklearn_crfsuite import metrics as crf_metrics

labels = list(crf.classes_)
labels.remove('O')

In [21]:
print(crf_metrics.flat_classification_report(y_test, y_pred, labels=labels))

              precision    recall  f1-score   support

       B-org       0.81      0.73      0.77      5116
       B-per       0.85      0.84      0.84      4239
       I-per       0.85      0.90      0.88      4273
       B-geo       0.86      0.91      0.89      9403
       I-geo       0.81      0.80      0.81      1826
       B-tim       0.93      0.89      0.91      5095
       I-org       0.82      0.79      0.80      4195
       B-gpe       0.97      0.94      0.96      3961
       I-tim       0.84      0.81      0.82      1604
       B-nat       0.50      0.24      0.32        55
       B-eve       0.51      0.33      0.40        80
       B-art       0.36      0.14      0.20       102
       I-art       0.24      0.07      0.10        90
       I-eve       0.45      0.19      0.27        74
       I-gpe       0.86      0.53      0.66        36
       I-nat       0.57      0.22      0.32        18

   micro avg       0.86      0.85      0.86     40167
   macro avg       0.70   

We have intentially left out the ___Others___ tag to understand the performance of model on the remaining tags. The above evaluation statistics showcase a model which seems to have learnt the transitions quite well giving us an overall F1-score of 85%!

#### We can achieve even better results by fine tuning the feature engineering step along with hyper-parameter tuning.

## End-to-End NER Tagger with trained NER Model

There is no fun (or value!) if we cannot use our model to tag new sentences in the future assuming we would want to put this model in production. Let's try and build an end-to-end workflow to perform NER Tagging on our sample document. First we perform NER tagging with SpaCy to remind you how it looks like.

### Prepare Sample Document

In [2]:
import re

text = """Three more countries have joined an “international grand committee” of parliaments, adding to calls for 
Facebook’s boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore 
bring the total to eight different parliaments across the world, with plans to send representatives to London on 27 
November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief 
has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament. 
Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg. 
He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook 
to “delay, deny and deflect” negative news stories, “raises further questions about how recent data breaches were allegedly 
dealt with within Facebook.”
"""

text = re.sub(r'\n', '', text)
text

'Three more countries have joined an “international grand committee” of parliaments, adding to calls for Facebook’s boss, Mark Zuckerberg, to give evidence on misinformation to the coalition. Brazil, Latvia and Singapore bring the total to eight different parliaments across the world, with plans to send representatives to London on 27 November with the intention of hearing from Zuckerberg. Since the Cambridge Analytica scandal broke, the Facebook chief has only appeared in front of two legislatures: the American Senate and House of Representatives, and the European parliament. Facebook has consistently rebuffed attempts from others, including the UK and Canadian parliaments, to hear from Zuckerberg. He added that an article in the New York Times on Thursday, in which the paper alleged a pattern of behaviour from Facebook to “delay, deny and deflect” negative news stories, “raises further questions about how recent data breaches were allegedly dealt with within Facebook.”'

In [3]:
text = """οκτώ 8 Παρασκευή 3 Νοεμβρίου 2018 Zuckeberg Ευρωπαϊκή Ένωση: αποτελείται από 27 μέλη, τα οποία συναντιούνται κατ' ιδίαν για να συμφωνήσουν σχετικά με τις κοινές θέσεις τους και αντιπροσωπεύεται από τη χώρα που έχει την προεδρία. Σημειώνεται ότι η Ευρωπαϊκή Ένωση είναι η πιο ενεργή ομάδα όσον αφορά στις διαπραγματεύσεις για την προστασία του περιβάλλοντος και πιέζει συνεχώς για τη λήψη αυστηρών μέτρων. Σημειώνεται ότι την περίοδο των διαπραγματεύσεων η Ευρωπαϊκή Ένωση αποτελούνταν από 15 κράτη μέλη, με αυτά όμως συμμάχησαν και τα 12 νέα μέλη της διεύρυνσης.
«Λέσχη του Άνθρακα» (“Carbon Club”): περιλαμβάνει τις χώρες «JUSCANZ» (από τα αρχικά των χωρών Ιαπωνία, ΗΠΑ, Καναδάς, Αυστραλία, Νέα Ζηλανδία στα Αγγλικά), τις χώρες μέλη του ΟΠΕΚ, τη Ρωσία και τη Νορβηγία, στις οποίες γενικά τα συμφέροντά τους θίγονται από το Πρωτόκολλο του Κιότο (είτε επειδή θα πρέπει να μειώσουν την παραγωγή τους είτε επειδή προτείνεται η στροφή προς διαφορετικά καύσιμα) και κατά συνέπεια αντιτίθενται στην καθιέρωση των δικαιωμάτων και στη λήψη αυστηρών μέτρων.
Συμμαχία των Μικρών Νησιωτικών Κρατών (AOSIS): είναι ένας συνασπισμός περίπου 43 μικρών νησιωτικών κρατών, τα οποία είναι ιδιαίτερα ευάλωτα στην άνοδο της στάθμης της θάλασσας. Τα κράτη αυτά κινδυνεύουν να εξαφανιστούν από το χάρτη εξαιτίας του μικρού τους υψομέτρου σε σχέση με το επίπεδο της θάλασσας και επομένως απειλείται άμεσα η ίδια τους η επιβίωση. Οι χώρες της ομάδας αυτής ήταν μάλιστα οι πρώτες που πρότειναν ένα σχέδιο κειμένου κατά τη διάρκεια των διαπραγματεύσεων του πρωτοκόλλου του Κιότο ζητώντας μία μείωση στις εκπομπές διοξειδίου του άνθρακα της τάξης του 20% έως το 2005 σε σχέση με τα επίπεδα του 1990.
Λιγότερο αναπτυγμένες χώρες: πρόκειται για 48 χώρες, οι οποίες συμμετείχαν όλο και πιο ενεργά στη διαδικασία των διαπραγματεύσεων για την αλλαγή του κλίματος, συχνά για να υπερασπιστούν τα ιδιαίτερα συμφέροντά τους και την εύθραυστη οικονομία τους, όπως για παράδειγμα την παροχή μέτρων για να μπορέσουν να προσαρμοστούν στην αλλαγή του κλίματος και να μην είναι τόσο ευάλωτες.
Ομάδα των 77 (G-77): πρόκειται για εκείνες τις αναπτυσσόμενες χώρες που είναι αναδυόμενες, όπως η Ινδία και η Κίνα, που θεωρούν ότι βρίσκονται σε τροχιά ανάπτυξης και ότι είναι εις βάρος τους να δεσμευτούν να περιορίσουν τις εκπομπές τους. Η δε απαίτηση των βιομηχανικών χωρών (που είναι κυρίως υπεύθυνες για τις μεγαλύτερες εκπομπές αερίων του θερμοκηπίου παγκοσμίως) να αντιμετωπιστούν επί ίσοις όροις με τις αναπτυσσόμενες χώρες τους φαίνεται άδικη και παράλογη.
"""

text = re.sub(r'\n', '', text)
text

"οκτώ 8 Παρασκευή 3 Νοεμβρίου 2018 Zuckeberg Ευρωπαϊκή Ένωση: αποτελείται από 27 μέλη, τα οποία συναντιούνται κατ' ιδίαν για να συμφωνήσουν σχετικά με τις κοινές θέσεις τους και αντιπροσωπεύεται από τη χώρα που έχει την προεδρία. Σημειώνεται ότι η Ευρωπαϊκή Ένωση είναι η πιο ενεργή ομάδα όσον αφορά στις διαπραγματεύσεις για την προστασία του περιβάλλοντος και πιέζει συνεχώς για τη λήψη αυστηρών μέτρων. Σημειώνεται ότι την περίοδο των διαπραγματεύσεων η Ευρωπαϊκή Ένωση αποτελούνταν από 15 κράτη μέλη, με αυτά όμως συμμάχησαν και τα 12 νέα μέλη της διεύρυνσης.«Λέσχη του Άνθρακα» (“Carbon Club”): περιλαμβάνει τις χώρες «JUSCANZ» (από τα αρχικά των χωρών Ιαπωνία, ΗΠΑ, Καναδάς, Αυστραλία, Νέα Ζηλανδία στα Αγγλικά), τις χώρες μέλη του ΟΠΕΚ, τη Ρωσία και τη Νορβηγία, στις οποίες γενικά τα συμφέροντά τους θίγονται από το Πρωτόκολλο του Κιότο (είτε επειδή θα πρέπει να μειώσουν την παραγωγή τους είτε επειδή προτείνεται η στροφή προς διαφορετικά καύσιμα) και κατά συνέπεια αντιτίθενται στην καθιέρω

### NER Tagging with SpaCy

In [4]:
import spacy
from spacy import displacy

import el_core_news_md
nlp = el_core_news_md.load()
# nlp = spacy.load('en_core_web_sm')
text_nlp = nlp(text)
displacy.render(text_nlp, style='ent', jupyter=True)

### Pipeline Step 1

- Tokenize Text
- POS Tagging

In [6]:
import nltk

text_tokens = nltk.word_tokenize(text)
text_pos = nltk.pos_tag(text_tokens)
text_pos[:20]

[('οκτώ', 'RB'),
 ('8', 'CD'),
 ('Παρασκευή', 'JJ'),
 ('3', 'CD'),
 ('Νοεμβρίου', 'JJ'),
 ('2018', 'CD'),
 ('Zuckeberg', 'NNP'),
 ('Ευρωπαϊκή', 'NNP'),
 ('Ένωση', 'NN'),
 (':', ':'),
 ('αποτελείται', 'JJ'),
 ('από', '$'),
 ('27', 'CD'),
 ('μέλη', 'NNP'),
 (',', ','),
 ('τα', 'NNP'),
 ('οποία', 'NNP'),
 ('συναντιούνται', 'NNP'),
 ('κατ', 'NNP'),
 ("'", 'POS')]

### Pipeline Step 2
- Extract Features from the POS tagged text document
- Hint: Use `sent2features`

In [7]:
features = [sent2features(text_pos)]
features[0][0]

NameError: name 'sent2features' is not defined

### Pipeline Step 3
- Use the CRF Model `crf` to predict on the features

In [30]:
labels = crf.predict(features)
doc_labels = labels[0]
doc_labels[10:20]

AttributeError: 'NoneType' object has no attribute 'tag'

### Pipeline Step 4
- Combine text tokens with NER Tags
- Retrieve relevant named entities from NER Tags

In [31]:
text_ner = [(token, tag) for token, tag in zip(text_tokens, doc_labels)]
print(text_ner)

NameError: name 'doc_labels' is not defined

In [28]:
named_entities = []
temp_entity_name = ''
temp_named_entity = None
for term, tag in text_ner:
    if tag != 'O':
        temp_entity_name = ' '.join([temp_entity_name, term]).strip()
        temp_named_entity = (temp_entity_name, tag)
    else:
        if temp_named_entity:
            named_entities.append(temp_named_entity)
            temp_entity_name = ''
            temp_named_entity = None

In [29]:
import pandas as pd

pd.DataFrame(named_entities, columns=['Entity', 'Tag'])

Unnamed: 0,Entity,Tag
0,Facebook ’,I-art
1,Mark Zuckerberg,I-per
2,Brazil,B-geo
3,Latvia and Singapore,I-org
4,London,B-geo
5,27 November,I-tim
6,Zuckerberg,B-geo
7,Cambridge Analytica,I-org
8,Facebook,B-org
9,American Senate and House of Representatives,I-org
