# NER Tagging by CRF Suite

The dataset we will be using is  GMB(Groningen Meaning Bank) corpus for entity classification. I have downloaded it from Kaggle for demonstration.

## Step 1 : Importing Data
The data is in json format and has data in format.
* Each element in parsed is a sentence
* Each element in the sentence sublist is primary word feature containing *\[word, pos_tag, ner_tag(label) \]*

In [54]:
import json
import pprint
with open("data.json", 'rb') as json_data:
    json_data_content = list(json_data)
    json_data_content = json_data_content[0] 
    doc = json.loads(json_data_content)

### Example of parsing elements from dataset

In [46]:
# How to parse words from sentences 
for sentence_idx,sentence in enumerate(doc):
    print ('sentence_idx : ',sentence_idx)
    for word_idx,word in enumerate(sentence):
        print ('word : ',word)
    if i > 5 :break;


sentence_idx :  0
word :  [['Thousands', 'NNS'], 'O']
word :  [['of', 'IN'], 'O']
word :  [['demonstrators', 'NNS'], 'O']
word :  [['have', 'VBP'], 'O']
word :  [['marched', 'VBN'], 'O']
word :  [['through', 'IN'], 'O']
word :  [['London', 'NNP'], 'B-geo']
word :  [['to', 'TO'], 'O']
word :  [['protest', 'VB'], 'O']
word :  [['the', 'DT'], 'O']
word :  [['war', 'NN'], 'O']
word :  [['in', 'IN'], 'O']
word :  [['Iraq', 'NNP'], 'B-geo']
word :  [['and', 'CC'], 'O']
word :  [['demand', 'VB'], 'O']
word :  [['the', 'DT'], 'O']
word :  [['withdrawal', 'NN'], 'O']
word :  [['of', 'IN'], 'O']
word :  [['British', 'JJ'], 'B-gpe']
word :  [['troops', 'NNS'], 'O']
word :  [['from', 'IN'], 'O']
word :  [['that', 'DT'], 'O']
word :  [['country', 'NN'], 'O']
word :  [['.', '.'], 'O']


## Step 2 : Splitting the train-test dataset

In [49]:
total_sentences = len(doc)
split_ratio = 0.8
train_doc,test_doc = doc[:int(total_sentences*split_ratio)],doc[int(total_sentences*split_ratio):]
print ('original doc length : ',len(doc),'sentences')
print ('train split length  : ',len(train_doc),'sentences')
print ('test split length   : ',len(test_doc),'sentences')

original doc length :  62010 sentences
train split length  :  49608 sentences
test split length   :  12402 sentences


## Step 3: Feature Engineering for each word
For each word in a sentence we need to create a feature vector. I have used the bellow feature vector  
* **POS Tag** : *Obvious choice, gives me tag like Noun Phrase, Verb, etc*  
* **Preceeding Word** : *Word before my current word*  
* **Succeeding Word** : *Word just next to my current current word*  
etc.  
These are the kind of features are the ones which I will use for tagging.
As a reference, I have used [SK-Learn NER Tagging](http://eli5.readthedocs.io/en/latest/tutorials/sklearn_crfsuite.html)

In [65]:
# How to do feature engineering.
def word2feature(sentence,word_idx):
    word = sentence[word_idx][0][0];
    postag = sentence[word_idx][0][1];
    
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'postag': postag,
        'postag[:2]': postag[:2],
    }
    if word_idx > 0:
        word1 = sentence[word_idx-1][0][0]
        postag1 = sentence[word_idx-1][0][1]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            '-1:postag': postag1,
            '-1:postag[:2]': postag1[:2],
        })
    else:
        features['BOS'] = True

    if word_idx < len(sentence)-1:
        word1 = sentence[word_idx+1][0][0]
        postag1 = sentence[word_idx+1][0][1]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            '+1:postag': postag1,
            '+1:postag[:2]': postag1[:2],
        })
    else:
        features['EOS'] = True

    return features

def sentence2feature(sentence):
    return [word2feature(sentence,idx) for idx in range(len(sentence))]
def sentence2label(sentence):
    return [word[1] for word in sentence]


# Trying out Word to vec on single word
print ("Features for word '", doc[0][0][0][0] + "' \n in sentence '"+ ' '.join([word[0][0] for word in doc[0]]) +"'\n")
for sentence_idx,sentence in enumerate(doc):
    for word_idx,word in enumerate(sentence):
        pprint.pprint(word2feature(sentence,word_idx))
        break;
    break;
    

Features for word ' Thousands' 
 in sentence 'Thousands of demonstrators have marched through London to protest the war in Iraq and demand the withdrawal of British troops from that country .'

{'+1:postag': 'IN',
 '+1:postag[:2]': 'IN',
 '+1:word.istitle()': False,
 '+1:word.isupper()': False,
 '+1:word.lower()': 'of',
 'BOS': True,
 'bias': 1.0,
 'postag': 'NNS',
 'postag[:2]': 'NN',
 'word.isdigit()': False,
 'word.istitle()': True,
 'word.lower()': 'thousands',
 'word[-3:]': 'nds'}


## Step 4: Getting features for train and test split

In [66]:
train_data =  [sentence2feature(sentence) for sentence in train_doc]
train_label =   [sentence2label(sentence) for sentence in train_doc]
test_data =   [sentence2feature(sentence) for sentence in test_doc]
test_label =    [sentence2label(sentence) for sentence in test_doc]

## Step 5: Roll out your Training aka Label Propagation

In [83]:
import sklearn_crfsuite
import eli5
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=20,
    all_possible_transitions=False,
)
crf.fit(train_data,train_label);

In [84]:
eli5.show_weights(crf, top=5)

From \ To,O,B-art,I-art,B-eve,I-eve,B-geo,I-geo,B-gpe,I-gpe,B-nat,I-nat,B-org,I-org,B-per,I-per,B-tim,I-tim
O,2.79,0.096,0.0,0.124,0.0,1.097,0.0,0.79,0.0,0.058,0.0,1.085,0.0,1.138,0.0,1.716,0.0
B-art,-0.048,0.0,0.517,0.0,0.0,-0.129,0.0,0.0,0.0,0.0,0.0,-0.013,0.0,-0.093,0.0,-0.005,0.0
I-art,-0.069,0.0,0.282,0.0,0.0,-0.159,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.016,0.0,-0.01,0.0
B-eve,-0.084,0.0,0.0,0.0,0.523,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.005,0.0
I-eve,-0.063,0.0,0.0,0.0,0.26,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.006,0.0
B-geo,0.567,-0.086,0.0,-0.2,0.0,0.0,4.441,0.017,0.0,0.0,0.0,-0.555,0.0,-1.467,0.0,0.67,0.0
I-geo,0.215,-0.006,0.0,0.0,0.0,0.0,1.928,-0.117,0.0,0.0,0.0,-0.192,0.0,-0.293,0.0,0.135,0.0
B-gpe,0.685,-0.064,0.0,-0.056,0.0,-0.129,0.0,0.0,0.385,0.0,0.0,0.159,0.0,0.23,0.0,-0.446,0.0
I-gpe,0.0,0.0,0.0,0.0,0.0,-0.009,0.0,0.0,0.024,0.0,0.0,0.0,0.0,-0.001,0.0,0.0,0.0
B-nat,-0.07,0.0,0.0,0.0,0.0,-0.063,0.0,0.0,0.0,0.0,0.119,0.0,0.0,-0.029,0.0,-0.021,0.0

Weight?,Feature,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,Unnamed: 5_level_0,Unnamed: 6_level_0,Unnamed: 7_level_0,Unnamed: 8_level_0,Unnamed: 9_level_0,Unnamed: 10_level_0,Unnamed: 11_level_0,Unnamed: 12_level_0,Unnamed: 13_level_0,Unnamed: 14_level_0,Unnamed: 15_level_0,Unnamed: 16_level_0
Weight?,Feature,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
Weight?,Feature,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
Weight?,Feature,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3
Weight?,Feature,Unnamed: 2_level_4,Unnamed: 3_level_4,Unnamed: 4_level_4,Unnamed: 5_level_4,Unnamed: 6_level_4,Unnamed: 7_level_4,Unnamed: 8_level_4,Unnamed: 9_level_4,Unnamed: 10_level_4,Unnamed: 11_level_4,Unnamed: 12_level_4,Unnamed: 13_level_4,Unnamed: 14_level_4,Unnamed: 15_level_4,Unnamed: 16_level_4
Weight?,Feature,Unnamed: 2_level_5,Unnamed: 3_level_5,Unnamed: 4_level_5,Unnamed: 5_level_5,Unnamed: 6_level_5,Unnamed: 7_level_5,Unnamed: 8_level_5,Unnamed: 9_level_5,Unnamed: 10_level_5,Unnamed: 11_level_5,Unnamed: 12_level_5,Unnamed: 13_level_5,Unnamed: 14_level_5,Unnamed: 15_level_5,Unnamed: 16_level_5
Weight?,Feature,Unnamed: 2_level_6,Unnamed: 3_level_6,Unnamed: 4_level_6,Unnamed: 5_level_6,Unnamed: 6_level_6,Unnamed: 7_level_6,Unnamed: 8_level_6,Unnamed: 9_level_6,Unnamed: 10_level_6,Unnamed: 11_level_6,Unnamed: 12_level_6,Unnamed: 13_level_6,Unnamed: 14_level_6,Unnamed: 15_level_6,Unnamed: 16_level_6
Weight?,Feature,Unnamed: 2_level_7,Unnamed: 3_level_7,Unnamed: 4_level_7,Unnamed: 5_level_7,Unnamed: 6_level_7,Unnamed: 7_level_7,Unnamed: 8_level_7,Unnamed: 9_level_7,Unnamed: 10_level_7,Unnamed: 11_level_7,Unnamed: 12_level_7,Unnamed: 13_level_7,Unnamed: 14_level_7,Unnamed: 15_level_7,Unnamed: 16_level_7
Weight?,Feature,Unnamed: 2_level_8,Unnamed: 3_level_8,Unnamed: 4_level_8,Unnamed: 5_level_8,Unnamed: 6_level_8,Unnamed: 7_level_8,Unnamed: 8_level_8,Unnamed: 9_level_8,Unnamed: 10_level_8,Unnamed: 11_level_8,Unnamed: 12_level_8,Unnamed: 13_level_8,Unnamed: 14_level_8,Unnamed: 15_level_8,Unnamed: 16_level_8
Weight?,Feature,Unnamed: 2_level_9,Unnamed: 3_level_9,Unnamed: 4_level_9,Unnamed: 5_level_9,Unnamed: 6_level_9,Unnamed: 7_level_9,Unnamed: 8_level_9,Unnamed: 9_level_9,Unnamed: 10_level_9,Unnamed: 11_level_9,Unnamed: 12_level_9,Unnamed: 13_level_9,Unnamed: 14_level_9,Unnamed: 15_level_9,Unnamed: 16_level_9
Weight?,Feature,Unnamed: 2_level_10,Unnamed: 3_level_10,Unnamed: 4_level_10,Unnamed: 5_level_10,Unnamed: 6_level_10,Unnamed: 7_level_10,Unnamed: 8_level_10,Unnamed: 9_level_10,Unnamed: 10_level_10,Unnamed: 11_level_10,Unnamed: 12_level_10,Unnamed: 13_level_10,Unnamed: 14_level_10,Unnamed: 15_level_10,Unnamed: 16_level_10
Weight?,Feature,Unnamed: 2_level_11,Unnamed: 3_level_11,Unnamed: 4_level_11,Unnamed: 5_level_11,Unnamed: 6_level_11,Unnamed: 7_level_11,Unnamed: 8_level_11,Unnamed: 9_level_11,Unnamed: 10_level_11,Unnamed: 11_level_11,Unnamed: 12_level_11,Unnamed: 13_level_11,Unnamed: 14_level_11,Unnamed: 15_level_11,Unnamed: 16_level_11
Weight?,Feature,Unnamed: 2_level_12,Unnamed: 3_level_12,Unnamed: 4_level_12,Unnamed: 5_level_12,Unnamed: 6_level_12,Unnamed: 7_level_12,Unnamed: 8_level_12,Unnamed: 9_level_12,Unnamed: 10_level_12,Unnamed: 11_level_12,Unnamed: 12_level_12,Unnamed: 13_level_12,Unnamed: 14_level_12,Unnamed: 15_level_12,Unnamed: 16_level_12
Weight?,Feature,Unnamed: 2_level_13,Unnamed: 3_level_13,Unnamed: 4_level_13,Unnamed: 5_level_13,Unnamed: 6_level_13,Unnamed: 7_level_13,Unnamed: 8_level_13,Unnamed: 9_level_13,Unnamed: 10_level_13,Unnamed: 11_level_13,Unnamed: 12_level_13,Unnamed: 13_level_13,Unnamed: 14_level_13,Unnamed: 15_level_13,Unnamed: 16_level_13
Weight?,Feature,Unnamed: 2_level_14,Unnamed: 3_level_14,Unnamed: 4_level_14,Unnamed: 5_level_14,Unnamed: 6_level_14,Unnamed: 7_level_14,Unnamed: 8_level_14,Unnamed: 9_level_14,Unnamed: 10_level_14,Unnamed: 11_level_14,Unnamed: 12_level_14,Unnamed: 13_level_14,Unnamed: 14_level_14,Unnamed: 15_level_14,Unnamed: 16_level_14
Weight?,Feature,Unnamed: 2_level_15,Unnamed: 3_level_15,Unnamed: 4_level_15,Unnamed: 5_level_15,Unnamed: 6_level_15,Unnamed: 7_level_15,Unnamed: 8_level_15,Unnamed: 9_level_15,Unnamed: 10_level_15,Unnamed: 11_level_15,Unnamed: 12_level_15,Unnamed: 13_level_15,Unnamed: 14_level_15,Unnamed: 15_level_15,Unnamed: 16_level_15
Weight?,Feature,Unnamed: 2_level_16,Unnamed: 3_level_16,Unnamed: 4_level_16,Unnamed: 5_level_16,Unnamed: 6_level_16,Unnamed: 7_level_16,Unnamed: 8_level_16,Unnamed: 9_level_16,Unnamed: 10_level_16,Unnamed: 11_level_16,Unnamed: 12_level_16,Unnamed: 13_level_16,Unnamed: 14_level_16,Unnamed: 15_level_16,Unnamed: 16_level_16
+3.332,bias,,,,,,,,,,,,,,,
+2.983,postag:PRP,,,,,,,,,,,,,,,
+2.907,BOS,,,,,,,,,,,,,,,
+2.823,postag[:2]:VB,,,,,,,,,,,,,,,
… 26275 more positive …,… 26275 more positive …,,,,,,,,,,,,,,,
… 7521 more negative …,… 7521 more negative …,,,,,,,,,,,,,,,
-2.919,postag:NNP,,,,,,,,,,,,,,,
… 862 more positive …,… 862 more positive …,,,,,,,,,,,,,,,
… 82 more negative …,… 82 more negative …,,,,,,,,,,,,,,,
-0.181,postag:IN,,,,,,,,,,,,,,,

Weight?,Feature
+3.332,bias
+2.983,postag:PRP
+2.907,BOS
+2.823,postag[:2]:VB
… 26275 more positive …,… 26275 more positive …
… 7521 more negative …,… 7521 more negative …
-2.919,postag:NNP

Weight?,Feature
… 862 more positive …,… 862 more positive …
… 82 more negative …,… 82 more negative …
-0.181,postag:IN
-0.181,postag[:2]:IN
-0.276,postag:NNS
-0.319,BOS
-0.804,-1:postag:NNP

Weight?,Feature
+0.214,-1:postag:NNP
+0.183,-1:postag[:2]:NN
+0.170,-1:word.istitle()
… 793 more positive …,… 793 more positive …
… 105 more negative …,… 105 more negative …
-0.273,postag:NNS
-0.356,-1:postag[:2]:VB

Weight?,Feature
+0.227,word.lower():ii
+0.227,word[-3:]:II
… 382 more positive …,… 382 more positive …
… 78 more negative …,… 78 more negative …
-0.228,+1:postag[:2]:JJ
-0.288,postag:NN
-0.391,BOS

Weight?,Feature
… 386 more positive …,… 386 more positive …
… 74 more negative …,… 74 more negative …
-0.272,-1:postag:DT
-0.272,-1:postag[:2]:DT
-0.359,-1:postag[:2]:VB
-0.389,postag:JJ
-0.420,postag[:2]:JJ

Weight?,Feature
+1.875,word.lower():iran
+1.626,-1:word.lower():in
… 6846 more positive …,… 6846 more positive …
… 2100 more negative …,… 2100 more negative …
-1.400,word[-3:]:ber
-1.890,postag:NNS
-4.713,word[-3:]:day

Weight?,Feature
+1.135,-1:word.lower():south
+1.079,word.lower():states
+0.963,word.lower():korea
+0.957,word[-3:]:rea
… 3028 more positive …,… 3028 more positive …
… 828 more negative …,… 828 more negative …
-0.891,word[-3:]:day

Weight?,Feature
+4.581,word.istitle()
+2.239,word[-3:]:ese
+1.983,word.lower():iraqi
+1.976,word[-3:]:aqi
… 2711 more positive …,… 2711 more positive …
… 1078 more negative …,… 1078 more negative …
-1.878,postag:NNP

Weight?,Feature
+0.203,-1:word.lower():bosnian
+0.168,word.lower():serb
… 213 more positive …,… 213 more positive …
… 44 more negative …,… 44 more negative …
-0.224,bias
-0.435,-1:postag[:2]:IN
-0.435,-1:postag:IN

Weight?,Feature
… 204 more positive …,… 204 more positive …
… 55 more negative …,… 55 more negative …
-0.213,-1:postag[:2]:NN
-0.216,bias
-0.305,BOS
-0.339,postag:JJ
-0.367,postag[:2]:JJ

Weight?,Feature
… 78 more positive …,… 78 more positive …
… 43 more negative …,… 43 more negative …
-0.182,+1:postag[:2]:IN
-0.340,postag[:2]:NN
-0.344,word.istitle()
-0.384,+1:postag[:2]:NN
-0.835,bias

Weight?,Feature
+1.896,word[-3:]:ban
+1.338,postag:NNP
+1.290,word.lower():taleban
+1.286,word[-3:]:ATO
… 7340 more positive …,… 7340 more positive …
… 1958 more negative …,… 1958 more negative …
-1.519,word[-3:]:day

Weight?,Feature
+1.302,word[-3:]:ons
+1.251,word.lower():nations
+1.078,word[-3:]:ion
… 8273 more positive …,… 8273 more positive …
… 1944 more negative …,… 1944 more negative …
-1.076,-1:word.lower():minister
-1.272,word[-3:]:day

Weight?,Feature
+3.738,word[-3:]:Mr.
+3.738,word.lower():mr.
+2.229,word.lower():president
+1.721,word.lower():prime
+1.702,word[-3:]:ime
… 8831 more positive …,… 8831 more positive …
… 1387 more negative …,… 1387 more negative …

Weight?,Feature
+2.320,-1:word.lower():president
+1.028,-1:postag:NNP
… 9083 more positive …,… 9083 more positive …
… 1353 more negative …,… 1353 more negative …
-1.069,word[-3:]:ion
-1.566,+1:postag:NN
-1.706,word[-3:]:day

Weight?,Feature
+4.985,word[-3:]:day
+2.643,-1:word.lower():in
+2.410,word[-3:]:ber
+1.555,+1:word.lower():years
+1.486,-1:word.lower():on
… 4316 more positive …,… 4316 more positive …
… 1215 more negative …,… 1215 more negative …

Weight?,Feature
+1.036,word[-3:]:day
+0.939,-1:word.lower():since
+0.926,postag:NN
+0.736,word.isdigit()
… 2455 more positive …,… 2455 more positive …
… 435 more negative …,… 435 more negative …
-0.739,+1:postag[:2]:VB


In [104]:
#Prediction for a single sentence
dummy_sentence = test_doc[0]
for (word,original,predicted) in zip([word[0][0] for word in dummy_sentence],crf.predict_single(sentence2feature(dummy_sentence)),sentence2label(dummy_sentence)):
    print('(',word,' , ',original,' , ',predicted,')')

( Turkey  ,  B-geo  ,  B-org )
( 's  ,  O  ,  O )
( prime  ,  O  ,  O )
( minister  ,  O  ,  O )
( ,  ,  O  ,  O )
( Recep  ,  B-per  ,  B-org )
( Tayyip  ,  I-per  ,  I-org )
( Erdogan  ,  I-per  ,  I-org )
( ,  ,  O  ,  O )
( Tuesday  ,  B-tim  ,  B-tim )
( appealed  ,  O  ,  O )
( for  ,  O  ,  O )
( calm  ,  O  ,  O )
( and  ,  O  ,  O )
( national  ,  O  ,  O )
( unity  ,  O  ,  O )
( .  ,  O  ,  O )
