The provided train data has values of O for entities that are not nouns or ones that cannot fall into any specific defined category from one of 'O', 'B-location', 'I-location', 'B-group', 'B-corporation', 'B-person', 'B-creative-work', 'B-product', 'I-person', 'I-creative-work', 'I-corporation', 'I-group', 'I-product'. B and I tags are in continuation of the same object with multiple words such as a name. That is for a name like John Smith, John would be tagged 'B-person' and Smith would be tagged 'I-person'. 

This problem could be done with multiple different libraries such as Spacy, Stanford NLP NER and the one I used NLTK or the Natural Language Toolkit. I chose NLTK as it the simplest access to the processed data. Since the categories in the train.txt files are very specific, I beleived that NLTK would be the best fit for this problem.


In [1]:
import nltk

I first find the tags that I mentioned above by running a simple for loop through the text data, making sure to skip the "." elements which signify the end of a sentence. I place these tags in a list that is aptly named.

In [2]:
#getting unique tags and making a string of the data elements from train.txt
fhand = open('train.txt', 'r', encoding="utf8")
tags_from_train = []

for line in fhand:
    if line == "\n":
        continue
    else:
        pair = line.split()
        if len(pair) == 0 :
            continue
        word = pair[0]
        tag = pair[1]
        if word=='.' :
            continue
        if tag !='0':
            #unique tags
            if tag not in tags_from_train:
                tags_from_train.append(tag)

I collect the test data in a string to use in the nltk methods. The sentence and test methods use the test data and collect data in lists of sentences which are itself lists of words.

In [3]:
#placing test data in a string
fhand = open('test.txt', 'r', encoding="utf8")
sentence = []
test = []
data = ""
for line in fhand:
    pair = line.split()
    if len(pair) == 0 :
        continue
    word = pair[0]
    if word!='.' :
        data = data + word + " \n"
        sentence.append(word)
    else :
        data = data + word + " \n"
        sentence.append(word)
        test.append(sentence)
        sentence = []   

Here I use the nltk library to first tokenize and then tag the data. Tokenizing puts the data in a list of 
words which form a list of sentences.


In [4]:
#data gets converted to the words
words = nltk.word_tokenize(data)
#gives pos tags to the words
pos_tags = nltk.pos_tag(words)
#makes chunks of the pos tags
chunks = nltk.ne_chunk(pos_tags, binary=False) 
print(words)
print(pos_tags)
print(chunks)

['&', 'gt', ';', '*', 'The', 'soldier', 'was', 'killed', 'when', 'another', 'avalanche', 'hit', 'an', 'army', 'barracks', 'in', 'the', 'northern', 'area', 'of', 'Sonmarg', ',', 'said', 'a', 'military', 'spokesman', '.', '&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo', 'Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.', '&', 'gt', ';', '*', 'The', 'army', 'on', 'Thursday', 'recovered', 'the', 'bodies', 'of', 'ten', 'of', 'its', 'men', 'who', 'were', 'killed', 'in', 'an', 'avalanche', 'the', 'previous', 'day', '.', '&', 'gt', ';', '*', 'The', 'four', 'civilians', 'killed', 'included', 'two', 'children', 'of', 'a', 'family', 'whose', 'house', 'was', 'hit', 'by', 'a', 'separate', 'avalanche', ',', 'also', 'on', 'Wednesday', ',', 'a', 'police', 'spokesman', 'said', '.', 'The', 'bodies', 'of', 'the', 'soldiers', 'were'

I form two lists, namely entities and labels which store the different entities and their related tag in concurrent indexes in the lists. 

In [5]:
entities =[]
labels =[]
for chunk in chunks:
    # if there is a label then it is of significance (noun)
    if hasattr(chunk,'label'):
        entities.append(' '.join(c[0] for c in chunk))
        labels.append(chunk.label())
    # all other words: adjectives, verbs, etc
    else:
        entities.append(chunk[0])
        labels.append(chunk[1])

The entities in the test data are printed in a list.

In [6]:
print(entities)

['&', 'gt', ';', '*', 'The', 'soldier', 'was', 'killed', 'when', 'another', 'avalanche', 'hit', 'an', 'army', 'barracks', 'in', 'the', 'northern', 'area', 'of', 'Sonmarg', ',', 'said', 'a', 'military', 'spokesman', '.', '&', 'gt', ';', '*', 'Police', 'last', 'week', 'evacuated', '80', 'villagers', 'from', 'Waltengoo Nar', 'where', 'dozens', 'were', 'killed', 'after', 'a', 'series', 'of', 'avalanches', 'hit', 'the', 'area', 'in', '2005', 'in', 'the', 'south', 'of', 'the', 'territory', '.', '&', 'gt', ';', '*', 'The', 'army', 'on', 'Thursday', 'recovered', 'the', 'bodies', 'of', 'ten', 'of', 'its', 'men', 'who', 'were', 'killed', 'in', 'an', 'avalanche', 'the', 'previous', 'day', '.', '&', 'gt', ';', '*', 'The', 'four', 'civilians', 'killed', 'included', 'two', 'children', 'of', 'a', 'family', 'whose', 'house', 'was', 'hit', 'by', 'a', 'separate', 'avalanche', ',', 'also', 'on', 'Wednesday', ',', 'a', 'police', 'spokesman', 'said', '.', 'The', 'bodies', 'of', 'the', 'soldiers', 'were', '

Once in a while, in the labels list we notice an ORGANIZATION, PERSON or GPE which stands for geopolitical entity pop up. 

In [7]:
print(labels)

['CC', 'NN', ':', 'CC', 'DT', 'NN', 'VBD', 'VBN', 'WRB', 'DT', 'NN', 'VBD', 'DT', 'NN', 'NNS', 'IN', 'DT', 'JJ', 'NN', 'IN', 'GPE', ',', 'VBD', 'DT', 'JJ', 'NN', '.', 'CC', 'NN', ':', 'NNP', 'NNP', 'JJ', 'NN', 'VBD', 'CD', 'NNS', 'IN', 'PERSON', 'WRB', 'NNS', 'VBD', 'VBN', 'IN', 'DT', 'NN', 'IN', 'NNS', 'VBP', 'DT', 'NN', 'IN', 'CD', 'IN', 'DT', 'NN', 'IN', 'DT', 'NN', '.', 'CC', 'NN', ':', 'CC', 'DT', 'NN', 'IN', 'NNP', 'VBD', 'DT', 'NNS', 'IN', 'NN', 'IN', 'PRP$', 'NNS', 'WP', 'VBD', 'VBN', 'IN', 'DT', 'NN', 'DT', 'JJ', 'NN', '.', 'CC', 'NN', ':', 'CC', 'DT', 'CD', 'NNS', 'VBN', 'VBD', 'CD', 'NNS', 'IN', 'DT', 'NN', 'WP$', 'NN', 'VBD', 'VBN', 'IN', 'DT', 'JJ', 'NN', ',', 'RB', 'IN', 'NNP', ',', 'DT', 'NN', 'NN', 'VBD', '.', 'DT', 'NNS', 'IN', 'DT', 'NNS', 'VBD', 'VBN', 'IN', 'DT', 'JJ', 'NNS', 'IN', 'DT', 'ORGANIZATION', '(', 'ORGANIZATION', ')', ',', 'WDT', 'VBZ', 'VBN', 'TO', 'VB', 'IN', 'JJ', 'NN', 'CC', 'NN', 'NNS', '.', 'CC', 'NN', ':', 'CC', 'NNS', 'VBP', 'IN', 'NN', 'TO', 'VB'

Since the nltk library names these entities differently than the train.data file, I hace to change the labels in them. After making the required changes, I can write out the output to a file and replicate the formatting in the train data.

In [8]:
f = open("test_results.txt", 'a', encoding="utf8")
for i in range(len(entities)):
    f.write(entities[i])
    f.write("\t")
    
    #organization becomes corporation
    if labels[i]=="ORGANIZATION":
        f.write("B-corporation")
    elif labels[i]=="ORGANIZATION" and labels[i-1]=="ORGANIZATION":
        f.write("I-corporation")
    #person remains person
    elif labels[i]=="PERSON":
        f.write("B-person")
    elif labels[i]=="PERSON" and labels[i-1]=="PERSON":
        f.write("I-person")
    #gpe becomes location
    elif labels[i]=="GPE":
        f.write("B-location")
    elif labels[i]=="GPE" and labels[i-1]=="GPE":
        f.write("I-location")
    #default is 0
    else:
        f.write("0")
    f.write("\n")