<h1>Named Entities Extraction From Online News</h1><br><b><i>Python Notebook 1 of 3</i></b><br>The Following notebook contains all necessary steps for data pre-processing in order to produce the Gold Copus.
<br>The steps are in the followig order:<br>1. Load raw data<br>2. Split content into individual documents<br>3. Extract news content<br>4. Feed each document into SpaCy (open source laguange processing library)<br>5. Save each tagged document into a seperate file

<h3>Load Raw Data, Split into document and Extract content</h3>

In [2]:
#Preprocess files.
import os, re, glob
path = '.\source\*.txt'   
files=glob.glob(path)   
document = {}
n=0
#Go through every .txt file in the source file directory
for file in files: 
    f = open(file,'r')
    raw = f.read()
    #Documents are split by multiple underscores in one file - we use the next line to split into individual documents
    docs = raw.split("__________________________________________________________")
    for i in range(len(docs)):
        text = ''
        line_start = False
        #Split each line and do any required cleaning here
        for line in docs[i].split('\n'):
            if(re.match(r'Full text: ',line)):
                line_start = True
                line = re.sub(r'Full text: ','',line)
                line = re.sub(r'Credit: Staff ','',line)
            if line_start:
                text += line + "\n"
        document[n] = text
        n+=1
       

Check corpus size - number of documents produced.

In [3]:
#Print Corpus Size
print("Corpus Size:",len(document),"articles\n")

#Preview sample document
print("Sample Document:")
print(document[114][0:500])

Corpus Size: 2116 articles

Sample Document:
Workers at auto-body shops deliberately damaged cars; installed used parts but billed for new ones; or invoiced for phantom repairs, according to an investigation by a Canadian insurer that is calling on government to help in curbing the problem.
Aviva Canada found about half the total expenses submitted for repairs to crashed vehicles during its investigation in Ontario were bogus - an amount the company estimates adds up to hundreds of millions of dollars a year.
"Nobody has ever really sample


<h3>Prepare Tagged Dataset Using Spacy Open Source Application for Natural Language Processing.</h3><br>You will need to open your Anaconda promt and execute the following 2 commands:<br>
- pip install spacy<br>
- python -m spacy download xx_ent_wiki_sm
- Restart jupyter notebook using command: jupyter notebook

The following code will generate our tagged corpus and save it in the form of <b><i>word pos iob-ner</i></b> into external files. This will facilitate the process of manually tagging and validating the output of out NER corpus. For example Super Bowl and Sleeping Beauty were not correctly labeled as<br>Since this will be used as the Gold Standard Corpus it is important to ensure that the data is free of errors.<br>Also, the Spacy NER-Tags labels were not exactly the same, for example Spacy label for organization is ORG while in NLTK it is ORGANIZATION. To adhere to the same NLTK format, manual formating the format to the NLTK scheme was required in this step.

In [None]:
import nltk
import spacy

#In the following steps we used the pre-trained SpaCy model 'en_core_web_sm' to tag the dataset.
nlp = spacy.load('en_core_web_sm') 

#the following mapping is to ensure our tag dataset contains the same labels used in most NLP entities systems.
entLabelDict = {'ORG':'ORGANIZATION',\
                'PERSON':'PERSON',\
                'LOC':'LOCATION',\
                'DATE':'DATE',\
                'TIME':'TIME',\
                'MONEY':'MONEY',\
                'PERCENT':'PERCENT',\
                'FACILITY':'FACILITY',\
                'GPE':'GPE'}

#Create NER_Tagger Function to resuse for pre-processing all documents into tagged corpus.
def NER_Tagger(doc):
    sents = nltk.sent_tokenize(doc)
    sentences = ''
    #Process each sentence 
    for sent in sents:
        #sent = re.sub(r'','',sent)
        tagged_sent = nlp(sent.strip())
        tokens = ([t for t in tagged_sent])
        #for each sentence return word,pos,iob-NER
        if(len(tokens) > 3):
            for t in tokens:
                if(str(t) != '\n' and str(t.tag_) != '_SP' and str(t) != '' and str(t) != ' ' and str(t.tag) != ' ' and str(t.tag) != ''):
                    if(t.ent_type_ in entLabelDict.keys()): 
                        word = str(t) +' '+ str(t.tag_) + ' ' + str(t.ent_iob_) + '-' + str(entLabelDict[t.ent_type_]) 
                    else:
                        word = str(t) +' '+ str(t.tag_) +' O' #+ str(t.ent_iob_)
                    #append tagged words to sentence
                    sentences += word + '\n'
            #add new line for sentence boundary (***required for Connl Corpus Reader)
            sentences += '\n'
    #return all sentences associated with document
    return sentences

#loop through each document - tag using NER_Tagger and store into external corpus file

#file counter
n=0

#open directory for files
correction = open("dataset5/correction_needed.txt","w")

#step into each document
for doc in document:
    n+=1
    #process document using NER_Tagger Function above
    text = str(NER_Tagger(document[n]))
    #create file for saving current dataset file
    f = open("dataset5/"+str(n)+".txt","wb")
    #write into file using utf-8 encoding
    f.write(NER_Tagger(document[n]).encode('utf-8'))
    #close file
    f.close()
    #the following line was used to track additional entities that couldnt get labeled
    #those entities and their referenced file were stored in a seprate file for easy retrieval called correction.txt
    correction.write("\n--- Document: "+str(n)+" ------------------------------------\n\n")
    #join all entity not correcty labeled into a variable
    text_need_fixing = '\n'.join(re.findall(r'.*NNP I\n',text))
    #append corrections into file
    correction.write(str(text_need_fixing))
#close correction file
correction.close()

<h3>Load pre-tagged corpus using ConllCorpusReader format</h3>Here we used the CoNNL Corpus reader function to assist in loading and ensuring data is in the right order.

In [4]:
from nltk.corpus import ConllCorpusReader
my_corpus = ''
my_corpus = ConllCorpusReader('.\dataset4', '.*\.txt', columntypes=('words', 'pos','chunk'), encoding='utf8')
#my_corpus.iob_words('1.txt')

#Extract Named Entities only - labels not containing the label O
named_entities = [(e[0],e[2]) for e in my_corpus.iob_words() if e[2] not in ('O')]

In [63]:
print(named_entities[0:50])

[('The', 'B-ORGANIZATION'), ('Globe', 'I-ORGANIZATION'), ('and', 'I-ORGANIZATION'), ('Mail', 'I-ORGANIZATION'), ('National', 'B-ORGANIZATION'), ('Ballet', 'I-ORGANIZATION'), ('of', 'I-ORGANIZATION'), ('Canada', 'I-ORGANIZATION'), ('Canada', 'B-GPE'), ('Sophie', 'B-LOCATION'), ('Letendre', 'I-LOCATION'), ('the', 'B-ORGANIZATION'), ('Walter', 'I-ORGANIZATION'), ('Carsen', 'I-ORGANIZATION'), ('Centre', 'I-ORGANIZATION'), ('Toronto', 'B-GPE'), ('Feb.', 'B-DATE'), ('28', 'I-DATE'), ('to', 'I-DATE'), ('March', 'I-DATE'), ('4', 'I-DATE'), ('Letendre', 'B-ORGANIZATION'), ('The', 'B-ORGANIZATION'), ('Globe', 'I-ORGANIZATION'), ('and', 'I-ORGANIZATION'), ('Mail', 'I-ORGANIZATION'), ('late', 'B-DATE'), ('last', 'I-DATE'), ('month', 'I-DATE'), ('Today', 'B-DATE'), ('Letendre', 'B-ORGANIZATION'), ('Letendre', 'B-ORGANIZATION'), ('2006', 'B-DATE'), ('a', 'B-DATE'), ('year', 'I-DATE'), ('later', 'I-DATE'), ('Letendre', 'B-ORGANIZATION'), ('the', 'B-ORGANIZATION'), ('Canadian', 'I-ORGANIZATION'), ('Ac

<h3>Get some statitics on the named_entities</h3>

In [77]:
#Get all the unique labels from the named entities
labels = set([element[1] for element in named_entities])
print("Entities Count of Labels and Unique Words")
for lbl in sorted(labels):
    ent_list = [e[1] for e in named_entities if e[1] == lbl]
    words = [e[0] for e in named_entities if e[1] == lbl]
    print(str(lbl) + ',' + str(len(ent_list)) +',' + str(len(set(words))))


Entities Count of Labels and Unique Words
B-DATE,22093,882
B-GPE,21036,2069
B-LOCATION,1922,332
B-MONEY,5893,1940
B-ORGANIZATION,26639,5501
B-PERCENT,40,34
B-PERSON,30522,5694
B-TIME,1887,280
I-DATE,24643,827
I-GPE,3824,423
I-LOCATION,1712,392
I-MONEY,6952,551
I-ORGANIZATION,31036,4599
I-PERCENT,44,8
I-PERSON,16445,6104
I-TIME,2633,248


<h3>Print Entities in tree format</h3>

In [35]:
import nltk
from nltk import *
def extract_entities(text):
    entities = []
    for sentence in nltk.sent_tokenize(text):
        #sentence = re.sub(r'.\-.','_',sentence)
        chunks = ne_chunk(pos_tag(word_tokenize(sentence)))
        entities.extend([chunk for chunk in chunks if hasattr(chunk, 'label')])
    return entities

#Print count of entities (tree form - complete entity name)
ent_tree = extract_entities(document[183])
print(ent_tree)



[Tree('PERSON', [('Wendy', 'NNP')]), Tree('PERSON', [('Massie', 'NNP')]), Tree('GPE', [('Pointe', 'NNP')]), Tree('PERSON', [('Baril', 'NNP')]), Tree('GPE', [('Georgian', 'NNP')]), Tree('PERSON', [('Alex', 'NNP')]), Tree('PERSON', [('Wendy', 'NNP')]), Tree('GPE', [('Blood', 'NN')]), Tree('PERSON', [('Wendy', 'NNP')]), Tree('PERSON', [('Wendy', 'NNP')]), Tree('PERSON', [('Massie', 'NNP')]), Tree('PERSON', [('Alex', 'NNP')]), Tree('ORGANIZATION', [('Jeongson', 'NNP'), ('Alpine', 'NNP'), ('Centre', 'NNP')]), Tree('ORGANIZATION', [('Pyeongchang', 'NNP'), ('Paralympics', 'NNPS')]), Tree('GPE', [('New', 'NNP'), ('Zealand', 'NNP')]), Tree('PERSON', [('Carl', 'NNP'), ('Murphy', 'NNP')]), Tree('PERSON', [('Alex', 'NNP')]), Tree('PERSON', [('Alex', 'NNP')]), Tree('GPE', [('Beijing', 'NNP')]), Tree('GPE', [('Finland', 'NNP')]), Tree('PERSON', [('Matti', 'NNP')]), Tree('GPE', [('Massie', 'NNP')]), Tree('GPE', [('American', 'NNP')]), Tree('PERSON', [('Brenna', 'NNP'), ('Huckaby', 'NNP')]), Tree('GPE