# Testing the Named Entities Recognition engine of Spacy with the SE articles

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [1]:
import re
import pandas as pd
import spacy
import sys
from collections import Counter
import re
#import pprint

## Run to install the language library, then comment-out
## !{sys.executable} -m spacy download en
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***

* Read the file with the scraped content from the SE articles. 
* In later versions, the corresponding tables will be directly exported from the database.
* Discard records with duplicate titles and/or abstracts and/or contents and do some data cleansing.



In [2]:
dat = pd.read_excel('articles_4_30_12_26.xlsx')
dat = dat[['title','abstract','categories','raw content']]

dat = dat.dropna(axis=0,subset=["title"])
dat = dat.dropna(axis=0,subset=["abstract"])
dat = dat.dropna(axis=0,subset=["raw content"])
dat.reset_index(drop=True, inplace=True)

dat['raw content'] = dat['raw content'].apply(lambda x: re.sub("[^a-z\\.,A-Z0-9]", " ",x)) ## replace anything except digits,letters,comma and dot by space 
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' +', ' ',x)) ## remove more than one spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub('^ +| +$', '',x)) ## remove start and end spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' ,',',',x)) ## space-comma-space -> comma-space
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(r'\D\d{4}', '.', x))## replace a non-digit+4digits with a dot e.g. I5530Camping-> .Camping
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(r'[a-zA-Z]+\d+', '.', x))## letters+digits -> .

dat

Unnamed: 0,title,abstract,categories,raw content
0,Air safety statistics in the EU,Detailed data from the European Aviation Safe...,"['Air', 'Passengers', 'Statistical article', '...",Overview of fatalities in air transport in the...
1,ASEAN-EU - international trade in goods statis...,This article provides a picture of the inte...,"['Non-EU countries', 'Trade in goods', 'Statis...",ASEAN countries trade in goods with main partn...
2,Air pollution statistics - emission inventories,This article is about emissions of air pollut...,"['Air pollution', 'Environment', 'Health', 'He...",General overview Air pollution harms human hea...
3,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...
4,Annual national accounts - evolution of the in...,This article explains the income components of...,"['Authored article', 'National accounts (incl....","Shares of income components to GDP in. In., co..."
...,...,...,...,...
610,Accommodation and food service statistics - NA...,This article presents an overview of statist...,"['Services', 'Statistical article', 'Structura...",Structural profile The accommodation and food ...
611,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents In., there were 3.1 millio..."
612,Accidents at work - statistics on causes and c...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Workstation accidents Non fatal accidents In.,..."
613,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time Non fatal accidents In....


### Step 3. Apply the NER engine
***

Create columns ORG, GPE, NORP, LOCATION which will hold dictionaries with entities recognized as: 
* Organizations; 
* Countries, cities, states;
* Nationalities or religious or political groups;
* Non-GPE locations, mountain ranges, bodies of water, respectively. 

In each dictionary in a record, the key is the entity and the values are a list with the token span's *start* index position, the token span's *stop* index position and the count of occurence in the content of the SE article.

In [3]:
nlp.max_length = 1500000

dat['ORG'] = [dict() for i in range(len(dat))]
dat['GPE'] = [dict() for i in range(len(dat))]
dat['NORP'] = [dict() for i in range(len(dat))]
dat['LOCATION'] = [dict() for i in range(len(dat))]

for i in range(len(dat)):
    if i % 100 == 0: print('i = ',i,' of ',len(dat))
    tokens = nlp(dat.loc[i,'raw content'])
    entities = tokens.ents
    for ent in entities:
        #print(ent.text, ent.label_)
        if ent.label_ == 'ORG':
            if ent.text.upper() in dat.loc[i,'ORG'].keys():
                dat.loc[i,'ORG'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'ORG'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'ORG'][ent.text.upper()] = [[(ent.start,ent.end)],1]
        
        elif ent.label_ == 'GPE':
            if ent.text.upper() in dat.loc[i,'GPE'].keys():
                dat.loc[i,'GPE'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'GPE'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'GPE'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'NORP':
            if ent.text.upper() in dat.loc[i,'NORP'].keys():
                dat.loc[i,'NORP'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'NORP'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'NORP'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'LOCATION':
            if ent.text.upper() in dat.loc[i,'LOCATION'].keys():
                dat.loc[i,'LOCATION'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'LOCATION'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'LOCATION'][ent.text.upper()] = [[(ent.start,ent.end)],1]         
    
dat

#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ

i =  0  of  615
i =  100  of  615
i =  200  of  615
i =  300  of  615
i =  400  of  615
i =  500  of  615
i =  600  of  615


Unnamed: 0,title,abstract,categories,raw content,ORG,GPE,NORP,LOCATION
0,Air safety statistics in the EU,Detailed data from the European Aviation Safe...,"['Air', 'Passengers', 'Statistical article', '...",Overview of fatalities in air transport in the...,"{'EU': [[(8, 9), (80, 81), (286, 287), (305, 3...","{'THE MEMBER STATES': [[(89, 92)], 1], 'FRANCE...","{'GERMAN': [[(361, 362)], 1], 'FRENCH': [[(365...",{}
1,ASEAN-EU - international trade in goods statis...,This article provides a picture of the inte...,"['Non-EU countries', 'Trade in goods', 'Statis...",ASEAN countries trade in goods with main partn...,"{'ASEAN': [[(0, 1), (16, 17), (39, 40), (57, 5...","{'CHINA': [[(26, 27), (79, 80)], 2], 'JAPAN': ...","{'GERMAN': [[(982, 983)], 1]}",{}
2,Air pollution statistics - emission inventories,This article is about emissions of air pollut...,"['Air pollution', 'Environment', 'Health', 'He...",General overview Air pollution harms human hea...,"{'NOX': [[(40, 41), (134, 135)], 2], 'EU': [[(...","{'NH': [[(168, 169)], 1]}",{},{}
3,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,"{'EU': [[(58, 59), (180, 181), (259, 260), (79...","{'THE MEMBER STATES': [[(1089, 1092), (1404, 1...",{},{}
4,Annual national accounts - evolution of the in...,This article explains the income components of...,"['Authored article', 'National accounts (incl....","Shares of income components to GDP in. In., co...","{'EUROPEAN UNION': [[(20, 22)], 1], 'EU': [[(4...","{'THE MEMBER STATES': [[(96, 99), (648, 651)],...",{},{}
...,...,...,...,...,...,...,...,...
610,Accommodation and food service statistics - NA...,This article presents an overview of statist...,"['Services', 'Statistical article', 'Structura...",Structural profile The accommodation and food ...,"{'EU': [[(17, 18), (100, 101), (151, 152), (26...","{'GERMANY': [[(713, 714), (837, 838), (1389, 1...","{'GREEK': [[(1738, 1739)], 1]}",{}
611,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents In., there were 3.1 millio...","{'EU': [[(32, 33), (71, 72), (95, 96), (144, 1...","{'FINLAND': [[(372, 373), (1668, 1669)], 2], '...",{},{}
612,Accidents at work - statistics on causes and c...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Workstation accidents Non fatal accidents In.,...","{'EU': [[(26, 27), (216, 217), (381, 382), (63...","{'SLOVENIA': [[(1796, 1797), (2353, 2354), (29...","{'IRISH': [[(2383, 2384), (2886, 2887)], 2], '...",{}
613,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time Non fatal accidents In....,"{'EU': [[(30, 31), (47, 48), (118, 119), (190,...",{},{},{}


### Step 4. Gathering the most common entities: example with ORG entities
***

We can see several errors and repetitions. These require some further cleansing steps and fine-tuning of the NER engine (not yet carried out). There are in total 1502 terms identified as named entities - organizations.


In [4]:
from itertools import chain
org_list=sorted(list(chain.from_iterable(dat['ORG'].apply(lambda x: x.keys()))))
org_all_freqs = sorted(Counter(org_list))
print('Total terms identified as ORG: ',len(org_all_freqs))

print('\n100 most common:\n')
org_common_freqs = Counter(org_list).most_common(100)
org_common = sorted([x[0] for x in org_common_freqs])
org_common_freqs

Total terms identified as ORG:  1273

100 most common:



[('EU', 597),
 ('CZECHIA', 214),
 ('EUROSTAT', 84),
 ('EFTA', 66),
 ('EUR', 55),
 ('THE EUROPEAN UNION', 52),
 ('NACE', 45),
 ('PPS', 35),
 ('ISCED', 27),
 ('DATA', 26),
 ('EA', 26),
 ('THE EUROPEAN COMMISSION', 24),
 ('OECD', 21),
 ('SITC', 21),
 ('SDG', 19),
 ('EUROPEAN UNION', 17),
 ('STATE', 16),
 ('EHIS', 15),
 ('FDI', 15),
 ('CO 2', 14),
 ('ENP EAST', 13),
 ('EU 27', 13),
 ('UAA', 13),
 ('ASEAN', 12),
 ('EEA', 12),
 ('GHG', 12),
 ('ICT', 12),
 ('LFS', 11),
 ('DE BRUXELLES CAPITALE', 10),
 ('ENP SOUTH', 10),
 ('EU SILC', 10),
 ('MEMBER STATE', 10),
 ('PROV', 10),
 ('THE EUROPEAN ENVIRONMENT AGENCY EEA', 10),
 ('THE LE DE FRANCE', 10),
 ('IRELAND EUR', 9),
 ('NEET', 9),
 ('THE EU STATISTICS ON INCOME AND LIVING CONDITIONS EU SILC', 9),
 ('ATTIKI', 8),
 ('EC', 8),
 ('ESA', 8),
 ('HEALTHCARE', 8),
 ('THE UNITED ARAB EMIRATES', 8),
 ('UN', 8),
 ('GNI', 7),
 ('HOUSEHOLDS', 7),
 ('ASEM', 6),
 ('COFOG', 6),
 ('COICOP', 6),
 ('EASTERN', 6),
 ('NATURA', 6),
 ('NESA', 6),
 ('NPISH', 6),
 ('

### Step 5. Storing information on these most common entities per article: example with ORG entities
***

This is one way of storing the information on both all entities and counts and on the most common ones in a Pandas dataframe.


In [5]:
dat['ORG_COMMON_100'] = dat[['ORG']].apply(lambda x: {y:x[y] for y in x.keys() if y in org_common})
dat

Unnamed: 0,title,abstract,categories,raw content,ORG,GPE,NORP,LOCATION,ORG_COMMON_100
0,Air safety statistics in the EU,Detailed data from the European Aviation Safe...,"['Air', 'Passengers', 'Statistical article', '...",Overview of fatalities in air transport in the...,"{'EU': [[(8, 9), (80, 81), (286, 287), (305, 3...","{'THE MEMBER STATES': [[(89, 92)], 1], 'FRANCE...","{'GERMAN': [[(361, 362)], 1], 'FRENCH': [[(365...",{},{}
1,ASEAN-EU - international trade in goods statis...,This article provides a picture of the inte...,"['Non-EU countries', 'Trade in goods', 'Statis...",ASEAN countries trade in goods with main partn...,"{'ASEAN': [[(0, 1), (16, 17), (39, 40), (57, 5...","{'CHINA': [[(26, 27), (79, 80)], 2], 'JAPAN': ...","{'GERMAN': [[(982, 983)], 1]}",{},
2,Air pollution statistics - emission inventories,This article is about emissions of air pollut...,"['Air pollution', 'Environment', 'Health', 'He...",General overview Air pollution harms human hea...,"{'NOX': [[(40, 41), (134, 135)], 2], 'EU': [[(...","{'NH': [[(168, 169)], 1]}",{},{},
3,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,"{'EU': [[(58, 59), (180, 181), (259, 260), (79...","{'THE MEMBER STATES': [[(1089, 1092), (1404, 1...",{},{},
4,Annual national accounts - evolution of the in...,This article explains the income components of...,"['Authored article', 'National accounts (incl....","Shares of income components to GDP in. In., co...","{'EUROPEAN UNION': [[(20, 22)], 1], 'EU': [[(4...","{'THE MEMBER STATES': [[(96, 99), (648, 651)],...",{},{},
...,...,...,...,...,...,...,...,...,...
610,Accommodation and food service statistics - NA...,This article presents an overview of statist...,"['Services', 'Statistical article', 'Structura...",Structural profile The accommodation and food ...,"{'EU': [[(17, 18), (100, 101), (151, 152), (26...","{'GERMANY': [[(713, 714), (837, 838), (1389, 1...","{'GREEK': [[(1738, 1739)], 1]}",{},
611,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents In., there were 3.1 millio...","{'EU': [[(32, 33), (71, 72), (95, 96), (144, 1...","{'FINLAND': [[(372, 373), (1668, 1669)], 2], '...",{},{},
612,Accidents at work - statistics on causes and c...,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Workstation accidents Non fatal accidents In.,...","{'EU': [[(26, 27), (216, 217), (381, 382), (63...","{'SLOVENIA': [[(1796, 1797), (2353, 2354), (29...","{'IRISH': [[(2383, 2384), (2886, 2887)], 2], '...",{},
613,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time Non fatal accidents In....,"{'EU': [[(30, 31), (47, 48), (118, 119), (190,...",{},{},{},


### Step 6. Exporting the dataframe to Excel
***

This is useful for the manual inspection and the design of rules for the fine-tuning of the NER engine. This output can then directly be imported in the database.


In [6]:
dat.to_excel('SE_NERs.xlsx')