# Testing the Named Entities Recognition engine of Spacy with the SE articles

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [9]:
import re
import pandas as pd
import spacy
import sys
from collections import Counter
#import pprint

## Run to install the language library, then comment-out
## !{sys.executable} -m spacy download en
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')

[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***

* Read the file with the scraped content from the SE articles. 
* In later versions, the corresponding tables will be directly exported from the database.
* Discard records with duplicate titles and/or abstracts and/or contents and do some data cleansing.



In [10]:
dat = pd.read_excel('articles_5_1_15_25.xlsx')
dat = dat[['title','abstract','categories','raw content']]

dat = dat.dropna(axis=0,subset=["title"])
dat = dat.dropna(axis=0,subset=["abstract"])
dat = dat.dropna(axis=0,subset=["raw content"])
dat.reset_index(drop=True, inplace=True)

dat['raw content'] = dat['raw content'].apply(lambda x: re.sub("[^a-z\\.,A-Z0-9]", " ",x)) ## replace anything except digits,letters,comma and dot by space 
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(r'[a-zA-Z]+\d+', ' ', x))## letters+digits -> space
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' +', ' ',x)) ## remove more than one spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub('^ +| +$', '',x)) ## remove start and end spaces
dat['raw content'] = dat['raw content'].apply(lambda x: re.sub(' ,',',',x)) ## space-comma-space -> comma-space

dat

Unnamed: 0,title,abstract,categories,raw content
0,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non seasonally adjuste...
2,Accidents and injuries statistics,This article presents an overview of European...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I..."
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m..."
4,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non fatal accidents In...
...,...,...,...,...
610,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa s main trade in goods partner is the EU...
611,Adult learning statistics,This article provides an overview of adult l...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...
612,Acquisition of citizenship statistics,This article presents recent statistics on th...,"['Asylum and migration', 'Population', 'Acquis...",EU 27 Member States granted citizenship to 706...
613,Adult learning statistics - characteristics of...,This article presents an overview of European...,"['Education and training', 'Participation in e...",Formal and non formal adult education and trai...


### Step 3. Apply the NER engine
***

Create columns ORG, GPE, NORP, LOCATION which will hold dictionaries with entities recognized as: 
* Organizations; 
* Countries, cities, states;
* Nationalities or religious or political groups;
* Non-GPE locations, mountain ranges, bodies of water, respectively. 

In each dictionary in a record, the key is the entity and the values are a list with the token span's *start* index position, the token span's *stop* index position and the count of occurence in the content of the SE article.

In [11]:
nlp.max_length = 1500000

dat['ORG'] = [dict() for i in range(len(dat))]
dat['GPE'] = [dict() for i in range(len(dat))]
dat['NORP'] = [dict() for i in range(len(dat))]
dat['LOCATION'] = [dict() for i in range(len(dat))]

for i in range(len(dat)):
    if i % 100 == 0: print('i = ',i,' of ',len(dat))
    tokens = nlp(dat.loc[i,'raw content'])
    entities = tokens.ents
    for ent in entities:
        #print(ent.text, ent.label_)
        if ent.label_ == 'ORG':
            if ent.text.upper() in dat.loc[i,'ORG'].keys():
                dat.loc[i,'ORG'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'ORG'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'ORG'][ent.text.upper()] = [[(ent.start,ent.end)],1]
        
        elif ent.label_ == 'GPE':
            if ent.text.upper() in dat.loc[i,'GPE'].keys():
                dat.loc[i,'GPE'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'GPE'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'GPE'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'NORP':
            if ent.text.upper() in dat.loc[i,'NORP'].keys():
                dat.loc[i,'NORP'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'NORP'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'NORP'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'LOCATION':
            if ent.text.upper() in dat.loc[i,'LOCATION'].keys():
                dat.loc[i,'LOCATION'][ent.text.upper()][0].append((ent.start,ent.end)) 
                dat.loc[i,'LOCATION'][ent.text.upper()][1] += 1 
            else:    
                dat.loc[i,'LOCATION'][ent.text.upper()] = [[(ent.start,ent.end)],1]         
    
dat

#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ

i =  0  of  615
i =  100  of  615
i =  200  of  615
i =  300  of  615
i =  400  of  615
i =  500  of  615
i =  600  of  615


Unnamed: 0,title,abstract,categories,raw content,ORG,GPE,NORP,LOCATION
0,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,"{'EU': [[(59, 60), (181, 182), (261, 262), (80...","{'THE MEMBER STATES': [[(1099, 1102), (1413, 1...",{},{}
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non seasonally adjuste...,"{'EU': [[(4, 5), (148, 149), (470, 471), (569,...","{'THE MEMBER STATES': [[(202, 205)], 1], 'MEMB...",{},{}
2,Accidents and injuries statistics,This article presents an overview of European...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I...","{'EU': [[(18, 19), (55, 56), (110, 111), (132,...","{'SLOVENIA': [[(44, 45), (531, 532), (563, 564...","{'BALTIC': [[(161, 162), (258, 259), (708, 709...",{}
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m...","{'EU': [[(33, 34), (72, 73), (96, 97), (145, 1...","{'FINLAND': [[(374, 375), (1679, 1680)], 2], '...",{},{}
4,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non fatal accidents In...,"{'EU': [[(31, 32), (48, 49), (119, 120), (191,...",{},{},{}
...,...,...,...,...,...,...,...,...
610,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa s main trade in goods partner is the EU...,"{'EU': [[(9, 10), (22, 23), (68, 69), (184, 18...","{'CHINA': [[(42, 43), (53, 54)], 2], 'NORTHERN...","{'AFRICAN': [[(38, 39), (59, 60), (437, 438), ...",{}
611,Adult learning statistics,This article provides an overview of adult l...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...,"{'EU': [[(80, 81), (156, 157), (255, 256), (29...","{'DENMARK': [[(146, 147)], 1], 'FINLAND': [[(1...","{'EUROPEAN': [[(16, 17)], 1]}",{}
612,Acquisition of citizenship statistics,This article presents recent statistics on th...,"['Asylum and migration', 'Population', 'Acquis...",EU 27 Member States granted citizenship to 706...,"{'EU': [[(0, 1), (23, 24), (133, 134), (282, 2...","{'GERMANY': [[(47, 48), (136, 137), (360, 361)...","{'GERMAN': [[(54, 55), (678, 679), (698, 699)]...",{}
613,Adult learning statistics - characteristics of...,This article presents an overview of European...,"['Education and training', 'Participation in e...",Formal and non formal adult education and trai...,"{'EU': [[(35, 36), (112, 113), (239, 240), (39...","{'NETHERLANDS': [[(218, 219), (277, 278)], 2],...",{},{}


### Step 4. Gathering the most common entities: example with ORG entities
***

We can see a few errors and repetitions. These require some further cleansing steps and fine-tuning of the NER engine (not yet carried out). There are in total 1240 terms identified as named entities - organizations.


In [23]:
from itertools import chain
org_list=sorted(list(chain.from_iterable(dat['ORG'].apply(lambda x: x.keys()))))
org_all_freqs = sorted(Counter(org_list))
print('Total terms identified as ORG: ',len(org_all_freqs))

print('\n100 most common:\n')
org_common_freqs = Counter(org_list).most_common(100)
org_common = sorted([x[0] for x in org_common_freqs])
#import pprint
#pp = pprint.PrettyPrinter(indent=4)
print(org_common_freqs)

Total terms identified as ORG:  1240

100 most common:

['AAGR', 'ALGECIRAS', 'AROPE', 'ASEAN', 'ASEM', 'ATTIKI', 'BAYERN', 'BMI', 'CO 2', 'COFOG', 'COICOP', 'COMMISSION', 'CORSE', 'COVID 19', 'CZECHIA', 'CZECHIA EUR', 'DATA', 'DE BRUXELLES CAPITALE', 'DMC', 'EA', 'EA 19', 'EASTERN', 'EC', 'EEA', 'EFTA', 'EHIS', 'ENP EAST', 'ENP SOUTH', 'ESA', 'EU', 'EU 27', 'EU 28', 'EU SILC', 'EUR', 'EUROBASE', 'EUROPEAN UNION', 'EUROSTAT', 'FDI', 'FLEVOLAND', 'FOOD', 'FTE', 'GHG', 'GNI', 'GROUP', 'GVA', 'HBS', 'HEALTHCARE', 'HICP', 'HOUSEHOLDS', 'ICD', 'ICT', 'INSTAGRAM', 'INTRA EU', 'IRELAND EUR', 'ISCED', 'ISCO', 'LFS', 'LPG', 'MACHINERY', 'MAYOTTE', 'MEMBER STATE', 'METROPOLITANA DE LISBOA', 'NACE', 'NATURA', 'NEET', 'NESA', 'NPISH', 'OECD', 'PLI', 'PPS', 'PRINCIPADO DE ASTURIAS', 'PROV', 'REGI N DE MURCIA', 'REGULATION EC', 'SDG', 'SEVEREN', 'SITC', 'SOSTIN', 'STATE', 'STS', 'THE COMUNIDAD DE MADRID', 'THE EU LABOUR FORCE SURVEY', 'THE EU STATISTICS ON INCOME AND LIVING CONDITIONS EU SILC', 'THE

### Step 5. Storing information on these most common entities per article: example with ORG entities
***

This is one way of storing the information on both all entities and counts and on the most common ones in a Pandas dataframe.


In [62]:
dat['ORG_COMMON_100'] = dat['ORG'].apply(lambda x: {y:x[y] for y in x.keys() if y in org_common})
dat

Unnamed: 0,title,abstract,categories,raw content,ORG,GPE,NORP,LOCATION,ORG_COMMON_100
0,Absences from work - quarterly statistics,Absences from work can be classified into two...,"['Employment', 'Labour market', 'Statistical a...",Absences from work sharply increase in first h...,"{'EU': [[(59, 60), (181, 182), (261, 262), (80...","{'THE MEMBER STATES': [[(1099, 1102), (1413, 1...",{},{},"{'EU': [[(59, 60), (181, 182), (261, 262), (80..."
1,Balance of payments statistics - quarterly data,This article presents quarterly statistics on ...,"['Balance of payments', 'Statistical article']",Current account. The EU non seasonally adjuste...,"{'EU': [[(4, 5), (148, 149), (470, 471), (569,...","{'THE MEMBER STATES': [[(202, 205)], 1], 'MEMB...",{},{},"{'EU': [[(4, 5), (148, 149), (470, 471), (569,..."
2,Accidents and injuries statistics,This article presents an overview of European...,"['Health', 'Health status', 'Statistical artic...","Deaths from accidents, injuries and assault. I...","{'EU': [[(18, 19), (55, 56), (110, 111), (132,...","{'SLOVENIA': [[(44, 45), (531, 532), (563, 564...","{'BALTIC': [[(161, 162), (258, 259), (708, 709...",{},"{'EU': [[(18, 19), (55, 56), (110, 111), (132,..."
3,Accidents at work statistics,This article presents a set of main statistica...,"['Accidents at work', 'Health', 'Health and sa...","Number of accidents. In 2018, there were 3.1 m...","{'EU': [[(33, 34), (72, 73), (96, 97), (145, 1...","{'FINLAND': [[(374, 375), (1679, 1680)], 2], '...",{},{},"{'EU': [[(33, 34), (72, 73), (96, 97), (145, 1..."
4,Accidents at work - statistics by economic act...,This article presents a set of main statistic...,"['Accidents at work', 'Health', 'Health and sa...",Developments over time. Non fatal accidents In...,"{'EU': [[(31, 32), (48, 49), (119, 120), (191,...",{},{},{},"{'EU': [[(31, 32), (48, 49), (119, 120), (191,..."
...,...,...,...,...,...,...,...,...,...
610,Africa-EU - international trade in goods stati...,This article provides a picture of internation...,"['Non-EU countries', 'Trade in goods', 'Statis...",Africa s main trade in goods partner is the EU...,"{'EU': [[(9, 10), (22, 23), (68, 69), (184, 18...","{'CHINA': [[(42, 43), (53, 54)], 2], 'NORTHERN...","{'AFRICAN': [[(38, 39), (59, 60), (437, 438), ...",{},"{'EU': [[(9, 10), (22, 23), (68, 69), (184, 18..."
611,Adult learning statistics,This article provides an overview of adult l...,"['Education and training', 'Lifelong learning'...",Participation rate of adults in learning in th...,"{'EU': [[(80, 81), (156, 157), (255, 256), (29...","{'DENMARK': [[(146, 147)], 1], 'FINLAND': [[(1...","{'EUROPEAN': [[(16, 17)], 1]}",{},"{'EU': [[(80, 81), (156, 157), (255, 256), (29..."
612,Acquisition of citizenship statistics,This article presents recent statistics on th...,"['Asylum and migration', 'Population', 'Acquis...",EU 27 Member States granted citizenship to 706...,"{'EU': [[(0, 1), (23, 24), (133, 134), (282, 2...","{'GERMANY': [[(47, 48), (136, 137), (360, 361)...","{'GERMAN': [[(54, 55), (678, 679), (698, 699)]...",{},"{'EU': [[(0, 1), (23, 24), (133, 134), (282, 2..."
613,Adult learning statistics - characteristics of...,This article presents an overview of European...,"['Education and training', 'Participation in e...",Formal and non formal adult education and trai...,"{'EU': [[(35, 36), (112, 113), (239, 240), (39...","{'NETHERLANDS': [[(218, 219), (277, 278)], 2],...",{},{},"{'EU': [[(35, 36), (112, 113), (239, 240), (39..."


### Step 6. Exporting the dataframe to Excel
***

This is useful for the manual inspection and the design of rules for the fine-tuning of the NER engine. This output can then directly be imported in the database.


In [14]:
dat.to_excel('SE_NERs.xlsx')