# Testing the Named Entities Recognition engine of Spacy with the SE articles

### Revised (January 2022) to read all data from the database

### Installation instructions
*    Download the notebook as "raw" file and save it with extension .ipynb (cut the .txt extension which is added)
*    Install the necessary libraries from your jupyter command prompt. These, together with the versions used, are:     
    *    pyodbc==4.0.32
    *    spacy==3.2.1 
    *    pandas==1.3.5
    *    numpy==1.20.3
*   Launch the notebook and put your own credentials for access to the Virtuoso database in the call to pyodbc.connect() in Step 2 "Pre-processing"    

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [1]:
import re
import pandas as pd
import numpy as np
import spacy
import sys
from collections import Counter
#import pprint

## Run to install the language library, then comment-out
## !{sys.executable} -m spacy download en
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')

import pyodbc

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***
#### The data cleansing function



In [2]:
import re
import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) ## NEW
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) ## NEW

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x

#### Import Statistics Explained data from the database

* Id, context and last update from table dat_article.  
* Title and url from table dat_link_info, on matching id and resource_information_id=1 (i.e. Eurostat).
* Abstract from field content in table dat_article_paragraph, on matching article_id and abstract=1 ("yes").
* Apply data cleansing.


In [3]:
c = pyodbc.connect('DSN=Virtuoso All;DBA=ESTAT;UID=xxxxx;PWD=xxxxx')
cursor = c.cursor()

SQLCommand = """SELECT T1.id, T1.context, T1.last_update, T2.title, T2.url, T3.content 
                FROM ESTAT.V1.dat_article as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.id=T2.id  
                INNER JOIN ESTAT.V1.dat_article_paragraph as T3  
                  ON T2.id=T3.article_id  
                WHERE T2.resource_information_id=1 AND T3.abstract=1"""

SE_df = pd.read_sql(SQLCommand,c)
SE_df.rename(columns={'content':'abstract'},inplace=True)
SE_df = SE_df[['id','context','title','abstract','url','last_update']]

SE_df['context'] = SE_df['context'].apply(clean)
SE_df['title'] = SE_df['title'].apply(clean)
SE_df['abstract'] = SE_df['abstract'].apply(clean)

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00
...,...,...,...,...,...,...
885,10456,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:37:00
886,10470,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:35:00
887,10506,Eurostat compiles European Union and euro area...,Methods for compiling PEEIs in short-term busi...,Eurostat compiles European Union (EU) and euro...,https://ec.europa.eu/eurostat/statistics-expla...,2018-09-10 15:07:00
888,10531,,Building the System of National Accounts - adm...,This article is part of a set of background ar...,https://ec.europa.eu/eurostat/statistics-expla...,2019-11-13 15:48:00


#### Add paragraphs titles and contents

* From dat_article_paragraph with abstract=0 (i.e. "no").
* Match article_id from dat_article_paragraph with id from dat_article.
* Carry out data cleansing on titles and paragraph contents.

In [4]:
SQLCommand = """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = pd.read_sql(SQLCommand,c)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content

Unnamed: 0,article_id,title,content
0,2905,Absences from work sharply increase in first h...,Absences from work recorded unprecedented high...
1,2905,Absences: 9.5 % of employment in Q4 2019 and 1...,The article's next figure (Figure 4) compares ...
2,2905,Higher share of absences from work among women...,"Considering all four quarters of 2020, the sha..."
3,2905,Absences from work due to own illness or disab...,"From Q4 2019 to Q4 2020, the number of people ..."
4,2905,Absences from work due to holidays,"Expressed as a share of employed people, absen..."
...,...,...,...
3854,10539,General presentation and definition,Scope of asylum statistics and Dublin statisti...
3855,10539,Methodological aspects in asylum statistics,Annual aggregate of the number of asylum appli...
3856,10539,Methodological aspects in Dublin statistics,Asymmetries For most of the collected Dublin s...
3857,10539,What questions can or cannot be answered with ...,How many asylum seekers are entering EU Member...


#### Aggregate above paragraph titles and contents from SE articles paragraphs by article id
* Create a column raw content which gathers all paragraph titles and contents in one text per article.

In [5]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += a + '. ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped

Unnamed: 0,article_id,raw content
0,7,"Number of accidents. In 2018, there were 3.1 m..."
1,13,Developments for GDP in the EU-27: growth sinc...
2,16,Fall in the number of railway accidents. 9 % f...
3,17,Downturn for EU transport performance in 2019....
4,18,Rail passenger transport performance continued...
...,...,...
860,10456,Problem. After successfully identifying and jo...
861,10470,"Problem. In France, there was significant room..."
862,10506,General overview. Nine PEEIs concern short-ter...
863,10531,What are administrative sources?. The term aad...


#### Merge raw content of SE articles with main file

* Add the title to column "raw content".

In [6]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df['raw content'] = SE_df['title'] +'. ' + SE_df['raw content']

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...
...,...,...,...,...,...,...,...
860,10456,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:37:00,"Merging statistics and geospatial information,..."
861,10470,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:35:00,"Merging statistics and geospatial information,..."
862,10506,Eurostat compiles European Union and euro area...,Methods for compiling PEEIs in short-term busi...,Eurostat compiles European Union (EU) and euro...,https://ec.europa.eu/eurostat/statistics-expla...,2018-09-10 15:07:00,Methods for compiling PEEIs in short-term busi...
863,10531,,Building the System of National Accounts - adm...,This article is part of a set of background ar...,https://ec.europa.eu/eurostat/statistics-expla...,2019-11-13 15:48:00,Building the System of National Accounts - adm...


#### Read categories of SE articles from the database

In [7]:
import ast

SQLCommand = """SELECT article_id, categories 
                FROM ESTAT.V1.SE_articles_categories """

categories = pd.read_sql(SQLCommand,c)
categories['categories']=categories['categories'].apply(ast.literal_eval)
categories

Unnamed: 0,article_id,categories
0,7,"[Accidents at work, Health, Health and safety,..."
1,13,"[National accounts (incl. GDP), Statistical ar..."
2,16,"[Rail, Statistical article, Transport, Transpo..."
3,17,"[Freight, Rail, Statistical article, Transport]"
4,18,"[Passengers, Rail, Statistical article, Transp..."
...,...,...
600,9472,"[International trade, Trade in goods, Trade in..."
601,9477,"[Trade in goods, Statistical article]"
602,9479,"[Trade in goods, Statistical article, Internat..."
603,9492,"[Household composition and family situation, L..."


#### Merge with the main file

In [8]:
SE_df = pd.merge(SE_df,categories,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)
SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar..."
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo..."
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]"
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp..."
...,...,...,...,...,...,...,...,...
600,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in..."
601,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]"
602,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat..."
603,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L..."


* Discard records which have empty strings in any of the columns [titles, abstracts, raw contents].
* Discard records with duplicate titles and/or abstracts and/or raw contents (but not context sections which are frequently the same and some are missing)

In [9]:
SE_df = SE_df.replace('', np.nan) 

SE_df = SE_df.dropna(axis=0,subset=['title','abstract','raw content'],how='any')
SE_df = SE_df.drop_duplicates(subset=["title"])
SE_df = SE_df.drop_duplicates(subset=["abstract"])
SE_df = SE_df.drop_duplicates(subset=["raw content"])
SE_df.reset_index(drop=True, inplace=True)

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar..."
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo..."
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]"
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp..."
...,...,...,...,...,...,...,...,...
574,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in..."
575,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]"
576,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat..."
577,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L..."


### Step 3. Apply the NER engine
***

Create columns ORG, GPE, NORP, LOCATION which will hold dictionaries with entities recognized as: 
* Organizations; 
* Countries, cities, states;
* Nationalities or religious or political groups;
* Non-GPE locations, mountain ranges, bodies of water, respectively. 

In each dictionary in a record, the key is the entity and the values are a list with the token span's *start* index position, the token span's *stop* index position and the count of occurence in the content of the SE article.

In [10]:
nlp.max_length = 1500000

SE_df['ORG'] = [dict() for i in range(len(SE_df))]
SE_df['GPE'] = [dict() for i in range(len(SE_df))]
SE_df['NORP'] = [dict() for i in range(len(SE_df))]
SE_df['LOCATION'] = [dict() for i in range(len(SE_df))]

for i in range(len(SE_df)):
    if i % 100 == 0: print('i = ',i,' of ',len(SE_df))
    tokens = nlp(SE_df.loc[i,'raw content'])
    entities = tokens.ents
    for ent in entities:
        #print(ent.text, ent.label_)
        if ent.label_ == 'ORG':
            if ent.text.upper() in SE_df.loc[i,'ORG'].keys():
                SE_df.loc[i,'ORG'][ent.text.upper()][0].append((ent.start,ent.end)) 
                SE_df.loc[i,'ORG'][ent.text.upper()][1] += 1 
            else:    
                SE_df.loc[i,'ORG'][ent.text.upper()] = [[(ent.start,ent.end)],1]
        
        elif ent.label_ == 'GPE':
            if ent.text.upper() in SE_df.loc[i,'GPE'].keys():
                SE_df.loc[i,'GPE'][ent.text.upper()][0].append((ent.start,ent.end)) 
                SE_df.loc[i,'GPE'][ent.text.upper()][1] += 1 
            else:    
                SE_df.loc[i,'GPE'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'NORP':
            if ent.text.upper() in SE_df.loc[i,'NORP'].keys():
                SE_df.loc[i,'NORP'][ent.text.upper()][0].append((ent.start,ent.end)) 
                SE_df.loc[i,'NORP'][ent.text.upper()][1] += 1 
            else:    
                SE_df.loc[i,'NORP'][ent.text.upper()] = [[(ent.start,ent.end)],1]
                
        elif ent.label_ == 'LOCATION':
            if ent.text.upper() in SE_df.loc[i,'LOCATION'].keys():
                SE_df.loc[i,'LOCATION'][ent.text.upper()][0].append((ent.start,ent.end)) 
                SE_df.loc[i,'LOCATION'][ent.text.upper()][1] += 1 
            else:    
                SE_df.loc[i,'LOCATION'][ent.text.upper()] = [[(ent.start,ent.end)],1]         
    
SE_df

#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ

i =  0  of  579
i =  100  of  579
i =  200  of  579
i =  300  of  579
i =  400  of  579
i =  500  of  579


Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories,ORG,GPE,NORP,LOCATION
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,...","{'EU': [[(439, 440), (537, 538), (1014, 1015),...","{'FINLAND': [[(398, 399), (1773, 1774)], 2], '...",{},{}
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar...","{'EU': [[(411, 412), (424, 425), (440, 441), (...","{'JAPAN': [[(50, 51)], 1], 'THE UNITED STATES'...","{'CHINESE': [[(1170, 1171)], 1]}",{}
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo...","{'EU': [[(5, 6)], 1], 'ERAAS': [[(1059, 1060)]...","{'BELGIUM': [[(248, 249), (766, 767), (1425, 1...",{},{}
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]","{'EU': [[(7, 8), (17, 18), (33, 34), (214, 215...","{'MONTENEGRO': [[(353, 354), (489, 490), (603,...","{'EUROPEAN': [[(963, 964), (1014, 1015)], 2]}",{}
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp...","{'EU': [[(35, 36), (320, 321), (347, 348), (39...","{'LUXEMBOURG': [[(280, 281), (452, 453), (676,...",{},{}
...,...,...,...,...,...,...,...,...,...,...,...,...
574,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in...","{'EU': [[(0, 1), (24, 25), (120, 121), (197, 1...","{'COVID-19': [[(3, 4)], 1], 'UNITED STATES': [...",{},{}
575,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]","{'EU': [[(0, 1), (240, 241), (313, 314), (769,...","{'CHINA': [[(18, 19), (118, 119), (217, 218), ...",{},{}
576,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat...","{'EU': [[(0, 1), (10, 11), (81, 82), (125, 126...","{'USA': [[(12, 13)], 1], 'CHINA': [[(14, 15), ...",{},{}
577,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L...","{'EU': [[(47, 48), (556, 557), (907, 908), (92...","{'CROATIA': [[(51, 52), (124, 125), (442, 443)...",{},{}


### Step 4. Gathering the most common entities: example with ORG entities
***

We can see a few errors and repetitions. These require some further cleansing steps and fine-tuning of the NER engine (not yet carried out). There are in total 1288 terms identified as named entities - organizations.


In [11]:
from itertools import chain
org_list=sorted(list(chain.from_iterable(SE_df['ORG'].apply(lambda x: x.keys()))))
org_all_freqs = sorted(Counter(org_list))
print('Total terms identified as ORG: ',len(org_all_freqs))

print('\n100 most common:\n')
org_common_freqs = Counter(org_list).most_common(100)
org_common = sorted([x[0] for x in org_common_freqs])
#import pprint
#pp = pprint.PrettyPrinter(indent=4)
print(org_common_freqs)

Total terms identified as ORG:  1288

100 most common:

[('EU', 556), ('STATE', 145), ('CZECHIA', 110), ('EUROSTAT', 93), ('THE EUROPEAN UNION', 79), ('EFTA', 61), ('NACE', 46), ('SDG', 38), ('INTRA-EU', 36), ('PPS', 32), ('NON-EU', 31), ('THE EUROPEAN COMMISSION', 27), ('EEA', 23), ('SITC', 23), ('OECD', 19), ('EC', 18), ('DATA', 17), ('EHIS', 17), ('FOOD & DRINK', 17), ('EUROSTATAS', 16), ('UAA', 16), ('FOOD &', 15), ('ICT', 15), ('THE EUROPEAN ENVIRONMENT AGENCY', 14), ('ENP-EAST', 12), ('FDI', 12), ('MOLDOVA', 12), ('NUTS', 12), ('ENP-SOUTH', 11), ('GHG', 11), ('THE EUROPEAN GREEN DEAL', 11), ('THE UNITED NATIONS', 11), ('CO 2', 10), ('EA', 10), ('STATISTICS', 10), ('COFOG', 9), ('EASTERN', 9), ('THE EU STATISTICS', 9), ('EUROPEAN UNION', 8), ('HEALTHCARE', 8), ('ILO', 8), ('NEET', 8), ('UN', 8), ('ATTIKI', 7), ('COMMISSION', 7), ('DMC', 7), ('ICD', 7), ('MONTENEGROAS', 7), ('THE INTERNATIONAL LABOUR ORGANISATION (ILO', 7), ('THE UNITED ARAB EMIRATES', 7), ('ALE DE FRANCE', 6), ('A

### Step 5. Storing information on these most common entities per article: example with ORG entities
***

This is one way of storing the information on both all entities and counts and on the most common ones in a Pandas dataframe.


In [12]:
SE_df['ORG_COMMON_100'] = SE_df['ORG'].apply(lambda x: {y:x[y] for y in x.keys() if y in org_common})
SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories,ORG,GPE,NORP,LOCATION,ORG_COMMON_100
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,...","{'EU': [[(439, 440), (537, 538), (1014, 1015),...","{'FINLAND': [[(398, 399), (1773, 1774)], 2], '...",{},{},"{'EU': [[(439, 440), (537, 538), (1014, 1015),..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar...","{'EU': [[(411, 412), (424, 425), (440, 441), (...","{'JAPAN': [[(50, 51)], 1], 'THE UNITED STATES'...","{'CHINESE': [[(1170, 1171)], 1]}",{},"{'EU': [[(411, 412), (424, 425), (440, 441), (..."
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo...","{'EU': [[(5, 6)], 1], 'ERAAS': [[(1059, 1060)]...","{'BELGIUM': [[(248, 249), (766, 767), (1425, 1...",{},{},"{'EU': [[(5, 6)], 1], 'CZECHIA': [[(1843, 1844..."
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]","{'EU': [[(7, 8), (17, 18), (33, 34), (214, 215...","{'MONTENEGRO': [[(353, 354), (489, 490), (603,...","{'EUROPEAN': [[(963, 964), (1014, 1015)], 2]}",{},"{'EU': [[(7, 8), (17, 18), (33, 34), (214, 215..."
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp...","{'EU': [[(35, 36), (320, 321), (347, 348), (39...","{'LUXEMBOURG': [[(280, 281), (452, 453), (676,...",{},{},"{'EU': [[(35, 36), (320, 321), (347, 348), (39..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
574,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in...","{'EU': [[(0, 1), (24, 25), (120, 121), (197, 1...","{'COVID-19': [[(3, 4)], 1], 'UNITED STATES': [...",{},{},"{'EU': [[(0, 1), (24, 25), (120, 121), (197, 1..."
575,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]","{'EU': [[(0, 1), (240, 241), (313, 314), (769,...","{'CHINA': [[(18, 19), (118, 119), (217, 218), ...",{},{},"{'EU': [[(0, 1), (240, 241), (313, 314), (769,..."
576,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat...","{'EU': [[(0, 1), (10, 11), (81, 82), (125, 126...","{'USA': [[(12, 13)], 1], 'CHINA': [[(14, 15), ...",{},{},"{'EU': [[(0, 1), (10, 11), (81, 82), (125, 126..."
577,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L...","{'EU': [[(47, 48), (556, 557), (907, 908), (92...","{'CROATIA': [[(51, 52), (124, 125), (442, 443)...",{},{},"{'EU': [[(47, 48), (556, 557), (907, 908), (92..."


### Step 6. Exporting the dataframe to Excel
***

This is useful for the manual inspection and the design of rules for the fine-tuning of the NER engine. This output can then directly be imported in the database.


In [13]:
SE_df.to_excel('SE_NERs.xlsx')