# Extraction of Subject -   Verb - Object tuples related to Categories and Named Entities of Selected Classes

### Revised (January 2022) to read all data from the database

### Step 1. Loading Spacy models
***

We install Spacy's language library for the first run. Then we can comment-out the download command. Note that we are loading Spacy's "medium" model.


In [1]:
import re
import pandas as pd
import numpy as np
import spacy
import sys

## Run to install the language library, then comment-out
!{sys.executable} -m spacy download en_core_web_md

nlp = spacy.load('en_core_web_md')
print('Finished loading.')

import pyodbc

Collecting en-core-web-md==3.2.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.2.0/en_core_web_md-3.2.0-py3-none-any.whl (45.7 MB)
[+] Download and installation successful
You can now load the package via spacy.load('en_core_web_md')
Finished loading.


### Step 2. Pre-processing
***
#### The data cleansing function

In [2]:
import re
import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) ## NEW
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) ## NEW

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x

#### Import Statistics Explained data from the database
***

* Id, context and last update from table dat_article.  
* Title and url from table dat_link_info, on matching id and resource_information_id=1 (i.e. Eurostat).
* Abstract from field content in table dat_article_paragraph, on matching article_id and abstract=1 ("yes").
* Apply data cleansing.


In [3]:

c = pyodbc.connect('DSN=Virtuoso All;DBA=ESTAT;UID=xxxxx;PWD=xxxxx')
cursor = c.cursor()

SQLCommand = """SELECT T1.id, T1.context, T1.last_update, T2.title, T2.url, T3.content 
                FROM ESTAT.V1.dat_article as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.id=T2.id  
                INNER JOIN ESTAT.V1.dat_article_paragraph as T3  
                  ON T2.id=T3.article_id  
                WHERE T2.resource_information_id=1 AND T3.abstract=1"""

SE_df = pd.read_sql(SQLCommand,c)
SE_df.rename(columns={'content':'abstract'},inplace=True)
SE_df = SE_df[['id','context','title','abstract','url','last_update']]

SE_df['context'] = SE_df['context'].apply(clean)
SE_df['title'] = SE_df['title'].apply(clean)
SE_df['abstract'] = SE_df['abstract'].apply(clean)

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00
...,...,...,...,...,...,...
885,10456,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:37:00
886,10470,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:35:00
887,10506,Eurostat compiles European Union and euro area...,Methods for compiling PEEIs in short-term busi...,Eurostat compiles European Union (EU) and euro...,https://ec.europa.eu/eurostat/statistics-expla...,2018-09-10 15:07:00
888,10531,,Building the System of National Accounts - adm...,This article is part of a set of background ar...,https://ec.europa.eu/eurostat/statistics-expla...,2019-11-13 15:48:00


#### Add paragraphs titles and contents

* From dat_article_paragraph with abstract=0 (i.e. "no").
* Match article_id from dat_article_paragraph with id from dat_article.
* Carry out data cleansing on titles and paragraph contents.

In [4]:
SQLCommand = """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = pd.read_sql(SQLCommand,c)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content

Unnamed: 0,article_id,title,content
0,2905,Absences from work sharply increase in first h...,Absences from work recorded unprecedented high...
1,2905,Absences: 9.5 % of employment in Q4 2019 and 1...,The article's next figure (Figure 4) compares ...
2,2905,Higher share of absences from work among women...,"Considering all four quarters of 2020, the sha..."
3,2905,Absences from work due to own illness or disab...,"From Q4 2019 to Q4 2020, the number of people ..."
4,2905,Absences from work due to holidays,"Expressed as a share of employed people, absen..."
...,...,...,...
3854,10539,General presentation and definition,Scope of asylum statistics and Dublin statisti...
3855,10539,Methodological aspects in asylum statistics,Annual aggregate of the number of asylum appli...
3856,10539,Methodological aspects in Dublin statistics,Asymmetries For most of the collected Dublin s...
3857,10539,What questions can or cannot be answered with ...,How many asylum seekers are entering EU Member...


#### Aggregate above paragraph titles and contents  from SE articles paragraphs by article id

* Create a column _raw content_ which gathers all paragraph titles and contents in one text per article.

In [5]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += a + '. ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped

Unnamed: 0,article_id,raw content
0,7,"Number of accidents. In 2018, there were 3.1 m..."
1,13,Developments for GDP in the EU-27: growth sinc...
2,16,Fall in the number of railway accidents. 9 % f...
3,17,Downturn for EU transport performance in 2019....
4,18,Rail passenger transport performance continued...
...,...,...
860,10456,Problem. After successfully identifying and jo...
861,10470,"Problem. In France, there was significant room..."
862,10506,General overview. Nine PEEIs concern short-ter...
863,10531,What are administrative sources?. The term aad...


#### Merge raw content of SE articles with main file

* Add the title to column "raw content".

In [6]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df['raw content'] = SE_df['title'] +'. ' + SE_df['raw content']

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...
...,...,...,...,...,...,...,...
860,10456,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:37:00,"Merging statistics and geospatial information,..."
861,10470,,"Merging statistics and geospatial information,...",This article forms part of Eurostat as statist...,https://ec.europa.eu/eurostat/statistics-expla...,2019-06-05 16:35:00,"Merging statistics and geospatial information,..."
862,10506,Eurostat compiles European Union and euro area...,Methods for compiling PEEIs in short-term busi...,Eurostat compiles European Union (EU) and euro...,https://ec.europa.eu/eurostat/statistics-expla...,2018-09-10 15:07:00,Methods for compiling PEEIs in short-term busi...
863,10531,,Building the System of National Accounts - adm...,This article is part of a set of background ar...,https://ec.europa.eu/eurostat/statistics-expla...,2019-11-13 15:48:00,Building the System of National Accounts - adm...


#### Read categories of SE articles from the database


In [7]:
import ast

SQLCommand = """SELECT article_id, categories 
                FROM ESTAT.V1.SE_articles_categories """

categories = pd.read_sql(SQLCommand,c)
categories['categories']=categories['categories'].apply(ast.literal_eval)
categories

Unnamed: 0,article_id,categories
0,7,"[Accidents at work, Health, Health and safety,..."
1,13,"[National accounts (incl. GDP), Statistical ar..."
2,16,"[Rail, Statistical article, Transport, Transpo..."
3,17,"[Freight, Rail, Statistical article, Transport]"
4,18,"[Passengers, Rail, Statistical article, Transp..."
...,...,...
600,9472,"[International trade, Trade in goods, Trade in..."
601,9477,"[Trade in goods, Statistical article]"
602,9479,"[Trade in goods, Statistical article, Internat..."
603,9492,"[Household composition and family situation, L..."


#### Merge with the main file


In [8]:
SE_df = pd.merge(SE_df,categories,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)
SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar..."
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo..."
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]"
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp..."
...,...,...,...,...,...,...,...,...
600,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in..."
601,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]"
602,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat..."
603,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L..."


#### Related links
* From the dat_article_shared_link table with article_division=2 ("Other articles", see mod_article_division table).
* link_id points to id in dat_link_info (where we select resource_information_id=1).
* Apply data cleansing (with an additional step to replace question marks from the related titles).

In [9]:
SQLCommand = """SELECT T1.article_id, T1.link_id, T2.title, T2.url 
                FROM dat_article_shared_link as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.link_id=T2.id  
                WHERE T1.article_division_id=2 AND T2.resource_information_id=1
                ORDER BY T1.article_id, T1.link_id """

add_related_links = pd.read_sql(SQLCommand,c)
add_related_links['title'] = add_related_links['title'] .apply(clean)
add_related_links['title'] = add_related_links['title'] .apply(lambda x: re.sub(r'\?','-',x))
add_related_links.head(5)

Unnamed: 0,article_id,link_id,title,url
0,7,229,Health in the European Union a facts and figures,https://ec.europa.eu/eurostat/statistics-expla...
1,7,1157,Health statistics introduced,https://ec.europa.eu/eurostat/statistics-expla...
2,7,2914,Accidents and injuries statistics,https://ec.europa.eu/eurostat/statistics-expla...
3,7,2946,Accidents at work - statistics by economic act...,https://ec.europa.eu/eurostat/statistics-expla...
4,7,2947,Accidents at work - statistics on causes and c...,https://ec.europa.eu/eurostat/statistics-expla...


#### Aggregate above by article id

* Aggregate related titles and URLs in one string.

In [10]:
add_related_grouped = pd.DataFrame(add_related_links.groupby(['article_id'])[['title','url']].aggregate(lambda x: list(x)))
add_related_grouped.reset_index(drop=False, inplace=True)
add_related_grouped.rename(columns={'title':'related_titles','url':'related_urls'},inplace=True)
add_related_grouped.head(5)


Unnamed: 0,article_id,related_titles,related_urls
0,7,[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...
1,13,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...
2,16,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...
3,17,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...
4,18,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...


#### Merge with the main file

In [11]:
SE_df = pd.merge(SE_df,add_related_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)


SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories,related_titles,related_urls
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar...","[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo...","[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]","[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp...","[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...
...,...,...,...,...,...,...,...,...,...,...
587,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in...","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
588,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
589,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat...","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
590,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L...","[Labour market, EU labour force survey, Househ...",[https://ec.europa.eu/eurostat/statistics-expl...


* Discard records which have empty strings in any of the columns [titles, abstracts, raw contents].
* Discard records with duplicate titles and/or abstracts and/or raw contents (but *not* context sections which are frequently the same and some are missing)



In [12]:
SE_df = SE_df.replace('', np.nan) 

SE_df = SE_df.dropna(axis=0,subset=['title','abstract','raw content'],how='any')
SE_df = SE_df.drop_duplicates(subset=["title"])
SE_df = SE_df.drop_duplicates(subset=["abstract"])
SE_df = SE_df.drop_duplicates(subset=["raw content"])
SE_df.reset_index(drop=True, inplace=True)

SE_df

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,categories,related_titles,related_urls
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,Accidents at work statistics. Number of accide...,"[Accidents at work, Health, Health and safety,...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,National accounts and GDP. Developments for GD...,"[National accounts (incl. GDP), Statistical ar...","[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Railway safety statistics in the EU. Fall in t...,"[Rail, Statistical article, Transport, Transpo...","[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Railway freight transport statistics. Downturn...,"[Freight, Rail, Statistical article, Transport]","[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Railway passenger transport statistics - quart...,"[Passengers, Rail, Statistical article, Transp...","[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...
...,...,...,...,...,...,...,...,...,...,...
561,9472,Trade is an important indicator of Europeas pr...,EU trade in COVID-19 related products,To help prevent the spread of the COVID-19 pan...,https://ec.europa.eu/eurostat/statistics-expla...,2021-03-31 13:04:00,EU trade in COVID-19 related products. Sharp i...,"[International trade, Trade in goods, Trade in...","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
562,9477,Trade is an important indicator of Europeas pr...,EU international trade in goods - latest devel...,This article provides a picture of the interna...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-02 16:55:00,EU international trade in goods - latest devel...,"[Trade in goods, Statistical article]","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
563,9479,Trade is an important indicator of Europeas pr...,EU and main world traders,International trade a especially the size and ...,https://ec.europa.eu/eurostat/statistics-expla...,2020-10-07 15:19:00,EU and main world traders. Main world traders:...,"[Trade in goods, Statistical article, Internat...","[International trade in goods, Extra-EU trade ...",[https://ec.europa.eu/eurostat/statistics-expl...
564,9492,"In addition to the Labour Force Survey (LFS), ...",Age of young people leaving their parental hou...,Leaving the parental home is considered as a m...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-30 14:54:00,Age of young people leaving their parental hou...,"[Household composition and family situation, L...","[Labour market, EU labour force survey, Househ...",[https://ec.europa.eu/eurostat/statistics-expl...


#### Load Glossary articles from the database

* Definitions from dat_glossary.
* Titles and urls from dat_link_info (with resource_information_id=1, i.e. Eurostat, see ESTAT.V1.mod_resource_information).
* Match above on id.

In [13]:

SQLCommand = """SELECT T1.id, T1.definition, T2.title, T2.url 
                FROM ESTAT.V1.dat_glossary as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.id=T2.id 
                WHERE T2.resource_information_id=1 """

GL_df = pd.read_sql(SQLCommand,c)
GL_df = GL_df[['id', 'title', 'definition','url']]

GL_df


Unnamed: 0,id,title,definition,url
0,1,Accident at work,An accident at work in the framework ...,https://ec.europa.eu/eurostat/statistics-expla...
1,5,Fatal accident at work,A fatal accident at work refers to an...,https://ec.europa.eu/eurostat/statistics-expla...
2,6,Non-fatal accident at work,A non-fatal accident at work is...,https://ec.europa.eu/eurostat/statistics-expla...
3,8,Aggregate demand,Aggregate demand is the total amount of ...,https://ec.europa.eu/eurostat/statistics-expla...
4,9,Goods and services account,The goods and services account shows ...,https://ec.europa.eu/eurostat/statistics-expla...
...,...,...,...,...
1309,2319,Actual individual consumption (AIC),"Actual individual consumption , abbrevia...",https://ec.europa.eu/eurostat/statistics-expla...
1310,2321,Activity rate,Activity rate is the percentage of a...,https://ec.europa.eu/eurostat/statistics-expla...
1311,2322,Activation policies,The activation policies are policies ...,https://ec.europa.eu/eurostat/statistics-expla...
1312,2324,Active enterprises - FRIBS,"<Brief user-oriented definition, one or a fe...",https://ec.europa.eu/eurostat/statistics-expla...


#### Read categories of SE Glossary articles

In [14]:
import ast
SQLCommand = """SELECT id,article_id,categories 
                FROM GL_articles_categories """

GL_cats = pd.read_sql(SQLCommand,c)
GL_cats['categories']=GL_cats['categories'].apply(ast.literal_eval)

GL_cats

Unnamed: 0,id,article_id,categories
0,0,1,"[Glossary, Health glossary, Labour market glos..."
1,1,5,"[Glossary, Health glossary, Labour market glos..."
2,2,6,"[Glossary, Health glossary, Labour market glos..."
3,3,8,"[Economy and finance glossary, Glossary, Natio..."
4,4,9,"[Economy and finance glossary, Glossary, Natio..."
...,...,...,...
1229,1229,2312,"[Glossary, Transport glossary]"
1230,1230,2319,"[Economy and finance glossary, Glossary, Natio..."
1231,1231,2321,"[Economy and finance glossary, Glossary, Labou..."
1232,1232,2322,"[Economy and finance glossary, Glossary, Labou..."


#### Merge with the main file

In [15]:
GL_df = pd.merge(GL_df,GL_cats,left_on='id',right_on='article_id')
GL_df = GL_df[['title','url','definition','categories']]
GL_df

Unnamed: 0,title,url,definition,categories
0,Accident at work,https://ec.europa.eu/eurostat/statistics-expla...,An accident at work in the framework ...,"[Glossary, Health glossary, Labour market glos..."
1,Fatal accident at work,https://ec.europa.eu/eurostat/statistics-expla...,A fatal accident at work refers to an...,"[Glossary, Health glossary, Labour market glos..."
2,Non-fatal accident at work,https://ec.europa.eu/eurostat/statistics-expla...,A non-fatal accident at work is...,"[Glossary, Health glossary, Labour market glos..."
3,Aggregate demand,https://ec.europa.eu/eurostat/statistics-expla...,Aggregate demand is the total amount of ...,"[Economy and finance glossary, Glossary, Natio..."
4,Goods and services account,https://ec.europa.eu/eurostat/statistics-expla...,The goods and services account shows ...,"[Economy and finance glossary, Glossary, Natio..."
...,...,...,...,...
1229,Airport,https://ec.europa.eu/eurostat/statistics-expla...,An airport is a defined area of land ...,"[Glossary, Transport glossary]"
1230,Actual individual consumption (AIC),https://ec.europa.eu/eurostat/statistics-expla...,"Actual individual consumption , abbrevia...","[Economy and finance glossary, Glossary, Natio..."
1231,Activity rate,https://ec.europa.eu/eurostat/statistics-expla...,Activity rate is the percentage of a...,"[Economy and finance glossary, Glossary, Labou..."
1232,Activation policies,https://ec.europa.eu/eurostat/statistics-expla...,The activation policies are policies ...,"[Economy and finance glossary, Glossary, Labou..."


* Discard records with missing titles and/or URLs and/or definitions and do some data cleansing.
* Drop records with duplicate URLs.
* Discard records with definitions which point to redirections ('Redirect to ...) or are the remnants of deleted articles ('The revision #...').
* Discard duplicates in titles and definitions (which point to the same articles).
* Add the titles to the definitions.

In [16]:
GL_df = GL_df.replace('', np.nan) 
GL_df = GL_df.dropna(axis=0,subset=['title','url','definition'],how='any')

GL_df['title'] = GL_df['title'].apply(clean)
GL_df['title'] = GL_df['title'].apply(lambda x: re.sub(r'\?','-',x)) ## also replace question marks by dashes
GL_df['url'] = GL_df['url'].apply(lambda x: x.strip())
GL_df['definition'] = GL_df['definition'].apply(clean)

GL_df = GL_df.drop_duplicates(subset=["url"])

idx = GL_df[GL_df['definition'].str.startswith('The revision #')].index
GL_df.drop(idx , inplace=True)
idx = GL_df[GL_df['definition'].str.startswith('Redirect to')].index
GL_df.drop(idx , inplace=True)

GL_df = GL_df.drop_duplicates(subset=["title","definition"])

GL_df.reset_index(drop=True, inplace=True)

GL_df['definition'] = GL_df['title'] +'. ' + GL_df['definition']


GL_df

Unnamed: 0,title,url,definition,categories
0,Accident at work,https://ec.europa.eu/eurostat/statistics-expla...,Accident at work. An accident at work in the f...,"[Glossary, Health glossary, Labour market glos..."
1,Fatal accident at work,https://ec.europa.eu/eurostat/statistics-expla...,Fatal accident at work. A fatal accident at wo...,"[Glossary, Health glossary, Labour market glos..."
2,Non-fatal accident at work,https://ec.europa.eu/eurostat/statistics-expla...,Non-fatal accident at work. A non-fatal accide...,"[Glossary, Health glossary, Labour market glos..."
3,Aggregate demand,https://ec.europa.eu/eurostat/statistics-expla...,Aggregate demand. Aggregate demand is the tota...,"[Economy and finance glossary, Glossary, Natio..."
4,Goods and services account,https://ec.europa.eu/eurostat/statistics-expla...,Goods and services account. The goods and serv...,"[Economy and finance glossary, Glossary, Natio..."
...,...,...,...,...
1225,Airport,https://ec.europa.eu/eurostat/statistics-expla...,Airport. An airport is a defined area of land ...,"[Glossary, Transport glossary]"
1226,Actual individual consumption (AIC),https://ec.europa.eu/eurostat/statistics-expla...,Actual individual consumption (AIC). Actual in...,"[Economy and finance glossary, Glossary, Natio..."
1227,Activity rate,https://ec.europa.eu/eurostat/statistics-expla...,Activity rate. Activity rate is the percentage...,"[Economy and finance glossary, Glossary, Labou..."
1228,Activation policies,https://ec.europa.eu/eurostat/statistics-expla...,Activation policies. The activation policies a...,"[Economy and finance glossary, Glossary, Labou..."


#### Categories in SE articles

* Create a dataframe _Categories_SE_ with:
    * the unique categories met in the SE articles in column _category_,
    * their stemmed tokens, without stop-words, in column _category tokens_. 
    * Stemming is carried out with library _nltk_ because it is not available in Spacy. 
    * Drop the category _Statistical article_.


In [17]:
## create the Categories dataframe
import nltk
from nltk.stem.porter import *
stemmer = PorterStemmer()
all_stopwords = nlp.Defaults.stop_words

Categories_SE = pd.DataFrame(np.unique([el for i in range(len(SE_df)) 
                                        for el in SE_df.loc[i,'categories']]),
                                        columns=['category'])
Categories_SE['category tokens'] = Categories_SE['category'].apply(lambda x: 
                                                             [stemmer.stem(w.text.lower()) for w in nlp(str(x)) 
                                                             if not w.is_punct and not w.text.lower() in all_stopwords])                                                       

Categories_SE.drop( Categories_SE[ Categories_SE['category'] == 'Statistical article' ].index, inplace=True)

Categories_SE.reset_index(drop=True, inplace=True)

Categories_SE.to_excel('Categories_SE.xlsx')
Categories_SE


Unnamed: 0,category,category tokens
0,Accidents at work,"[accid, work]"
1,Africa,[africa]
2,Agricultural performance,"[agricultur, perform]"
3,Agriculture,[agricultur]
4,Agriculture statistics by area and region,"[agricultur, statist, area, region]"
...,...,...
191,"Wages, earnings and income","[wage, earn, incom]"
192,Waste,[wast]
193,Water,[water]
194,World trade,"[world, trade]"


#### Categories in SE Glossary articles

* Do the same with the categories found in the SE Glossary articles (omitting the "glossary" in the end), drop the categories _Under construction_ and _Glossary_, create a dataframe _Categories_GL_, and 
* Merge the two dataframes into a _Categories_df_ dataframe dropping duplicates.


In [18]:


Categories_GL = pd.DataFrame(np.unique([el for i in range(len(GL_df)) 
                                        for el in GL_df.loc[i,'categories']]),
                                        columns=['category'])
Categories_GL['category'] = Categories_GL['category'].apply(lambda x: re.sub('glossary$','',x)) 
Categories_GL['category tokens'] = Categories_GL['category'].apply(lambda x: 
                                                             [stemmer.stem(w.text.lower()) for w in nlp(str(x)) 
                                                             if not w.is_punct and not w.text.lower() in all_stopwords])                                                       

idx = Categories_GL[ (Categories_GL['category'] == 'Under construction') | (Categories_GL['category'] == 'Glossary') ].index
Categories_GL.drop(idx , inplace=True)

Categories_GL.reset_index(drop=True, inplace=True)


Categories_GL.to_excel('Categories_GL.xlsx')
Categories_GL

Categories_df = pd.concat([Categories_SE,Categories_GL])
Categories_df.drop_duplicates(subset=["category"],inplace=True)
Categories_df.reset_index(drop=True, inplace=True)
del(Categories_SE, Categories_GL)
Categories_df

Unnamed: 0,category,category tokens
0,Accidents at work,"[accid, work]"
1,Africa,[africa]
2,Agricultural performance,"[agricultur, perform]"
3,Agriculture,[agricultur]
4,Agriculture statistics by area and region,"[agricultur, statist, area, region]"
...,...,...
233,Statistical method,"[statist, method]"
234,Structural business statistics,"[structur, busi, statist]"
235,Survey,[survey]
236,Tourism,[tourism]


### Step 3. An improved version of a Subject-Verb-Object extraction function using Spacy
***

* By Peter de Vocht, see [GitHub code](https://github.com/peter3125/enhanced-subject-verb-object-extraction/blob/master/subject_verb_object_extract.py).
* Function needs some **DESCRIPTION**.


In [19]:
# Copyright 2017 Peter de Vocht
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

#import en_core_web_sm
from collections.abc import Iterable

# use spacy small model
#nlp = en_core_web_sm.load()



##ClearNLP Dependency Labels
## https://github.com/clir/clearnlp-guidelines/blob/master/md/specifications/dependency_labels.md

##https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

## https://www.mathcs.emory.edu/~choi/doc/cu-2012-choi.pdf

# dependency markers for subjects
SUBJECTS = {"nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"}
## nominal subject, nominal subject passive, clausal subject, clausal subject passive, agent (e.g. killed by the "agent"), 
## expletive - an existential “there”

# dependency markers for objects
OBJECTS = {"dobj", "dative", "attr", "oprd"}
## direct object, dative (indirect object), attr: “to be”, “to seem”, “to appear”, object predicate

# POS tags that will break adjoining items
BREAKER_POS = {"CCONJ", "VERB"}
## coordinating conjunction, verb

# words that are negations
NEGATIONS = {"no", "not", "n't", "never", "none"}


# does dependency set contain any coordinating conjunctions?
def contains_conj(depSet):
    return "and" in depSet or "or" in depSet or "nor" in depSet or \
           "but" in depSet or "yet" in depSet or "so" in depSet or "for" in depSet


# get subs joined by conjunctions
def _get_subs_from_conjunctions(subs):
    more_subs = []
    for sub in subs:
        # rights is a generator
        rights = list(sub.rights)
        rightDeps = {tok.lower_ for tok in rights} 
        if contains_conj(rightDeps):
            more_subs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
            if len(more_subs) > 0:
                more_subs.extend(_get_subs_from_conjunctions(more_subs))
    return more_subs


# get objects joined by conjunctions
def _get_objs_from_conjunctions(objs):
    more_objs = []
    for obj in objs:
        # rights is a generator
        rights = list(obj.rights)
        rightDeps = {tok.lower_ for tok in rights}
        if contains_conj(rightDeps):
            more_objs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
            if len(more_objs) > 0:
                more_objs.extend(_get_objs_from_conjunctions(more_objs))
    return more_objs


# find sub dependencies
def _find_subs(tok):
    head = tok.head
    while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
        head = head.head
    if head.pos_ == "VERB":
        subs = [tok for tok in head.lefts if tok.dep_ == "SUB"] ## !!! CHANGE: not stop-words ?
        if len(subs) > 0:
            verb_negated = _is_negated(head)
            subs.extend(_get_subs_from_conjunctions(subs))
            return subs, verb_negated
        elif head.head != head:
            return _find_subs(head)
    elif head.pos_ == "NOUN":
        return [head], _is_negated(tok)
    return [], False


# is the tok set's left or right negated?
def _is_negated(tok):
    parts = list(tok.lefts) + list(tok.rights)
    for dep in parts:
        if dep.lower_ in NEGATIONS:
            return True
    return False


# get all the verbs on tokens with negation marker
def _find_svs(tokens):
    svs = []
    verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        if len(subs) > 0:
            for sub in subs:
                svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
    return svs


# get grammatical objects for a given set of dependencies (including passive sentences)
def _get_objs_from_prepositions(deps, is_pas):
    objs = []
    for dep in deps:
        if dep.pos_ == "ADP" and (dep.dep_ == "prep" or (is_pas and dep.dep_ == "agent")):
            objs.extend([tok for tok in dep.rights if tok.dep_  in OBJECTS or
                         (tok.pos_ == "PRON" and tok.lower_ == "me") or
                         (is_pas and tok.dep_ == 'pobj')])
    return objs


# get objects from the dependencies using the attribute dependency
def _get_objs_from_attrs(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "NOUN" and dep.dep_ == "attr":
            verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
            if len(verbs) > 0:
                for v in verbs:
                    rights = list(v.rights)
                    objs = [tok for tok in rights if tok.dep_ in OBJECTS]
                    objs.extend(_get_objs_from_prepositions(rights, is_pas))
                    if len(objs) > 0:
                        return v, objs
    return None, None


# xcomp; open complement - verb has no suject
def _get_obj_from_xcomp(deps, is_pas):
    for dep in deps:
        if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
            v = dep
            rights = list(v.rights)
            objs = [tok for tok in rights if tok.dep_ in OBJECTS]
            objs.extend(_get_objs_from_prepositions(rights, is_pas))
            if len(objs) > 0:
                return v, objs
    return None, None


# get all functional subjects adjacent to the verb passed in
def _get_all_subs(v):
    verb_negated = _is_negated(v)
    ## !!! CHANGE: exclude stop-words ?
    subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
    if len(subs) > 0:
        subs.extend(_get_subs_from_conjunctions(subs))
    else:
        foundSubs, verb_negated = _find_subs(v)
        subs.extend(foundSubs)
    return subs, verb_negated


# find the main verb - or any aux verb if we can't find it
## !!! CHANGE: exclude stop-words ?
def _find_verbs(tokens):
    verbs = [tok for tok in tokens if _is_non_aux_verb(tok)] ### !!!
    if len(verbs) == 0:
        verbs = [tok for tok in tokens if _is_verb(tok)] ### !!!
    
    return verbs


# is the token a verb?  (excluding auxiliary verbs)
def _is_non_aux_verb(tok):
    return tok.pos_ == "VERB" and (tok.dep_ != "aux" and tok.dep_ != "auxpass")


# is the token a verb?  (excluding auxiliary verbs)
def _is_verb(tok):
    return tok.pos_ == "VERB" or tok.pos_ == "AUX"


# return the verb to the right of this verb in a CCONJ relationship if applicable
# returns a tuple, first part True|False and second part the modified verb if True
def _right_of_verb_is_conj_verb(v):
    # rights is a generator
    rights = list(v.rights)

    # VERB CCONJ VERB (e.g. he beat and hurt me)
    if len(rights) > 1 and rights[0].pos_ == 'CCONJ':
        for tok in rights[1:]:
            if _is_non_aux_verb(tok):
                return True, tok

    return False, v


# get all objects for an active/passive sentence
def _get_all_objs(v, is_pas):
    # rights is a generator
    rights = list(v.rights)

    objs = [tok for tok in rights if tok.dep_ in OBJECTS or (is_pas and tok.dep_ == 'pobj')]
    objs.extend(_get_objs_from_prepositions(rights, is_pas))

    #potentialNewVerb, potentialNewObjs = _get_objs_from_attrs(rights)
    #if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
    #    objs.extend(potentialNewObjs)
    #    v = potentialNewVerb

    potential_new_verb, potential_new_objs = _get_obj_from_xcomp(rights, is_pas)
    if potential_new_verb is not None and potential_new_objs is not None and len(potential_new_objs) > 0:
        objs.extend(potential_new_objs)
        v = potential_new_verb
    if len(objs) > 0:
        objs.extend(_get_objs_from_conjunctions(objs))
    return v, objs


# return true if the sentence is passive - at the moment a sentence is assumed passive if 
# it has an auxpass (auxiliary passive) verb
def _is_passive(tokens):
    for tok in tokens:
        if tok.dep_ == "auxpass":
            return True
    return False


# resolve a 'that' where/if appropriate
def _get_that_resolution(toks):
    for tok in toks:
        if 'that' in [t.orth_ for t in tok.lefts]:
            return tok.head
    return None


# simple stemmer using lemmas
def _get_lemma(word: str):
    tokens = nlp(word)
    if len(tokens) == 1:
        return tokens[0].lemma_
    return word


# print information for displaying all kinds of things of the parse tree
def printDeps(toks):
    for tok in toks:
        print(tok.orth_, tok.dep_, tok.pos_, tok.head.orth_, [t.orth_ for t in tok.lefts], [t.orth_ for t in tok.rights])


# expand an obj / subj np using its chunk
def expand(item, tokens, visited):
    if item.lower_ == 'that':
        temp_item = _get_that_resolution(tokens)
        if temp_item is not None:
            item = temp_item

    parts = []

    if hasattr(item, 'lefts'):
        for part in item.lefts:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    parts.append(item)

    if hasattr(item, 'rights'):
        for part in item.rights:
            if part.pos_ in BREAKER_POS:
                break
            if not part.lower_ in NEGATIONS:
                parts.append(part)

    if hasattr(parts[-1], 'rights'):
        for item2 in parts[-1].rights:
            if item2.pos_ == "DET" or item2.pos_ == "NOUN":
                if item2.i not in visited:
                    visited.add(item2.i)
                    parts.extend(expand(item2, tokens, visited))
            break

    return parts


# convert a list of tokens to a string
def to_str(tokens):
    if isinstance(tokens, Iterable):
        return ' '.join([item.text for item in tokens])
    else:
        return ''


# find verbs and their subjects / objects to create SVOs, detect passive/active sentences
def findSVOs(tokens):
    svos = []
    is_pas = _is_passive(tokens) ## is an "auxpass" verb contained in the tokens?
    verbs = _find_verbs(tokens) ## get the main verbs (or aux verbs if none) 
    visited = set()  # recursion detection
    for v in verbs:
        subs, verbNegated = _get_all_subs(v)
        # hopefully there are subs, if not, don't examine this verb any longer
        if len(subs) > 0:
            isConjVerb, conjV = _right_of_verb_is_conj_verb(v)
            if isConjVerb:
                v2, objs = _get_all_objs(conjV, is_pas)
                for sub in subs:
                    for obj in objs:
                        objNegated = _is_negated(obj)
                        if is_pas:  # reverse object / subject for passive
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            svos.append((to_str(expand(obj, tokens, visited)),
                                         "!" + v2.lemma_ if verbNegated or objNegated else v2.lemma_, to_str(expand(sub, tokens, visited))))
                        else:
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                            svos.append((to_str(expand(sub, tokens, visited)),
                                         "!" + v2.lower_ if verbNegated or objNegated else v2.lower_, to_str(expand(obj, tokens, visited))))
            else:
                v, objs = _get_all_objs(v, is_pas)
                for sub in subs:
                    if len(objs) > 0:
                        for obj in objs:
                            objNegated = _is_negated(obj)
                            if is_pas:  # reverse object / subject for passive
                                svos.append((to_str(expand(obj, tokens, visited)),
                                             "!" + v.lemma_ if verbNegated or objNegated else v.lemma_, to_str(expand(sub, tokens, visited))))
                            else:
                                svos.append((to_str(expand(sub, tokens, visited)),
                                             "!" + v.lower_ if verbNegated or objNegated else v.lower_, to_str(expand(obj, tokens, visited))))
                    else:
                        # no obj - just return the SV parts
                        ## !!! CHANGE: return 'Object:None' as object
                        svos.append((to_str(expand(sub, tokens, visited)),
                                     "!" + v.lower_ if verbNegated else v.lower_,'Object:None'))
                        #print('just return SV: ',(to_str(expand(sub, tokens, visited)),
                        #             "!" + v.lower_ if verbNegated else v.lower_,))                              
    
    return svos

### Step 4. Apply the SVO function to the various texts and find tuples relevant to Named Entities and Categories 
***


In each dataframe (SE_df and GL_df), create column **NER** which will hold dictionaries with the entities recognized as: 
 * Companies, agencies, institutions, etc. (code ORG), 
 * Countries, cities, states (code GPE), 
 * Nationalities or religious or political groups (code NORP), 
 * Non-GPE locations, mountain ranges, bodies of water (code LOCATION). 
 * Buildings, airports, highways, bridges, etc. (code FACILITY),
 * Named hurricanes, battles, wars, sports events, etc. (code EVENT),
 * Named documents made into laws (code LAW),
 * Any named language (code LANGUAGE),
 * People, including fictional (code PERSON).

In column **NER** in a record, the key is the entity and the values are:
* a list with the tuples of the occurences of the entity (token span's *start* index position, token span's *stop* index position), 
* a list of the corresponding (coded) sources, and 
* the count of occurences in the content of the text processed.

In each dataframe, we also create column **NER_SVOs** which will hold dictionaries with SVOs involving the above entities. In each dictionary in column NER_SVOs in a record, the key is the entity and the values are:
* a list with the SVO tuples, 
* a list with the corresponding coded sources, 
* three lists with the titles, URLs and sentences  where the corresponding SVOs were found (for debugging purposes), 
* the count of occurences in the content of the text processed.

Column **NER_SVOs** will also store keys of the form **"Cat:category_name"** corresponding to the **Categories**, with values **the lists of SVOs whose tokenized and stemmed terms have an [overlap coefficient](https://en.wikipedia.org/wiki/Overlap_coefficient) with some category's stemmed tokens** $\ge$ 0.4. This value was found after experimentation. 

Finally, we also create a separate dictionary **Glob_NER_SVOs** gathering the above SVOs information from all texts.



In [20]:
SE_df['NER'] = [dict() for i in range(len(SE_df))]
SE_df['NER_SVOs'] = [dict() for i in range(len(SE_df))]
GL_df['NER'] = [dict() for i in range(len(GL_df))]
GL_df['NER_SVOs'] = [dict() for i in range(len(GL_df))]
Glob_NER_SVOs = dict() ## a separate dictionary holding all SVOs from all articles
Cat_threshold = 0.4

In [21]:
def Overlap(lst1, lst2):
    return len(set(lst1).intersection(lst2))/min(len(set(lst1)),len(set(lst1)))

def process_texts(dat,source,column):

    nlp.max_length = 1500000
    
    for i in range(len(dat)):
        if (i+1) % 100 == 0: print('article i = ',i+1,' of ',len(dat))
        if all(dat.loc[i,[column]].isna()): continue    
        doc = nlp(dat.loc[i,column]) ## pre-process text
        url = dat.loc[i,'url']

        sents = doc.sents ## segment into sentences
        sents_list = [sent for sent in doc.sents]
        num_sents = len(sents_list)
        if num_sents ==0: 
            print(sents_list)
            raise Exception("Error A!") 

        for (j,sent) in enumerate(sents_list): ## Loop A over sentences #column 8
            #----------------------------------------------------------
            doc_sent = nlp(sent.text) ## pre-process sentence # column 12
            
            entities = doc_sent.ents ## general entities in sentence       
            selected_ents=[]
            if len(entities) > 0: ## otherwise proceed with SVOs vs. the categories only
                for ent in entities: ## just a check to verify the span of each entity IN THE SENTENCE
                    if ent.text != doc_sent.text[ent.start_char: ent.end_char]:
                        raise Exception("Error B!")             
            
                ## continue with selected named entities if any
                selected_ents = [ent for ent in entities if ent.label_ in ['ORG','GPE','NORP','LOCATION','FACILITY','EVENT','LAW','LANGUAGE','PERSON']] ## selected  entities
                ## cut +8.3, -17.4, 31353, etc.
                selected_ents = [ent for ent in selected_ents if not re.search(r'^[\d\+\-\.\,%\-]+$',ent.text) ] 


            svos = findSVOs(doc_sent) 
            for sv in svos: ## loop B1 over SVOs in sentence
            #--------------------------------------------------------------   
                if sv[-1] == 'Object:None': 
                    continue
                if '-' in sv or '%' in sv: 
                    continue
                if any([x.startswith('Figure') or x.startswith('Table') for x in sv]):
                    continue
                if any([re.search(r'(\d|\.|\+|\-)+',x) for x in sv]):
                    continue
                if sum([1 for x in sv if x.lower() in all_stopwords])>=1:
                    continue
                    
                ## open a parenthesis and then a number
                sv = tuple(re.sub(r'(\(|\))$','',x) for x in sv)    
                sv = tuple(re.sub(r'(\(|\))$','',x) for x in sv)  
                #print(sv)
                    
                for s in sv: ## loop C1 over each SVO # column 16
                #------------------------------------------------    
                    #print('searching in: ',s)
                    for e in selected_ents: ## loop D1 over each selected entity in an SVO # column 20
                    #----------------------    
                        #print('searching for ',e.text)
                        if s.find(e.text) != -1:
                            #print(sv,' : found ',e.text)
                            key = e.text.upper()
                            if key in dat.loc[i,'NER'].keys():
                                dat.loc[i,'NER'][key][0].append((e.start,e.end)) 
                                dat.loc[i,'NER'][key][1].append(source) 
                                dat.loc[i,'NER'][key][2] += 1 
                            else:    
                                dat.loc[i,'NER'][key] = [[(e.start,e.end)],[source],1]
                        
                            if key in dat.loc[i,'NER_SVOs'].keys():
                                if sv not in dat.loc[i,'NER_SVOs'][key][0]:
                                    dat.loc[i,'NER_SVOs'][key][0].append(sv) 
                                    dat.loc[i,'NER_SVOs'][key][1].append(source) 
                                    dat.loc[i,'NER_SVOs'][key][2] += 1 
                            else:    
                                dat.loc[i,'NER_SVOs'][key] = [[sv],[source],1] 
                        
                            ## global dictionary - avoid duplicates
                            key = e.text.upper()
                            if key in Glob_NER_SVOs.keys():
                                if sv not in Glob_NER_SVOs[key][0]:
                                    Glob_NER_SVOs[key][0].append(sv) 
                                    Glob_NER_SVOs[key][1].append(source)
                                    Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
                                    Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
                                    Glob_NER_SVOs[key][4].append(sent.text)
                                    Glob_NER_SVOs[key][5] += 1     
                            else:    
                                Glob_NER_SVOs[key] = [[sv],[source],[dat.loc[i,'title']],[dat.loc[i,'url']],[sent.text],1] 
                
            
                ## Continue loop C1 over each SVO # column 16, now with the Categories
                sj = ' '.join(sv)
                doc_sj = nlp(sj)
                sj = [w.text.lower() for w in doc_sj if not w.is_punct]
                sj = [w for w in sj if not w in all_stopwords]
                sj = [stemmer.stem(w) for w in sj]
                
                # sj = [stemmer.stem(w.text.lower()) for w in doc_sj if not w.is_punct and not w.text.lower() in all_stopwords]
                if len(sj) == 0: continue
                #print('\n',sv_copy)
                ##print('sj = ',sj)
                for m in range(len(Categories_df)): ## loop C2 over categories vs an SVO in a sentence
                #-----------------------------------------------------------------------------------    
                    ##print('cats:',categories_df.loc[m,'Category tokens'])
                    try:
                        overlap = Overlap(sj,Categories_df.loc[m,'category tokens'])
                    except:
                        print('sj = ',sj)
                        print('m=',m)
                        print('cats:',Categories_df.loc[m,'category tokens'])
                        raise
                    if overlap >= Cat_threshold:
                        ##print('sj = ',sj)
                        ##print(categories_df.loc[m,'Category'])
                        key = 'Cat:'+Categories_df.loc[m,'category'].upper()
                        if key in dat.loc[i,'NER_SVOs'].keys():
                            if sv not in dat.loc[i,'NER_SVOs'][key][0]:
                                dat.loc[i,'NER_SVOs'][key][0].append(sv) 
                                dat.loc[i,'NER_SVOs'][key][1].append(source) 
                                dat.loc[i,'NER_SVOs'][key][2] +=1
                        else:
                            dat.loc[i,'NER_SVOs'][key] = [[sv],[source],1]
                            
                        ## global dictionary 
                        if key in Glob_NER_SVOs.keys():
                            if sv not in Glob_NER_SVOs[key][0]:
                                Glob_NER_SVOs[key][0].append(sv) 
                                Glob_NER_SVOs[key][1].append(source) 
                                Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
                                Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
                                Glob_NER_SVOs[key][4].append(sent.text)                                
                                Glob_NER_SVOs[key][5] += 1                             
                        else:
                            Glob_NER_SVOs[key] = [[sv],[source],[dat.loc[i,'title']],[dat.loc[i,'url']],[sent.text],1]                             
                       
    return dat                              
 
                
                
                
                




#PERSON People, including fictional
#NORP Nationalities or religious or political groups
#FACILITY Buildings, airports, highways, bridges, etc.
#ORGANIZATION Companies, agencies, institutions, etc.
#GPE Countries, cities, states
#LOCATION Non-GPE locations, mountain ranges, bodies of water
#PRODUCT Vehicles, weapons, foods, etc. (Not services)
#EVENT Named hurricanes, battles, wars, sports events, etc.
#WORK OF ART Titles of books, songs, etc.
#LAW Named documents made into laws 
#LANGUAGE Any named language
#The following values are also annotated in a style similar to names:
#DATE Absolute or relative dates or periods
#TIME Times smaller than a day
#PERCENT Percentage (including “%”)
#MONEY Monetary values, including unit
#QUANTITY Measurements, as of weight or distance
#ORDINAL “first”, “second”
#CARDINAL Numerals that do not fall under another typ



### Step 5. Apply this  procedure to the various texts
***

* Update column NER in both dataframes.
* Update column NER_SVOs in both dataframes. 
* Update the separate global dictionary Glob_NER_SVOs.

#### SE articles titles.

In [22]:
SE_df = process_texts(SE_df,'SE title','title')

article i =  100  of  566
article i =  200  of  566
article i =  300  of  566
article i =  400  of  566
article i =  500  of  566


#### SE articles abstracts.

In [23]:

SE_df = process_texts(SE_df,'SE abstract','abstract')

article i =  100  of  566
article i =  200  of  566
article i =  300  of  566
article i =  400  of  566
article i =  500  of  566


#### SE articles context sections.

In [24]:
SE_df = process_texts(SE_df,'SE context','context')

article i =  100  of  566
article i =  200  of  566
article i =  300  of  566
article i =  400  of  566
article i =  500  of  566


#### SE articles full contents.

In [25]:

SE_df = process_texts(SE_df,'SE content','raw content')
              


article i =  100  of  566
article i =  200  of  566
article i =  300  of  566
article i =  400  of  566
article i =  500  of  566


#### SE Glossary articles titles.

In [26]:
GL_df = process_texts(GL_df,'GL title','title')

article i =  100  of  1230
article i =  200  of  1230
article i =  300  of  1230
article i =  400  of  1230
article i =  500  of  1230
article i =  600  of  1230
article i =  700  of  1230
article i =  800  of  1230
article i =  900  of  1230
article i =  1000  of  1230
article i =  1100  of  1230
article i =  1200  of  1230


#### SE Glossary articles definitions.

In [27]:
GL_df = process_texts(GL_df,'GL definition','definition')

article i =  100  of  1230
article i =  200  of  1230
article i =  300  of  1230
article i =  400  of  1230
article i =  500  of  1230
article i =  600  of  1230
article i =  700  of  1230
article i =  800  of  1230
article i =  900  of  1230
article i =  1000  of  1230
article i =  1100  of  1230
article i =  1200  of  1230


### Step 6. Exporting the dataframes to Excel
***

This is also useful for the manual inspection and the design of rules for the fine-tuning of the NER engine and the SVO extraction. This output can then directly be imported in the database.


In [28]:
import datetime
current_time = datetime.datetime.now() 
outfile1 = 'SE_SVOs_'+str(current_time.month)+ '_' + str(current_time.day) + '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.xlsx'
outfile2 = 'GL_SVOs_'+str(current_time.month)+ '_' + str(current_time.day) + '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.xlsx'
#SE_df.to_excel(outfile1)
#GL_df.to_excel(outfile2)

SE_df.to_excel('SE_df.xlsx')
GL_df.to_excel('GL_df.xlsx')


### Step 7. Checking the dictionary with all SVOs collected
***
* And write all SVOs to both Excel and text files. The files include all information useful for debugging.

In [29]:
import unidecode
#import pickle


import datetime

def file_name(pre,ext):
    current_time = datetime.datetime.now() 
    return pre + '_'+ str(current_time.month)+ '_' + str(current_time.day) + \
                 '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.'+ext
    
outfile3 = file_name('SVOs_all','txt')
outfile3b = file_name('SVOs_all','pkl')
outfile3c = file_name('SVOs_all','xlsx')

#with open(outfile3b, 'wb') as file:
#        pickle.dump(Glob_NER_SVOs, file, pickle.HIGHEST_PROTOCOL)

Glob_NER_SVOs_2 = {k:v for k,v in sorted(Glob_NER_SVOs.items(), key=lambda item: item[0])}

results = pd.DataFrame(index=range(len(Glob_NER_SVOs.items())),columns=['Key','Source','Subject','Verb','Object','Title','URL','Sentence'])
c = -1
with open(outfile3, 'w') as file:
    for key in Glob_NER_SVOs_2.keys():
            print('<'+key+'>',end=' ')
            number = Glob_NER_SVOs_2[key][5]
            print(number, ' entries')
            #Glob_NER_SVOs[key][0].append(sv) 
            #Glob_NER_SVOs[key][1].append(source) 
            #Glob_NER_SVOs[key][2].append(dat.loc[i,'title'])
            #Glob_NER_SVOs[key][3].append(dat.loc[i,'url'])
            #Glob_NER_SVOs[key][4].append(sent.text)                                
            #Glob_NER_SVOs[key][5] += 1    
            phrases, sources, titles, urls, sentences = Glob_NER_SVOs_2[key][0:5]
            for (i,(phrase,source,title,url,sentence)) in enumerate(zip(phrases,sources,titles,urls,sentences)):
                s = unidecode.unidecode(str(phrase))
                s0 = unidecode.unidecode(phrase[0])
                s1 = unidecode.unidecode(phrase[1])
                s2 = unidecode.unidecode(phrase[2])
                st = unidecode.unidecode(title)
                surl=url
                ss = unidecode.unidecode(sentence)
                ## print('{0:70s} / {1:30s} {2:5d}: {3:s}\n'.format(key,source,i,s))
                ##print(ss)
                file.write('{0:70s} / {1:30s} {2:5d}: {3:30s} {4:30s} {5:30s} {6:s} {7:s}\n'.format(unidecode.unidecode(key),source,i,s0,s1,s2,st,ss))
                #file.write('{0:40s} / {1:16s} {2:4d}: {3:s}\n'.format(unidecode.unidecode(key),source,i,s))
                c +=1
                results.loc[c,'Key'] = format(unidecode.unidecode(key))
                results.loc[c,'Source'] = source
                results.loc[c,'Subject'] = s0
                results.loc[c,'Verb'] = s1
                results.loc[c,'Object'] = s2
                results.loc[c,'Title'] = st
                results.loc[c,'URL'] = surl
                results.loc[c,'Sentence'] = ss

#results.to_excel(outfile3c)
results.to_excel('SVOs_all.xlsx')

<A COUNCIL RECOMMENDATION> 3  entries
<A EUROPEAN AGENDA ON> 1  entries
<A EUROPEAN ASYLUM SUPPORT OFFICE (> 1  entries
<A EUROPEAN COMMISSION COMMUNICATION> 2  entries
<A EUROPEAN VACANCY MONITOR> 1  entries
<AAA> 1  entries
<ACCOMMODATIONA> 1  entries
<ACER> 1  entries
<AEA> 7  entries
<AEROTHERMAL> 1  entries
<AES> 4  entries
<AEU> 6  entries
<AEXCESS> 1  entries
<AEXCESS MORTALITYA> 2  entries
<AF> 1  entries
<AFGHAN> 3  entries
<AFGHANS> 1  entries
<AFRICAN> 6  entries
<AGEING WORKING GROUP> 1  entries
<AGRICULTURAL> 1  entries
<AGRICULTURE> 1  entries
<AIRBNB> 1  entries
<ALASKIE> 1  entries
<ALBANIA> 12  entries
<ALBANIAN> 3  entries
<ALBANIANS> 2  entries
<ALENTEJO> 2  entries
<ALGARVE> 1  entries
<ALGECIRAS> 1  entries
<ALGERIA> 12  entries
<ALGERIAN> 2  entries
<ALPS> 1  entries
<ALZHEIMERAS> 5  entries
<AMANUFACTURINGA> 1  entries
<AMEATA> 2  entries
<AMENTAL> 1  entries
<AMSTERDAM> 5  entries
<AMSTERDAM SCHIPHOL> 2  entries
<AN ACTION PLAN> 1  entries
<ANALYTICAL CRM> 1  en

<Cat:LABOUR MARKET > 34  entries
<Cat:LABOUR MARKET - YOUNG PEOPLE> 156  entries
<Cat:LABOUR MARKET BACKGROUND ARTICLES> 44  entries
<Cat:LABOUR MARKET STATISTICS BY AREA AND REGION> 65  entries
<Cat:LABOUR MOBILITY> 2  entries
<Cat:LANGUAGE LEARNING> 15  entries
<Cat:LIFELONG LEARNING> 5  entries
<Cat:LIVING CONDITIONS> 17  entries
<Cat:LIVING CONDITIONS > 17  entries
<Cat:LIVING CONDITIONS - YOUNG PEOPLE> 203  entries
<Cat:MATERIAL FLOWS> 8  entries
<Cat:MIGRANT INTEGRATION> 2  entries
<Cat:MIGRATION AND MIGRANT POPULATION> 5  entries
<Cat:MONETARY AND OTHER FINANCIAL STATISTICS > 3  entries
<Cat:MORTALITY AND LIFE EXPECTANCY> 13  entries
<Cat:NATIONAL ACCOUNTS > 34  entries
<Cat:NATIONAL ACCOUNTS (INCL. GDP)> 34  entries
<Cat:NATURAL GAS> 13  entries
<Cat:NON-EU COUNTRIES> 44  entries
<Cat:PARTICIPATION IN CULTURE> 6  entries
<Cat:PARTICIPATION IN EDUCATION AND TRAINING> 10  entries
<Cat:POPULATION> 1  entries
<Cat:POPULATION > 1  entries
<Cat:POPULATION AGEING> 36  entries
<Cat:POP

<LAU> 1  entries
<LCI> 2  entries
<LCS> 2  entries
<LEBANON> 8  entries
<LEYEN> 3  entries
<LFS> 7  entries
<LIBYA> 2  entries
<LIECHTENSTEIN> 4  entries
<LIETUVA> 1  entries
<LIGURIA> 3  entries
<LISBON> 3  entries
<LISBON STRATEGY> 1  entries
<LITHIANIA> 1  entries
<LITHUANIA> 90  entries
<LITHUANIAN> 1  entries
<LITHUANIANS> 1  entries
<LIVESTOCK> 1  entries
<LKIS> 1  entries
<LMP> 3  entries
<LOMBARDIA> 4  entries
<LOVEHOMESWAP> 1  entries
<LPG> 1  entries
<LSU> 3  entries
<LTR> 2  entries
<LU> 2  entries
<LUCAS> 6  entries
<LUXEMBOURG> 145  entries
<LUXEMBOURGAS> 2  entries
<MAASTRICHT> 5  entries
<MACINTOSH> 1  entries
<MADEIRA> 3  entries
<MALTA> 109  entries
<MALTESE> 3  entries
<MAYOTTE> 5  entries
<MAZOWIECKI> 1  entries
<MBT> 1  entries
<MEDITERRANEAN> 1  entries
<MEHM> 1  entries
<MELILLA> 1  entries
<MEMBER STATES> 1  entries
<MERCOSUR> 1  entries
<MESSINA> 1  entries
<MEXICO> 2  entries
<MFN> 1  entries
<MIP> 2  entries
<MMR> 2  entries
<MMTCDE> 1  entries
<MNC> 1  entrie

<THE UNITED NATIONS> 10  entries
<THE UNITED STATES> 28  entries
<THE VOLKSWAGEN GROUP> 1  entries
<THE WHITE PAPER> 1  entries
<THE WORLD BANK> 5  entries
<THE WORLD HEALTH ORGANISATION> 10  entries
<THE WORLD HEALTH ORGANIZATION> 3  entries
<THE WORLD WIDE WEB> 1  entries
<THE ZARAGOZA DECLARATION> 1  entries
<THEDEFINITION> 1  entries
<TIME> 2  entries
<TINOS> 1  entries
<TISA> 2  entries
<TJ> 1  entries
<TMT> 1  entries
<TOMATO> 1  entries
<TOURISM> 2  entries
<TRAILER> 1  entries
<TRANSFERSA> 1  entries
<TRANSPARENCY INTERNATIONAL> 1  entries
<TREATY> 1  entries
<TREE> 1  entries
<TREES> 1  entries
<TSC> 1  entries
<TUNISIA> 6  entries
<TURKEY> 89  entries
<TURKISH> 1  entries
<UAA> 6  entries
<UCI> 1  entries
<UK> 4  entries
<UKRAINE> 10  entries
<UKRAINEAS> 1  entries
<UKRAINIAN> 5  entries
<UKRAINIANS> 3  entries
<ULTIMATE> 1  entries
<UN> 23  entries
<UNAS> 1  entries
<UNECE> 1  entries
<UNESCO> 2  entries
<UNESCO S&T> 1  entries
<UNFCCC> 1  entries
<UNGA> 3  entries
<UNION> 2

* Verify the information written to the file.

In [30]:
%%script false --no-raise-error

with open(outfile3c, 'r') as f:
    count = 0
 
    while True:
        line = f.readline()
        if not line:
            break
        print(line)
 
