# Statistics Explained  articles and OECD's Glossary articles: common noun-phrases

### Revised (January 2022) to read all data from the database

### Objective: to create a common vocabulary for the labelling of both sources
### This common vocabulary is being used in the Power BI application in Use Case B

### Installations instructions
*    For the setup of the Virtuoso ODBC data source please see section 1a in https://github.com/eurostat/NLP4Stat/tree/testing/Software%20Environment
*    Download the notebook as "raw" file and save it with extension .ipynb (cut the .txt extension which is added)
*    Install the necessary libraries from your jupyter command prompt. These, together with the versions used, are:     
    *    pyodbc==4.0.32
    *    pandas==1.3.5
    *    nltk==3.6.5
*   Launch the notebook and put your own credentials for access to the Virtuoso database in the call to pyodbc.connect() in step "Connect to the database"      

In [29]:
import re
import pandas as pd
import sys


In [30]:
from datetime import datetime

def file_name(pre,ext):
    current_time = datetime.now() 
    return pre + '_'+ str(current_time.month)+ '_' + str(current_time.day) + \
                 '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.'+ext

### Connect to the database

In [31]:
import pyodbc
c = pyodbc.connect('DSN=Virtuoso All;DBA=ESTAT;UID=xxxxx;PWD=xxxxx')
cursor = c.cursor()

In [32]:
import re
#import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) 
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) 

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    x = re.sub(r'â.{2}',"'",x) ### !!! NEW: single quotes are read as: âXX
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    #x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x

### Statistics explained articles

* IDs and titles from dat_link_info, with resource_information_id=1, i.e. Eurostat (see ESTAT.V1.mod_resource_information) and matching IDs from dat_article.
* Carry out data cleansing on titles.

In [33]:
SQLCommand = """SELECT id, title 
                FROM ESTAT.V1.dat_link_info 
                WHERE resource_information_id=1 AND id IN (SELECT id FROM ESTAT.V1.dat_article) """

SE_df = pd.read_sql(SQLCommand,c)

SE_df['title'] = SE_df['title'].apply(clean)
SE_df.head(5)


Unnamed: 0,id,title
0,7,Accidents at work statistics
1,13,National accounts and GDP
2,16,Railway safety statistics in the EU
3,17,Railway freight transport statistics
4,18,Railway passenger transport statistics - quart...


### Add paragraphs titles and contents

* From dat_article_paragraph with abstract=0 (i.e. "no").
* Match article_id from dat_article_paragraph with id from dat_article.
* Carry out data cleansing on titles and paragraph contents.

In [34]:
SQLCommand = """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = pd.read_sql(SQLCommand,c)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content

Unnamed: 0,article_id,title,content
0,2905,Absences from work sharply increase in first h...,Absences from work recorded unprecedented high...
1,2905,Absences: 9.5 % of employment in Q4 2019 and 1...,The article's next figure (Figure 4) compares ...
2,2905,Higher share of absences from work among women...,"Considering all four quarters of 2020, the sha..."
3,2905,Absences from work due to own illness or disab...,"From Q4 2019 to Q4 2020, the number of people ..."
4,2905,Absences from work due to holidays,"Expressed as a share of employed people, absen..."
...,...,...,...
3854,10539,General presentation and definition,Scope of asylum statistics and Dublin statisti...
3855,10539,Methodological aspects in asylum statistics,Annual aggregate of the number of asylum appli...
3856,10539,Methodological aspects in Dublin statistics,Asymmetries For most of the collected Dublin s...
3857,10539,What questions can or cannot be answered with ...,How many asylum seekers are entering EU Member...


### Aggregate above paragraph titles and contents  from SE articles paragraphs by article id

* Create a column _raw content_ which gathers all paragraph titles and contents in one text per article.

In [35]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += ' '+a + ' ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped

Unnamed: 0,article_id,raw content
0,7,"Number of accidents In 2018, there were 3.1 m..."
1,13,Developments for GDP in the EU-27: growth sin...
2,16,Fall in the number of railway accidents 9 % f...
3,17,Downturn for EU transport performance in 2019...
4,18,Rail passenger transport performance continue...
...,...,...
860,10456,Problem After successfully identifying and jo...
861,10470,"Problem In France, there was significant room..."
862,10506,General overview Nine PEEIs concern short-ter...
863,10531,What are administrative sources? The term 'ad...


### Merge raw content of SE articles with main file

* Also, add the title to the raw content.

In [36]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df['raw content'] = SE_df['title'] +'. '+SE_df['raw content']
SE_df

SE_df.head(5)

Unnamed: 0,id,title,raw content
0,7,Accidents at work statistics,Accidents at work statistics. Number of accid...
1,13,National accounts and GDP,National accounts and GDP. Developments for G...
2,16,Railway safety statistics in the EU,Railway safety statistics in the EU. Fall in ...
3,17,Railway freight transport statistics,Railway freight transport statistics. Downtur...
4,18,Railway passenger transport statistics - quart...,Railway passenger transport statistics - quart...


### Check for missing information

In [37]:
import numpy as np

SE_df = SE_df.replace('', np.nan) 
print(SE_df.isnull().sum())

id             0
title          0
raw content    0
dtype: int64


### Collecting information on noun phrases


In [38]:
import nltk
import re
import pprint
from nltk import Tree

new_patterns = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
           
    """

new_NPChunker = nltk.RegexpParser(new_patterns)

def prepare_text(input):
    tokenized_sentence = nltk.sent_tokenize(input)  # Tokenize the text into sentences.
    tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]  # Tokenize words in sentences.
    tagged_words = [nltk.pos_tag(word) for word in tokenized_words]  # Tag words for POS in each sentence.
    word_tree = [new_NPChunker.parse(word) for word in tagged_words]  # Identify NP chunks
    return word_tree  # Return the tagged & chunked sentences.


def return_a_list_of_NPs(sentences):
    nps = []  # an empty list in which to NPs will be stored.
    for sent in sentences:
        tree = new_NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


In [39]:
d=[]

for i in range(len(SE_df)):
    sentences = prepare_text(SE_df.loc[i,'raw content'])
    results = return_a_list_of_NPs(sentences)
    results = [(SE_df.loc[i,'id'],l) for l in results]
    d.extend(results)

In [40]:
nphrases_SE = pd.DataFrame(d,columns=["SE_article_id", "noun_phrase"])   
nphrases_SE.drop_duplicates(inplace=True)
nphrases_SE.reset_index(drop=True,inplace=True)
nphrases_SE

Unnamed: 0,SE_article_id,noun_phrase
0,7,Accidents at work statistics
1,7,Number of accidents
2,7,non-fatal accidents
3,7,calendar days
4,7,absence from work
...,...,...
216598,10539,Asylum Procedures Directive
216599,10539,Reception Conditions Directive
216600,10539,EURODAC Regulation
216601,10539,access


### Merge with the file with the SE titles

In [41]:
nphrases_SE2=pd.merge(SE_df[['id','title']],nphrases_SE,left_on='id',right_on='SE_article_id')
nphrases_SE2.rename(columns={'title':'SE_article_title'},inplace=True)
nphrases_SE2.drop(columns=['id'],inplace=True)
nphrases_SE2

Unnamed: 0,SE_article_title,SE_article_id,noun_phrase
0,Accidents at work statistics,7,Accidents at work statistics
1,Accidents at work statistics,7,Number of accidents
2,Accidents at work statistics,7,non-fatal accidents
3,Accidents at work statistics,7,calendar days
4,Accidents at work statistics,7,absence from work
...,...,...,...
216598,Asylum statistics introduced,10539,Asylum Procedures Directive
216599,Asylum statistics introduced,10539,Reception Conditions Directive
216600,Asylum statistics introduced,10539,EURODAC Regulation
216601,Asylum statistics introduced,10539,access


### Normalize noun phrases: lemmas without stop-words, upper-case all
* NLTK seems to be better than Spacy in lemmatization. Convert to lower-case first. 
* Keep only words with alphanumeric characters and drop stop-words.

In [42]:
import nltk

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop = stopwords.words('english')
  

In [43]:
def lemmatize_text(text): ## only alphanumeric characters and drop stop-words
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text) if w.isalnum() and not w in stop]

nphrases_SE2['normalized_noun_phrase'] = nphrases_SE2.noun_phrase.apply(lambda x: x.lower())
nphrases_SE2['normalized_noun_phrase'] = nphrases_SE2.normalized_noun_phrase.apply(lemmatize_text)
nphrases_SE2['normalized_noun_phrase'] = [' '.join(map(str, l)) for l in nphrases_SE2['normalized_noun_phrase'] ]
nphrases_SE2['normalized_noun_phrase'] = nphrases_SE2.normalized_noun_phrase.apply(lambda x: x.upper())
nphrases_SE2.drop(columns=['noun_phrase'],inplace=True)
nphrases_SE2

Unnamed: 0,SE_article_title,SE_article_id,normalized_noun_phrase
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC
1,Accidents at work statistics,7,NUMBER ACCIDENT
2,Accidents at work statistics,7,ACCIDENT
3,Accidents at work statistics,7,CALENDAR DAY
4,Accidents at work statistics,7,ABSENCE WORK
...,...,...,...
216598,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE
216599,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE
216600,Asylum statistics introduced,10539,EURODAC REGULATION
216601,Asylum statistics introduced,10539,ACCESS


### Cut noun-phrases with only one word
* Also drop duplicated records (multiple occurences of the same normalized noun phrase in an article).

In [44]:
nphrases_SE2['normalized_noun_phrase_count'] = nphrases_SE2['normalized_noun_phrase'].apply(lambda x: len(x.replace(',',' ').split()))
idx = nphrases_SE2[nphrases_SE2['normalized_noun_phrase_count'] <=1].index
print(idx)

nphrases_SE2.drop(index=idx, inplace = True)
idx = nphrases_SE2[nphrases_SE2['normalized_noun_phrase_count'] <=1].index
print(idx)

nphrases_SE2.drop(columns=['normalized_noun_phrase_count'],inplace=True)
nphrases_SE2.drop_duplicates(inplace=True)
nphrases_SE2.reset_index(drop=True, inplace=True)
nphrases_SE2


Int64Index([     2,      6,      7,      8,     10,     13,     14,     15,
                16,     17,
            ...
            216578, 216579, 216580, 216581, 216583, 216590, 216593, 216595,
            216596, 216601],
           dtype='int64', length=99023)
Int64Index([], dtype='int64')


Unnamed: 0,SE_article_title,SE_article_id,normalized_noun_phrase
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC
1,Accidents at work statistics,7,NUMBER ACCIDENT
2,Accidents at work statistics,7,CALENDAR DAY
3,Accidents at work statistics,7,ABSENCE WORK
4,Accidents at work statistics,7,FATAL ACCIDENT
...,...,...,...
110291,Asylum statistics introduced,10539,REFERENCE REGULATION
110292,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE
110293,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE
110294,Asylum statistics introduced,10539,EURODAC REGULATION


### Unique normalized noun phrases in SE articles

In [45]:
unique_nps = nphrases_SE2.groupby(['normalized_noun_phrase']).size().to_frame('size').reset_index() ## unique noun phrases
unique_nps.drop(columns=['size'],inplace=True)
unique_nps

Unnamed: 0,normalized_noun_phrase
0,10TH MAIN FLOW
1,17TH CENTURY
2,1970S EUROSTAT
3,1ST QUARTER
4,20TH ANNIVERSARY
...,...
57128,ZOOM BUTTON
57129,Ã LAND
57130,Ã LAND ISLAND
57131,Ã RDAL


### Selected (manually processed) normalized noun phrases stored in the database
* In previous versions these were read from file _Termino V2.xlsx_.

In [46]:
SQLCommand = """SELECT norm_nphrase 
                FROM ESTAT.V1.Norm_NPs """

selected_df = pd.read_sql(SQLCommand,c)
selected_df.rename(columns={'norm_nphrase':'normalized_noun_phrase'},inplace=True)
selected_df

Unnamed: 0,normalized_noun_phrase
0,ABBREVIATED NEET
1,ABBREVIATION ESA
2,ABDOMEN HEART
3,ABDOMEN ORDER
4,ABDOMEN UTERUS
...,...
49025,ZONE CITY
49026,ZONE EXCLAVES
49027,ZONE FORM
49028,ZONE MATRIX


### Keep the common noun phrases in the file with the SE titles and articles IDs and also create a separate file 'unique_nps' with the unique noun phrases

In [47]:
nphrases_SE2 = pd.merge(nphrases_SE2,selected_df,on=['normalized_noun_phrase'])
nphrases_SE2

Unnamed: 0,SE_article_title,SE_article_id,normalized_noun_phrase
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC
1,Accidents at work statistics,7,NUMBER ACCIDENT
2,Railway safety statistics in the EU,16,NUMBER ACCIDENT
3,Accidents at work ? statistics on causes and c...,2947,NUMBER ACCIDENT
4,Road safety statistics ? characteristics at na...,7156,NUMBER ACCIDENT
...,...,...,...
93347,Asylum statistics introduced,10539,REFERENCE REGULATION
93348,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE
93349,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE
93350,Asylum statistics introduced,10539,EURODAC REGULATION


In [48]:
unique_nps = pd.merge(unique_nps,selected_df,on=['normalized_noun_phrase'])
unique_nps

Unnamed: 0,normalized_noun_phrase
0,ABBREVIATED NEET
1,ABBREVIATION ESA
2,ABDOMEN HEART
3,ABDOMEN ORDER
4,ABDOMEN UTERUS
...,...
48824,ZONE CITY
48825,ZONE EXCLAVES
48826,ZONE FORM
48827,ZONE MATRIX


### OECD - Glossary of Statistical Terms 
https://stats.oecd.org/glossary/alpha.asp

* Drop records with missing terms or definitions.
* Clean contents.
* Create a column 'raw content' gathering terms, definitions and contexts.

In [49]:
import numpy as np
SQLCommand = """SELECT article_id, term, url, definition, context, theme
                FROM ESTAT.V1.OECD_Glossary """
OECD_df = pd.read_sql(SQLCommand,c)
OECD_df.rename(columns={'article_id':'OECD_term_id','url':'URL'},inplace=True)
OECD_df.replace('',np.nan,inplace=True)
OECD_df


Unnamed: 0,OECD_term_id,term,URL,definition,context,theme
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics
...,...,...,...,...,...,...
6931,7352,European Agricultural Fund for Rural Developme...,https://stats.oecd.org/glossary/detail.asp?ID=...,The Common Agricultural Policy (CAP) is financ...,,
6932,7354,Carbon market,https://stats.oecd.org/glossary/detail.asp?ID=...,A popular (but misleading) term for a trading ...,,
6933,7355,Classification structure,https://stats.oecd.org/glossary/detail.asp?ID=...,Refers to how the categories of a classificati...,,
6934,7356,United Nation Framework Convention on Climate ...,https://stats.oecd.org/glossary/detail.asp?ID=...,The United Nations Framework Convention on Cli...,"The other ?Rio Conventions?, also negotiated a...",


In [50]:
OECD_df.dropna(subset=['term','definition'],inplace=True)
OECD_df.reset_index(drop=True, inplace=True)

OECD_df['term'] = OECD_df['term'].apply(clean)
OECD_df['definition'] = OECD_df['definition'].apply(clean)
OECD_df['context'] = OECD_df['context'].apply(clean)

OECD_df['term'] = OECD_df['term'].apply(lambda x: x.replace(","," ")) ## drop comma in term!

OECD_df['context'] = OECD_df['context'].fillna('')

OECD_df['raw content'] = OECD_df['term'] +'. '+OECD_df['definition'] + '. '+OECD_df['context']
OECD_df['raw content'].apply(lambda x: re.sub(r" +"," ",x)) ## replace multiple spaces
OECD_df.drop(columns=['definition','context'],inplace=True)
OECD_df

Unnamed: 0,OECD_term_id,term,URL,theme,raw content
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,Environmental statistics,Abatement. See Pollution abatement..
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Health statistics,Absence from work due to illness. Absence from...
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Health statistics,Activity restriction - free expectancy. Functi...
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Health statistics,Acute care. Acute care is one in which the pri...
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Health statistics,Acute care beds. Acute care beds are beds acco...
...,...,...,...,...,...
6928,7352,European Agricultural Fund for Rural Developme...,https://stats.oecd.org/glossary/detail.asp?ID=...,,European Agricultural Fund for Rural Developme...
6929,7354,Carbon market,https://stats.oecd.org/glossary/detail.asp?ID=...,,Carbon market. A popular (but misleading) term...
6930,7355,Classification structure,https://stats.oecd.org/glossary/detail.asp?ID=...,,Classification structure. Refers to how the ca...
6931,7356,United Nation Framework Convention on Climate ...,https://stats.oecd.org/glossary/detail.asp?ID=...,,United Nation Framework Convention on Climate ...


### Normalize in the same way the noun phrases found in OECD's Glossary

In [51]:
d=[]

for i in range(len(OECD_df)):
    #print(i)
    
    sentences = prepare_text(OECD_df.loc[i,'raw content'])
    results = return_a_list_of_NPs(sentences)
    results = [(OECD_df.loc[i,'OECD_term_id'],OECD_df.loc[i,'URL'],OECD_df.loc[i,'term'],l) for l in results]
    d.extend(results)

OECD_df2 = pd.DataFrame(d,columns=["OECD_term_id", "OECD_URL","OECD_term","noun_phrase"])    
OECD_df2    
    
OECD_df2['normalized_noun_phrase'] = OECD_df2.noun_phrase.apply(lambda x: x.lower())

OECD_df2['normalized_noun_phrase'] = OECD_df2.normalized_noun_phrase.apply(lemmatize_text)
OECD_df2['normalized_noun_phrase'] = [' '.join(map(str, l)) for l in OECD_df2['normalized_noun_phrase'] ]
OECD_df2['normalized_noun_phrase'] = OECD_df2.normalized_noun_phrase.apply(lambda x: x.upper())
OECD_df2

Unnamed: 0,OECD_term_id,OECD_URL,OECD_term,noun_phrase,normalized_noun_phrase
0,1,https://stats.oecd.org/glossary/detail.asp?ID=1,Abatement,Abatement,ABATEMENT
1,1,https://stats.oecd.org/glossary/detail.asp?ID=1,Abatement,Pollution abatement ..,POLLUTION ABATEMENT
2,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,Absence from work,ABSENCE WORK
3,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,Absence from work,ABSENCE WORK
4,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,refers,REFERS
...,...,...,...,...,...
100431,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),energy-related sources,SOURCE
100432,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),transport,TRANSPORT
100433,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),storage location,STORAGE LOCATION
100434,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),longterm isolation,LONGTERM ISOLATION


* Again cut noun phrases with only one word.

In [52]:
OECD_df2['normalized_noun_phrase_count'] = OECD_df2['normalized_noun_phrase'].apply(lambda x: len(x.replace(',',' ').split()))
idx = OECD_df2[OECD_df2['normalized_noun_phrase_count'] <=1].index
print(idx)

OECD_df2.drop(OECD_df2[OECD_df2['normalized_noun_phrase_count'] <=1].index, inplace = True)
idx = OECD_df2[OECD_df2['normalized_noun_phrase_count'] <=1].index
print(idx)

OECD_df2.drop(columns=['noun_phrase','normalized_noun_phrase_count'],inplace=True)

OECD_df2.reset_index(drop=True,inplace=True)
OECD_df2

Int64Index([     0,      4,      6,      7,     12,     15,     18,     19,
                20,     21,
            ...
            100411, 100412, 100422, 100424, 100425, 100427, 100428, 100431,
            100432, 100435],
           dtype='int64', length=51678)
Int64Index([], dtype='int64')


Unnamed: 0,OECD_term_id,OECD_URL,OECD_term,normalized_noun_phrase
0,1,https://stats.oecd.org/glossary/detail.asp?ID=1,Abatement,POLLUTION ABATEMENT
1,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,ABSENCE WORK
2,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,ABSENCE WORK
3,2,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness,NUMBER WORK DAY
4,3,https://stats.oecd.org/glossary/detail.asp?ID=3,Activity restriction - free expectancy,ACTIVITY RESTRICTION
...,...,...,...,...
48753,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),CARBON CAPTURE
48754,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),PROCESS CONSISTING
48755,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),SEPARATION CO2
48756,7357,https://stats.oecd.org/glossary/detail.asp?ID=...,Carbon Capture and Storage (CCS),STORAGE LOCATION


###  Find matches per unique noun phrase

* Column 'found_in_OECD_ids': a list with the OECD IDs.
* Column 'found_in_OECD_URLs': a list with the corresponding URLs.
* Column 'found_in_OECD_terms': a list with the corresponding OECD terms.


In [53]:
unique_nps['found_in_OECD_ids']=[list() for i in range(len(unique_nps))]
unique_nps['found_in_OECD_URLs']=[list() for i in range(len(unique_nps))]
unique_nps['found_in_OECD_terms']=[list() for i in range(len(unique_nps))]
unique_nps['OECD_matches']=[list() for i in range(len(unique_nps))]
for i in range(len(unique_nps)):
    if i % 1000 ==0: print(i)
    np = unique_nps.loc[i,'normalized_noun_phrase']
    idx = list(OECD_df2[OECD_df2['normalized_noun_phrase'].str.contains(np,regex=False)].index)
    if len(idx) > 0:
        for j in idx:
            if OECD_df2.loc[j,'OECD_term_id'] not in unique_nps.loc[i,'found_in_OECD_ids']:
                unique_nps.loc[i,'found_in_OECD_ids'].append(OECD_df2.loc[j,'OECD_term_id'])
                unique_nps.loc[i,'found_in_OECD_URLs'].append(OECD_df2.loc[j,'OECD_URL'])
                unique_nps.loc[i,'found_in_OECD_terms'].append(OECD_df2.loc[j,'OECD_term'])
                unique_nps.loc[i,'OECD_matches'].append(OECD_df2.loc[j,'normalized_noun_phrase'])

unique_nps

0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
12000
13000
14000
15000
16000
17000
18000
19000
20000
21000
22000
23000
24000
25000
26000
27000
28000
29000
30000
31000
32000
33000
34000
35000
36000
37000
38000
39000
40000
41000
42000
43000
44000
45000
46000
47000
48000


Unnamed: 0,normalized_noun_phrase,found_in_OECD_ids,found_in_OECD_URLs,found_in_OECD_terms,OECD_matches
0,ABBREVIATED NEET,[],[],[],[]
1,ABBREVIATION ESA,[],[],[],[]
2,ABDOMEN HEART,[],[],[],[]
3,ABDOMEN ORDER,[],[],[],[]
4,ABDOMEN UTERUS,[],[],[],[]
...,...,...,...,...,...
48824,ZONE CITY,[],[],[],[]
48825,ZONE EXCLAVES,[],[],[],[]
48826,ZONE FORM,[],[],[],[]
48827,ZONE MATRIX,[],[],[],[]


### Merge with the file SE articles file

In [54]:
nphrases_SE3 = pd.merge(nphrases_SE2,unique_nps,on='normalized_noun_phrase')

nphrases_SE3


Unnamed: 0,SE_article_title,SE_article_id,normalized_noun_phrase,found_in_OECD_ids,found_in_OECD_URLs,found_in_OECD_terms,OECD_matches
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC,[],[],[],[]
1,Accidents at work statistics,7,NUMBER ACCIDENT,[],[],[],[]
2,Railway safety statistics in the EU,16,NUMBER ACCIDENT,[],[],[],[]
3,Accidents at work ? statistics on causes and c...,2947,NUMBER ACCIDENT,[],[],[],[]
4,Road safety statistics ? characteristics at na...,7156,NUMBER ACCIDENT,[],[],[],[]
...,...,...,...,...,...,...,...
93347,Asylum statistics introduced,10539,REFERENCE REGULATION,[],[],[],[]
93348,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE,[],[],[],[]
93349,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE,[],[],[],[]
93350,Asylum statistics introduced,10539,EURODAC REGULATION,[],[],[],[]


### The output file is used in the Power BI application

In [55]:

nphrases_SE3.to_excel('SE_vs_OECD_Glossary_Noun_Phrases.xlsx')