# Statistics Explained  articles: noun phrases and matching with OECD's Glossary

### Installation instructions
*    Download the notebook as "raw" file and save it with extension .ipynb (cut the .txt extension which is added)
*    Install the necessary libraries from your jupyter command prompt. These, together with the versions used, are:
    *    pyodbc==4.0.32
    *    pandas==1.3.5
    *    nltk==3.6.5
*   Launch the notebook and put your own credentials for access to the Virtuoso database in the call to pyodbc.connect() in the chunk with title "Connect to the database"  

In [1]:
import re
import pandas as pd
import sys


In [2]:
from datetime import datetime

def file_name(pre,ext):
    current_time = datetime.now() 
    return pre + '_'+ str(current_time.month)+ '_' + str(current_time.day) + \
                 '_' + str(current_time.hour)+ '_' + str(current_time.minute)  +'.'+ext

#### Connect to the database

In [3]:
import pyodbc
c = pyodbc.connect('DSN=Virtuoso All;DBA=ESTAT;UID=xxxxx;PWD=xxxxx')
cursor = c.cursor()

In [4]:
import re
#import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) 
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) 

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    x = re.sub(r'â.{2}',"'",x) ### !!! NEW: single quotes are read as: âXX
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    #x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x

### Statistics explained articles

* IDs and titles from dat_link_info, with resource_information_id=1, i.e. Eurostat (see ESTAT.V1.mod_resource_information) and matching IDs from dat_article.
* Carry out data cleansing on titles.

In [5]:
SQLCommand = """SELECT id, title 
                FROM ESTAT.V1.dat_link_info 
                WHERE resource_information_id=1 AND id IN (SELECT id FROM ESTAT.V1.dat_article) """

SE_df = pd.read_sql(SQLCommand,c)

SE_df['title'] = SE_df['title'].apply(clean)
SE_df.head(5)


Unnamed: 0,id,title
0,7,Accidents at work statistics
1,13,National accounts and GDP
2,16,Railway safety statistics in the EU
3,17,Railway freight transport statistics
4,18,Railway passenger transport statistics - quart...


### Add paragraphs titles and contents

* From dat_article_paragraph with abstract=0 (i.e. "no").
* Match article_id from dat_article_paragraph with id from dat_article.
* Carry out data cleansing on titles and paragraph contents.

In [6]:
SQLCommand = """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = pd.read_sql(SQLCommand,c)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content

Unnamed: 0,article_id,title,content
0,2905,Absences from work sharply increase in first h...,Absences from work recorded unprecedented high...
1,2905,Absences: 9.5 % of employment in Q4 2019 and 1...,The article's next figure (Figure 4) compares ...
2,2905,Higher share of absences from work among women...,"Considering all four quarters of 2020, the sha..."
3,2905,Absences from work due to own illness or disab...,"From Q4 2019 to Q4 2020, the number of people ..."
4,2905,Absences from work due to holidays,"Expressed as a share of employed people, absen..."
...,...,...,...
3854,10539,General presentation and definition,Scope of asylum statistics and Dublin statisti...
3855,10539,Methodological aspects in asylum statistics,Annual aggregate of the number of asylum appli...
3856,10539,Methodological aspects in Dublin statistics,Asymmetries For most of the collected Dublin s...
3857,10539,What questions can or cannot be answered with ...,How many asylum seekers are entering EU Member...


### Aggregate above paragraph titles and contents  from SE articles paragraphs by article id

* Create a column _raw content_ which gathers all paragraph titles and contents in one text per article.

In [7]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += ' '+a + ' ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped

Unnamed: 0,article_id,raw content
0,7,"Number of accidents In 2018, there were 3.1 m..."
1,13,Developments for GDP in the EU-27: growth sin...
2,16,Fall in the number of railway accidents 9 % f...
3,17,Downturn for EU transport performance in 2019...
4,18,Rail passenger transport performance continue...
...,...,...
860,10456,Problem After successfully identifying and jo...
861,10470,"Problem In France, there was significant room..."
862,10506,General overview Nine PEEIs concern short-ter...
863,10531,What are administrative sources? The term 'ad...


### Merge raw content of SE articles with main file

* Also, add the title to the raw content.

In [8]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df['raw content'] = SE_df['title'] +'. '+SE_df['raw content']
SE_df

SE_df.head(5)

Unnamed: 0,id,title,raw content
0,7,Accidents at work statistics,Accidents at work statistics. Number of accid...
1,13,National accounts and GDP,National accounts and GDP. Developments for G...
2,16,Railway safety statistics in the EU,Railway safety statistics in the EU. Fall in ...
3,17,Railway freight transport statistics,Railway freight transport statistics. Downtur...
4,18,Railway passenger transport statistics - quart...,Railway passenger transport statistics - quart...


### Check for missing information

In [9]:
import numpy as np

SE_df = SE_df.replace('', np.nan) 
print(SE_df.isnull().sum())

id             0
title          0
raw content    0
dtype: int64


### Collecting information on noun phrases


In [10]:
import nltk
import re
import pprint
from nltk import Tree

new_patterns = """
    NP:    {<DT><WP><VBP>*<RB>*<VBN><IN><NN>}
           {<NN|NNS|NNP|NNPS><IN>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS><CC>*<NN|NNS|NNP|NNPS>+}
           {<JJ>*<NN|NNS|NNP|NNPS>+}
           
    """

new_NPChunker = nltk.RegexpParser(new_patterns)

def prepare_text(input):
    tokenized_sentence = nltk.sent_tokenize(input)  # Tokenize the text into sentences.
    tokenized_words = [nltk.word_tokenize(sentence) for sentence in tokenized_sentence]  # Tokenize words in sentences.
    tagged_words = [nltk.pos_tag(word) for word in tokenized_words]  # Tag words for POS in each sentence.
    word_tree = [new_NPChunker.parse(word) for word in tagged_words]  # Identify NP chunks
    return word_tree  # Return the tagged & chunked sentences.


def return_a_list_of_NPs(sentences):
    nps = []  # an empty list in which to NPs will be stored.
    for sent in sentences:
        tree = new_NPChunker.parse(sent)
        for subtree in tree.subtrees():
            if subtree.label() == 'NP':
                t = subtree
                t = ' '.join(word for word, tag in t.leaves())
                nps.append(t)
    return nps


In [11]:
d=[]

for i in range(len(SE_df)):
    sentences = prepare_text(SE_df.loc[i,'raw content'])
    res = return_a_list_of_NPs(sentences)
    res = [(SE_df.loc[i,'id'],l) for l in res]
    d.extend(res)

In [12]:
nphrases_df = pd.DataFrame(d,columns=["doc_id", "noun_phrase"])    
nphrases_df

Unnamed: 0,doc_id,noun_phrase
0,7,Accidents at work statistics
1,7,Number of accidents
2,7,non-fatal accidents
3,7,calendar days
4,7,absence from work
...,...,...
418461,10539,stateless person
418462,10539,EURODAC Regulation
418463,10539,access
418464,10539,EU fingerprint database record


### Merge with the file with the SE titles

In [13]:
nphrases_df2=pd.merge(SE_df[['id','title']],nphrases_df,left_on='id',right_on='doc_id')
nphrases_df2.drop(columns=['id'],inplace=True)
nphrases_df2

Unnamed: 0,title,doc_id,noun_phrase
0,Accidents at work statistics,7,Accidents at work statistics
1,Accidents at work statistics,7,Number of accidents
2,Accidents at work statistics,7,non-fatal accidents
3,Accidents at work statistics,7,calendar days
4,Accidents at work statistics,7,absence from work
...,...,...,...
418461,Asylum statistics introduced,10539,stateless person
418462,Asylum statistics introduced,10539,EURODAC Regulation
418463,Asylum statistics introduced,10539,access
418464,Asylum statistics introduced,10539,EU fingerprint database record


### Lemmatize noun phrases
* NLTK seems to be better than Spacy in lemmatization. Convert to lower-case first. 
* Keep only words with alphanumeric characters and drop stop-words.

In [14]:
import nltk

from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop = stopwords.words('english')
  

In [15]:
def lemmatize_text(text): ## only alphanumeric characters and drop stop-words
    return [lemmatizer.lemmatize(w) for w in w_tokenizer.tokenize(text) if w.isalnum() and not w in stop]

nphrases_df2['normalized_noun_phrase'] = nphrases_df2.noun_phrase.apply(lambda x: x.lower())
nphrases_df2['normalized_noun_phrase'] = nphrases_df2.normalized_noun_phrase.apply(lemmatize_text)
nphrases_df2['normalized_noun_phrase'] = [' '.join(map(str, l)) for l in nphrases_df2['normalized_noun_phrase'] ]
nphrases_df2['normalized_noun_phrase'] = nphrases_df2.normalized_noun_phrase.apply(lambda x: x.upper())
nphrases_df2

Unnamed: 0,title,doc_id,noun_phrase,normalized_noun_phrase
0,Accidents at work statistics,7,Accidents at work statistics,ACCIDENT WORK STATISTIC
1,Accidents at work statistics,7,Number of accidents,NUMBER ACCIDENT
2,Accidents at work statistics,7,non-fatal accidents,ACCIDENT
3,Accidents at work statistics,7,calendar days,CALENDAR DAY
4,Accidents at work statistics,7,absence from work,ABSENCE WORK
...,...,...,...,...
418461,Asylum statistics introduced,10539,stateless person,STATELESS PERSON
418462,Asylum statistics introduced,10539,EURODAC Regulation,EURODAC REGULATION
418463,Asylum statistics introduced,10539,access,ACCESS
418464,Asylum statistics introduced,10539,EU fingerprint database record,EU FINGERPRINT DATABASE RECORD


In [16]:
nphrases_df2.replace('', np.nan, inplace=True)
nphrases_df2.dropna(subset=['normalized_noun_phrase'],inplace=True)
nphrases_df2

Unnamed: 0,title,doc_id,noun_phrase,normalized_noun_phrase
0,Accidents at work statistics,7,Accidents at work statistics,ACCIDENT WORK STATISTIC
1,Accidents at work statistics,7,Number of accidents,NUMBER ACCIDENT
2,Accidents at work statistics,7,non-fatal accidents,ACCIDENT
3,Accidents at work statistics,7,calendar days,CALENDAR DAY
4,Accidents at work statistics,7,absence from work,ABSENCE WORK
...,...,...,...,...
418461,Asylum statistics introduced,10539,stateless person,STATELESS PERSON
418462,Asylum statistics introduced,10539,EURODAC Regulation,EURODAC REGULATION
418463,Asylum statistics introduced,10539,access,ACCESS
418464,Asylum statistics introduced,10539,EU fingerprint database record,EU FINGERPRINT DATABASE RECORD


* Some further processing.
* Cut noun-phrases with only one word.

In [17]:
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'[()]','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'Â','A',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'%','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'\]','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'/]','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'\+]','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'\-]','',x))
nphrases_df2['normalized_noun_phrase'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: re.sub(r'\d+','',x))

nphrases_df2['normalized_noun_phrase_count'] = nphrases_df2['normalized_noun_phrase'].apply(lambda x: len(x.replace(',',' ').split()))
idx = nphrases_df2[nphrases_df2['normalized_noun_phrase_count'] <=1].index
print(idx)

nphrases_df2.drop(nphrases_df2[nphrases_df2['normalized_noun_phrase_count'] <=1].index, inplace = True)
idx = nphrases_df2[nphrases_df2['normalized_noun_phrase_count'] <=1].index
print(idx)

nphrases_df2.drop(columns=['normalized_noun_phrase_count'],inplace=True)


Int64Index([     2,      7,      8,      9,     11,     15,     19,     20,
                22,     23,
            ...
            418434, 418437, 418439, 418440, 418443, 418445, 418446, 418452,
            418458, 418463],
           dtype='int64', length=200908)
Int64Index([], dtype='int64')


* Collect overall frequencies.

In [18]:

tmp=nphrases_df2.groupby(by='normalized_noun_phrase').size().to_frame('Overall_Frequencies')
tmp
nphrases_df2 = pd.merge(nphrases_df2,tmp,on='normalized_noun_phrase')
nphrases_df2

Unnamed: 0,title,doc_id,noun_phrase,normalized_noun_phrase,Overall_Frequencies
0,Accidents at work statistics,7,Accidents at work statistics,ACCIDENT WORK STATISTIC,1
1,Accidents at work statistics,7,Number of accidents,NUMBER ACCIDENT,14
2,Accidents at work statistics,7,number of accidents,NUMBER ACCIDENT,14
3,Accidents at work statistics,7,number of accidents,NUMBER ACCIDENT,14
4,Accidents at work statistics,7,number of accidents,NUMBER ACCIDENT,14
...,...,...,...,...,...
182535,Asylum statistics introduced,10539,references Regulation,REFERENCE REGULATION,1
182536,Asylum statistics introduced,10539,Asylum Procedures Directive,ASYLUM PROCEDURE DIRECTIVE,1
182537,Asylum statistics introduced,10539,Reception Conditions Directive,RECEPTION CONDITION DIRECTIVE,1
182538,Asylum statistics introduced,10539,EURODAC Regulation,EURODAC REGULATION,1


* Collect frequencies per document and drop duplicates.

In [19]:
nphrases_df2['Frequencies_per_doc']=nphrases_df2.groupby(['doc_id','normalized_noun_phrase'])['normalized_noun_phrase'].transform('count')
nphrases_df2.drop(columns=['noun_phrase'],inplace=True)
nphrases_df2.drop_duplicates(subset=['title','normalized_noun_phrase'], inplace=True, ignore_index=False)

nphrases_df2


Unnamed: 0,title,doc_id,normalized_noun_phrase,Overall_Frequencies,Frequencies_per_doc
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC,1,1
1,Accidents at work statistics,7,NUMBER ACCIDENT,14,7
8,Railway safety statistics in the EU,16,NUMBER ACCIDENT,14,2
10,Accidents at work ? statistics on causes and c...,2947,NUMBER ACCIDENT,14,1
11,Road safety statistics ? characteristics at na...,7156,NUMBER ACCIDENT,14,2
...,...,...,...,...,...
182535,Asylum statistics introduced,10539,REFERENCE REGULATION,1,1
182536,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE,1,1
182537,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE,1,1
182538,Asylum statistics introduced,10539,EURODAC REGULATION,1,1


### Unique noun phrases in SE articles

In [20]:
res = nphrases_df2.groupby(['normalized_noun_phrase']).size().to_frame('size').reset_index() ## unique noun phrases
res.drop(columns=['size'],inplace=True)
res

Unnamed: 0,normalized_noun_phrase
0,A LEVEL
1,A SINGLE PERSON
2,AASTERN EUROPEAN COUNTRY
3,ABBREVIATED NEET
4,ABBREVIATION ESA
...,...
57053,ZOOM BUTTON
57054,Ã LAND
57055,Ã LAND ISLAND
57056,Ã RDAL


### OECD - Glossary of Statistical Terms 
https://stats.oecd.org/glossary/alpha.asp

* Scrape terms and lemmatize.

In [21]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

url = "https://stats.oecd.org/glossary/alpha.asp"
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
    
rows = soup.find_all('tr')
str_cells = str(rows)
cleantext = BeautifulSoup(str_cells, "lxml").get_text()
#print(cleantext)

list_rows = []
for row in rows:
    cells = row.find_all('a')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
#print(clean2)
#type(clean2)

df = pd.DataFrame(list_rows)
df.head(10)
df[0]=df[0].apply(lambda x: re.sub(r'\[',' ',x))
df[0]=df[0].apply(lambda x: re.sub(r'\]',' ',x))
df1 = df[0].str.split(',', expand=True)
df_t = df1.T
df_t=df_t[[22]]
df_t = df_t.rename(columns={22: 'normalized_noun_phrase'})
nan_value = float("NaN")

df_t.replace(" ", nan_value, inplace=True)
df_t.dropna(subset = ["normalized_noun_phrase"], inplace=True)

df_t['normalized_noun_phrase']= df_t['normalized_noun_phrase'].apply(lambda x: x.lower())
df_t['normalized_noun_phrase']= df_t['normalized_noun_phrase'].apply(lemmatize_text)
df_t['normalized_noun_phrase']= [' '.join(map(str, l)) for l in df_t['normalized_noun_phrase']]
df_t['normalized_noun_phrase']= df_t['normalized_noun_phrase'].apply(lambda x: x.upper())

df_t.reset_index(drop=True, inplace=True)
df_t.index.rename('id', inplace=True)
df_t.head()

Unnamed: 0_level_0,normalized_noun_phrase
id,Unnamed: 1_level_1
0,POSTERIORI AUDIT
1,PRIORI AUDIT
2,PROGRAMME LANGUAGE
3,ABATEMENT
4,ABATEMENT COST


###  Find matches per unique noun phrase

* Column 'Common' has a list with tuples per record/unique noun phrase from the SE articles: (id of the OECD's term with the match, entire OECD's normalized term, the part that matches).
* Columns 'len_intersect' and 'len_union' contain the corresponding lengths of the intersection and the union of terms for the calculation of Jaccard similarities.
* Column 'Jaccard' has lists with the corresponding Jaccard similarities. 


In [22]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

res['Common']=[[] for i in range(len(res))]
res['len_intersect']=[[] for i in range(len(res))]
res['len_union']=[[] for i in range(len(res))]
res['Jaccard']=[[] for i in range(len(res))]

search_in = df_t['normalized_noun_phrase'].apply(lambda x: x.split(' '))
search_in
for i in range(len(res)):
    np = res.loc[i,'normalized_noun_phrase'].strip().split(' ')
    np = [el for el in np if not el.lower() in stop] ## excluding individual words which are stop-words
    if len(np) <= 1 : continue
    np = set(np)
    if (i+1) % 1000==0:
        print(i+1,' of ',len(res),' unique noun phrases: set of terms: ',np)
        #print(i,np, ' with ',df_t.loc[j,'normalized_noun_phrase'],' : ',common)
    for (j,x) in enumerate(search_in):
        set_2 = set(x)
        common = np.intersection(set_2)
        uni = np.union(set_2)
        if len(common) > 0:
            #print(i+1,' of ',len(res), np, ' with ',df_t.loc[j,'normalized_noun_phrase'],' : ',common)
            #res.loc[i,'Common'].append(list((df_t.loc[j,'id'],df_t.loc[j,'normalized_noun_phrase'],common)))
            res.loc[i,'Common'].append(list((df_t.index.get_loc(j),df_t.loc[j,'normalized_noun_phrase'],common)))
            res.loc[i,'len_intersect'].append(len(common))
            res.loc[i,'len_union'].append(len(uni))
            res.loc[i,'Jaccard'].append(len(common)/len(uni))
            
res            

1000  of  57058  unique noun phrases: set of terms:  {'FUND', 'ADMINISTRATION', 'LAND', 'MUNICIPALITY'}
2000  of  57058  unique noun phrases: set of terms:  {'REVEALS', 'STATE', 'ANALYSIS', 'MEMBER', 'EU'}
3000  of  57058  unique noun phrases: set of terms:  {'LIKE', 'ASSET', 'RESEARCH'}
4000  of  57058  unique noun phrases: set of terms:  {'RENEWAL', 'BC'}
5000  of  57058  unique noun phrases: set of terms:  {'MONITORING', 'SCHEME', 'BUTTERFLY'}
6000  of  57058  unique noun phrases: set of terms:  {'CERTAIN', 'VARIABLE'}
7000  of  57058  unique noun phrases: set of terms:  {'CLOSING', 'VALUE', 'STOCK'}
8000  of  57058  unique noun phrases: set of terms:  {'COMPONENT', 'GFCF'}
9000  of  57058  unique noun phrases: set of terms:  {'IRELAND', 'CONTRACTION'}
10000  of  57058  unique noun phrases: set of terms:  {'CROATIA', 'LUXEMBOURG'}
11000  of  57058  unique noun phrases: set of terms:  {'SECURITY', 'DATA'}
12000  of  57058  unique noun phrases: set of terms:  {'QUALITY', 'DENSITY'}
13

Unnamed: 0,normalized_noun_phrase,Common,len_intersect,len_union,Jaccard
0,A LEVEL,[],[],[],[]
1,A SINGLE PERSON,"[[540, SINGLE DEFINITION, {SINGLE}], [1137, CO...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 5, 4, 3, 3, 3, 3, 3, 3, 4, 3, 2, 3, 4, 5, ...","[0.3333333333333333, 0.2, 0.25, 0.333333333333..."
2,AASTERN EUROPEAN COUNTRY,"[[268, APPLICANT COUNTRY, {COUNTRY}], [298, CO...","[1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[4, 3, 4, 4, 5, 7, 5, 4, 3, 6, 4, 4, 4, 5, 4, ...","[0.25, 0.3333333333333333, 0.25, 0.25, 0.4, 0...."
3,ABBREVIATED NEET,[],[],[],[]
4,ABBREVIATION ESA,"[[201, ALLOCATION PRIMARY INCOME ACCOUNT ESA, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[6, 4, 4, 6, 6, 4, 3, 3, 5, 3, 3, 4, 4, 6, 4, ...","[0.16666666666666666, 0.25, 0.25, 0.1666666666..."
...,...,...,...,...,...
57053,ZOOM BUTTON,[],[],[],[]
57054,Ã LAND,"[[163, AGRICULTURAL LAND, {LAND}], [164, AGRIC...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 5, 5, 5, 3, 3, 5, 4, 3, 6, 6, 3, 3, 3, 3, ...","[0.3333333333333333, 0.2, 0.2, 0.2, 0.33333333..."
57055,Ã LAND ISLAND,"[[163, AGRICULTURAL LAND, {LAND}], [164, AGRIC...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[4, 6, 6, 6, 4, 4, 6, 5, 4, 7, 7, 4, 4, 4, 4, ...","[0.25, 0.16666666666666666, 0.1666666666666666..."
57056,Ã RDAL,[],[],[],[]


In [23]:
res2 = pd.merge(nphrases_df2,res,on='normalized_noun_phrase')
res2


Unnamed: 0,title,doc_id,normalized_noun_phrase,Overall_Frequencies,Frequencies_per_doc,Common,len_intersect,len_union,Jaccard
0,Accidents at work statistics,7,ACCIDENT WORK STATISTIC,1,1,"[[7, ABSENCE WORK DUE ILLNESS, {WORK}], [959, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[6, 6, 4, 4, 5, 4, 4, 4, 5, 4, 4, 4, 5, 4, 7, ...","[0.16666666666666666, 0.16666666666666666, 0.2..."
1,Accidents at work statistics,7,NUMBER ACCIDENT,14,7,"[[976, COMMUTING ACCIDENT, {ACCIDENT}], [2974,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 2, 3, 4, 5, 5, 4, 5, 4, 4, 6, 4, 3, 2, 4, ...","[0.3333333333333333, 0.5, 0.3333333333333333, ..."
2,Railway safety statistics in the EU,16,NUMBER ACCIDENT,14,2,"[[976, COMMUTING ACCIDENT, {ACCIDENT}], [2974,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 2, 3, 4, 5, 5, 4, 5, 4, 4, 6, 4, 3, 2, 4, ...","[0.3333333333333333, 0.5, 0.3333333333333333, ..."
3,Accidents at work ? statistics on causes and c...,2947,NUMBER ACCIDENT,14,1,"[[976, COMMUTING ACCIDENT, {ACCIDENT}], [2974,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 2, 3, 4, 5, 5, 4, 5, 4, 4, 6, 4, 3, 2, 4, ...","[0.3333333333333333, 0.5, 0.3333333333333333, ..."
4,Road safety statistics ? characteristics at na...,7156,NUMBER ACCIDENT,14,2,"[[976, COMMUTING ACCIDENT, {ACCIDENT}], [2974,...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 2, 3, 4, 5, 5, 4, 5, 4, 4, 6, 4, 3, 2, 4, ...","[0.3333333333333333, 0.5, 0.3333333333333333, ..."
...,...,...,...,...,...,...,...,...,...
110226,Asylum statistics introduced,10539,REFERENCE REGULATION,1,1,"[[118, ADMINISTRATIVE REGULATION, {REGULATION}...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[3, 5, 5, 4, 3, 4, 4, 3, 4, 3, 3, 3, 3, 3, 4, ...","[0.3333333333333333, 0.2, 0.2, 0.25, 0.3333333..."
110227,Asylum statistics introduced,10539,ASYLUM PROCEDURE DIRECTIVE,1,1,"[[1886, EDITING PROCEDURE, {PROCEDURE}], [2200...","[1, 1, 1, 1, 1, 1]","[4, 5, 9, 5, 5, 4]","[0.25, 0.2, 0.1111111111111111, 0.2, 0.2, 0.25]"
110228,Asylum statistics introduced,10539,RECEPTION CONDITION DIRECTIVE,1,1,"[[89, ACUTE HEALTH CONDITION, {CONDITION}], [8...","[1, 1, 1, 1, 1]","[5, 5, 5, 5, 8]","[0.2, 0.2, 0.2, 0.2, 0.125]"
110229,Asylum statistics introduced,10539,EURODAC REGULATION,1,1,"[[118, ADMINISTRATIVE REGULATION, {REGULATION}...","[1, 1, 1, 1, 1, 1, 1, 1]","[3, 4, 3, 3, 3, 2, 3, 3]","[0.3333333333333333, 0.25, 0.3333333333333333,..."


In [24]:
outfile = file_name('SE_vs_OECD_Glossary_Noun_Phrases','xlsx')
res2.to_excel(outfile)