# Produce inputs for Power BI
***

### Revised (January 2022) to read all data from the database

* Produces the following inputs to Power BI:  
    * **SE_df_for_PowerBI.xlsx**.
    * **OECD_content_for_PowerBI.xlsx**.

In [1]:
import pandas as pd
import numpy as np

import pyodbc

import gensim


### The data cleansing function

In [2]:
import re
import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) ## NEW
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) ## NEW

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x

### Import Statistics Explained data from the database

* Id, context and last update from table dat_article.  
* Title and url from table dat_link_info, on matching id and resource_information_id=1 (i.e. Eurostat).
* Abstract from field content in table dat_article_paragraph, on matching article_id and abstract=1 ("yes").
* Apply data cleansing.


In [3]:


c = pyodbc.connect('DSN=Virtuoso All;DBA=ESTAT;UID=xxxxx;PWD=xxxxx')
cursor = c.cursor()

SQLCommand = """SELECT T1.id, T1.context, T1.last_update, T2.title, T2.url, T3.content 
                FROM ESTAT.V1.dat_article as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.id=T2.id  
                INNER JOIN ESTAT.V1.dat_article_paragraph as T3  
                  ON T2.id=T3.article_id  
                WHERE T2.resource_information_id=1 AND T3.abstract=1"""

SE_df = pd.read_sql(SQLCommand,c)
SE_df.rename(columns={'content':'abstract'},inplace=True)
SE_df = SE_df[['id','context','title','abstract','url','last_update']]

SE_df['context'] = SE_df['context'].apply(clean)
SE_df['title'] = SE_df['title'].apply(clean)
SE_df['abstract'] = SE_df['abstract'].apply(clean)

SE_df.head(5)


Unnamed: 0,id,context,title,abstract,url,last_update
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00


### Paragraph titles and contents

* From the dat_article_paragraph table with abstract=0 and matching article_id.
* Apply data cleansing.

In [4]:
SQLCommand = """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = pd.read_sql(SQLCommand,c)
add_content.sort_values(by=['article_id'],inplace=True)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content.head(5)

Unnamed: 0,article_id,title,content
9,7,Number of accidents,"In 2018, there were 3.1 million non-fatal acci..."
10,7,Incidence rates,An alternative way to analyse the information ...
11,7,Standardised incidence rates,"When comparing data between countries, inciden..."
12,7,Analysis by activity,"As noted above, one of the main reasons why th..."
13,7,Analysis by type of injury,Figure 6 presents an analysis of data accordin...


### Aggregate the above paragraph titles and contents  

* Create a column _raw content_ which gathers all paragraph titles and contents in one text per article.

In [5]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += ' '+a + ' ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped.head(5)

Unnamed: 0,article_id,raw content
0,7,"Number of accidents In 2018, there were 3.1 m..."
1,13,Household consumption Consumption expenditure...
2,16,Suicides on railways Suicides occurring on th...
3,17,Geographical location plays a key role in the...
4,18,Number of passengers transported by rail incr...


### Merge the raw content of the SE articles with the main file


In [6]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df.head(5)

Unnamed: 0,id,context,title,abstract,url,last_update,raw content
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...


### Related links

* From the dat_article_shared_link table with article_division=2 ("Other articles", see mod_article_division table).  
* link_id points to id in dat_link_info (where we select resource_information_id=1).
* Apply data cleansing (with an additional step to replace question marks from the related titles).


In [7]:
SQLCommand = """SELECT T1.article_id, T1.link_id, T2.title, T2.url 
                FROM dat_article_shared_link as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.link_id=T2.id  
                WHERE T1.article_division_id=2 AND T2.resource_information_id=1
                ORDER BY T1.article_id, T1.link_id """

add_related_links = pd.read_sql(SQLCommand,c)
add_related_links['title'] = add_related_links['title'] .apply(clean)
add_related_links['title'] = add_related_links['title'] .apply(lambda x: re.sub(r'\?','-',x))
add_related_links.head(5)

Unnamed: 0,article_id,link_id,title,url
0,7,229,Health in the European Union a facts and figures,https://ec.europa.eu/eurostat/statistics-expla...
1,7,1157,Health statistics introduced,https://ec.europa.eu/eurostat/statistics-expla...
2,7,2914,Accidents and injuries statistics,https://ec.europa.eu/eurostat/statistics-expla...
3,7,2946,Accidents at work - statistics by economic act...,https://ec.europa.eu/eurostat/statistics-expla...
4,7,2947,Accidents at work - statistics on causes and c...,https://ec.europa.eu/eurostat/statistics-expla...


### Aggregate above by article id

* Aggregate related titles and URLs in one string.

In [8]:
add_related_grouped = pd.DataFrame(add_related_links.groupby(['article_id'])[['title','url']].aggregate(lambda x: list(x)))
add_related_grouped.reset_index(drop=False, inplace=True)
add_related_grouped.rename(columns={'title':'related_titles','url':'related_urls'},inplace=True)
add_related_grouped.head(5)



Unnamed: 0,article_id,related_titles,related_urls
0,7,[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...
1,13,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...
2,16,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...
3,17,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...
4,18,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...


### Merge above with the main file


In [9]:
SE_df = pd.merge(SE_df,add_related_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)


SE_df.head(5)

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,related_titles,related_urls
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...


### Read categories from the database


In [10]:
import ast

SQLCommand = """SELECT article_id, categories 
                FROM ESTAT.V1.SE_articles_categories 
             """

categories = pd.read_sql(SQLCommand,c)
categories['categories']=categories['categories'].apply(ast.literal_eval)
categories

Unnamed: 0,article_id,categories
0,7,"[Accidents at work, Health, Health and safety,..."
1,13,"[National accounts (incl. GDP), Statistical ar..."
2,16,"[Rail, Statistical article, Transport, Transpo..."
3,17,"[Freight, Rail, Statistical article, Transport]"
4,18,"[Passengers, Rail, Statistical article, Transp..."
...,...,...
600,9472,"[International trade, Trade in goods, Trade in..."
601,9477,"[Trade in goods, Statistical article]"
602,9479,"[Trade in goods, Statistical article, Internat..."
603,9492,"[Household composition and family situation, L..."


### Merge with the main file


In [11]:
SE_df = pd.merge(SE_df,categories,left_on='id',right_on='article_id')
SE_df.drop(columns=['article_id'],inplace=True)

SE_df.head(5)

Unnamed: 0,id,context,title,abstract,url,last_update,raw content,related_titles,related_urls,categories
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...,"[Accidents at work, Health, Health and safety,..."
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...,"[National accounts (incl. GDP), Statistical ar..."
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Rail, Statistical article, Transport, Transpo..."
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Freight, Rail, Statistical article, Transport]"
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Passengers, Rail, Statistical article, Transp..."


### Exract last update year

* And check missing values.

In [12]:
SE_df['new_date'] = [d.date() for d in SE_df['last_update']]  
SE_df['year'] = SE_df['last_update'].dt.year
SE_df['year'] =SE_df["year"].astype(str)

SE_df.replace('', np.nan, inplace=True)

SE_df['year'].fillna(value="Not found", inplace=True)

print(SE_df.isnull().sum(),'\n')

SE_df.reset_index(drop=True,inplace=True)
SE_df.head(5)

id                 0
context           59
title              0
abstract           9
url                0
last_update        0
raw content        0
related_titles     0
related_urls       0
categories         0
new_date           0
year               0
dtype: int64 



Unnamed: 0,id,context,title,abstract,url,last_update,raw content,related_titles,related_urls,categories,new_date,year
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...,"[Accidents at work, Health, Health and safety,...",2020-11-26,2020
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...,"[National accounts (incl. GDP), Statistical ar...",2021-06-28,2021
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Rail, Statistical article, Transport, Transpo...",2021-06-25,2021
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Freight, Rail, Statistical article, Transport]",2020-11-27,2020
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Passengers, Rail, Statistical article, Transp...",2021-07-07,2021


### Add themes / sub-themes information in the articles

* We create dictionary _themes_ manually.
* Include some artificial ones (theme: 'Other') to match some OECD's Glossary themes.
* Each article will have a list of themes and corresponding sub-themes, potentially empty. If an article has a category which is a key of _themes_ the theme is added to the first list. If it has a category which is in one of the values of _themes_ i.e. it is a sub-theme, the corresponding key (theme) is added to the first list and the sub-theme is added to the second list.
* There are relatively few articles without such information, see below.


In [13]:
import ast

themes = {'General and regional statistics/EU policies':
          ['Non-EU countries','Regions and cities','Sustainable development goals',
          'Policy indicators'],
          'Economy and finance': 
          ['Balance of payments','Comparative price levels (PPPs)','Consumer prices',
           'Exchange rates and interest rates','Government finance','National accounts (incl. GDP)'],
          'Population and social conditions':
          ['Asylum and migration','Crime','Culture','Education and training','Health',
           'Labour market','Living conditions','Population','Social protection','Sport','Youth'],
          'Industry and services': ['Short-term business statistics','Structural business statistics',
                                    'Business registers','Globalisation in businesses','Production statistics',
                                    'Tourism'],
          'Agriculture, forestry and fisheries':['Agriculture','Fisheries','Forestry'],
          'International trade':['Goods','Services'],
          'Transport':[],
          'Environment and energy':['Energy','Environment'],
          'Science, technology and digital society':['Digital economy and society','Science and technology'],
          'Other':['Methodology','Other']}

SE_df['themes'] = pd.Series([set() for i in range(len(SE_df))])
SE_df['sub_themes'] = pd.Series([set() for i in range(len(SE_df))])
for i in range(len(SE_df)):
    
    cats=SE_df.loc[i,'categories']
    cats = [cat.strip() for cat in cats]

    for cat in cats:
        if cat in themes.keys():
            SE_df.loc[i,'themes'].add(cat)
        else:
            for theme in themes.keys():
                if cat in themes[theme]:
                    SE_df.loc[i,'themes'].add(theme)
                    SE_df.loc[i,'sub_themes'].add(cat)
    
SE_df['themes'] = SE_df['themes'].apply(lambda x: ';'.join(x))    
SE_df['sub_themes'] = SE_df['sub_themes'].apply(lambda x: ';'.join(x))    

#SE_df['categories']= SE_df['categories'].apply(lambda x: ';'.join(x))  ## de-comment to produce the input file for R Shiny, 
## i.e. categories not in list but separated by semicolon    

print(SE_df.isnull().sum(),'\n')

print('No info in themes: ',sum(SE_df['themes']==''))
print('No info in sub_themes: ',sum(SE_df['sub_themes']==''))


SE_df.head(5)

id                 0
context           59
title              0
abstract           9
url                0
last_update        0
raw content        0
related_titles     0
related_urls       0
categories         0
new_date           0
year               0
themes             0
sub_themes         0
dtype: int64 

No info in themes:  48
No info in sub_themes:  83


Unnamed: 0,id,context,title,abstract,url,last_update,raw content,related_titles,related_urls,categories,new_date,year,themes,sub_themes
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...,"[Accidents at work, Health, Health and safety,...",2020-11-26,2020,Population and social conditions,Health;Labour market
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...,"[National accounts (incl. GDP), Statistical ar...",2021-06-28,2021,Economy and finance,National accounts (incl. GDP)
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Rail, Statistical article, Transport, Transpo...",2021-06-25,2021,Transport,
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Freight, Rail, Statistical article, Transport]",2020-11-27,2020,Transport,
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Passengers, Rail, Statistical article, Transp...",2021-07-07,2021,Transport,


### Tokenize and stem the articles titles, contexts, abstracts and contents and produce first file

* Also remove stop-words.
* Create columns _title tokens_, _context tokens_, _abstract tokens_, _raw content tokens_.

In [14]:
#Stemming.

from gensim.parsing.preprocessing import remove_stopwords
from gensim.parsing.preprocessing import stem_text
from gensim.parsing.porter import PorterStemmer

p = PorterStemmer()

def text_to_words(text):
    words = str(gensim.utils.simple_preprocess(text, deacc=True))
    words = remove_stopwords(words) 
    words = gensim.utils.tokenize(words)
        
    words = [p.stem(token) for token in words]  
    return ' '.join(words)        

for i in range(len(SE_df)):
    SE_df.loc[i,'title tokens']=text_to_words(SE_df.loc[i,'title'])
    if not pd.isnull(SE_df.loc[i,'context']):
        SE_df.loc[i,'context tokens']=text_to_words(SE_df.loc[i,'context'])
    else:
        SE_df.loc[i,'context tokens']=''
    if not pd.isnull(SE_df.loc[i,'abstract']):        
        SE_df.loc[i,'abstract tokens']=text_to_words(SE_df.loc[i,'abstract'])
    else:
        SE_df.loc[i,'abstract tokens']=''
    SE_df.loc[i,'raw content tokens']=text_to_words(SE_df.loc[i,'raw content'])

SE_df.rename(columns={'id':'article_id'},inplace=True)


## Produce first file
SE_df.to_excel('SE_df_for_PowerBI.xlsx')
SE_df.head(5)

Unnamed: 0,article_id,context,title,abstract,url,last_update,raw content,related_titles,related_urls,categories,new_date,year,themes,sub_themes,title tokens,context tokens,abstract tokens,raw content tokens
0,7,"A safe, healthy working environment is a cruci...",Accidents at work statistics,This article presents a set of main statistica...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-26 16:06:00,"Number of accidents In 2018, there were 3.1 m...",[Health in the European Union a facts and figu...,[https://ec.europa.eu/eurostat/statistics-expl...,"[Accidents at work, Health, Health and safety,...",2020-11-26,2020,Population and social conditions,Health;Labour market,accid at work statist,safe healthi work environ is crucial factor in...,thi articl present set of main statist find in...,number of accid in there were million non fata...
1,13,"European institutions, governments, central ba...",National accounts and GDP,National accounts are the source for a multitu...,https://ec.europa.eu/eurostat/statistics-expla...,2021-06-28 16:29:00,Household consumption Consumption expenditure...,"[Sector accounts, European system of national ...",[https://ec.europa.eu/eurostat/statistics-expl...,"[National accounts (incl. GDP), Statistical ar...",2021-06-28,2021,Economy and finance,National accounts (incl. GDP),nation account and gdp,european institut govern central bank as well ...,nation account ar the sourc for multitud of we...,household consumpt consumpt expenditur of hous...
2,16,National rail networks have different technica...,Railway safety statistics in the EU,"In 2019, 1516 significant railway accidents we...",https://ec.europa.eu/eurostat/statistics-expla...,2021-06-25 18:31:00,Suicides on railways Suicides occurring on th...,"[Railway freight transport statistics, Railway...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Rail, Statistical article, Transport, Transpo...",2021-06-25,2021,Transport,,railwai safeti statist in the eu,nation rail network have differ technic specif...,in signific railwai accid were report in the e...,suicid on railwai suicid occur on the railwai ...
3,17,The content of this statistical article is bas...,Railway freight transport statistics,This article focuses on recent rail freight tr...,https://ec.europa.eu/eurostat/statistics-expla...,2020-11-27 18:19:00,Geographical location plays a key role in the...,"[Transport statistics at regional level, All a...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Freight, Rail, Statistical article, Transport]",2020-11-27,2020,Transport,,railwai freight transport statist,the content of thi statist articl is base on d...,thi articl focus on recent rail freight transp...,geograph locat plai kei role in the share of i...
4,18,The content of this statistical article is bas...,Railway passenger transport statistics - quart...,This article takes a look at recent annual and...,https://ec.europa.eu/eurostat/statistics-expla...,2021-07-07 10:30:00,Number of passengers transported by rail incr...,"[Railway freight transport statistics, Freight...",[https://ec.europa.eu/eurostat/statistics-expl...,"[Passengers, Rail, Statistical article, Transp...",2021-07-07,2021,Transport,,railwai passeng transport statist quarterli an...,the content of thi statist articl is base on d...,thi articl take look at recent annual and quar...,number of passeng transport by rail increas in...


### Read the file with OECD's terms and definitions
* Column 'related' has the cross-references separated by semicolons and also with some invalid ones (not valid URL in 'related_URL') removed.


In [15]:
SQLCommand = """SELECT article_id,term,url,definition,context,theme,related,related_url,last_update,source_publ
                FROM ESTAT.V1.OECD_Glossary """

OECD_df = pd.read_sql(SQLCommand,c)
OECD_df.rename(columns={'article_id':'ID','url':'URL','related_url':'related_URL','source_publ':'Source Publication:'},inplace=True)
OECD_df

Unnamed: 0,ID,term,URL,definition,context,theme,related,related_URL,last_update,Source Publication:
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics,Pollution abatement,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, March 14, 2002",
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics,,,"Thursday, November 22, 2001",OECD Health Data 2001: A Comparative Analysis ...
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics,,,"Wednesday, October 31, 2001",OECD Health Data 2001: A Comparative Analysis ...
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics,Acute care beds;Acute care hospital staff rati...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",OECD Health Data 2001: A Comparative Analysis ...
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics,Acute care;Long-term care beds in hospitals,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",2001 Data Collection on Education Systems: Def...
...,...,...,...,...,...,...,...,...,...,...
6931,7352,European Agricultural Fund for Rural Developme...,https://stats.oecd.org/glossary/detail.asp?ID=...,The Common Agricultural Policy (CAP) is financ...,,,Common Agricultural Policy (CAP);European Agri...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Wednesday, April 3, 2013","European Commission, Agriculture and Rural Dev..."
6932,7354,Carbon market,https://stats.oecd.org/glossary/detail.asp?ID=...,A popular (but misleading) term for a trading ...,,,Greenhouse gases,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 4, 2013",United Nations Framework Convention on Climate...
6933,7355,Classification structure,https://stats.oecd.org/glossary/detail.asp?ID=...,Refers to how the categories of a classificati...,,,Classification,https://stats.oecd.org/glossary/detail.asp?ID=350,"Tuesday, April 9, 2013","United Nations Statistics Division, n.d. UN Gl..."
6934,7356,United Nation Framework Convention on Climate ...,https://stats.oecd.org/glossary/detail.asp?ID=...,The United Nations Framework Convention on Cli...,"The other ?Rio Conventions?, also negotiated a...",,United Nations Conference on Environment and D...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Friday, April 26, 2013",United Nations Framework Convention on Climate...


* Drop records with missing values and apply data cleansing.

In [16]:
OECD_df.replace('',np.nan,inplace=True)
print(OECD_df.isnull().sum())
OECD_df.dropna(subset=['term','definition'],inplace=True)
OECD_df.reset_index(drop=True, inplace=True)
print(OECD_df.isnull().sum())

OECD_df['term'] = OECD_df['term'].apply(clean)
OECD_df['definition'] = OECD_df['definition'].apply(clean)
OECD_df['context'] = OECD_df['context'].apply(clean)
OECD_df

ID                        0
term                      3
URL                       0
definition                0
context                5538
theme                    35
related                4373
related_URL            4373
last_update            1763
Source Publication:     866
dtype: int64
ID                        0
term                      0
URL                       0
definition                0
context                5536
theme                    35
related                4371
related_URL            4371
last_update            1761
Source Publication:     866
dtype: int64


Unnamed: 0,ID,term,URL,definition,context,theme,related,related_URL,last_update,Source Publication:
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics,Pollution abatement,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, March 14, 2002",
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics,,,"Thursday, November 22, 2001",OECD Health Data 2001: A Comparative Analysis ...
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics,,,"Wednesday, October 31, 2001",OECD Health Data 2001: A Comparative Analysis ...
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics,Acute care beds;Acute care hospital staff rati...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",OECD Health Data 2001: A Comparative Analysis ...
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics,Acute care;Long-term care beds in hospitals,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",2001 Data Collection on Education Systems: Def...
...,...,...,...,...,...,...,...,...,...,...
6928,7352,European Agricultural Fund for Rural Developme...,https://stats.oecd.org/glossary/detail.asp?ID=...,The Common Agricultural Policy (CAP) is financ...,,,Common Agricultural Policy (CAP);European Agri...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Wednesday, April 3, 2013","European Commission, Agriculture and Rural Dev..."
6929,7354,Carbon market,https://stats.oecd.org/glossary/detail.asp?ID=...,A popular (but misleading) term for a trading ...,,,Greenhouse gases,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 4, 2013",United Nations Framework Convention on Climate...
6930,7355,Classification structure,https://stats.oecd.org/glossary/detail.asp?ID=...,Refers to how the categories of a classificati...,,,Classification,https://stats.oecd.org/glossary/detail.asp?ID=350,"Tuesday, April 9, 2013","United Nations Statistics Division, n.d. UN Gl..."
6931,7356,United Nation Framework Convention on Climate ...,https://stats.oecd.org/glossary/detail.asp?ID=...,The United Nations Framework Convention on Cli...,"The other ?Rio Conventions?, also negotiated a...",,United Nations Conference on Environment and D...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Friday, April 26, 2013",United Nations Framework Convention on Climate...


### Tokenize and stem the articles terms, definitions and contexts

* Also remove stop-words.
* Create columns _term tokens_, _definition tokens_, _context tokens_.

In [17]:
texts=list()

for i in range(len(OECD_df)):
    OECD_df.loc[i,'term tokens']=text_to_words(OECD_df.loc[i,'term'])
    OECD_df.loc[i,'definition tokens']=text_to_words(OECD_df.loc[i,'definition'])
    if not pd.isnull(OECD_df.loc[i,'context']):        
        OECD_df.loc[i,'context tokens']=text_to_words(OECD_df.loc[i,'context'])
    else:
        OECD_df.loc[i,'context tokens']=''

OECD_df

Unnamed: 0,ID,term,URL,definition,context,theme,related,related_URL,last_update,Source Publication:,term tokens,definition tokens,context tokens
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics,Pollution abatement,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, March 14, 2002",,abat,see pollut abat,
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics,,,"Thursday, November 22, 2001",OECD Health Data 2001: A Comparative Analysis ...,absenc from work due to ill,absenc from work due to ill refer to the numbe...,
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics,,,"Wednesday, October 31, 2001",OECD Health Data 2001: A Comparative Analysis ...,activ restrict free expect,function limit free life expect is the averag ...,
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics,Acute care beds;Acute care hospital staff rati...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",OECD Health Data 2001: A Comparative Analysis ...,acut care,acut care is on in which the princip intent is...,
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics,Acute care;Long-term care beds in hospitals,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",2001 Data Collection on Education Systems: Def...,acut care bed,acut care bed ar bed accommod patient where th...,acut care bed have altern been defin as bed ac...
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6928,7352,European Agricultural Fund for Rural Developme...,https://stats.oecd.org/glossary/detail.asp?ID=...,The Common Agricultural Policy (CAP) is financ...,,,Common Agricultural Policy (CAP);European Agri...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Wednesday, April 3, 2013","European Commission, Agriculture and Rural Dev...",european agricultur fund for rural develop eafrd,the common agricultur polici cap is financ by ...,
6929,7354,Carbon market,https://stats.oecd.org/glossary/detail.asp?ID=...,A popular (but misleading) term for a trading ...,,,Greenhouse gases,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 4, 2013",United Nations Framework Convention on Climate...,carbon market,popular but mislead term for trade system thro...,
6930,7355,Classification structure,https://stats.oecd.org/glossary/detail.asp?ID=...,Refers to how the categories of a classificati...,,,Classification,https://stats.oecd.org/glossary/detail.asp?ID=350,"Tuesday, April 9, 2013","United Nations Statistics Division, n.d. UN Gl...",classif structur,refer to how the categori of classif ar arrang...,
6931,7356,United Nation Framework Convention on Climate ...,https://stats.oecd.org/glossary/detail.asp?ID=...,The United Nations Framework Convention on Cli...,"The other ?Rio Conventions?, also negotiated a...",,United Nations Conference on Environment and D...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Friday, April 26, 2013",United Nations Framework Convention on Climate...,unit nation framework convent on climat chang ...,the unit nation framework convent on climat ch...,the other rio convent also negoti at the unit ...


### Read the table with the correspondence between a) Eurostat's themes and sub-themes b) OECD's Glossary themes

* There may be more than one OECD's themes corresponding to a Eurostat's theme and sub-theme combination.

In [18]:
SQLCommand = """SELECT ESTAT_theme, ESTAT_sub_theme, OECD_themes
                FROM ESTAT.V1.Eurostat_OECD_themes """

corresp_df = pd.read_sql(SQLCommand,c)
corresp_df.replace(np.nan,value='',inplace=True)
corresp_df['OECD_themes'] = corresp_df['OECD_themes'].apply(lambda x: x.split(';'))
corresp_df.rename(columns={'OECD_themes':'OECD_theme'},inplace=True)
print(corresp_df.isnull().sum())
corresp_df
  

ESTAT_theme        0
ESTAT_sub_theme    0
OECD_theme         0
dtype: int64


Unnamed: 0,ESTAT_theme,ESTAT_sub_theme,OECD_theme
0,General and regional statistics/EU policies,Non-EU countries,[]
1,General and regional statistics/EU policies,Regions and cities,[]
2,General and regional statistics/EU policies,Sustainable development goals,[]
3,General and regional statistics/EU policies,Policy indicators,[]
4,Economy and finance,Balance of payments,[Financial statistics - Balance of payments]
5,Economy and finance,Comparative price levels (PPPs),[Prices and purchasing power parities - Price ...
6,Economy and finance,Consumer prices,[Prices and purchasing power parities - Price ...
7,Economy and finance,Exchange rates and interest rates,[Financial statistics - Exchange rates]
8,Economy and finance,Government finance,[Financial statistics - Government finance and...
9,Economy and finance,National accounts (incl. GDP),"[National accounts - Input-output tables, Nati..."


### Insert Eurostat's themes - sub-themes information into OECD Glossary articles dataframe

In [19]:
OECD_df['ESTAT_theme']=pd.Series(list() for i in range(len(OECD_df)))
OECD_df['ESTAT_sub_theme']=pd.Series(list() for i in range(len(OECD_df)))
for i in range(len(OECD_df)):
    theme = OECD_df.loc[i,'theme']
    #print(theme)
    for j in range(len(corresp_df)):
        if theme in corresp_df.loc[j,'OECD_theme']:
            if corresp_df.loc[j,'ESTAT_theme'] not in OECD_df.loc[i,'ESTAT_theme']: ## avoid duplicates
                OECD_df.loc[i,'ESTAT_theme'].append(corresp_df.loc[j,'ESTAT_theme'])
            if corresp_df.loc[j,'ESTAT_sub_theme'] not in OECD_df.loc[i,'ESTAT_sub_theme']: ## avoid duplicates               
                OECD_df.loc[i,'ESTAT_sub_theme'].append(corresp_df.loc[j,'ESTAT_sub_theme'])
            

idx=OECD_df[OECD_df['ESTAT_theme'].apply(len)==0].index
OECD_df.drop(index=idx,inplace=True)
OECD_df.reset_index()
OECD_df

Unnamed: 0,ID,term,URL,definition,context,theme,related,related_URL,last_update,Source Publication:,term tokens,definition tokens,context tokens,ESTAT_theme,ESTAT_sub_theme
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics,Pollution abatement,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, March 14, 2002",,abat,see pollut abat,,[Environment and energy],"[Energy, Environment]"
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics,,,"Thursday, November 22, 2001",OECD Health Data 2001: A Comparative Analysis ...,absenc from work due to ill,absenc from work due to ill refer to the numbe...,,[Population and social conditions],[Health]
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics,,,"Wednesday, October 31, 2001",OECD Health Data 2001: A Comparative Analysis ...,activ restrict free expect,function limit free life expect is the averag ...,,[Population and social conditions],[Health]
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics,Acute care beds;Acute care hospital staff rati...,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",OECD Health Data 2001: A Comparative Analysis ...,acut care,acut care is on in which the princip intent is...,,[Population and social conditions],[Health]
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics,Acute care;Long-term care beds in hospitals,https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",2001 Data Collection on Education Systems: Def...,acut care bed,acut care bed ar bed accommod patient where th...,acut care bed have altern been defin as bed ac...,[Population and social conditions],[Health]
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6919,7343,Statistical products,https://stats.oecd.org/glossary/detail.asp?ID=...,"Statistical products are, generally, informati...",Statistical products include general-purpose t...,Methodological information (metadata),Statistical data,https://stats.oecd.org/glossary/detail.asp?ID=...,"Monday, October 1, 2007","United States Office of Management and Budget,...",statist product,statist product ar gener inform dissemin produ...,statist product includ gener purpos tabul anal...,[Other],[Methodology]
6920,7344,Statistical press release,https://stats.oecd.org/glossary/detail.asp?ID=...,Is an announcement to media of statistical pro...,,Methodological information (metadata),Statistical products,https://stats.oecd.org/glossary/detail.asp?ID=...,"Monday, October 1, 2007","United States Office of Management and Budget,...",statist press releas,is an announc to media of statist product rele...,,[Other],[Methodology]
6921,7345,Press release,https://stats.oecd.org/glossary/detail.asp?ID=...,See Statistical press release.,,Methodological information (metadata),Statistical press release,https://stats.oecd.org/glossary/detail.asp?ID=...,,,press releas,see statist press releas,,[Other],[Methodology]
6922,7346,APW - Average Production Worker,https://stats.oecd.org/glossary/detail.asp?ID=...,An adult full-time worker directly engaged in ...,This definition was last used in tax calculati...,Tax policy & analysis - Taxing Wages,,,"Thursday, January 13, 2011",OECD: The Tax/Benefits Position of Production ...,apw averag product worker,an adult full time worker directli engag in pr...,thi definit wa last us in tax calcul within th...,[Economy and finance],[Government finance]


* Extract last update year and produce the second file

In [20]:
def my_split(x):
    if pd.isna(x):
        return []
    else:
        return x.split(';')

OECD_df['related'] = OECD_df['related'].apply(my_split)
OECD_df[["day", "month", "year"]] =OECD_df["last_update"].str.split(",", expand = True)
OECD_df['year'] =OECD_df["year"].astype(str)

OECD_df.loc[OECD_df['year'] == 'nan', 'year'] = np.nan 
OECD_df['year'].fillna(value="Not found", inplace=True)
OECD_df.reset_index(drop=True,inplace=True)
OECD_df

Unnamed: 0,ID,term,URL,definition,context,theme,related,related_URL,last_update,Source Publication:,term tokens,definition tokens,context tokens,ESTAT_theme,ESTAT_sub_theme,day,month,year
0,1,Abatement,https://stats.oecd.org/glossary/detail.asp?ID=1,See Pollution abatement.,,Environmental statistics,[Pollution abatement],https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, March 14, 2002",,abat,see pollut abat,,[Environment and energy],"[Energy, Environment]",Thursday,March 14,2002
1,2,Absence from work due to illness,https://stats.oecd.org/glossary/detail.asp?ID=2,Absence from work due to illness refers to the...,,Health statistics,[],,"Thursday, November 22, 2001",OECD Health Data 2001: A Comparative Analysis ...,absenc from work due to ill,absenc from work due to ill refer to the numbe...,,[Population and social conditions],[Health],Thursday,November 22,2001
2,3,Activity restriction - free expectancy,https://stats.oecd.org/glossary/detail.asp?ID=3,Functional limitation-free life expectancy is ...,,Health statistics,[],,"Wednesday, October 31, 2001",OECD Health Data 2001: A Comparative Analysis ...,activ restrict free expect,function limit free life expect is the averag ...,,[Population and social conditions],[Health],Wednesday,October 31,2001
3,4,Acute care,https://stats.oecd.org/glossary/detail.asp?ID=4,Acute care is one in which the principal inten...,,Health statistics,"[Acute care beds, Acute care hospital staff ra...",https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",OECD Health Data 2001: A Comparative Analysis ...,acut care,acut care is on in which the princip intent is...,,[Population and social conditions],[Health],Thursday,April 25,2013
4,5,Acute care beds,https://stats.oecd.org/glossary/detail.asp?ID=5,Acute care beds are beds accommodating patient...,Acute care beds have alternatively been define...,Health statistics,"[Acute care, Long-term care beds in hospitals]",https://stats.oecd.org/glossary/detail.asp?ID=...,"Thursday, April 25, 2013",2001 Data Collection on Education Systems: Def...,acut care bed,acut care bed ar bed accommod patient where th...,acut care bed have altern been defin as bed ac...,[Population and social conditions],[Health],Thursday,April 25,2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5667,7343,Statistical products,https://stats.oecd.org/glossary/detail.asp?ID=...,"Statistical products are, generally, informati...",Statistical products include general-purpose t...,Methodological information (metadata),[Statistical data],https://stats.oecd.org/glossary/detail.asp?ID=...,"Monday, October 1, 2007","United States Office of Management and Budget,...",statist product,statist product ar gener inform dissemin produ...,statist product includ gener purpos tabul anal...,[Other],[Methodology],Monday,October 1,2007
5668,7344,Statistical press release,https://stats.oecd.org/glossary/detail.asp?ID=...,Is an announcement to media of statistical pro...,,Methodological information (metadata),[Statistical products],https://stats.oecd.org/glossary/detail.asp?ID=...,"Monday, October 1, 2007","United States Office of Management and Budget,...",statist press releas,is an announc to media of statist product rele...,,[Other],[Methodology],Monday,October 1,2007
5669,7345,Press release,https://stats.oecd.org/glossary/detail.asp?ID=...,See Statistical press release.,,Methodological information (metadata),[Statistical press release],https://stats.oecd.org/glossary/detail.asp?ID=...,,,press releas,see statist press releas,,[Other],[Methodology],,,Not found
5670,7346,APW - Average Production Worker,https://stats.oecd.org/glossary/detail.asp?ID=...,An adult full-time worker directly engaged in ...,This definition was last used in tax calculati...,Tax policy & analysis - Taxing Wages,[],,"Thursday, January 13, 2011",OECD: The Tax/Benefits Position of Production ...,apw averag product worker,an adult full time worker directli engag in pr...,thi definit wa last us in tax calcul within th...,[Economy and finance],[Government finance],Thursday,January 13,2011


In [21]:
## Produce the second file



OECD_df['ESTAT_theme'] = OECD_df['ESTAT_theme'].apply(lambda x: ';'.join(x))
OECD_df['ESTAT_sub_theme'] = OECD_df['ESTAT_sub_theme'].apply(lambda x: ';'.join(x))

OECD_df['related'] = OECD_df['related'].apply(lambda x: ';'.join(x))

OECD_df.to_excel('OECD_content_for_PowerBI.xlsx')