# Word vectors trained on SE articles and SE Glossary articles. Application for the identification of Eurostat datasets

This is a Google Colab notebook. You must have a Google account with a Google Drive to store/ load the model. Upload it from its location in GitHub and allow the code to access your Google Drive

*   Launch the notebook and put your own credentials in the pyodbc.connect() call in the chunk with title "Connect to Virtuoso database"


###Connect Google Drive

In [1]:
## To store the model (allowing re-runs starting from re-loading the model)

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install pyodbc



Collecting pyodbc
  Downloading pyodbc-4.0.32.tar.gz (280 kB)
[?25l[K     |█▏                              | 10 kB 19.5 MB/s eta 0:00:01[K     |██▍                             | 20 kB 26.1 MB/s eta 0:00:01[K     |███▌                            | 30 kB 15.1 MB/s eta 0:00:01[K     |████▊                           | 40 kB 7.6 MB/s eta 0:00:01[K     |█████▉                          | 51 kB 7.7 MB/s eta 0:00:01[K     |███████                         | 61 kB 9.0 MB/s eta 0:00:01[K     |████████▏                       | 71 kB 9.5 MB/s eta 0:00:01[K     |█████████▍                      | 81 kB 8.6 MB/s eta 0:00:01[K     |██████████▌                     | 92 kB 9.5 MB/s eta 0:00:01[K     |███████████▊                    | 102 kB 8.4 MB/s eta 0:00:01[K     |████████████▉                   | 112 kB 8.4 MB/s eta 0:00:01[K     |██████████████                  | 122 kB 8.4 MB/s eta 0:00:01[K     |███████████████▏                | 133 kB 8.4 MB/s eta 0:00:01[K     |█████

In [3]:
!pip install eurostat

Collecting eurostat
  Downloading eurostat-0.2.3-py3-none-any.whl (11 kB)
Collecting pandasdmx<=0.9
  Downloading pandaSDMX-0.9-py2.py3-none-any.whl (45 kB)
[K     |████████████████████████████████| 45 kB 2.1 MB/s 
Collecting jsonpath-rw
  Downloading jsonpath-rw-1.4.0.tar.gz (13 kB)
Collecting ply
  Downloading ply-3.11-py2.py3-none-any.whl (49 kB)
[K     |████████████████████████████████| 49 kB 5.9 MB/s 
Building wheels for collected packages: jsonpath-rw
  Building wheel for jsonpath-rw (setup.py) ... [?25l[?25hdone
  Created wheel for jsonpath-rw: filename=jsonpath_rw-1.4.0-py3-none-any.whl size=15147 sha256=5f0def1cfa6923095e933feceff2500686c7c679468b41066237541274c288ba
  Stored in directory: /root/.cache/pip/wheels/58/88/2a/8d619cf38d7cf939e54b6ccdece05d31b64b3eb419c11d1ed3
Successfully built jsonpath-rw
Installing collected packages: ply, jsonpath-rw, pandasdmx, eurostat
Successfully installed eurostat-0.2.3 jsonpath-rw-1.4.0 pandasdmx-0.9 ply-3.11


In [4]:
!apt-get install virtuoso-opensource

Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following package was automatically installed and is no longer required:
  libnvidia-common-470
Use 'apt autoremove' to remove it.
The following additional packages will be installed:
  libvirtodbc0 virtuoso-opensource-6.1 virtuoso-opensource-6.1-bin
  virtuoso-opensource-6.1-common virtuoso-server virtuoso-vad-conductor
  virtuoso-vsp-startpage
Suggested packages:
  virtuoso-vad-doc virtuoso-vad-demo virtuoso-vad-tutorial
  virtuoso-vad-rdfmappers virtuoso-vad-sparqldemo virtuoso-vad-syncml
  virtuoso-vad-bpel virtuoso-vad-isparql virtuoso-vad-ods virtuoso-vad-dbpedia
  virtuoso-vad-facetedbrowser
The following NEW packages will be installed:
  libvirtodbc0 virtuoso-opensource virtuoso-opensource-6.1
  virtuoso-opensource-6.1-bin virtuoso-opensource-6.1-common virtuoso-server
  virtuoso-vad-conductor virtuoso-vsp-startpage
0 upgraded, 8 newly installed, 0 to remove and 39 not upgraded.


In [5]:
import gensim
import pandas as pd
import numpy as np

import re

import pyodbc

###Connect to Virtuoso database

In [6]:
c = pyodbc.connect('DRIVER=/usr/lib/odbc/virtodbc.so;HOST=lod.csd.auth.gr:1111;UID=xxxxx;PWD=xxxxx;DATABASE=ESTAT')

In [7]:
#set encoding
c.setdecoding(pyodbc.SQL_CHAR, encoding='latin-1')
c.setencoding(encoding="latin-1")

In [8]:
cursor = c.cursor()

In [9]:
def load_table(cursor,query):
  cursor.execute(query)
  t1 = cursor.fetchall()
  df = pd.DataFrame.from_records(t1, columns=[x[0] for x in cursor.description])
  return df


## The following processing of SE and SE Glossary articles is for the first run only, to save the model. After this, it suffices to load the model



### Glossary articles  

* Definitions from dat_glossary.
* Titles from dat_link_info (with resource_information_id=1, i.e. Eurostat, see ESTAT.V1.mod_resource_information).
* Match above on id.


In [10]:
query = """SELECT T1.id, T1.definition, T2.title 
                FROM ESTAT.V1.dat_glossary as T1 
                INNER JOIN ESTAT.V1.dat_link_info as T2  
                  ON T1.id=T2.id 
                WHERE T2.resource_information_id=1 """
GL_df = load_table(cursor,query)

GL_df = GL_df[['id', 'title', 'definition']]

GL_df


Unnamed: 0,id,title,definition
0,1,Accident at work,An accident at work in the framework ...
1,5,Fatal accident at work,A fatal accident at work refers to an...
2,6,Non-fatal accident at work,A non-fatal accident at work is...
3,8,Aggregate demand,Aggregate demand is the total amount of ...
4,9,Goods and services account,The goods and services account shows ...
...,...,...,...
1309,2319,Actual individual consumption (AIC),"Actual individual consumption , abbrevia..."
1310,2321,Activity rate,Activity rate is the percentage of a...
1311,2322,Activation policies,The activation policies are policies ...
1312,2324,Active enterprises - FRIBS,"<Brief user-oriented definition, one or a fe..."


### Delete records with empty definitions and carry out data cleansing


In [11]:

GL_df = GL_df.dropna(axis=0,how='any')
print(GL_df.isnull().sum())
GL_df.reset_index(drop=True, inplace=True)

#import unicodedata as ud

def clean(x, quotes=True):
    if pd.isnull(x): return x  
    x = x.strip()
    
    ## make letter-question mark-letter -> letter-quote-space-letter !!! but NOT in the lists of URLs!!!
    if quotes:
        x = re.sub(r'([A-Za-z])\?([A-Za-z])','\\1\' \\2',x) 
    
    ## make letter-question mark-space lower case letter letter-quote-space letter
    x = re.sub(r'([A-Za-z])\? ([a-z])','\\1\' \\2',x) 

    ## delete ,000 commas in numbers    
    x = re.sub(r'\b(\d+),(\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## delete  000 spaces in numbers
    x = re.sub(r'\b(\d+) (\d+)\b','\\1\\2',x) ## CORRECTED
    
    ## remove more than one spaces
    x = re.sub(r' +', ' ',x)
    
    ## remove start and end spaces
    x = re.sub(r'^ +| +$', '',x,flags=re.MULTILINE) 
    
    ## space-comma -> comma
    x = re.sub(r' \,',',',x)
    
    ## space-dot -> dot
    x = re.sub(r' \.','.',x)
    
    x = re.sub(r'â.{2}',"'",x) ### !!! NEW: single quotes are read as: âXX
    
    #x = x.encode('latin1').decode('utf-8') ## â\x80\x99
    #x = ud.normalize('NFKD',x).encode('ascii', 'ignore').decode()
    
    return x


GL_df['title'] = GL_df['title'].apply(clean)
GL_df['title'] = GL_df['title'].apply(lambda x: re.sub(r'\?','-',x)) ## also replace question marks by dashes
GL_df['definition'] = GL_df['definition'].apply(clean)

GL_df.head(5)


id            0
title         0
definition    0
dtype: int64


Unnamed: 0,id,title,definition
0,1,Accident at work,An accident at work in the framework of the ad...
1,5,Fatal accident at work,A fatal accident at work refers to an accident...
2,6,Non-fatal accident at work,A non-fatal accident at work is an accident wh...
3,8,Aggregate demand,Aggregate demand is the total amount of goods ...
4,9,Goods and services account,The goods and services account shows the balan...


### Delete "special" records

* i.e. redirections.



In [12]:
## Drop The records with definitions "The revision ..." and "Redirect to ..." 

idx = GL_df[GL_df['definition'].str.startswith('The revision #')].index
print(idx)
GL_df.drop(idx , inplace=True)
idx = GL_df[GL_df['definition'].str.startswith('Redirect to')].index
print(idx)
GL_df.drop(idx , inplace=True)
GL_df.reset_index(drop=True, inplace=True)
GL_df

Int64Index([ 230,  292,  384,  386,  433,  436,  438,  439,  504,  519,  530,
             557,  588,  729,  742,  775,  826,  889,  891,  912,  960,  961,
             969, 1003, 1007, 1133, 1144, 1182, 1231],
           dtype='int64')
Int64Index([], dtype='int64')


Unnamed: 0,id,title,definition
0,1,Accident at work,An accident at work in the framework of the ad...
1,5,Fatal accident at work,A fatal accident at work refers to an accident...
2,6,Non-fatal accident at work,A non-fatal accident at work is an accident wh...
3,8,Aggregate demand,Aggregate demand is the total amount of goods ...
4,9,Goods and services account,The goods and services account shows the balan...
...,...,...,...
1280,2319,Actual individual consumption (AIC),"Actual individual consumption, abbreviated as ..."
1281,2321,Activity rate,Activity rate is the percentage of active pers...
1282,2322,Activation policies,The activation policies are policies designed ...
1283,2324,Active enterprises - FRIBS,"<Brief user-oriented definition, one or a few ..."


### Create column "raw content" with the titles and the definitions

In [13]:
GL_df['raw content'] = GL_df['title'] +'. '+GL_df['definition']
#GL_df['source'] = 'GL'
GL_df

Unnamed: 0,id,title,definition,raw content
0,1,Accident at work,An accident at work in the framework of the ad...,Accident at work. An accident at work in the f...
1,5,Fatal accident at work,A fatal accident at work refers to an accident...,Fatal accident at work. A fatal accident at wo...
2,6,Non-fatal accident at work,A non-fatal accident at work is an accident wh...,Non-fatal accident at work. A non-fatal accide...
3,8,Aggregate demand,Aggregate demand is the total amount of goods ...,Aggregate demand. Aggregate demand is the tota...
4,9,Goods and services account,The goods and services account shows the balan...,Goods and services account. The goods and serv...
...,...,...,...,...
1280,2319,Actual individual consumption (AIC),"Actual individual consumption, abbreviated as ...",Actual individual consumption (AIC). Actual in...
1281,2321,Activity rate,Activity rate is the percentage of active pers...,Activity rate. Activity rate is the percentage...
1282,2322,Activation policies,The activation policies are policies designed ...,Activation policies. The activation policies a...
1283,2324,Active enterprises - FRIBS,"<Brief user-oriented definition, one or a few ...",Active enterprises - FRIBS. <Brief user-orient...


### Statistics Explained articles

* IDs, titles from dat_link_info, with resource_information_id=1, i.e. Eurostat (see ESTAT.V1.mod_resource_information) and matching IDs from dat_article.
* Carry out data cleansing on titles.

In [14]:
query =      """SELECT id, title 
                FROM ESTAT.V1.dat_link_info 
                WHERE resource_information_id=1 AND id IN (SELECT id FROM ESTAT.V1.dat_article) """

SE_df = load_table(cursor,query)

SE_df['title'] = SE_df['title'].apply(clean)
SE_df.head(5)

Unnamed: 0,id,title
0,7,Accidents at work statistics
1,13,National accounts and GDP
2,16,Railway safety statistics in the EU
3,17,Railway freight transport statistics
4,18,Railway passenger transport statistics - quart...


### Add paragraphs titles and contents

* From dat_article_paragraph with abstract=0 (i.e. "no").
* Match article_id from dat_article_paragraph with id from dat_article.
* Carry out data cleansing on titles and paragraph contents.

In [15]:
query =      """SELECT article_id, title, content 
                FROM ESTAT.V1.dat_article_paragraph
                WHERE abstract=0 AND article_id IN (SELECT id FROM ESTAT.V1.dat_article) """

add_content = load_table(cursor,query)
add_content['title'] = add_content['title'].apply(clean)
add_content['content'] = add_content['content'].apply(clean)
add_content

Unnamed: 0,article_id,title,content
0,2905,Absences from work sharply increase in first h...,Absences from work recorded unprecedented high...
1,2905,Absences: 9.5 % of employment in Q4 2019 and 1...,The article's next figure (Figure 4) compares ...
2,2905,Higher share of absences from work among women...,"Considering all four quarters of 2020, the sha..."
3,2905,Absences from work due to own illness or disab...,"From Q4 2019 to Q4 2020, the number of people ..."
4,2905,Absences from work due to holidays,"Expressed as a share of employed people, absen..."
...,...,...,...
3854,10539,General presentation and definition,Scope of asylum statistics and Dublin statisti...
3855,10539,Methodological aspects in asylum statistics,Annual aggregate of the number of asylum appli...
3856,10539,Methodological aspects in Dublin statistics,Asymmetries For most of the collected Dublin s...
3857,10539,What questions can or cannot be answered with ...,How many asylum seekers are entering EU Member...


### Aggregate above paragraph titles and contents  from SE articles paragraphs by article id

* Create a column _raw content_ which gathers all paragraph titles and contents in one text per article.

In [16]:
add_content_grouped = add_content.groupby(['article_id'])[['title','content']].aggregate(lambda x: list(x))
add_content_grouped.reset_index(drop=False, inplace=True)
for i in range(len(add_content_grouped)):
    add_content_grouped.loc[i,'raw content'] = ''
    for (a,b) in zip(add_content_grouped.loc[i,'title'],add_content_grouped.loc[i,'content']):
        add_content_grouped.loc[i,'raw content'] += a + '. ' + b
add_content_grouped = add_content_grouped[['article_id','raw content']]    

add_content_grouped

Unnamed: 0,article_id,raw content
0,7,"Number of accidents. In 2018, there were 3.1 m..."
1,13,Developments for GDP in the EU-27: growth sinc...
2,16,Fall in the number of railway accidents. 9 % f...
3,17,Downturn for EU transport performance in 2019....
4,18,Rail passenger transport performance continued...
...,...,...
860,10456,Problem. After successfully identifying and jo...
861,10470,"Problem. In France, there was significant room..."
862,10506,General overview. Nine PEEIs concern short-ter...
863,10531,What are administrative sources?. The term 'ad...


### Merge raw content of SE articles with main file

* Add the title to column "raw content".

In [17]:
SE_df = pd.merge(SE_df,add_content_grouped,left_on='id',right_on='article_id',how='inner')
SE_df.drop(['article_id'],axis=1,inplace=True)

SE_df['raw content'] = SE_df['title'] +'. ' + SE_df['raw content']
#SE_df['source'] = 'SE'

SE_df

Unnamed: 0,id,title,raw content
0,7,Accidents at work statistics,Accidents at work statistics. Number of accide...
1,13,National accounts and GDP,National accounts and GDP. Developments for GD...
2,16,Railway safety statistics in the EU,Railway safety statistics in the EU. Fall in t...
3,17,Railway freight transport statistics,Railway freight transport statistics. Downturn...
4,18,Railway passenger transport statistics - quart...,Railway passenger transport statistics - quart...
...,...,...,...
860,10456,"Merging statistics and geospatial information,...","Merging statistics and geospatial information,..."
861,10470,"Merging statistics and geospatial information,...","Merging statistics and geospatial information,..."
862,10506,Methods for compiling PEEIs in short-term busi...,Methods for compiling PEEIs in short-term busi...
863,10531,Building the System of National Accounts - adm...,Building the System of National Accounts - adm...


### Concatenate the two dataframes 

In [18]:
all_df = pd.concat([GL_df[['id','title','raw content']],SE_df[['id','title','raw content']]],ignore_index=True)
all_df

Unnamed: 0,id,title,raw content
0,1,Accident at work,Accident at work. An accident at work in the f...
1,5,Fatal accident at work,Fatal accident at work. A fatal accident at wo...
2,6,Non-fatal accident at work,Non-fatal accident at work. A non-fatal accide...
3,8,Aggregate demand,Aggregate demand. Aggregate demand is the tota...
4,9,Goods and services account,Goods and services account. The goods and serv...
...,...,...,...
2145,10456,"Merging statistics and geospatial information,...","Merging statistics and geospatial information,..."
2146,10470,"Merging statistics and geospatial information,...","Merging statistics and geospatial information,..."
2147,10506,Methods for compiling PEEIs in short-term busi...,Methods for compiling PEEIs in short-term busi...
2148,10531,Building the System of National Accounts - adm...,Building the System of National Accounts - adm...


### Create 'sentences' list from the 'raw content' column after tokenization and lemmatization

* Keep terms with only alphanumeric characters and drop stop words.
* Drop numbers.
* Keep lemmas from original terms with at least 5 characters.

In [19]:
import nltk

nltk.download('wordnet')
nltk.download('stopwords')

from nltk.corpus import stopwords

w_tokenizer = nltk.tokenize.WordPunctTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()

stop = stopwords.words('english')

def lemmatize_text(text): ## only alphanumeric characters and drop stop-words
    return [lemmatizer.lemmatize(w).lower() for w in w_tokenizer.tokenize(text) if w.isalnum() and not w.lower() in stop
           and not re.match(r'^[0-9]+$',w) and len(w) >=5]

all_df['raw content']=all_df['raw content'].apply(lambda x: lemmatize_text(x))
##all_df         
sentences = all_df['raw content'].to_list()
##sentences

del SE_df, GL_df, all_df

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


## Train word vectors on the 'sentences' list and save the trained model in binary file SE_GL_wordvectors.bin

## Processing can then start from the loading of the saved model

### Save the model also in plain text format: 'SE_GL_wordvectors.txt'.
* This has the number of vectors and the dimensionality in the first row.
* Then, for each word in the model, the coordinates of the corresponding vector.


In [20]:
from gensim.models import Word2Vec
from gensim.models import KeyedVectors

from gensim.test.utils import datapath

from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.porter import PorterStemmer

p = PorterStemmer()
all_stopwords_gensim = STOPWORDS

print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

model = Word2Vec(sentences=sentences,size=300,workers=4,iter=10) # number of dimensions: 300, 4 threads,
## epochs = 2 x default

word_vectors = model.wv
word_vectors.save('/content/drive/MyDrive/SE_GL_wordvectors.bin') ## store only vectors - not going to update the model

## save also as plain text
word_vectors.save_word2vec_format(datapath('/content/drive/MyDrive/SE_GL_wordvectors.txt'), binary=False)

del sentences

model = KeyedVectors.load("/content/drive/MyDrive/SE_GL_wordvectors.bin") 
print('Number of vectors: ',len(model.vocab))

## examples of use
for i in range(10):
    w = model.wv.index2word[i]
    print(w,lemmatizer.lemmatize(w),p.stem(w),end='')
    if w in all_stopwords_gensim:
        print(' ***')
    else:
        print()
        similarities = model.most_similar(w,topn=10)
        for w2, score in similarities:
            print('    ',w2,score)

Type of corpus:  <class 'list'>
Length of corpus:  2150
Number of vectors:  6261
member member member
     varied 0.4571118950843811
     nordic 0.4425370991230011
     seven 0.42762595415115356
     repeated 0.4266907572746277
     baltic 0.42584097385406494
     remaining 0.42392611503601074
     exception 0.4040917456150055
     asean 0.3863232135772705
     members 0.3827565312385559
     seventeen 0.38266146183013916
country country countri
     countries 0.41346684098243713
     states 0.41129326820373535
     candidate 0.4083942472934723
     greatly 0.3962809443473816
     interesting 0.3856881260871887
     nationality 0.3811797499656677
     partner 0.376822292804718
     liechtenstein 0.36730408668518066
     asian 0.353677362203598
     widely 0.353537380695343
states state state
     state 0.7291289567947388
     kingdom 0.5600324869155884
     joined 0.5125753879547119
     greatly 0.4204980134963989
     exception 0.4146255552768707
     country 0.41129326820373535
     



## Read the Eurostat's database table of contents

* From the parsed table_of_contents.xml file, stored in the database.
* Keep only records - leaves, corresponding to datasets.
* Column 'Normalized file description': from column 'File description', keeping terms with only alphanumeric characters (but not numbers), dropping stop words, lemmatizing and keeping lemmas from original terms with at least 5 characters.
* Column 'Normalized full path': as above from column 'Names'.

In [21]:
import ast

query =      """SELECT id, number, codes, names, file_descr, file_code, level 
                FROM ESTAT.V1.dat_all_datasets """

crumbs_df = load_table(cursor,query)

crumbs_df['codes'] = crumbs_df['codes'].apply(lambda x: ast.literal_eval(x))
crumbs_df['names'] = crumbs_df['names'].apply(lambda x: ast.literal_eval(x))
crumbs_df.rename(columns={'number':'Number','codes':'Codes','names':'Names','file_descr':'File description','level':'Level','file_code':'File code'},inplace=True)
crumbs_df = crumbs_df[['Number','Codes','Names','File description','File code','Level']].copy()

## Keep only records - leaves, corresponding to datasets.
idx = crumbs_df[crumbs_df['File code']==''].index
crumbs_df.drop(index=idx, inplace=True) 
crumbs_df.reset_index(drop=True,inplace=True)

def lemmatize_list(x):
    str = ' '.join(x)
    return ' '.join(lemmatize_text(str))

crumbs_df['Normalized file description'] = crumbs_df['File description'].apply(lambda x: x.lower())
crumbs_df['Normalized file description'] = crumbs_df['Normalized file description'].apply(lambda x: ' '.join(lemmatize_text(x)))

crumbs_df['Normalized full path'] = crumbs_df['Names'].apply(lambda x: [y.lower() for y in x[1:]]) ## exclude first part
crumbs_df['Normalized full path'] = crumbs_df['Normalized full path'].apply(lemmatize_list)

crumbs_df


Unnamed: 0,Number,Codes,Names,File description,File code,Level,Normalized file description,Normalized full path
0,1.1.1.1.1.1,"[data, general, euroind, ei_bcs, ei_bcs_cs, ei...","[Database by themes, General and regional stat...",Consumers - monthly data,ei_bsco_m,5,consumer monthly,general regional statistic european national i...
1,1.1.1.1.1.2,"[data, general, euroind, ei_bcs, ei_bcs_cs, ei...","[Database by themes, General and regional stat...",Consumers - quarterly data,ei_bsco_q,5,consumer quarterly,general regional statistic european national i...
2,1.1.1.1.2.1,"[data, general, euroind, ei_bcs, ei_bcs_bs, ei...","[Database by themes, General and regional stat...",Industry - monthly data,ei_bsin_m_r2,5,industry monthly,general regional statistic european national i...
3,1.1.1.1.2.2,"[data, general, euroind, ei_bcs, ei_bcs_bs, ei...","[Database by themes, General and regional stat...",Industry - quarterly data,ei_bsin_q_r2,5,industry quarterly,general regional statistic european national i...
4,1.1.1.1.2.3,"[data, general, euroind, ei_bcs, ei_bcs_bs, ei...","[Database by themes, General and regional stat...",Construction - monthly data,ei_bsbu_m_r2,5,construction monthly,general regional statistic european national i...
...,...,...,...,...,...,...,...,...
8740,4.8.6.3.3.8,"[cc, sks, sks_dev, sks_devuoe, educ_uoe_enrt, ...","[Cross cutting topics, Skills-related statisti...",Students in tertiary education - as % of 20-24...,educ_uoe_enrt08,5,student tertiary education year population,skill related statistic skill development part...
8741,4.8.6.4.1,"[cc, sks, sks_dev, sks_devict, isoc_ske_ittn2]","[Cross cutting topics, Skills-related statisti...",Enterprises that provided training to develop/...,isoc_ske_ittn2,4,enterprise provided training develop upgrade s...,skill related statistic skill development part...
8742,4.8.6.5.1,"[cc, sks, sks_dev, sks_devcvt, trng_cvt_01s]","[Cross cutting topics, Skills-related statisti...",Enterprises providing training by type of trai...,trng_cvt_01s,4,enterprise providing training training class e...,skill related statistic skill development part...
8743,4.8.6.5.2,"[cc, sks, sks_dev, sks_devcvt, trng_cvt_12s]","[Cross cutting topics, Skills-related statisti...",Participants in CVT courses by sex and size cl...,trng_cvt_12s,4,participant course class person employed enter...,skill related statistic skill development part...


### The function scanning the datasets

* Create a 0-vector 'scores' of dimension # datasets
* Function scan_datasets():
    *    Average the vectors of words found in the input sentence using function 'avg_feature_vector()'.
    *    For each entry in the datasets column 'Normalized file description' (option=1) or in the datasets column 'Normalized full path' (option=2), similarly average the vectors of words found.
    *    Calculate the similarity of the input sentence with each dataset description, by calculating the cosine similarity of the two averaged vectors (function sent_similarity()).  
    *    Assign the similarities as scores in vector 'scores'. 


In [22]:
index2_to_key_set = set(model.index2word)
num_features=300

from scipy import spatial

#import warnings
#warnings.filterwarnings("error")

def avg_feature_vector(sentence, model, num_features, index2_to_key_set):
    words = sentence.split()
    feature_vec = np.zeros((num_features, ), dtype='float32')
    n_words = 0
    for word in words:
        if word in index2_to_key_set:
            n_words += 1
            feature_vec = np.add(feature_vec, model[word])
    if (n_words > 0):
        feature_vec = np.divide(feature_vec, n_words)
    return feature_vec

def sent_similarity(sent1,sent2):

    s1_afv = avg_feature_vector(sent1, model=model, num_features=num_features, index2_to_key_set=index2_to_key_set)
    s2_afv = avg_feature_vector(sent2, model=model, num_features=num_features, index2_to_key_set=index2_to_key_set)
    if all(s1_afv==0.) or all(s2_afv==0.):
        sim=0.
    else:
 #       try:
        sim = 1 - spatial.distance.cosine(s1_afv, s2_afv)
#       except RuntimeWarning:
#            print(s1_afv)
#            print(s2_afv)
    return sim

def scan_datasets(lemmas, option):
    scores = np.zeros(len(crumbs_df))
    sentence = ' '.join(lemmas)
    if option==1:
        col = 'Normalized file description'
    else:
        col = 'Normalized full path'    
    for j in range(len(crumbs_df)):
        file_desc= crumbs_df.loc[j,col]
        scores[j] = sent_similarity(sentence,file_desc)
    return(scores)   



In [23]:
import ipywidgets as widgets
from ipywidgets import Layout
layout = widgets.Layout(width='600px', height='60px')

In [24]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

<IPython.core.display.Javascript object>

In [25]:
from IPython.display import HTML, display,clear_output

### The function producing the HTML output to display
*    Sentences contains the text entered in the Textarea widget.
*    option = 1 to use the simple descriptions of the datasets or 2 for the full paths.
*    howmany: display this much number of datasets
*    show_meta: True to display metadata for a selected dataset
*    show_meta_ind: zero-based index in the results for the display of metadata

In [26]:
import eurostat
#import qgrid

codes = {'hazard': 16, 'time': 1169, 'nace_r1': 888, 'geo': 4017, 
    'age691': None, 'rbd': 412, 'time1': None, 'nace_r2': 1298, 
    'worktime': 31, 'metroreg': 422, 'geo91': None, 'injury': 50, 'physact': 22, 'ceparema': 37, 
    'modinj': 11, 'train': 12, 'wrkenv': 15, 'frequenc': 61, 'tra_cov': 32, 'stk_flow': 190, 'par_mar': 302, 
    'currency': 193, 'cities': 13507, 'indic_env': 81, 'emp_cont': 17, 'loadstat': 3, 'size_emp': 31, 'airp_pr': 25160, 
    'unit': 668, 'nst07': 104, 'wat_proc': 56, 'comspec': 3, 'sex': 7, 'time91': None, 'age': 651, 
    'workproc': 9, 'age91': None, 'citizen': 4017, 'port_iww': 1032, 'source': 14, 'c_cabot': 4017, 
    'wrkstat': 5, 'c_birth': 4017, 'rep_airp': 1116, 'deviatn': 11, 'matagent': 23, 'diagnose': 28, 
    'rep_mar': 1842, 'partner': 4017, 'c_regis': 4017, 'geo1': None, 'seabasin': 12}
        

def relevant_datasets(Sentences,option,howmany,show_meta,show_meta_ind):
    all_lemmas = lemmatize_text(Sentences)

    h  = '<h3>Scanning for lemmas:</h3>'
    h += '  '.join('[' + lemma + ']' for lemma in all_lemmas)+'<br/>'  
    
    h += '<h3>Datasets descriptions in descending order of score - top '+str(howmany)+':</h3>'
    display(HTML(h))
    
    scores=scan_datasets(all_lemmas,option) 
    idx = np.argsort(scores)[::-1][:howmany] ## indices of top 'howmany' scores
    res2 = crumbs_df.loc[crumbs_df.index[idx],['File description','File code']] ## corresponding records from crumbs_df
    res2['Score'] = scores[idx]
    res2=res2[['Score','File description','File code']]
    res2.reset_index(inplace=True)
    res2.rename(columns={'index':'Dataset id'},inplace=True)

    for i in range(len(res2)):
        url='https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2F'+res2.loc[i,'File code']+'.tsv.gz'
        res2.loc[i,'Link'] = url
        
    display(HTML(res2.to_html()))
    
    if show_meta:

        dst = pd.read_csv(res2.loc[show_meta_ind,'Link'],sep='\t',low_memory=False)
        dcode=res2.loc[show_meta_ind,'File code']
        dms = eurostat.get_sdmx_dims(dcode)
        h1 = '<h3>'+str(show_meta_ind)+': File code: '+dcode+'</h3>'
        h1 += '<h3>'+res2.loc[show_meta_ind,'File description']+'</h3>'
        h1 += '<h3>Dimensions:</h3>'
        h1 += '  '.join('[' + x + ']' for x in dms)
        #display(HTML(h1))

        h2=''
        dims=dst.columns[0].split(',')
        dims = dims[:-1]+dims[-1].split('\\')
        #print(dims)
        
        tmp=dst.iloc[:,0].str.split(',',expand=True)
        #print(tmp.head())
        #print('columns:',len(tmp.columns),': ',tmp.columns)
        for j in range(len(dims)-2): ## first the dimensions separated by comma
            if dims[j] not in codes.keys() or codes[dims[j]] != None:
                h2+='<h4>'+dims[j]+':</h4>'
                arr = pd.unique(tmp.iloc[:,j])
                for i in range(len(arr)):
                    h2+= '[ '+arr[i]+': '+eurostat.get_dic(dims[j])[arr[i]] +' ]'
        #j=len(dims)-2 ## omit last dimension which is usually time
        #if codes[dims[j]] is not None:    
        #    h2+='<h4>'+dims[j]+':</h4>'
        #    arr = pd.unique(tmp.iloc[:,j])
        #    for i in range(len(arr)):
        #        h2+= '[ '+arr[i]+': '+eurostat.get_dic(dims[j])[arr[i]] +' ]'
                
                #h2+= '</h4>'
        display(HTML(h1))
        display(HTML(h2))            
        display(HTML(dst.head().to_html()))


## Run once the chunk below then just change inputs

In [27]:
def relevant_datasets_to_text(): 
    
    first_text = 'In 2021 unemployment among young people in Greece increased. This was the result of recession over the last 10 years.'  
    style = {'description_width': 'initial'}
    
    textW = widgets.Textarea(
        value=first_text, 
        placeholder='Type something',
        description='',
        disabled=False,
        layout=Layout(width='90%', height='100px')
    )

    button = widgets.Button(description="Search")

    option = widgets.RadioButtons(
    options=['Simple description of datasets', 'Full-path description of datasets'],
    value='Full-path description of datasets', 
    layout={'width': 'max-content'}, 
    disabled=False
    )
    
    howmany = widgets.IntSlider(
        description='Display:',
        #tooltip='maximum:',
        value=20,
        min=1, 
        max = 30,
        style=style )
    howmany.style.handle_color = 'lightblue'
    
    show_more = widgets.Checkbox(
        value=True,
        description='Show metadata for row:',
        disabled=False,
        indent=True
     )
    
    select_ind = widgets.BoundedIntText(
        value=0,
        min=0,
        max=howmany.value-1,
        step=1,
        description='Row:',
        disabled=False,
        style={'description_width': 'initial'},
        layout = widgets.Layout(width='100px')
    )

    

    ui1=widgets.HBox([textW])
    ui2=widgets.HBox([option,button,howmany])
    ui3=widgets.HBox([show_more,select_ind])
    ui4=widgets.VBox([ui1,ui2,ui3])
    display(ui4, layout=Layout(align_items='center'))

    def on_button_clicked(b):
        clear_output()
        display(ui4, layout=Layout(align_items='center'))
        if option.value == 'Simple description of datasets':
            opt_val = 1
        else:
            opt_val = 2
        relevant_datasets(textW.value,option=opt_val,howmany=howmany.value,show_meta=show_more.value,show_meta_ind=select_ind.value)
       
    button.on_click(on_button_clicked)
    
    
relevant_datasets_to_text()
   

VBox(children=(HBox(children=(Textarea(value='In 2021 unemployment among young people in Greece increased. Thi…

Unnamed: 0,Dataset id,Score,File description,File code,Link
0,4146,0.797613,"Part-time employment as percentage of the total employment for young people by sex, age and country of birth",yth_empl_060,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fyth_empl_060.tsv.gz
1,4148,0.775086,Involuntary part-time employment as percentage of the total part-time employment for young people by sex and age,yth_empl_080,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fyth_empl_080.tsv.gz
2,7400,0.771216,Young people neither in employment nor in education and training (15-24 years) - % of the total population in the same age group,tipslm90,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Ftipslm90.tsv.gz
3,4156,0.769738,"Young people neither in employment nor in education and training by sex, age and educational attainment level (NEET rates)",yth_empl_160,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fyth_empl_160.tsv.gz
4,2309,0.764827,"Employment rates of young people not in education and training by sex, educational attainment level, years since completion of highest level of education and country of birth",edat_lfse_32,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_32.tsv.gz
5,8174,0.764088,"Employment rates of young people not in education and training by sex, educational attainment level, years since completion of highest level of education and country of birth",edat_lfse_32,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_32.tsv.gz
6,2312,0.763435,"Unemployment rates of young people not in education and training by sex, educational attainment level and years since completion of highest level of education",edat_lfse_25,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_25.tsv.gz
7,8611,0.759998,"Employment rates of young people not in education and training by sex, educational attainment level, years since completion of highest level of education and country of birth",edat_lfse_32,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_32.tsv.gz
8,2310,0.758552,"Employment rates of young people not in education and training by sex, educational attainment level, years since completion of highest level of education and NUTS 2 regions",edat_lfse_33,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_33.tsv.gz
9,2308,0.757012,"Employment rates of young people not in education and training by sex, educational attainment level, years since completion of highest level of education and citizenship",edat_lfse_31,https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?sort=1&file=data%2Fedat_lfse_31.tsv.gz


Unnamed: 0,"sex,age,unit,c_birth,geo\time",2020,2019,2018,2017,2016,2015,2014,2013,2012,2011,2010,2009,2008,2007,2006,2005,2004,2003,2002,2001,2000,1999,1998,1997,1996,1995
0,"F,Y15-19,PC,EU15_FOR,AT",:,: u,: u,: u,: u,: u,: u,: u,: u,: u,: u,: u,: u,: bu,: u,: bu,: bc,:,:,: c,:,:,: u,: u,: u,: u
1,"F,Y15-19,PC,EU15_FOR,BE",:,: u,: u,: bu,: c,: u,: u,: u,: u,: bu,: u,: u,: u,: u,: u,: bu,:,: c,: c,:,:,: bu,: u,: c,: c,: c
2,"F,Y15-19,PC,EU15_FOR,CH",:,21.7 u,16.0 u,18.2 u,28.3 u,: u,: u,: u,: u,35.7 u,: bu,: u,: u,: u,: u,: bu,: u,: u,:,: u,:,:,:,:,:,:
3,"F,Y15-19,PC,EU15_FOR,CY",:,: c,: u,: u,: u,: c,: u,: c,:,:,: u,: b,:,: u,:,:,: c,: c,:,:,:,:,:,:,:,:
4,"F,Y15-19,PC,EU15_FOR,CZ",:,:,:,: c,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:,:
