# Knowledge Graph examples

### Installations instructions

*    For the setup of the Virtuoso ODBC data source please see section 1a in https://github.com/eurostat/NLP4Stat/tree/testing/Software%20Environment
*    Download the notebook as "raw" file and save it with extension .ipynb (cut the .txt extension which is added)
*    Install the necessary libraries from your jupyter command prompt. These, together with the versions used, are:
    *    pyodbc==4.0.32
    *    SPARQLWrapper==1.8.5
    *    pyvis==0.1.9
    *    pandas==1.3.5
    *    palettable==3.3.0
    *    numpy==1.20.3
    *    pyodbc==4.0.32

*   Launch the notebook and put your own credentials in the chunk with title "Replace the values below with your own credentials"

### Load libraries and connect to Virtuoso

In [1]:
import os 
import re
import logging
import sys
import pyodbc
import hashlib
import pandas as pd
import numpy as np
from datetime import datetime
from SPARQLWrapper import SPARQLWrapper, POST, DIGEST, GET
from SPARQLWrapper import JSON, INSERT, DELETE
import sparql_dataframe

pd.set_option('display.max_rows', 500)

from IPython.display import Image

In [2]:
def connect_db(DSN, DBA, UID, PWD):

    connection = pyodbc.connect('DSN={};DBA={};UID={};PWD={}'.format(DSN, 
                                                                     DBA,
                                                                     UID,
                                                                     PWD))
    cursor = connection.cursor()

    return connection, cursor


def connect_virtuoso(DSN, UID, PWD):

    sparql = SPARQLWrapper(DSN)
    sparql.setHTTPAuth(DIGEST)
    sparql.setCredentials(UID, PWD)
    sparql.setMethod(GET)

    return sparql


### Replace the values below with your own credentials

In [3]:
user = "pierre"
login = "PlwWavJ0DwVZdgvzEUyG"

In [4]:
# Connection to CDB 
connection, cursor = connect_db('Virtuoso All', 
                                'ESTAT', 
                                user, 
                                login)


# Connection to the KDB 
## endpoint = "http://virtuoso-test.kapcode.fr:8890/sparql/"
endpoint = "http://lod.csd.auth.gr:8890/sparql/"
sparql = connect_virtuoso(endpoint, 
                          user, 
                          login)


### Select triplets based on specific properties  

* The relations list allow to specify which relations to take into account for the knowledge graph. 

In [5]:
relations = ["estat:relatedLegallnformation",
             "estat:relatedEditorialContent",
             "estat:relatedStatisticData",
             "estat:sourceInformation",
             "estat:sourceData",
             "estat:dataInformation",
             "skos:related"]

In [6]:
for rel in relations :

    RelationsStatements = """
    PREFIX estat: <https://nlp4statref/knowledge/ontology/> 
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    SELECT * FROM <https://nlp4statref/knowledge/ontology/>
    WHERE {
        ?s """ + rel + """ ?o .
    }
    """
    # print(RelationsStatements)
    sparql.setQuery(RelationsStatements)
    sparql.method = "POST"
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()['results']['bindings']
    results = pd.json_normalize(results)
    results["p.value"] = rel
    #print(results.shape)
        
    if (rel == relations[0]): results_relations = results
    else : results_relations =  pd.concat([results_relations,results])

#print(results_relations.shape)
#results_relations.to_excel("results_relations.xlsx")

print('\nRelations:\n')
results_relations


Relations:



Unnamed: 0,s.type,s.value,o.type,o.value,p.value
0,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/ontology/Europea...,estat:relatedLegallnformation
1,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/ontology/Europea...,estat:relatedLegallnformation
2,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/ontology/Europea...,estat:relatedLegallnformation
3,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/ontology/Europea...,estat:relatedLegallnformation
4,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/ontology/Europea...,estat:relatedLegallnformation
...,...,...,...,...,...
2680,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/resource/statist...,skos:related
2681,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/resource/statist...,skos:related
2682,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/resource/statist...,skos:related
2683,uri,https://nlp4statref/knowledge/resource/statist...,uri,https://nlp4statref/knowledge/resource/statist...,skos:related


### Get the titles of those elements 

* The titles are used for naming the elements of the graph and also to filter out elements of interest. This second use can be completed by other fields like the description, for instance. The selection of elements will in the end be replaced by enrichement from the knowledge database.

In [7]:
titles = ["skos:prefLabel", "dct:title"]
for tit in titles:

    TitlesStatements = """
    PREFIX estat: <https://nlp4statref/knowledge/ontology/> 
    PREFIX dct: <http://purl.org/dc/terms/>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    SELECT ?s ?o FROM <https://nlp4statref/knowledge/ontology/>
    WHERE {
        ?s """ + tit + """ ?o
    }
    """
    #print(TitlesStatements)
    sparql.setQuery(TitlesStatements)
    sparql.method = "POST"
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()['results']['bindings']
    results = pd.json_normalize(results)
    #print(results.shape)
    
    if (tit == titles[0]): results_titles = results
    else : results_titles =  pd.concat([results_titles,results])

results_titles['o.value'] = results_titles['o.value'].apply(lambda x: x.replace(' ? ',' - ')) ## ADDED

## BELOW use this method to save the file for inspection
## otherwise Excel's limit in URLs is exceeded

#writer = pd.ExcelWriter("results_titles.xlsx",
#                        engine_kwargs={'options': {'strings_to_urls': False}}) ## otherwise may pass Excel's limits in URLs       
#results_titles.to_excel(writer)
#writer.close()

print('\nTitles:\n')
results_titles


Titles:



Unnamed: 0,s.type,s.value,o.type,o.value
0,uri,https://nlp4statref/knowledge/resource/authori...,literal,
1,uri,https://nlp4statref/knowledge/resource/authori...,literal,
2,uri,https://nlp4statref/knowledge/resource/authori...,literal,
3,uri,https://nlp4statref/knowledge/resource/authori...,literal,
4,uri,https://nlp4statref/knowledge/resource/authori...,literal,
...,...,...,...,...
3507,uri,https://nlp4statref/knowledge/ontology/Miscell...,literal,"DG EMPLOYMENT, SOCIAL AFFAIRS AND INCLUSION"
3508,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,Eurostat classification server (RAMON) - corre...
3509,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,Eurostat classification server (RAMON) - corre...
3510,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,Correspondence table - Degree of Urbanisation ...


### Get Resource type of those elements 
* The types of the elements are used to characterize them in the graph. 

In [8]:
TypesStatements = """
PREFIX estat: <https://nlp4statref/knowledge/ontology/> 
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?s ?o FROM <https://nlp4statref/knowledge/ontology/>
WHERE {
    ?s dct:type ?o
}
"""
#print(TypesStatements)
sparql.setQuery(TypesStatements)
sparql.method = "POST"
sparql.setReturnFormat(JSON)
results_types = sparql.query().convert()['results']['bindings']
results_types = pd.json_normalize(results_types)
#print(results_types.shape)
#results_types.to_excel("results_types.xlsx")

print('\nResource types:\n')
results_types


Resource types:



Unnamed: 0,s.type,s.value,o.type,o.value
0,uri,https://nlp4statref/knowledge/resource/statist...,literal,https://nlp4statref/knowledge/resource/authori...
1,uri,https://nlp4statref/knowledge/resource/statist...,literal,https://nlp4statref/knowledge/resource/authori...
2,uri,https://nlp4statref/knowledge/resource/statist...,literal,https://nlp4statref/knowledge/resource/authori...
3,uri,https://nlp4statref/knowledge/resource/statist...,literal,https://nlp4statref/knowledge/resource/authori...
4,uri,https://nlp4statref/knowledge/resource/statist...,literal,https://nlp4statref/knowledge/resource/authori...
...,...,...,...,...
4790,uri,https://nlp4statref/knowledge/ontology/Miscell...,literal,https://nlp4statref/knowledge/resource/authori...
4791,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,https://nlp4statref/knowledge/resource/authori...
4792,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,https://nlp4statref/knowledge/resource/authori...
4793,uri,https://nlp4statref/knowledge/ontology/LegalCo...,literal,https://nlp4statref/knowledge/resource/authori...


### Function to create all node tables

* The inputs _results_relations_, _results_types_, _results_titles_ are created in the previous steps.
* First use clean type labels.
* Returns _node_table_ **required by the next function create_edge_tables()**.
* Returns _node_type_table_ **required by function draw_network()** (see later).
* Returns _data_for_network_nodes_ **required by function pattern_material_selection_for_KG()** (see later).

In [9]:
def create_node_tables(results_relations, results_types, results_titles):

    print('\nEntering create_node_tables\n')
    ## to clean type labels
    clean_type_label_table = pd.DataFrame({'node_type':["","None",np.nan,"background-article","glossary-concept","glossary-home-page",
                                                   "infography","news","statistic-reference-metadata","statistical-article",
                                                   "statistic-database","legal-context","european-union-law","miscellaneous",
                                                   "publication","statistical-data-report","statistic-table"],
                                       'clean_type_label':["","","","Background Article","Glossary Concept","Glossary Homepage",
                                                         "Infography","News","Statistic Reference Metadata","Statistical Article",
                                                         "Statistical DB","Legal Context","European Union Law","Miscellaneous",
                                                         "Publication","Statistical Data Report","Statistic Table"]})
    
    ## unique nodes  
    node_table = pd.DataFrame(np.unique(results_relations[['s.value','o.value']]),columns=['node'])

    ## Bring-in the types from the o.values in results_types with matched s.values
    node_table = pd.merge(node_table, results_types[['s.value','o.value']], left_on='node',right_on='s.value',how='left')
    node_table.drop(columns=['s.value'],inplace=True)
    node_table.rename(columns={"o.value":"node_type"},inplace=True)

    ## Bring-in the titles from the o.values in results_titles with matched s.values
    node_table = pd.merge(node_table, results_titles[['s.value','o.value']], left_on='node',right_on='s.value',how='left')
    node_table.drop(columns=['s.value'],inplace=True)
    node_table.rename(columns={"o.value":"node_title"},inplace=True)

    ## Fill missing types with empty strings and keep the last part in types
    node_table['node_type'].fillna(value='',inplace=True)
    node_table['node_type'] = node_table['node_type'].apply(lambda x: x.replace('https://nlp4statref/knowledge/resource/authority/resource-type#',''))

    #print('Node types:')
    #print(node_table.groupby(['node_type']).size())

    ## Put node_ids and initialize size to 1
    node_table['node_id'] = range(1,len(node_table)+1) # ids utilisés dans le graph
    node_table['size'] = 1 # pour le json 
    node_table = node_table[['node_id','node','node_type','node_title','size']]

    #print(node_table)

    ## Create the table with the types of the nodes 
    node_type_table = node_table.groupby(['node_type']).size().reset_index()
    node_type_table.drop(columns=[0],inplace=True)
    node_type_table['type_id'] = range(1,len(node_type_table)+1) # ids utilisés dans le graph
    #print(node_type_table)

    ## Merge with the clean_type_label_table  
    node_type_table=pd.merge(node_type_table,clean_type_label_table,on='node_type',how='left')
    #print(node_type_table)

    ## Merge main table with the unique types table
    node_table =pd.merge(node_table, node_type_table, on = "node_type")
    #print(node_table)

    ## Drop nodes with missing titles
    #idx= np.where(pd.isna(node_table['node_title']))
    #print('Missing titles: ',len(idx))
    node_table.dropna(subset=['node_title'],inplace=True)
    node_table.reset_index()
    #print(node_table)

    ## Create 'clean_type_title' from clean_type_label (if not empty) and node_title
    mask = node_table['clean_type_label']!=''
    node_table.loc[mask,'clean_node_title'] = node_table.loc[mask,'clean_type_label']+' - ' + node_table.loc[mask,'node_title']

    mask = node_table['clean_type_label']==''
    node_table.loc[mask,'clean_node_title'] = node_table.loc[mask,'node_title']


    #print(node_table)

    ## reduced output
    data_for_network_nodes = node_table[["clean_node_title", "type_id", "size"]]
    data_for_network_nodes.columns = ["label", "group", "size"]
    data_for_network_nodes
    
    output =     {'node_type_table':node_type_table, 
                  'node_table':node_table, 
                  'data_for_network_nodes': data_for_network_nodes}
    
    print('\ndata_for_network_nodes:\n')
    print(data_for_network_nodes)
    
    return output


### Function to create all edges tables

* The input _results_relations_ is created in the previous steps.
* _node_table_ is **created from the previous function create_node_tables()**.
* Returns _data_for_network_edges_ **required by function pattern_material_selection_for_KG()** (see later).
* Returns _edges_type_table_ **required by function draw_network()** (see later).

In [10]:
def create_edge_tables(results_relations, node_table):

    print('\nEntering create_edge_tables\n')
    ## Create the table containing the edges types 
    edge_type_table = pd.DataFrame(np.unique(results_relations[['p.value']]),columns=['edge'])
    edge_type_table['edge_id'] = range(1,len(edge_type_table)+1)
    
    #print('\nedge_type_table:\n',edge_type_table)                                                           

    # Merge info to relations :
    results_relations_w_edge_type = pd.merge(results_relations, edge_type_table, left_on = 'p.value', right_on = "edge")
    #print(results_relations_w_edge_type)

    data_for_network_edges=pd.DataFrame(columns=['source','target','val','group'])
    c = -1
    for i in range(len(results_relations_w_edge_type)):
        source= results_relations_w_edge_type.loc[i,'s.value']   
        idx = node_table[node_table['node'] == source].index.tolist()
        if len(idx) > 0:
            idx = idx[0]
            source_id = node_table.loc[idx,"node_id"]
            if not pd.isna(source_id):
                source_title = node_table.loc[idx,"clean_node_title"]
                if source_title.startswith(' -'):
                    source_title = node_table.loc[idx,"node_title"]
            else:
                source_title=''
        else:
            source_title=''
        


        target = results_relations_w_edge_type.loc[i,'o.value'] 
        idx = node_table[node_table['node'] == target].index.tolist()
        if len(idx) > 0:
            idx = idx[0]
            target_id = node_table.loc[idx,"node_id"]
            target_title = node_table.loc[idx,"clean_node_title"]
            if not pd.isna(target_title):
                if target_title.startswith(' -'):
                    target_title = node_table.loc[idx,"node_title"]
            else:
                target_title=''
        else:
            target_title=''
    
        val=1
        edge = results_relations_w_edge_type.loc[i,'edge']          
        
        c+=1
        data_for_network_edges.loc[c,'source'] = source_title
        data_for_network_edges.loc[c,'target'] = target_title
        data_for_network_edges.loc[c,'val'] = val
        data_for_network_edges.loc[c,'group'] = edge
    
    data_for_network_edges =  pd.merge(data_for_network_edges, edge_type_table,left_on='group',right_on='edge')    
    data_for_network_edges.drop(columns=['group','edge'],inplace=True)    
    data_for_network_edges.rename(columns={'edge_id':'group'},inplace=True)
    ## print(data_for_network_edges)
 
    output = {"edge_type_table":edge_type_table,
              ##"results_relations_w_edge_type":results_relations_w_edge_type, - NOT NEEDED
              "data_for_network_edges":data_for_network_edges}
    
    print('\ndata_for_network_edges:\n')
    print(data_for_network_edges)
    
    return output


### Function to filter the results with a specific pattern

* And also with nodes with a number of outcoming links above a specified minimum (_min_nb_links_)
* **Requires** _data_for_network_nodes_, _data_for_network_edges_ **created by functions create_node tables() and  create_edge_tables**. 
* **Called from function create_graph()** (see later), **creates** _data_for_network_nodes_pattern_min_, _data_for_network_edges_pattern_min_ **required by function draw_network()** (see later).

In [11]:
import unicodedata as ud

def pattern_material_selection_for_KG(data_for_network_nodes, 
                                      data_for_network_edges,
                                      pattern,
                                      min_nb_links):
    print('\nEntering pattern_material_selection_for_KG\n')
    import warnings
    warnings.filterwarnings("ignore", 'This pattern has match groups')

    ## Select nodes related to a specified pattern in their label

    ## from data_for_network_nodes, select the nodes which contain the pattern in label and put them in pattern_nodes
    preprocessed_nodes = data_for_network_nodes['label'].str.lower()
    preprocessed_nodes = preprocessed_nodes.apply(lambda x: ud.normalize('NFKD',x)) ## ADDED
    idx = preprocessed_nodes[preprocessed_nodes.str.contains(pattern,regex=True,case=False)==True].index
    pattern_nodes = data_for_network_nodes.loc[idx,'label']

    #print('\npattern_nodes:')
    #print(pattern_nodes)

    ## all edges with these nodes 
    ## from data_for_network_edges select those edges with source OR target in pattern_nodes and put them in data_for_network_edges_pattern
    idx1 = data_for_network_edges[data_for_network_edges['source'].apply(lambda x: x in pattern_nodes.values)==True].index.tolist()  
    idx2 = data_for_network_edges[data_for_network_edges['target'].apply(lambda x: x in pattern_nodes.values)==True].index.tolist()
    idx = sorted(list(set(idx1).union(set(idx2))))

    data_for_network_edges_pattern = data_for_network_edges.iloc[idx]
    data_for_network_edges_pattern.reset_index(drop=True,inplace=True)

    #print('\ndata_for_network_edges_pattern:')
    #print(data_for_network_edges_pattern)

    ## Count of links per edges, to keep nodes with at least the minimum specified in the arguments  
    ## First, from data_for_network_edges.pattern, keep the edges with non-blank source AND non-blank target
    idx1 = data_for_network_edges_pattern[data_for_network_edges_pattern['source'].str.strip()!=''].index.tolist()
    idx2 = data_for_network_edges_pattern[data_for_network_edges_pattern['target'].str.strip()!=''].index.tolist()
    idx = list(set(idx1).intersection(set(idx2)))
    data_for_network_edges_pattern = data_for_network_edges_pattern.iloc[idx]
    data_for_network_edges_pattern.reset_index(drop=True,inplace=True)


    ## Second, create table_links_per_node.pattern from data_for_network_edges_pattern with the frequencies of target per source
    #table_links_per_node_pattern= data_for_network_edges_pattern.groupby(['source']).agg({'target':['count']})
    table_links_per_node_pattern= data_for_network_edges_pattern.groupby(['source'])['target'].agg('count').to_frame().reset_index()
    table_links_per_node_pattern.rename(columns={'target':'size'},inplace=True)

    #print('\ntable_links_per_node_pattern:')
    #print(table_links_per_node_pattern)
    #table_links_per_node_pattern.to_excel('table_links_per_node_pattern.xlsx')

    ## Third, create table_links_per_node_min_pattern from the sources in table_links_per_node_pattern when the size in 
    ## table_links_per_node_pattern when the size is > min_nb_links
    table_links_per_node_min_pattern= table_links_per_node_pattern['source'][table_links_per_node_pattern['size']>min_nb_links]

    #print('\ntable_links_per_node_min_pattern:')
    #print(table_links_per_node_min_pattern)
    #print(table_links_per_node_min_pattern.tolist())

    ## edges pattern
    ## from data_for_network_edges_pattern: keep in data_for_network_edges_pattern_min those with source OR target in table_links_per_node_min_pattern

    mask = (data_for_network_edges_pattern['source'].apply(lambda x: x in table_links_per_node_min_pattern.tolist())) | \
           (data_for_network_edges_pattern['target'].apply(lambda x: x in table_links_per_node_min_pattern.tolist())) 

    data_for_network_edges_pattern_min=data_for_network_edges_pattern[mask]

    #print('\ndata_for_network_edges_pattern_min:')
    #print(data_for_network_edges_pattern_min)

    ## keep nodes related to pattern and min nb of links nodes: from data_for_network_nodes keep the nodes
    ## with label in the union of the unique values in data_for_network_edges_pattern_min sources and targets
    ## and put them in data_for_network.nodes.pattern.min

    mask=data_for_network_nodes['label'].apply(lambda x: x in np.unique(data_for_network_edges_pattern_min['source']) or \
                                                         x in np.unique(data_for_network_edges_pattern_min['target']))
    data_for_network_nodes_pattern_min = data_for_network_nodes[mask]
    #print('\ndata_for_network_nodes_pattern_min:\n')
    #print(data_for_network_nodes_pattern_min)

    ## keep edges that involve nodes that are in our list : 
    ## from data_for_network_edges_pattern_min keep the edges with source AND target in data_for_network_nodes_pattern_min
    mask = (data_for_network_edges_pattern_min['source'].apply(lambda x: x in data_for_network_nodes_pattern_min['label'].tolist()))  & \
           (data_for_network_edges_pattern_min['target'].apply(lambda x: x in data_for_network_nodes_pattern_min['label'].tolist()))  
    data_for_network_edges_pattern_min=data_for_network_edges_pattern_min[mask]

    #print('\ndata_for_network_nodes_pattern_min:')
    #print(data_for_network_edges_pattern_min)

    ## delete duplicates
    data_for_network_nodes_pattern_min.drop_duplicates(inplace=True)
    data_for_network_nodes_pattern_min.reset_index(drop=True,inplace=True)

    print('\ndata_for_network_nodes_pattern_min:')
    print(data_for_network_nodes_pattern_min)

    data_for_network_edges_pattern_min.drop_duplicates(inplace=True)
    data_for_network_edges_pattern_min.reset_index(drop=True,inplace=True)

    print('\ndata_for_network_edges_pattern_min:\n')
    print(data_for_network_edges_pattern_min)
    
    output = {"data_for_network_nodes_pattern_min":data_for_network_nodes_pattern_min,
              "data_for_network_edges_pattern_min":data_for_network_edges_pattern_min}
    return output




### Function to draw the graph

* **pyvis** is a wrapper around the popular Javascript [visJS library](https://visjs.github.io/vis-network/examples/).
* Graphs can also constructed with [Networkx](https://networkx.org/) and [translated](https://pyvis.readthedocs.io/en/latest/tutorial.html#networkx-integration).
* This function **requires** _data_for_network_nodes_pattern_min_, _data_for_network_edges_pattern_min_ **created by function create_graph()** (see below).
* Returns a network object.

In [12]:
from pyvis.network import Network
from pyvis.options import EdgeOptions
import random
import palettable
from palettable.colorbrewer.sequential import Blues_9
from palettable.colorbrewer.sequential import Greens_9
#print(Blues_8.hex_colors)

def draw_network(data_for_network_nodes_pattern_min,
                 data_for_network_edges_pattern_min):

    print('\nEntering draw_network\n')    
    net = Network(notebook=True,height='1000px',width='1000px')
    net.barnes_hut()
    net.repulsion(node_distance=100, spring_length=200)

    node_groups = np.unique(data_for_network_nodes_pattern_min['group'])
    edge_groups = np.unique(data_for_network_edges_pattern_min['group'])
    #print('edge_groups:',edge_groups)

    node_colors = random.sample(Blues_9.hex_colors, len(node_groups))
    edge_colors = random.sample(Greens_9.hex_colors, len(edge_groups))

    for i in range(len(node_groups)):
        nodes_group = data_for_network_nodes_pattern_min['label'][data_for_network_nodes_pattern_min['group']==node_groups[i]].to_list()
        nodes_group_color = node_colors[i]
        node_type = node_type_table['clean_type_label'][node_type_table['type_id']==node_groups[i]].values[0]
        #print(node_type)
        nodes = list(zip(nodes_group,nodes_group_color))
        net.add_nodes(nodes_group,label=['('+node_type+'): '+n for n in nodes_group],color=[nodes_group_color]*len(nodes_group))

    all_edge_colors=[]    
    for i in range(len(edge_groups)):
        #print('edge_group: ',i,': ',edge_groups[i])
        edges_group = data_for_network_edges_pattern_min[data_for_network_edges_pattern_min['group']==edge_groups[i]]
        #print(edges_group)
        edges_group_color = edge_colors[i]
        edges = list(zip(edges_group['source'],edges_group['target']))
        all_edge_colors.extend([edges_group_color]*len(edges))
        edges_type_title = edge_type_table['edge'][edge_type_table['edge_id']==edge_groups[i]].values[0]
        for k in range(len(edges)):
            net.add_edge(source=edges[k][0],to=edges[k][1], title=edges_type_title,color=edges_group_color)
        
    return net

#EdgeOptions.Color=all_edge_colors
#print(EdgeOptions.Color)
    

### Function to create the graph data with the specific pattern and the lower bound on the out-degrees of nodes 

* **Calls pattern_material_selection_for_KG()** with _data_for_network_nodes_, _data_for_network_edges_ created by the initial processing for all examples (see below). 
* Obtains _data_for_network_nodes_pattern_min_, _data_for_network_edges_pattern_min_ from the call above and **calls the previous function, draw_network()**.

In [13]:
def create_graph(data_for_network_nodes, 
                 data_for_network_edges,
                 pattern,min_nb_links):

    print('\nEntering create_graph\n')    
    output = pattern_material_selection_for_KG(data_for_network_nodes, 
                                               data_for_network_edges,
                                               pattern,
                                               min_nb_links)
    data_for_network_nodes_pattern_min = output['data_for_network_nodes_pattern_min']
    data_for_network_edges_pattern_min = output['data_for_network_edges_pattern_min']

    #print('\noutput: data_for_network_nodes_pattern_min:')
    #print(data_for_network_edges_pattern_min)

    #print('\noutput: data_for_network_edges_pattern_min:')
    #print(data_for_network_edges_pattern_min)

    net = draw_network(data_for_network_nodes_pattern_min=data_for_network_nodes_pattern_min,
                       data_for_network_edges_pattern_min=data_for_network_edges_pattern_min)
    #net.show('nodes.html') 
    
    output = {'data_for_network_nodes_pattern_min':data_for_network_nodes_pattern_min,
              'data_for_network_edges_pattern_min':data_for_network_edges_pattern_min,
              'net':net}

    return output

### Initial processing for all examples

* Creates the data for the nodes and for the nodes in _data_for_network_nodes_, _data_for_network_edges_ **required by function create_graph()**.

In [14]:



# print(clean_type_label_table)

output = create_node_tables(results_relations, results_types, results_titles)
node_type_table = output['node_type_table'] 
node_table = output['node_table']
data_for_network_nodes = output['data_for_network_nodes']

print('\nnode_type_table:\n',node_type_table) 


#data_for_network_nodes.to_excel('data_for_network_nodes.xlsx')
#print(data_for_network_nodes)   

output = create_edge_tables(results_relations, node_table)
edge_type_table = output['edge_type_table'] 
##results_relations_w_edge_type = output['results_relations_w_edge_type'] DELETED FROM THE OUTPUT OF create_edge_tables()
data_for_network_edges = output['data_for_network_edges']

print('\nedge_type_table:\n',edge_type_table)
#print(results_relations_w_edge_type) DELETED FROM THE OUTPUT OF create_edge_tables()

## print(data_for_network_edges)




Entering create_node_tables


data_for_network_nodes:

                                                  label  group  size
0               European Union Law - Regulation 99/2013      4     1
1     European Union Law - Implementing Regulation (...      4     1
2        European Union Law - Regulation (EU) 2019/2152      4     1
3     European Union Law - Commission Decision (EEC)...      4     1
4     European Union Law - Commission Regulation (EC...      4     1
...                                                 ...    ...   ...
5128                    Geospatial analysis at Eurostat      2     1
5129  Geographical information system of the Commiss...      2     1
5130             LUCAS - Land use and land cover survey      2     1
5131             LUCAS - Land use and land cover survey      2     1
5132             LUCAS - Land use and land cover survey      2     1

[4498 rows x 3 columns]

node_type_table:
                        node_type  type_id              clean_type_label


### Example 1: Covid-19

* The color nodes correspond to the node types
* The edge type can be found by hovering on an edge. Different edge types have different colors.

In [15]:
## Covid-19
net=None
data_for_network_nodes1 = data_for_network_nodes.copy()
data_for_network_edges1 = data_for_network_edges.copy()
output = create_graph(data_for_network_nodes = data_for_network_nodes1, 
                      data_for_network_edges = data_for_network_edges1,
                      pattern = "covid", min_nb_links=1)
net = output['net']
data_for_network_nodes_pattern_min = output['data_for_network_nodes_pattern_min']
data_for_network_edges_pattern_min = output['data_for_network_edges_pattern_min']
print('\nnodes data:\n')
print(data_for_network_nodes_pattern_min)
print('\nedges data:\n')      
print(data_for_network_edges_pattern_min)

net.show('nodes.html') 


Entering create_graph


Entering pattern_material_selection_for_KG


data_for_network_nodes_pattern_min:
                                                label  group  size
0   European Union Law - Regulation (EU) No 549/20...      4     1
1   European Union Law - Regulation (EC) No 1392/2007      4     1
2   Miscellaneous - Quarterly sector accounts news...      8     1
3   News - Flash estimates of income inequalities ...      9     1
4                News - Data in 2019 income estimates      9     1
5                                         News - here      9     1
6           Statistical DB - Annual national accounts     11     1
7   Statistical DB - Sector accounts - Non-financi...     11     1
8        Statistical DB - Quarterly national accounts     11     1
9              Statistic Reference Metadata - Targets     12     1
10  Statistical Article - Impact of Covid-19 crisi...     14     1
11  Statistical Article - Impact of COVID-19 on ma...     14     1
12  Statistical Article

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


### Example 2: Quality of Life

* The color nodes correspond to the node types
* The edge type can be found by hovering on an edge. Different edge types have different colors.

In [18]:
## QoL : 
net = None
data_for_network_nodes2 = data_for_network_nodes.copy()
data_for_network_edges2 = data_for_network_edges.copy()
output = create_graph(data_for_network_nodes = data_for_network_nodes2, 
                   data_for_network_edges = data_for_network_edges2,
                   pattern = "quality of life|.*quality.*life.*|.*life.*quality.*|qol", min_nb_links=1)

net = output['net']
data_for_network_nodes_pattern_min = output['data_for_network_nodes_pattern_min']
data_for_network_edges_pattern_min = output['data_for_network_edges_pattern_min']
print('\nnodes data:\n')
print(data_for_network_nodes_pattern_min)
print('\nedges data:\n')     
print(data_for_network_edges_pattern_min)
net.show('nodes.html') 


Entering create_graph


Entering pattern_material_selection_for_KG


data_for_network_nodes_pattern_min:
                                               label  group  size
0  News - Final report of the expert group on qua...      9     1
1  Statistic Reference Metadata - QoL-Measuring Q...     12     1
2  Statistic Reference Metadata - Quality of life...     12     1
3  Background Article - Quality of life indicator...      3     1

data_for_network_edges_pattern_min:

                                              source  \
0  Background Article - Quality of life indicator...   
1  Background Article - Quality of life indicator...   
2  Background Article - Quality of life indicator...   
3  Statistic Reference Metadata - Quality of life...   

                                              target val  group  
0  News - Final report of the expert group on qua...   1      2  
1  Statistic Reference Metadata - QoL-Measuring Q...   1      4  
2  Statistic Reference Metadata - Quality of li

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)


### Example 3: Climate

* The color nodes correspond to the node types
* The edge type can be found by hovering on an edge. Different edge types have different colors.

In [19]:
## Climate : 

net = None
data_for_network_nodes3 = data_for_network_nodes.copy()
data_for_network_edges3 = data_for_network_edges.copy()
output = create_graph(data_for_network_nodes = data_for_network_nodes3, 
                   data_for_network_edges = data_for_network_edges3,
                   pattern = "environment|greenhouse|climate",min_nb_links=4)
net = output['net']
data_for_network_nodes_pattern_min = output['data_for_network_nodes_pattern_min']
data_for_network_edges_pattern_min = output['data_for_network_edges_pattern_min']
print('\nnodes data:\n')
print(data_for_network_nodes_pattern_min)
print('n\edges data:\n')      
print(data_for_network_edges_pattern_min)
net.show('nodes.html') 



Entering create_graph


Entering pattern_material_selection_for_KG


data_for_network_nodes_pattern_min:
                                                label  group  size
0            European Union Law - Regulation 691/2011      4     1
1   European Union Law - Regulation No 691/2011 on...      4     1
2   European Union Law - Regulation (EU) No 549/20...      4     1
3   European Union Law - Commission Implementing R...      4     1
4   European Union Law - Summaries of EU legislati...      4     1
5   European Union Law - Summaries of EU legislati...      4     1
6   Glossary Concept - Good agricultural and envir...      5     1
7   Glossary Concept - Common agricultural policy ...      5     1
8                      Glossary Concept - Eco-schemes      5     1
9                     Glossary Concept - Biodiversity      5     1
10     Glossary Concept - Single payment scheme (SPS)      5     1
11   Glossary Concept - Agri-environmental indicators      5     1
12  Glossary Concept - 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return func(*args, **kwargs)
