# Introduction

Having obtained a first version of the hybrid graph, the final step before producing node embeddings under the proposed approach is to assign a base semantic embedding to each node in the graph. This notebook covers that process, which is carried out in ascending order according to the hierarchical level of each node type within the DOCLEG document structure. That is, embeddings are first assigned to concept and text chunk nodes, followed by sequence, article, and finally act nodes.

This order is not arbitrary — it reflects the need to begin with nodes that are directly associated with actual text content, from which semantic embeddings can be meaningfully derived. Within the base graph constructed during the hackathon, only text chunk nodes (and to some extent concept and act nodes, as will be discussed later) meet this condition.

Due to the time constraints of the hackathon, the semantic embeddings of text chunk nodes are used as the foundation to propagate embeddings to their connected sequence nodes, and from there to article nodes, and partially to act nodes. This approach provided a quick and practical solution to assign initial embeddings to all nodes in the graph, enabling the use of a GraphSAGE model for node embedding generation.

Nonetheless, it is worth noting that, given more time, it would have been preferable to explore alternative strategies capable of producing unique embeddings for sequence and/or article nodes based on their own content — for example, by generating textual summaries at these levels — rather than relying solely on the aggregation of neighboring node embeddings.

Another area of potential improvement/refinement is the model employed for embedding generation in the first place. Due to its convenience, OpenAI's embedding models avilable on azure were employed in this case. However the use of BERT-based embedding models especialized on legal texts, text in DUtch and/or French, or both could further improve thequality of embeddings potneially impacting in a positive manner the retrieval of relevant evidence for answering a user's query. 

The final product of this notebook is a graph where all its nodes have a semantic embedding assigned, enabling its use to porduce node embddings.

## 0. Import dependencies

In [2]:
# Standard library imports
import os
import pickle
import sys

# Third party imports
import numpy as np
from dotenv import load_dotenv
from openai import AzureOpenAI
# azure.ai.inference import EmbeddingsClient
# from azure.core.credentials import AzureKeyCredential

from tqdm import tqdm
import networkx as nx

try:
    # This will work in scripts where __file__ is defined
    current_dir = os.path.dirname(os.path.abspath(__file__))
    # Assuming "src" is parallel to the script folder
    project_root = os.path.abspath(os.path.join(current_dir, ".."))
except NameError:
    # In notebooks __file__ is not defined: assume we're in notebooks/riziv_dataset/
    project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

src_path = os.path.join(project_root, "src")
if src_path not in sys.path:
    sys.path.append(src_path)

from main.ollama_utils import get_ollama_embedding

## 1. Load data

In [3]:
# Define the path to the RIZIV dataset files
RIZIV_data_path = os.path.join(project_root, "data", "RIZIV_hackathon_main")

# Load the graph
with open(os.path.join(RIZIV_data_path, 'intermediate','embeddingless_base_hybrid_graph.pkl'), 'rb') as f:
    G = pickle.load(f)

# Resolve .env path relative to this script, regardless of where it's run from
env_path = os.path.join(RIZIV_data_path, '.env')

# Load environment variables from the .env file
load_dotenv(dotenv_path=env_path)

True

## 2. Generate semantic embeddings for Text Chunk and Concept nodes

This section covers the generation and assignment of semantic embeddings to all text chunk and concept nodes in the graph. For the hackathon, embeddings were generated using OpenAI's text-embedding-3-small model via the Azure API.

For text chunk nodes, embeddings are computed based on the full text associated with each node. In the case of concept nodes, embeddings are derived from the concept's name. This choice, while functional, should be considered sub-optimal. It was mainly driven by the time constraints of the hackathon and the fact that concept nodes were included primarily for illustrative purposes (see Notebook 02). A significantly more robust approach would have involved generating a concise, representative definition for each relevant concept, effectively building a dictionary of key legal terms extracted from the DOCLEG database. These definitions could then serve as a more meaningful basis for computing the semantic embeddings of concept nodes.

Once the embedding generation process is complete, a summary is printed including the number of nodes processed, the number successfully assigned an embedding, and the overall coverage rate across the targeted node types. Some nodes may fail to receive embeddings due to occasional issues such as API errors, content filtering constraints, or unexpected artifacts within the source text, all plausible given the size and heterogeneity of the DOCLEG dataset. In such cases, an empty embedding is assigned to the corresponding node. This is again a quick fix within the contex of the hackathon. With more time available an ideal fix would have been to closely inspect which may be the actual source of problems in each case and depending on the case apply a tailored solution.

In [7]:
# # Testing embeddings generation with examples
# 
# # For Serverless API or Managed Compute endpoints
# client = AzureOpenAI(   
#     api_version=os.getenv("AZURE_OPENAI_EMBEDDINGS_API_VERSION"),
#     azure_endpoint=os.getenv("AZURE_OPENAI_EMBEDDINGS_ENDPOINT"),
#     azure_ad_token=AzureKeyCredential(os.getenv("AZURE_OPENAI_EMBEDDINGS_KEY")),
# )
# 
# response = client.embeddings.create(
#     input=["first phrase","second phrase","third phrase"],
#     model=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT")
# )
# 
# for item in response.data:
#     item_embedding = item['embedding']
#     print(item_embedding)
#     print(len(item_embedding))

embedding = np.array(get_ollama_embedding("Embedding generation test", )['embedding'])
embedding

array([-1.05666721, -0.99160188, -0.19772267, ...,  0.03814373,
       -1.27074277,  1.15008116])

In [None]:
def generate_embeddings_for_nodes(G, client, assign_empty: bool = True):
    """
    Generate and assign embeddings for text_chunk and concept nodes.
    If a node cannot receive an embedding and assign_empty is True,
    it will receive an empty embedding of the appropriate dimension.
    """
    print("Generating embeddings for nodes...")
    
    # Get lists of nodes by type
    chunk_nodes = [node for node in G.nodes() if G.nodes[node].get('type_node') == 'text_chunk']
    concept_nodes = [node for node in G.nodes() if G.nodes[node].get('type_node') == 'concept']
    target_nodes = set(chunk_nodes + concept_nodes)  # Use set to ensure uniqueness
    
    print(f"Found {len(chunk_nodes)} text chunks and {len(concept_nodes)} concepts")
    
    # Clear existing embeddings for target nodes
    for node in target_nodes:
        if 'embedding' in G.nodes[node]:
            del G.nodes[node]['embedding']
    
    embedding_dim = None
    nodes_with_empty = 0
    processed_nodes = set()  # Track which nodes we've processed
    
    # Process text chunks
    print("\nProcessing text chunks...")
    for batch in tqdm(range(0, len(chunk_nodes), 16)):
        batch_nodes = chunk_nodes[batch:batch + 16]
        batch_texts = [G.nodes[node]['text'] for node in batch_nodes]
        
        try:
            response = client.embed(
                input=batch_texts,
                model="text-embedding-3-small"
            )
            
            for node, item in zip(batch_nodes, response.data):
                embedding = np.array(item['embedding'])
                G.nodes[node]['embedding'] = embedding
                processed_nodes.add(node)
                if embedding_dim is None:
                    embedding_dim = len(embedding)
                
        except Exception as e:
            print(f"Error processing batch starting with node {batch_nodes[0]}: {str(e)}")
            if assign_empty and embedding_dim is not None:
                for node in batch_nodes:
                    if node not in processed_nodes:
                        G.nodes[node]['embedding'] = np.zeros(embedding_dim)
                        nodes_with_empty += 1
                        processed_nodes.add(node)
            continue
    
    # Process concept nodes
    print("\nProcessing concept nodes...")
    for batch in tqdm(range(0, len(concept_nodes), 16)):
        batch_nodes = concept_nodes[batch:batch + 16]
        batch_texts = [G.nodes[node]['name'] for node in batch_nodes]
        
        try:
            response = client.embed(
                input=batch_texts,
                model="text-embedding-3-small"
            )
            
            for node, item in zip(batch_nodes, response.data):
                embedding = np.array(item['embedding'])
                G.nodes[node]['embedding'] = embedding
                processed_nodes.add(node)
                if embedding_dim is None:
                    embedding_dim = len(embedding)
                
        except Exception as e:
            print(f"Error processing batch starting with node {batch_nodes[0]}: {str(e)}")
            if assign_empty and embedding_dim is not None:
                for node in batch_nodes:
                    if node not in processed_nodes:
                        G.nodes[node]['embedding'] = np.zeros(embedding_dim)
                        nodes_with_empty += 1
                        processed_nodes.add(node)
            continue
    
    # Verify embeddings (solo para nodos objetivo)
    nodes_with_real_embeddings = sum(1 for node in target_nodes 
                                   if 'embedding' in G.nodes[node] 
                                   and not np.all(G.nodes[node]['embedding'] == 0))
    
    nodes_without_embeddings = len(target_nodes) - len(processed_nodes)
    
    print("\nEmbedding generation summary:")
    print(f"Total target nodes: {len(target_nodes)}")
    print(f"Nodes with real embeddings: {nodes_with_real_embeddings}")
    print(f"Nodes with empty embeddings: {nodes_with_empty}")
    print(f"Nodes without embeddings: {nodes_without_embeddings}")
    print(f"Coverage: {((nodes_with_real_embeddings + nodes_with_empty)/len(target_nodes))*100:.2f}%")
    
    return G

# Execute the function
G = generate_embeddings_for_nodes(G, client)

Generating embeddings for nodes...
Found 22127 text chunks and 285 concepts

Processing text chunks...


 63%|██████▎   | 878/1383 [17:03<07:50,  1.07it/s]  

Error processing batch starting with node chunk_12018556_0: (None) '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.
Code: None
Message: '$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.


100%|██████████| 1383/1383 [41:25<00:00,  1.80s/it]



Processing concept nodes...


100%|██████████| 18/18 [00:10<00:00,  1.66it/s]



Embedding generation summary:
Total target nodes: 22412
Nodes with real embeddings: 22396
Nodes with empty embeddings: 16
Nodes without embeddings: 0
Coverage: 100.00%


## 3. Propagate embeddings to Sequence and Article nodes

Having generated semantic embeddings for Text Chunk nodes, these are then propagated to their parent sequence nodes, and subsequently from sequences to their corresponding article nodes. The embeddings or "parent" nodes are computed by averaging the embeddings of all connected nodes at the lower hierarchical level. That is, each sequence node receives an embedding based on the average of its associated text chunks, and each article node does so based on the average of its linked sequences.

At the end of this process, all sequence and article nodes have been assigned an embedding, either real or empty, preparing the graph for the final embedding step at the act level.

In [5]:
def propagate_embeddings_to_sequences(G: nx.Graph, assign_empty: bool = True):
    """
    Propagate embeddings from text_chunk nodes to their parent sequence nodes
    by averaging the embeddings of all connected chunks. Optionally assign
    empty embeddings to sequences without valid chunk embeddings.
    
    Args:
        G (nx.Graph): Graph to modify
        assign_empty (bool): Whether to assign empty embeddings to sequences without valid chunks
        
    Returns:
        Tuple[int, int, int, List[str]]: (total_sequences, sequences_with_embedding, 
                                         sequences_with_empty, problem_sequences)
    """
    # Get all sequence nodes
    sequence_nodes = [
        node for node, attr in G.nodes(data=True) 
        if attr.get('type_node') == 'sequence'
    ]
    
    sequences_without_embedding = []
    sequences_with_empty = []
    total_sequences = len(sequence_nodes)
    embedding_dim = None
    
    print("Propagating embeddings to sequences...")
    for sequence_node in sequence_nodes:
        try:
            # Get all neighboring chunks
            chunk_neighbors = [
                neighbor for neighbor in G.neighbors(sequence_node)
                if G.nodes[neighbor].get('type_node') == 'text_chunk'
            ]
            
            if not chunk_neighbors:
                if assign_empty:
                    # If no embedding dimension is known yet, try to find it from any chunk node
                    if embedding_dim is None:
                        for node in G.nodes():
                            if (G.nodes[node].get('type_node') == 'text_chunk' and 
                                G.nodes[node].get('embedding') is not None):
                                embedding_dim = len(G.nodes[node]['embedding'])
                                break
                    
                    if embedding_dim is not None:
                        G.nodes[sequence_node]['embedding'] = [0.0] * embedding_dim
                        sequences_with_empty.append(sequence_node)
                        continue
                
                sequences_without_embedding.append(sequence_node)
                continue
            
            # Get embeddings from all chunks
            chunk_embeddings = []
            for chunk in chunk_neighbors:
                embedding = G.nodes[chunk].get('embedding')
                if embedding is not None:
                    chunk_embeddings.append(embedding)
                    if embedding_dim is None:
                        embedding_dim = len(embedding)
            
            if not chunk_embeddings:
                if assign_empty and embedding_dim is not None:
                    G.nodes[sequence_node]['embedding'] = [0.0] * embedding_dim
                    sequences_with_empty.append(sequence_node)
                    continue
                    
                sequences_without_embedding.append(sequence_node)
                continue
            
            # Calculate average embedding
            avg_embedding = [
                sum(values) / len(chunk_embeddings)
                for values in zip(*chunk_embeddings)
            ]
            
            # Assign average embedding to sequence node
            G.nodes[sequence_node]['embedding'] = avg_embedding
            
        except Exception as e:
            if assign_empty and embedding_dim is not None:
                G.nodes[sequence_node]['embedding'] = [0.0] * embedding_dim
                sequences_with_empty.append(sequence_node)
            else:
                sequences_without_embedding.append(sequence_node)
    
    # Print summary
    sequences_with_embedding = total_sequences - len(sequences_without_embedding) - len(sequences_with_empty)
    
    print(f"\nTotal sequence nodes: {total_sequences}")
    print(f"Sequences with propagated embedding: {sequences_with_embedding}")
    print(f"Sequences with empty embedding: {len(sequences_with_empty)}")
    print(f"Sequences without embedding: {len(sequences_without_embedding)}")
    print(f"Coverage: {((sequences_with_embedding + len(sequences_with_empty))/total_sequences)*100:.2f}%")
    
    return total_sequences, sequences_with_embedding, len(sequences_with_empty), sequences_without_embedding

total_seqeunces, sequences_with_embedding, n_sequences_with_empy, sequences_without_embedding = propagate_embeddings_to_sequences(G)

Propagating embeddings to sequences...

Total sequence nodes: 15857
Sequences with propagated embedding: 15857
Sequences with empty embedding: 0
Sequences without embedding: 0
Coverage: 100.00%


In [6]:
def propagate_embeddings_to_articles(G: nx.Graph, assign_empty: bool = True):
    """
    Propagate embeddings from sequence nodes to their parent article nodes
    by averaging the embeddings of all connected sequences. Optionally assign
    empty embeddings to articles without sequence neighbors.
    
    Args:
        G (nx.Graph): Graph to modify
        assign_empty (bool): Whether to assign empty embeddings to articles without sequences
        
    Returns:
        Tuple[int, int, int, List[str]]: (total_articles, articles_with_embedding, 
                                         articles_with_empty, problem_articles)
    """
    # Get all article nodes
    article_nodes = [
        node for node, attr in G.nodes(data=True) 
        if attr.get('type_node') == 'article'
    ]
    
    articles_without_embedding = []
    articles_with_empty = []
    total_articles = len(article_nodes)
    embedding_dim = None
    
    print("Propagating embeddings to articles...")
    for article_node in article_nodes:
        try:
            # Get all neighboring sequences
            sequence_neighbors = [
                neighbor for neighbor in G.neighbors(article_node)
                if G.nodes[neighbor].get('type_node') == 'sequence'
            ]
            
            if not sequence_neighbors:
                if assign_empty:
                    # If no embedding dimension is known yet, try to find it from any sequence node
                    if embedding_dim is None:
                        for node in G.nodes():
                            if (G.nodes[node].get('type_node') == 'sequence' and 
                                G.nodes[node].get('embedding') is not None):
                                embedding_dim = len(G.nodes[node]['embedding'])
                                break
                    
                    if embedding_dim is not None:
                        G.nodes[article_node]['embedding'] = [0.0] * embedding_dim
                        articles_with_empty.append(article_node)
                        continue
                
                articles_without_embedding.append(article_node)
                continue
            
            # Get embeddings from all sequences
            sequence_embeddings = []
            for sequence in sequence_neighbors:
                embedding = G.nodes[sequence].get('embedding')
                if embedding is not None:
                    sequence_embeddings.append(embedding)
                    if embedding_dim is None:
                        embedding_dim = len(embedding)
            
            if not sequence_embeddings:
                if assign_empty and embedding_dim is not None:
                    G.nodes[article_node]['embedding'] = [0.0] * embedding_dim
                    articles_with_empty.append(article_node)
                    continue
                
                articles_without_embedding.append(article_node)
                continue
            
            # Calculate average embedding
            avg_embedding = [
                sum(values) / len(sequence_embeddings)
                for values in zip(*sequence_embeddings)
            ]
            
            # Assign average embedding to article node
            G.nodes[article_node]['embedding'] = avg_embedding
            
        except Exception as e:
            articles_without_embedding.append(article_node)
    
    # Print summary
    articles_with_real_embedding = total_articles - len(articles_without_embedding) - len(articles_with_empty)
    
    print(f"\nTotal article nodes: {total_articles}")
    print(f"Articles with propagated embedding: {articles_with_real_embedding}")
    print(f"Articles with empty embedding: {len(articles_with_empty)}")
    print(f"Articles without any embedding: {len(articles_without_embedding)}")
    print(f"Coverage: {((articles_with_real_embedding + len(articles_with_empty))/total_articles)*100:.2f}%")
    
    return total_articles, articles_with_real_embedding, len(articles_with_empty), articles_without_embedding

total_articles, sequences_with_real_embedding, n_articles_with_empty, articles_without_embedding = propagate_embeddings_to_articles(G)

Propagating embeddings to articles...

Total article nodes: 9244
Articles with propagated embedding: 9175
Articles with empty embedding: 69
Articles without any embedding: 0
Coverage: 100.00%


## 4. Embeddings for Act nodes

The final step in assigning semantic embeddings across the graph involves the act nodes. Unlike other node types, acts do not contain rich textual content but do include some textual information in the form of their Title or Short format title with provide some elemental semantic value. The initial embeddings for Act nodes are generated based on these field (Title and TitleShor from the WorkActLanguageFR included as node attributes) when they are available, averaging them. Otherwise, the embedding is computed from whichever field is available. If neither yields a valid result, an empty embedding is assigned, ensuring coverage for all act nodes.

To refine these initial embeddings, a second step is performed where each act node’s embedding is recalculated by averaging its embedding from the first step with the mean embedding of its neighboring article nodes. This, although sub-optimal again, aimed at allowing act node embeddings to better reflect their actual content and context within the broader document hierarchy, leveraging the semantic information already captured at lower levels of the graph.

Having computed the resulting act embeddings, completing the embedding generation and propagation process, all nodes in the graph, from the most granular to the most abstract, are represented in the shared semantic space.

In [7]:
def generate_embeddings_for_acts(G: nx.Graph, client) -> nx.Graph:
    """
    Generate and assign embeddings for act nodes based on their Title and TitleShort attributes.
    Process one by one and handle failures gracefully.
    
    Args:
        G (nx.Graph): Graph containing act nodes
        client: Azure OpenAI embeddings client
        
    Returns:
        nx.Graph: Graph with updated act embeddings
    """
    print("Generating embeddings for act nodes...")
    
    # Get all act nodes
    act_nodes = [node for node in G.nodes() if G.nodes[node].get('type_node') == 'act']
    print(f"Found {len(act_nodes)} act nodes")
    
    # Track statistics
    acts_with_averaged_embedding = 0
    acts_with_single_embedding = 0
    acts_with_empty_embedding = 0
    embedding_dim = None
    
    # Process acts one by one
    for node in tqdm(act_nodes, desc="Processing acts"):
        try:
            title = G.nodes[node].get('Title', '')
            short_title = G.nodes[node].get('TitleShort', '')
            title_embedding = None
            short_title_embedding = None
            
            # Try to get Title embedding
            try:
                title_response = client.embed(
                    input=[title],
                    model="text-embedding-3-small"
                )
                title_embedding = np.array(title_response.data[0]['embedding'])
                if embedding_dim is None:
                    embedding_dim = len(title_embedding)
            except:
                pass
            
            # Try to get TitleShort embedding
            try:
                short_title_response = client.embed(
                    input=[short_title],
                    model="text-embedding-3-small"
                )
                short_title_embedding = np.array(short_title_response.data[0]['embedding'])
                if embedding_dim is None:
                    embedding_dim = len(short_title_embedding)
            except:
                pass
            
            # Assign embeddings based on what we got
            if title_embedding is not None and short_title_embedding is not None:
                # Store both embeddings and their average
                G.nodes[node]['title_embedding'] = title_embedding
                G.nodes[node]['title_short_embedding'] = short_title_embedding
                G.nodes[node]['embedding'] = (title_embedding + short_title_embedding) / 2
                acts_with_averaged_embedding += 1
                
            elif title_embedding is not None:
                # Store only title embedding
                G.nodes[node]['title_embedding'] = title_embedding
                G.nodes[node]['embedding'] = title_embedding
                acts_with_single_embedding += 1
                
            elif short_title_embedding is not None:
                # Store only short title embedding
                G.nodes[node]['title_short_embedding'] = short_title_embedding
                G.nodes[node]['embedding'] = short_title_embedding
                acts_with_single_embedding += 1
                
            else:
                # Assign empty embedding if we know the dimension
                if embedding_dim is not None:
                    G.nodes[node]['embedding'] = np.zeros(embedding_dim)
                    acts_with_empty_embedding += 1
                
        except Exception as e:
            if embedding_dim is not None:
                G.nodes[node]['embedding'] = np.zeros(embedding_dim)
                acts_with_empty_embedding += 1
    
    # Print summary
    total_acts = len(act_nodes)
    print("\nEmbedding generation summary:")
    print(f"Total act nodes: {total_acts}")
    print(f"Acts with averaged embeddings: {acts_with_averaged_embedding} ({(acts_with_averaged_embedding/total_acts)*100:.2f}%)")
    print(f"Acts with single embedding: {acts_with_single_embedding} ({(acts_with_single_embedding/total_acts)*100:.2f}%)")
    print(f"Acts with empty embeddings: {acts_with_empty_embedding} ({(acts_with_empty_embedding/total_acts)*100:.2f}%)")
    print(f"Embedding dimension: {embedding_dim}")
    
    return G

G = generate_embeddings_for_acts(G, client)


Generating embeddings for act nodes...
Found 1146 act nodes


Processing acts: 100%|██████████| 1146/1146 [05:09<00:00,  3.71it/s]


Embedding generation summary:
Total act nodes: 1146
Acts with averaged embeddings: 1134 (98.95%)
Acts with single embedding: 10 (0.87%)
Acts with empty embeddings: 2 (0.17%)
Embedding dimension: 1536





In [8]:
def recalculate_act_embeddings(G: nx.Graph) -> nx.Graph:
    """
    Recalculate act embeddings by averaging their current embedding with 
    the mean embedding of their article neighbors. The final result is stored
    as the 'embedding' attribute.
    
    Args:
        G (nx.Graph): Graph with act and article embeddings
        
    Returns:
        nx.Graph: Graph with updated act embeddings
    """
    print("Recalculating act embeddings...")
    
    # Get all act nodes
    act_nodes = [node for node in G.nodes() if G.nodes[node].get('type_node') == 'act']
    print(f"Found {len(act_nodes)} act nodes")
    
    # Track statistics
    acts_with_both = 0
    acts_kept_original = 0
    acts_with_only_articles = 0
    acts_with_empty = 0
    
    for act_node in tqdm(act_nodes, desc="Processing acts"):
        try:
            # Store original act embedding if it exists
            if 'embedding' in G.nodes[act_node]:
                original_act_embedding = G.nodes[act_node]['embedding'].copy()  # Make a copy
                G.nodes[act_node]['title_based_embedding'] = original_act_embedding  # Store original
            else:
                original_act_embedding = None
            
            # Get article neighbors
            article_neighbors = [
                neighbor for neighbor in G.neighbors(act_node)
                if G.nodes[neighbor].get('type_node') == 'article'
            ]
            
            # Collect valid article embeddings
            article_embeddings = []
            for article in article_neighbors:
                embedding = G.nodes[article].get('embedding')
                if embedding is not None:
                    article_embeddings.append(embedding)
            
            # Calculate mean article embedding if we have any
            if article_embeddings:
                mean_article_embedding = np.mean(article_embeddings, axis=0)
                G.nodes[act_node]['articles_mean_embedding'] = mean_article_embedding
            else:
                mean_article_embedding = None
            
            # Decide final embedding based on what we have
            if original_act_embedding is not None and mean_article_embedding is not None:
                # Average both embeddings and store as main embedding
                G.nodes[act_node]['embedding'] = (original_act_embedding + mean_article_embedding) / 2
                acts_with_both += 1
                
            elif original_act_embedding is not None:
                # Keep original as main embedding
                G.nodes[act_node]['embedding'] = original_act_embedding
                acts_kept_original += 1
                
            elif mean_article_embedding is not None:
                # Use articles' embedding as main embedding
                G.nodes[act_node]['embedding'] = mean_article_embedding
                acts_with_only_articles += 1
                
            else:
                # No embeddings available
                acts_with_empty += 1
            
        except Exception as e:
            print(f"\nError processing act {act_node}: {str(e)}")
            acts_with_empty += 1
    
    # Print summary
    total_acts = len(act_nodes)
    print("\nEmbedding recalculation summary:")
    print(f"Total act nodes: {total_acts}")
    print(f"Acts with combined embeddings (act + articles): {acts_with_both} ({(acts_with_both/total_acts)*100:.2f}%)")
    print(f"Acts with only original embedding: {acts_kept_original} ({(acts_kept_original/total_acts)*100:.2f}%)")
    print(f"Acts with only articles' embedding: {acts_with_only_articles} ({(acts_with_only_articles/total_acts)*100:.2f}%)")
    print(f"Acts without embeddings: {acts_with_empty} ({(acts_with_empty/total_acts)*100:.2f}%)")
    
    return G

G = recalculate_act_embeddings(G)

Recalculating act embeddings...
Found 1146 act nodes


Processing acts: 100%|██████████| 1146/1146 [00:01<00:00, 844.86it/s]


Embedding recalculation summary:
Total act nodes: 1146
Acts with combined embeddings (act + articles): 1009 (88.05%)
Acts with only original embedding: 137 (11.95%)
Acts with only articles' embedding: 0 (0.00%)
Acts without embeddings: 0 (0.00%)





In [32]:
# Homogenize embedding format to numpy array and save graph
for node in G.nodes():
    if 'embedding' in G.nodes[node] and not isinstance(G.nodes[node]['embedding'], np.ndarray):
        G.nodes[node]['embedding'] = np.array(G.nodes[node]['embedding'])

with open(os.path.join(RIZIV_data_path, 'base_hybrid_graph.pkl'), 'wb') as f:
    pickle.dump(G, f)