# Introduction

After building the base document graph and extracting relevant legal concepts for a sample of text fragments, this notebook focuses on incorporating these concepts into the graph as a new type of node: concept nodes, each representing one of the detected legal terms. These nodes are connected via edges to all text chunk nodes in which the corresponding concept was identified.

Additionally, before adding this new node type, the notebook includes a code section dedicated to pruning all sequence nodes of type PUB. Sequence nodes in the "Sequence" table of the DOCLEG database can be of two types: PUB and VIG, indicating whether the sequence captures the text at the moment of its publication (PUB) or at the time it entered into force (VIG) as well as the date when this happenned. During the hackathon, the RIZIV team expressed a preference for keeping only VIG sequences in order to simplify the prototype Q&A system.

## 0. Import dependencies

In [1]:
# Standard library imports
import os
import pickle
import sys

# Third party imports
import pandas as pd
from dotenv import load_dotenv
from tqdm import tqdm

try:
    # This will work in scripts where __file__ is defined
    current_dir = os.path.dirname(os.path.abspath(__file__))
    # Assuming "src" is parallel to the script folder
    project_root = os.path.abspath(os.path.join(current_dir, ".."))
except NameError:
    # In notebooks __file__ is not defined: assume we're in notebooks/riziv_dataset/
    project_root = os.path.abspath(os.path.join(os.getcwd(), "../.."))

src_path = os.path.join(project_root, "src")
if src_path not in sys.path:
    sys.path.append(src_path)

## 1. Loading the data

In [2]:
# Define the path to the RIZIV dataset files
RIZIV_data_path = os.path.join(project_root, "data", "RIZIV_hackathon_main")

# Load the base document graph
with open(os.path.join(RIZIV_data_path, 'base_document_graph.pkl'), 'rb') as f:
    G = pickle.load(f)

# Load the relevant legal concepts dataframe
# Note: the file being loaded is not the df_concepts.pkl file, but the df_concepts_hackathon_original.pkl file, 
# as the latter contains the original concepts dataframe used during the hackathon, which is not the same as 
# the one generated in notebook 02.
concepts_df = pd.read_pickle(os.path.join(RIZIV_data_path, 'df_concepts_hackathon_original.pkl'))

# Resolve .env path relative to this script, regardless of where it's run from
env_path = os.path.join(RIZIV_data_path, '.env')

# Load environment variables from the .env file
load_dotenv(dotenv_path=env_path)

True

## 2. Removal of sequence nodes of type "PUB"

The code below takes care of the deletion of 'PUB' type sequence nodes from the base document graph as well as all their dependen text chunk nodes. A summary of the graph number of nodes by type is printed before and after this operation to illustrate its impact.

In [3]:
# Removing 'PUB' sequence nodes from the graph as well as their dependent text_chunk nodes

def print_graph_summary(G, message="Graph Summary"):
    """Print summary of nodes by type"""
    print(f"\n{message}")
    print("-" * 50)
    node_types = {}
    for node in G.nodes():
        node_type = G.nodes[node].get('type_node')
        if node_type:
            node_types[node_type] = node_types.get(node_type, 0) + 1
    
    for node_type, count in sorted(node_types.items()):
        print(f"Nodes of type '{node_type}': {count}")
    print(f"Total edges: {G.number_of_edges()}")
    print("-" * 50)

# Print initial graph summary
print_graph_summary(G, "Initial Graph State")

# Get list of sequence nodes to remove
sequence_nodes_to_remove = [
    node for node in G.nodes() 
    if G.nodes[node].get('type_node') == 'sequence' 
    and G.nodes[node].get('type') == 'PUB'
]

# Get all chunk nodes connected to these sequences
chunk_nodes_to_remove = set()
for seq_node in sequence_nodes_to_remove:
    # Get all neighbors of the sequence node
    neighbors = G.neighbors(seq_node)
    # Add to removal list if it's a text_chunk
    for neighbor in neighbors:
        if G.nodes[neighbor].get('type_node') == 'text_chunk':
            chunk_nodes_to_remove.add(neighbor)

# Remove all identified nodes
all_nodes_to_remove = sequence_nodes_to_remove + list(chunk_nodes_to_remove)
G.remove_nodes_from(all_nodes_to_remove)

# Print final graph summary
print_graph_summary(G, "Graph State After Removing PUB Sequences and Their Chunks")

print(f"\nRemoved {len(sequence_nodes_to_remove)} PUB sequence nodes")
print(f"Removed {len(chunk_nodes_to_remove)} associated chunk nodes")
print(f"Total nodes removed: {len(all_nodes_to_remove)}")


Initial Graph State
--------------------------------------------------
Nodes of type 'act': 1146
Nodes of type 'article': 9244
Nodes of type 'sequence': 31941
Nodes of type 'text_chunk': 43436
Total edges: 192232
--------------------------------------------------

Graph State After Removing PUB Sequences and Their Chunks
--------------------------------------------------
Nodes of type 'act': 1146
Nodes of type 'article': 9244
Nodes of type 'sequence': 15857
Nodes of type 'text_chunk': 22127
Total edges: 106996
--------------------------------------------------

Removed 16084 PUB sequence nodes
Removed 21309 associated chunk nodes
Total nodes removed: 37393


## 3. Pre-processing the extracted relevant legal concepts 

Before introducing the new nodes and edges into the graph, the DataFrame containing the relevant legal concepts undergoes a few pre-processing steps. The first two are sanity checks motivated by the fact that the base document graph was pruned (removal of PUB sequence nodes) after the automated extraction of relevant concepts. As a result, any references to nodes that no longer exist in the graph need to be removed.

Additionally, to limit the number of new nodes added and prioritize only those that could serve as bridges between different sub-graphs or "neighborhoods" within the base graph, only concepts cited in at least two distinct text chunk nodes are retained — i.e., concept nodes must have at least two edges pointing to different text chunks to be added.

In [4]:
# Create filtered dataframe after pruning
print("Filtering concepts after graph pruning...")
print(f"Original concepts DataFrame shape: {concepts_df.shape}")
print(f"Original average chunks per concept: {concepts_df['chunk_list'].str.len().mean():.2f}")

# Sanity check 1: Create new DataFrame with filtered chunk lists
concepts_after_pruning_df = concepts_df.copy()
concepts_after_pruning_df['chunk_list'] = concepts_after_pruning_df['chunk_list'].apply(
    lambda chunks: [chunk for chunk in chunks if G.has_node(chunk)]
)

# Sanity check 2: Remove concepts with empty chunk lists 
concepts_after_pruning_df = concepts_after_pruning_df[
    concepts_after_pruning_df['chunk_list'].str.len() > 0
]

# Filter: Keep only concepts mentioned in at least 2 chunks (reduces the number of concepts to be added to the graph)
concepts_after_pruning_df = concepts_after_pruning_df[
    concepts_after_pruning_df['chunk_list'].str.len() >= 2
].reset_index(drop=True)

# Print summary statistics
print("\nFiltering results:")
print(f"New DataFrame shape: {concepts_after_pruning_df.shape}")
print(f"New average chunks per concept: {concepts_after_pruning_df['chunk_list'].str.len().mean():.2f}")
print("\nConcepts retention statistics:")
print(f"- After removing non-existing chunks: {len(concepts_after_pruning_df[concepts_after_pruning_df['chunk_list'].str.len() > 0])} concepts")
print(f"- After requiring minimum 2 chunks: {len(concepts_after_pruning_df)} concepts")
print(f"Final retention rate: {(len(concepts_after_pruning_df)/len(concepts_df))*100:.2f}% of original concepts")

# Distribution of mentions
print("\nDistribution of chunk mentions in final concepts:")
mentions_dist = concepts_after_pruning_df['chunk_list'].str.len().describe()
print(mentions_dist)

Filtering concepts after graph pruning...
Original concepts DataFrame shape: (4976, 4)
Original average chunks per concept: 1.41

Filtering results:
New DataFrame shape: (285, 4)
New average chunks per concept: 3.95

Concepts retention statistics:
- After removing non-existing chunks: 285 concepts
- After requiring minimum 2 chunks: 285 concepts
Final retention rate: 5.73% of original concepts

Distribution of chunk mentions in final concepts:
count    285.000000
mean       3.947368
std        4.732347
min        2.000000
25%        2.000000
50%        2.000000
75%        4.000000
max       42.000000
Name: chunk_list, dtype: float64


## 4. Adding concept nodes to the graph

In this section, the filtered DataFrame of relevant legal concepts is used to add new concept nodes to the document graph. Each row in the DataFrame corresponds to one concept node, which is added to the graph with its associated attributes, including the concept's name and its list of categories.

For each concept, bidirectional edges are then created between the concept node and all the text chunk nodes where the concept was detected. These edges are labeled according to their direction (chunk_cites_concept and concept_cited_in_chunk) and allow the newly added nodes to connect different areas of the base graph, potentially bridging separate sub-graphs.

A summary of the operation is printed upon completion, including the total number of concepts added, the total number of edges created, and the average number of edges per concept. Additionally, the distribution of edge types in the graph is reported, along with an updated summary of the graph’s structure after the integration of the concept nodes.

In [5]:
def add_concepts_to_graph(G, concepts_after_pruning_df):
    """Add concept nodes to graph and connect them to their chunks with bidirectional edges"""
    
    print("\nAdding concepts to graph...")
    print("Initial state:", print_graph_summary(G, "Before adding concepts"))
    
    concepts_added = 0
    total_edges_added = 0
    
    # Iterate over filtered concepts DataFrame
    for idx, row in tqdm(concepts_after_pruning_df.iterrows(), 
                        total=len(concepts_after_pruning_df), 
                        desc="Processing concepts"):
        # Create concept node ID
        concept_id = f"concept_{idx}"
        
        # Add concept node with all its attributes
        G.add_node(
            concept_id,
            type_node='concept',
            name=row['concept_name'],
            categories=row['category_list'],
        )
        
        # Add bidirectional edges to all chunks
        for chunk in row['chunk_list']:
            # Edge from chunk to concept
            G.add_edge(chunk, concept_id, relationship_type='chunk_cites_concept')
            # Edge from concept to chunk
            G.add_edge(concept_id, chunk, relationship_type='concept_cited_in_chunk')
        
        concepts_added += 1
        total_edges_added += 2 * len(row['chunk_list'])  # Multiply by 2 for bidirectional edges
    
    print("\nConcepts integration summary:")
    print(f"Concepts added: {concepts_added}")
    print(f"Total edges added: {total_edges_added} ({total_edges_added//2} pairs of bidirectional edges)")
    print(f"Average edges per concept: {total_edges_added/concepts_added:.2f} ({(total_edges_added/2)/concepts_added:.2f} pairs)")
    
    # Count edges by type
    edge_types = {}
    for _, _, edge_data in G.edges(data=True):
        edge_type = edge_data.get('relationship_type', 'unknown')
        edge_types[edge_type] = edge_types.get(edge_type, 0) + 1
    
    print("\nEdge types distribution:")
    for edge_type, count in edge_types.items():
        print(f"- {edge_type}: {count}")
    
    print("\nFinal state:", print_graph_summary(G, "After adding concepts"))
    
    return G

# Execute the function
G = add_concepts_to_graph(G, concepts_after_pruning_df)

# Save the graph
with open(os.path.join(RIZIV_data_path, 'embeddingless_base_hybrid_graph.pkl'), 'wb') as f:
    pickle.dump(G, f)


Adding concepts to graph...

Before adding concepts
--------------------------------------------------
Nodes of type 'act': 1146
Nodes of type 'article': 9244
Nodes of type 'sequence': 15857
Nodes of type 'text_chunk': 22127
Total edges: 106996
--------------------------------------------------
Initial state: None


Processing concepts: 100%|██████████| 285/285 [00:00<00:00, 2774.87it/s]


Concepts integration summary:
Concepts added: 285
Total edges added: 2250 (1125 pairs of bidirectional edges)
Average edges per concept: 7.89 (3.95 pairs)

Edge types distribution:
- contains_article: 9244
- belongs_to_act: 9244
- has_version: 15857
- version_of: 15857
- contains_chunk: 22127
- contained_in_sequence: 22127
- chunk_cites_concept: 1125
- followed_by: 6270
- preceded_by: 6270
- concept_cited_in_chunk: 1125

After adding concepts
--------------------------------------------------
Nodes of type 'act': 1146
Nodes of type 'article': 9244
Nodes of type 'concept': 285
Nodes of type 'sequence': 15857
Nodes of type 'text_chunk': 22127
Total edges: 109246
--------------------------------------------------

Final state: None



