# Building a Scalable Semantic Graph from ConceptNet with NetworkX
This notebook demonstrates how to construct a scalable semantic graph database from cleaned ConceptNet triples using NetworkX. It covers loading the data, building a graph, performing basic queries, and saving/loading the graph in a standard format.

In [1]:
# Import required libraries
import pandas as pd
import networkx as nx
import os
import pyarrow.parquet as pq

## 1. Load Cleaned ConceptNet Data
We use the preprocessed ConceptNet triples stored in Parquet format. Adjust the path if needed.

In [2]:
# Define the path to the cleaned ConceptNet data
conceptnet_path = os.path.join('..', 'Data', 'Input', 'conceptnet_en_processed_for_graph.parquet.gzip')

# Load the data
df = pd.read_parquet(conceptnet_path)
df.head()

Unnamed: 0,relation_type,start_concept,end_concept,edge_weight
0,Antonym,n,1,1.0
1,Antonym,n,24_hour_clock,1.0
2,Antonym,n,12_hour_clock,1.0
3,Antonym,n,3,1.0
4,Antonym,n,d.c,1.0


## 2. Build a NetworkX MultiDiGraph from the Edge List
Each ConceptNet triple becomes a directed edge with attributes (relation, weight, etc.).

In [3]:
# Create a MultiDiGraph (allows multiple edges between nodes)
G = nx.MultiDiGraph()

# Add edges from the DataFrame
for _, row in df.iterrows():
    G.add_edge(
        row['start_concept'],
        row['end_concept'],
        relation=row['relation_type'],
        weight=row.get('edge_weight', 1.0)
    )

print(f"Graph has {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")

Graph has 754380 nodes and 1655522 edges.


## 3. Basic Graph Database Operations
Query neighbors, edge attributes, and extract subgraphs.

In [4]:
# Example: Query neighbors of a concept
concept = 'dog'
neighbors = list(G.neighbors(concept))
print(f"Neighbors of '{concept}':", neighbors)

# Example: Get all edges and their attributes for a concept
edges = G.out_edges(concept, data=True)
for u, v, attr in edges:
    print(f"{u} --[{attr['relation']}]--> {v} (weight={attr['weight']})")

Neighbors of 'dog': ['again', 'car', 'cat', 'cat_again', 'cat_maybe', 'computer', 'maybe', 'woof', 'back_yard', 'backyard', 'bed', 'couch', 'desk', 'dog_house', "dog_owner's_home", 'doghouse', 'dogpound', 'farmyard', 'front_door', 'ground', 'house', "it's_kennel", 'kennel', 'leash', "neighbor's_house", 'outside', 'park', 'pet_shop', 'pet_store', 'petshop', 'porch', 'pound', 'relatives_house', 'rug', 'table', 'act_playful', 'answer_to_master', 'appear_tired', 'bark', 'bark_at_strangers', 'barks_when_hears_steps', 'become_pet', 'belong_human', 'bite', 'breathe', 'breed_several_puppies', 'bring_master_bone', 'bury_bone_in_ground', 'calm_mind', 'cause_accident', 'cause_rabbit_to_run_away', 'chase_ball', 'chasing_ball', 'circle_sheep', 'circle_tree', 'come_home', 'come_to_master', 'come_when_call_name', 'company_man', 'corner_cat', 'course_hare', 'cover_hole_in_ground', 'cross_street', 'dig_up_bone', 'dig_up_bones', 'dream_caught_rabbit', 'drink_water', "eat_cat_food_but_probably_shouldn't"

In [7]:
# Example: Extract a subgraph of a concept and its immediate neighbors
sub_nodes = [concept] + neighbors
subgraph = G.subgraph(sub_nodes)
print(f"Subgraph has {subgraph.number_of_nodes()} nodes and {subgraph.number_of_edges()} edges.")

Subgraph has 501 nodes and 2052 edges.


## 4. Save and Load the Graph in GraphML Format
GraphML is a standard, portable format for graph data.

In [8]:
# Save the graph
output_path = os.path.join('..', 'Data', 'Output', 'conceptnet_graph.graphml')
nx.write_graphml(G, output_path)
print(f"Graph saved to {output_path}")

# Load the graph back (demonstration)
G_loaded = nx.read_graphml(output_path)
print(f"Loaded graph has {G_loaded.number_of_nodes()} nodes and {G_loaded.number_of_edges()} edges.")

Graph saved to ..\Data\Output\conceptnet_graph.graphml
Loaded graph has 754380 nodes and 1655522 edges.


## 5. Next Steps and Advanced Queries
- Expand to larger datasets by chunking or streaming edge additions.
- Integrate with graph databases like Neo4j for even larger-scale applications.
- Perform advanced queries (e.g., shortest paths, community detection, etc.).

## 6. Semantic Enrichment: Adding Custom Logic
Now that the graph is built, you can add your own semantic logic. For example, you might:
- Add new semantic relationships based on custom rules or external data.
- Reweight or filter edges based on your own criteria.
- Annotate nodes or edges with additional semantic metadata.
- Implement reasoning or inference over the graph structure.

Below is an example of how to add a new semantic property to edges based on a custom rule.

In [9]:
# Example: Add a semantic property to edges based on a custom rule
# Here, we flag edges as 'is_animal_relation' if either node is an animal (simple demo)
animal_concepts = {'dog', 'cat', 'horse', 'cow', 'sheep'}  # Replace with your own logic or list

for u, v, k, data in G.edges(keys=True, data=True):
    if u in animal_concepts or v in animal_concepts:
        data['is_animal_relation'] = True
    else:
        data['is_animal_relation'] = False

# Show a few example edges with the new property
count = 0
for u, v, data in G.edges(data=True):
    if 'is_animal_relation' in data and data['is_animal_relation']:
        print(f"{u} --[{data['relation']}]--> {v} | is_animal_relation: {data['is_animal_relation']}")
        count += 1
    if count >= 5:
        break

n --[Antonym]--> dog | is_animal_relation: True
n --[FormOf]--> dog | is_animal_relation: True
n --[HasContext]--> dog | is_animal_relation: True
n --[IsA]--> dog | is_animal_relation: True
n --[RelatedTo]--> dog | is_animal_relation: True


You can now build on this pattern to add more advanced semantic logic, such as inference, clustering, or integration with external knowledge sources. Just add new cells below to continue your workflow.

# 1. NetworkX Hyper-Training Function

In [None]:
def networkx_hyper_train(data_path, iterations=3, save_prefix="nx_hyper_trained"):
    """
    Hyper-train a knowledge graph using NetworkX - much faster than matrix-based approaches
    while preserving the iterative quality improvements.
    """
    import pandas as pd
    import networkx as nx
    import os
    import time
    from tqdm.auto import tqdm
    
    print(f"🚀 Starting NetworkX Hyper-Training ({iterations} iterations)")
    
    # Load the data once
    full_df = pd.read_parquet(data_path)
    print(f"📊 Loaded {len(full_df):,} triples")
    
    # Track metrics
    metrics = []
    
    # Initial params
    current_quality_threshold = 0.5  # Start higher with NetworkX (faster)
    current_concept_limit = 10000
    
    # Create output directory
    os.makedirs(os.path.join('..', 'Data', 'Output'), exist_ok=True)
    
    for iteration in range(iterations):
        start_time = time.time()
        print(f"\n{'='*60}")
        print(f"📌 ITERATION {iteration+1}/{iterations}")
        print(f"{'='*60}")
        
        # 1. Filter concepts by popularity (memory efficiency)
        concept_counts = pd.concat([
            full_df['start_concept'].value_counts(),
            full_df['end_concept'].value_counts()
        ]).groupby(level=0).sum().sort_values(ascending=False)
        
        top_concepts = set(concept_counts.head(current_concept_limit).index)
        filtered_df = full_df[
            full_df['start_concept'].isin(top_concepts) & 
            full_df['end_concept'].isin(top_concepts)
        ]
        
        # 2. Apply quality threshold (weight-based)
        quality_df = filtered_df[filtered_df['edge_weight'] >= current_quality_threshold]
        
        print(f"📊 Using {len(top_concepts):,} concepts")
        print(f"📊 Quality threshold: {current_quality_threshold:.2f}")
        print(f"📊 Filtered to {len(quality_df):,} high-quality triples")
        
        # 3. Build graph (much faster with NetworkX)
        print(f"🔄 Building graph...")
        G = nx.MultiDiGraph()
        
        # Add edges with progress bar
        for _, row in tqdm(quality_df.iterrows(), total=len(quality_df), desc="Adding edges"):
            G.add_edge(
                row['start_concept'],
                row['end_concept'],
                relation=row['relation_type'],
                weight=row['edge_weight']
            )
        
        # 4. Apply semantic enrichment for this iteration
        print(f"✨ Adding semantic enrichment (iteration-specific)...")
        
        # Simple example: mark high-confidence edges
        high_confidence_threshold = 0.7 + (iteration * 0.05)  # Increases each iteration
        edge_count = 0
        
        for u, v, k, data in G.edges(keys=True, data=True):
            # Add iteration-specific attributes
            data['iteration_added'] = iteration + 1
            data['high_confidence'] = data['weight'] >= high_confidence_threshold
            edge_count += 1
            
            # Add custom logic (just examples - modify for your needs)
            if iteration == 1:
                # Second iteration: add transitivity scores
                data['transitive_potential'] = len(list(G.neighbors(v))) / 100
            elif iteration == 2:
                # Third iteration: add centrality-based importance
                try:
                    data['centrality_score'] = G.degree(u) * G.degree(v) / 1000
                except:
                    data['centrality_score'] = 0
        
        # 5. Save this iteration
        output_path = os.path.join('..', 'Data', 'Output', f"{save_prefix}_iter{iteration+1}.graphml")
        nx.write_graphml(G, output_path)
        
        # 6. Calculate and store metrics
        iter_time = time.time() - start_time
        metrics.append({
            'iteration': iteration + 1,
            'quality_threshold': current_quality_threshold,
            'concept_limit': current_concept_limit,
            'nodes': G.number_of_nodes(),
            'edges': G.number_of_edges(),
            'processing_time': iter_time,
            'edges_per_second': G.number_of_edges() / max(1, iter_time)
        })
        
        print(f"📊 Graph has {G.number_of_nodes():,} nodes and {G.number_of_edges():,} edges")
        print(f"⏱️  Processing time: {iter_time:.1f} seconds")
        print(f"💾 Saved to: {output_path}")
        
        # 7. Update parameters for next iteration
        current_quality_threshold = min(0.9, current_quality_threshold + 0.1)
        current_concept_limit = min(50000, current_concept_limit + 10000)  # Increase concept count
    
    # Final summary
    print(f"\n{'='*60}")
    print(f"🎉 HYPER-TRAINING COMPLETE")
    print(f"{'='*60}")
    
    for m in metrics:
        print(f"Iteration {m['iteration']}: {m['nodes']:,} nodes, "
              f"{m['edges']:,} edges, {m['processing_time']:.1f}s, "
              f"threshold={m['quality_threshold']:.2f}")
    
    final_path = os.path.join('..', 'Data', 'Output', f"{save_prefix}_final.graphml")
    nx.write_graphml(G, final_path)
    print(f"💾 Final model saved to: {final_path}")
    
    return G, metrics, final_path

# 2. Call The Function To Start Training

In [None]:
# # Call the NetworkX hyper-training function
# G_trained, training_metrics, final_model_path = networkx_hyper_train(
#     data_path=conceptnet_path,  # Path to your conceptnet data
#     iterations=3,               # Number of training iterations
#     save_prefix="nx_semantic"   # Prefix for saved files
# )

# # The result is already a NetworkX graph ready for querying!
# print(f"\n🔍 Testing final model...")
# concept = 'dog'
# for u, v, data in G_trained.out_edges(concept, data=True):
#     if data.get('high_confidence', False):
#         print(f"{u} --[{data['relation']}]--> {v} (weight={data['weight']}, added in iteration {data['iteration_added']})")

## 7. Memory-Efficient Hyper-Training for Scaling
To achieve our goal of eventually including the full ConceptNet dataset, we need a more memory-efficient approach that:
1. Processes data in chunks rather than loading everything at once
2. Uses more efficient storage formats
3. Implements checkpointing for recovery
4. Has configurable parameters for gradual scaling

In [11]:
# ⚡️ Fix: Chunked Parquet Reading with pyarrow
# The scalable training function is now updated to use a custom `parquet_chunk_reader` utility based on `pyarrow` for chunked reading of Parquet files. This avoids the `TypeError` from pandas' `read_parquet()` and enables true memory-efficient processing of large datasets.

# Utility: Chunked Parquet Reader using pyarrow
def parquet_chunk_reader(parquet_path, chunk_size=100000, columns=None):
    """
    Efficiently read a Parquet file in chunks using pyarrow, yielding pandas DataFrames.
    """
    import pyarrow.parquet as pq
    import pandas as pd

    parquet_file = pq.ParquetFile(parquet_path)
    num_row_groups = parquet_file.num_row_groups
    rows_read = 0
    total_rows = parquet_file.metadata.num_rows

    # Read in row groups, but yield DataFrames in chunks of chunk_size
    batch = []
    batch_rows = 0
    for rg in range(num_row_groups):
        table = parquet_file.read_row_group(rg, columns=columns)
        df = table.to_pandas()
        batch.append(df)
        batch_rows += len(df)
        rows_read += len(df)
        # If enough rows for a chunk, yield
        while batch_rows >= chunk_size:
            concat_df = pd.concat(batch)
            yield concat_df.iloc[:chunk_size]
            # Prepare for next batch
            if batch_rows > chunk_size:
                # Keep the remainder for next chunk
                batch = [concat_df.iloc[chunk_size:]]
                batch_rows = len(batch[0])
            else:
                batch = []
                batch_rows = 0
    # Yield any remaining rows
    if batch_rows > 0:
        yield pd.concat(batch)

# Refactor: networkx_scalable_hyper_train now uses parquet_chunk_reader for chunked reading
def networkx_scalable_hyper_train(
    data_path,
    iterations=3,
    save_prefix="nx_scalable",
    config=None
):
    """
    Memory-efficient scalable hyper-training for knowledge graphs.
    Now saves a graph snapshot after every chunk for smooth animation.
    """
    import pandas as pd
    import networkx as nx
    import os
    import time
    import pickle
    import json
    import gc
    from datetime import datetime
    from tqdm.auto import tqdm

    # Default configuration
    default_config = {
        # Data filtering
        'initial_quality_threshold': 0.3,
        'quality_threshold_step': 0.1,
        'max_quality_threshold': 0.9,
        # Concept limits
        'initial_concept_limit': 20000,
        'concept_limit_step': 20000,
        'max_concept_limit': 100000,
        # Memory optimization
        'chunk_size': 100000,
        'gc_frequency': 2,
        # Storage options
        'save_format': 'graphml',          # Save as 'graphml' (best for web), can change to 'pickle'
        'compression': False,              # Not used for graphml
        # Checkpointing
        'enable_checkpoints': True,
        'resume_from_checkpoint': True,
        # Enrichment options
        'add_high_confidence_flag': True,
        'add_transitivity': True,
        'add_centrality': True,
        # Animation snapshots
        'save_snapshots_per_chunk': True,   # << NEW: Save a snapshot after each chunk
        # Performance profiling
        'profile': False
    }

    # Update with user config
    if config is None:
        config = {}
    for key, value in config.items():
        if key in default_config:
            default_config[key] = value
    config = default_config

    print(f"🚀 Starting Scalable Hyper-Training ({iterations} iterations)")

    # Create output directory
    output_dir = os.path.join('..', 'Data', 'Output')
    os.makedirs(output_dir, exist_ok=True)

    # Initialize metrics tracking
    metrics = []

    # Current parameters
    current_quality_threshold = config['initial_quality_threshold']
    current_concept_limit = config['initial_concept_limit']

    # Try to resume from checkpoint if enabled
    checkpoint_path = os.path.join(output_dir, f"{save_prefix}_checkpoint.pkl")
    if config['enable_checkpoints'] and config['resume_from_checkpoint'] and os.path.exists(checkpoint_path):
        print(f"📂 Resuming from checkpoint: {checkpoint_path}")
        try:
            with open(checkpoint_path, 'rb') as f:
                checkpoint = pickle.load(f)
            metrics = checkpoint['metrics']
            current_quality_threshold = checkpoint['next_quality_threshold']
            current_concept_limit = checkpoint['next_concept_limit']
            last_iteration = checkpoint['last_completed_iteration']
            print(f"✅ Restored from checkpoint (completed {last_iteration+1}/{iterations} iterations)")
            start_iteration = last_iteration + 1
        except Exception as e:
            print(f"❌ Failed to load checkpoint: {e}")
            start_iteration = 0
    else:
        start_iteration = 0

    # Process only needed iterations
    for iteration in range(start_iteration, iterations):
        iteration_start_time = time.time()
        print(f"\n{'='*60}")
        print(f"📌 ITERATION {iteration+1}/{iterations}")
        print(f"{'='*60}")

        print(f"📊 Quality threshold: {current_quality_threshold:.2f}")
        print(f"📊 Concept limit: {current_concept_limit:,}")

        # Step 1: Initial scan - count concepts to identify top ones
        print(f"🔍 Scanning data to identify top concepts...")
        concept_counter = None
        for i, chunk in enumerate(parquet_chunk_reader(data_path, chunk_size=config['chunk_size'])):
            chunk_counts = pd.concat([
                chunk['start_concept'].value_counts(),
                chunk['end_concept'].value_counts()
            ]).groupby(level=0).sum()
            if concept_counter is None:
                concept_counter = chunk_counts
            else:
                concept_counter = concept_counter.add(chunk_counts, fill_value=0)
            if i % config['gc_frequency'] == 0:
                gc.collect()

        # Get top concepts
        top_concepts = set(concept_counter.nlargest(current_concept_limit).index)
        print(f"✅ Identified {len(top_concepts):,} most frequent concepts")

        # Step 2: Build graph in chunks
        print(f"🔄 Building graph with quality threshold {current_quality_threshold:.2f}...")

        G = nx.MultiDiGraph()
        total_edges_processed = 0
        total_edges_added = 0

        chunk_reader = parquet_chunk_reader(data_path, chunk_size=config['chunk_size'])
        for i, chunk in enumerate(tqdm(chunk_reader, desc="Building graph")):
            filtered_chunk = chunk[
                chunk['start_concept'].isin(top_concepts) &
                chunk['end_concept'].isin(top_concepts)
            ]
            quality_chunk = filtered_chunk[filtered_chunk['edge_weight'] >= current_quality_threshold]

            for _, row in tqdm(quality_chunk.iterrows(), total=len(quality_chunk), desc=f"Adding edges (chunk {i+1})", leave=False):
                G.add_edge(
                    row['start_concept'],
                    row['end_concept'],
                    relation=row['relation_type'],
                    weight=row['edge_weight'],
                    iteration_added=iteration + 1
                )
            total_edges_processed += len(filtered_chunk)
            total_edges_added += len(quality_chunk)
            if i % config['gc_frequency'] == 0:
                gc.collect()

            # **NEW: Save snapshot after every chunk**
            if config.get('save_snapshots_per_chunk', False):
                chunk_snap_path = os.path.join(
                    output_dir, f"{save_prefix}_iter{iteration+1}_chunk{i+1}.graphml")
                nx.write_graphml(G, chunk_snap_path)
                print(f"💾 Saved chunk snapshot: {chunk_snap_path}")

        print(f"✅ Processed {total_edges_processed:,} edges, added {total_edges_added:,} to graph")
        print(f"📊 Graph has {G.number_of_nodes():,} nodes and {G.number_of_edges():,} edges")
        
        # Step 3: Apply semantic enrichment
        if config['add_high_confidence_flag'] or config['add_transitivity'] or config['add_centrality']:
            print(f"✨ Adding semantic enrichment...")
            
            enrichment_start = time.time()
            
            # High confidence threshold increases with each iteration
            high_confidence_threshold = 0.7 + (iteration * 0.05)
            
            if config['add_high_confidence_flag']:
                # Mark high-confidence edges
                for u, v, k, data in tqdm(G.edges(keys=True, data=True), 
                                         desc="Adding high confidence flags",
                                         total=G.number_of_edges()):
                    data['high_confidence'] = data['weight'] >= high_confidence_threshold
            
            # Add iteration-specific enrichment
            if iteration == 1 and config['add_transitivity']:
                # Second iteration: add transitivity scores for a subset of nodes
                # This is expensive, so we limit to higher-degree nodes
                print("🔄 Adding transitivity scores...")
                for u, v, k, data in tqdm(G.edges(keys=True, data=True),
                                         desc="Adding transitivity scores",
                                         total=G.number_of_edges()):
                    try:
                        # More efficient calculation - only count if needed
                        if G.out_degree(v) > 0:  # Only if target has outgoing connections
                            data['transitive_potential'] = min(G.out_degree(v) / 100, 1.0)
                    except:
                        data['transitive_potential'] = 0
            
            elif iteration == 2 and config['add_centrality']:
                # Third iteration: add simplified centrality scores
                # Full betweenness centrality is too expensive, so we use degree as proxy
                print("🔄 Adding centrality scores...")
                for u, v, k, data in tqdm(G.edges(keys=True, data=True),
                                         desc="Adding centrality scores",
                                         total=G.number_of_edges()):
                    try:
                        data['centrality_score'] = min(
                            (G.degree(u) + G.degree(v)) / 1000, 
                            1.0
                        )
                    except:
                        data['centrality_score'] = 0
            
            print(f"✅ Enrichment completed in {time.time() - enrichment_start:.1f}s")
        
        # Step 4: Save this iteration
        save_start_time = time.time()
        
        if config['save_format'] == 'graphml':
            # Save as GraphML (standard but larger files)
            output_path = os.path.join(output_dir, f"{save_prefix}_iter{iteration+1}.graphml")
            nx.write_graphml(G, output_path)
        else:
            # Save as pickle (faster, smaller)
            output_path = os.path.join(output_dir, f"{save_prefix}_iter{iteration+1}.pkl")
            with open(output_path, 'wb') as f:
                pickle.dump(G, f, protocol=4)  # Protocol 4 for better performance
        
        print(f"💾 Saved to: {output_path} in {time.time() - save_start_time:.1f}s")
        
        # Step 5: Calculate and store metrics
        iter_time = time.time() - iteration_start_time
        iter_metrics = {
            'iteration': iteration + 1,
            'quality_threshold': current_quality_threshold,
            'concept_limit': current_concept_limit,
            'nodes': G.number_of_nodes(),
            'edges': G.number_of_edges(),
            'processing_time': iter_time,
            'edges_per_second': total_edges_processed / max(1, iter_time),
            'added_edges_ratio': total_edges_added / max(1, total_edges_processed),
            'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        }
        metrics.append(iter_metrics)
        
        print(f"📊 Iteration stats:")
        print(f"   Nodes: {iter_metrics['nodes']:,}")
        print(f"   Edges: {iter_metrics['edges']:,}")
        print(f"   Processing time: {iter_metrics['processing_time']:.1f}s")
        print(f"   Edges/second: {iter_metrics['edges_per_second']:.1f}")
        
        # Step 6: Save checkpoint if enabled
        if config['enable_checkpoints']:
            # Calculate next thresholds
            next_quality_threshold = min(
                config['max_quality_threshold'],
                current_quality_threshold + config['quality_threshold_step']
            )
            next_concept_limit = min(
                config['max_concept_limit'],
                current_concept_limit + config['concept_limit_step']
            )
            
            # Save checkpoint
            checkpoint = {
                'metrics': metrics,
                'last_completed_iteration': iteration,
                'next_quality_threshold': next_quality_threshold,
                'next_concept_limit': next_concept_limit,
                'timestamp': datetime.now().strftime("%Y-%m-%d %H:%M:%S")
            }
            
            with open(checkpoint_path, 'wb') as f:
                pickle.dump(checkpoint, f)
                
            print(f"✅ Checkpoint saved to: {checkpoint_path}")
        
        # Step 7: Update parameters for next iteration
        current_quality_threshold = min(
            config['max_quality_threshold'],
            current_quality_threshold + config['quality_threshold_step']
        )
        current_concept_limit = min(
            config['max_concept_limit'],
            current_concept_limit + config['concept_limit_step']
        )
        
        # Explicitly run garbage collection between iterations
        gc.collect()
    
    # Final summary
    print(f"\n{'='*60}")
    print(f"🎉 HYPER-TRAINING COMPLETE")
    print(f"{'='*60}")
    
    for m in metrics:
        print(f"Iteration {m['iteration']}: {m['nodes']:,} nodes, "
              f"{m['edges']:,} edges, {m['processing_time']:.1f}s, "
              f"threshold={m['quality_threshold']:.2f}")
    
    # Save final version (use the same format as iterations for consistency)
    if config['save_format'] == 'graphml':
        final_path = os.path.join(output_dir, f"{save_prefix}_final.graphml")
        nx.write_graphml(G, final_path)
    else:
        final_path = os.path.join(output_dir, f"{save_prefix}_final.pkl")
        with open(final_path, 'wb') as f:
            pickle.dump(G, f, protocol=4)
    
    print(f"💾 Final model saved to: {final_path}")
    
    # Save metrics as JSON for analysis
    metrics_path = os.path.join(output_dir, f"{save_prefix}_metrics.json")
    with open(metrics_path, 'w') as f:
        json.dump(metrics, f, indent=2)
    
    print(f"📊 Training metrics saved to: {metrics_path}")
    
    return G, metrics, final_path

# 3. Run the Scalable Hyper-Training

This version can handle much larger datasets by using chunked processing, efficient storage, and checkpointing. It's designed to be interruptible and resumable, making it practical for training on the full ConceptNet dataset in stages.

In [13]:
# Configure the hyper-training parameters
scalable_config = {
    # Data filtering - start with lower threshold for broader coverage
    'initial_quality_threshold': 0.3,
    'quality_threshold_step': 0.1,
    'max_quality_threshold': 0.6,   # Lower max threshold to keep more relationships

    # Concept limits - gradually scale up
    'initial_concept_limit': 15000,
    'concept_limit_step': 15000, 
    'max_concept_limit': 100000,

    # Memory optimization
    'chunk_size': 100000,
    'gc_frequency': 3,

    # Storage options
    'save_format': 'graphml',               # <--- CHANGE THIS!
    'compression': False,                   # Not used for graphml, but can leave as is

    # Animation!
    'save_snapshots_per_chunk': True,       # <--- ADD THIS!
    
    # Checkpointing
    'enable_checkpoints': True,
    'resume_from_checkpoint': False,
}
# Uncomment and run this cell to start the scalable training
G_scalable, scalable_metrics, scalable_path = networkx_scalable_hyper_train(
    data_path=conceptnet_path,
    iterations=4,                # More iterations for gradual refinement
    save_prefix="nx_scaled_kg",  # Different prefix to avoid overwriting
    config=scalable_config
)

# Test the final model
print(f"\n🔍 Testing scalable model...")
test_concept = 'computer'  # Try a different concept
for u, v, data in G_scalable.out_edges(test_concept, data=True):
    if data.get('high_confidence', False):
        print(f"{u} --[{data['relation']}]--> {v} (weight={data['weight']}, added in iteration {data['iteration_added']})")


🚀 Starting Scalable Hyper-Training (4 iterations)

📌 ITERATION 1/4
📊 Quality threshold: 0.30
📊 Concept limit: 15,000
🔍 Scanning data to identify top concepts...
✅ Identified 15,000 most frequent concepts
🔄 Building graph with quality threshold 0.30...


Building graph: 1it [00:02,  2.41s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk1.graphml


Building graph: 2it [00:02,  1.24s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk2.graphml


Building graph: 3it [00:03,  1.14it/s]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk3.graphml


Building graph: 4it [00:05,  1.26s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk4.graphml


Building graph: 5it [00:05,  1.02s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk5.graphml


Building graph: 6it [00:06,  1.10it/s]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk6.graphml


Building graph: 7it [00:08,  1.31s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk7.graphml


Building graph: 8it [00:09,  1.22s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk8.graphml


Building graph: 9it [00:12,  1.73s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk9.graphml


Building graph: 10it [00:15,  2.18s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk10.graphml


Building graph: 11it [00:19,  2.64s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk11.graphml


Building graph: 12it [00:23,  2.96s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk12.graphml


Building graph: 13it [00:26,  3.21s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk13.graphml


Building graph: 14it [00:32,  3.89s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk14.graphml


Building graph: 15it [00:35,  3.55s/it]

💾 Saved chunk snapshot: ..\Data\Output\nx_scaled_kg_iter1_chunk15.graphml


Building graph: 15it [00:39,  2.66s/it]


KeyboardInterrupt: 

In [14]:
# upgrade jupyter and ipywidgets
%pip install --upgrade jupyter ipywidgets

Collecting jupyter
  Using cached jupyter-1.1.1-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting ipywidgets
  Downloading ipywidgets-8.1.7-py3-none-any.whl.metadata (2.4 kB)
Collecting notebook (from jupyter)
  Downloading notebook-7.4.2-py3-none-any.whl.metadata (10 kB)
Collecting jupyter-console (from jupyter)
  Using cached jupyter_console-6.6.3-py3-none-any.whl.metadata (5.8 kB)
Collecting nbconvert (from jupyter)
  Using cached nbconvert-7.16.6-py3-none-any.whl.metadata (8.5 kB)
Collecting jupyterlab (from jupyter)
  Downloading jupyterlab-4.4.2-py3-none-any.whl.metadata (16 kB)
Collecting widgetsnbextension~=4.0.14 (from ipywidgets)
  Using cached widgetsnbextension-4.0.14-py3-none-any.whl.metadata (1.6 kB)
Collecting jupyterlab_widgets~=3.0.15 (from ipywidgets)
  Downloading jupyterlab_widgets-3.0.15-py3-none-any.whl.metadata (20 kB)
Collecting async-lru>=1.0.0 (from jupyterlab->jupyter)
  Using cached async_lru-2.0.5-py3-none-any.whl.metadata (4.5 kB)
Collecting httpx>=0.25.0 

## 8. Advanced Knowledge Graph Exploration Tools (Scalable Version)

The following tools are designed for advanced semantic exploration and analysis of large, semantically-enriched knowledge graphs produced by the scalable hyper-training workflow. You can use these tools to query, analyze, and visualize the graph, extract semantic paths, and perform advanced reasoning tasks.

**How to use:**
- Load your trained scalable graph (typically saved as a pickle file for efficiency).
- Initialize the `KnowledgeGraphExplorer` with your graph.
- Use the provided methods to explore concepts, find paths, analyze statistics, and more.
- For very large graphs, ensure your environment has sufficient memory.

In [None]:
class KnowledgeGraphExplorer:
    """
    Tools for exploring and analyzing a NetworkX-based knowledge graph.
    Designed to work with ConceptNet-style semantic graphs.
    """
    
    def __init__(self, graph, relation_types=None):
        """Initialize the explorer with a graph"""
        self.graph = graph
        
        # Filter relation types if specified, otherwise use all
        if relation_types is None:
            self.relation_types = self._get_all_relation_types()
        else:
            self.relation_types = relation_types
            
        # Cache for performance
        self._node_centrality = None
        
    def _get_all_relation_types(self):
        """Get all unique relation types in the graph"""
        relation_types = set()
        for _, _, data in self.graph.edges(data=True):
            if 'relation' in data:
                relation_types.add(data['relation'])
        return list(relation_types)
    
    def explore_concept(self, concept, relation_filter=None, min_weight=0.0, limit=10):
        """
        Get the most important relationships for a concept
        
        Parameters:
        -----------
        concept : str
            The concept to explore
        relation_filter : list or None
            Filter by relation types (e.g., ['IsA', 'PartOf'])
        min_weight : float
            Minimum weight threshold
        limit : int
            Maximum number of results to return
            
        Returns:
        --------
        pd.DataFrame
            DataFrame with source, target, relation, weight
        """
        if concept not in self.graph:
            return f"Concept '{concept}' not found in the graph."
        
        # Collect outgoing and incoming edges
        edges = []
        
        # Outgoing edges
        for _, target, data in self.graph.out_edges(concept, data=True):
            if (relation_filter is None or data.get('relation') in relation_filter) and data.get('weight', 0) >= min_weight:
                edges.append({
                    'source': concept,
                    'target': target,
                    'relation': data.get('relation', 'unknown'),
                    'weight': data.get('weight', 0.0),
                    'direction': 'outgoing',
                    'high_confidence': data.get('high_confidence', False)
                })
        
        # Incoming edges
        for source, _, data in self.graph.in_edges(concept, data=True):
            if (relation_filter is None or data.get('relation') in relation_filter) and data.get('weight', 0) >= min_weight:
                edges.append({
                    'source': source,
                    'target': concept,
                    'relation': data.get('relation', 'unknown'),
                    'weight': data.get('weight', 0.0),
                    'direction': 'incoming',
                    'high_confidence': data.get('high_confidence', False)
                })
        
        # Convert to DataFrame and sort by weight
        import pandas as pd
        edges_df = pd.DataFrame(edges)
        
        if len(edges_df) > 0:
            edges_df = edges_df.sort_values('weight', ascending=False).head(limit)
        
        return edges_df
    
    def find_path(self, source, target, relation_filter=None, max_length=3):
        """
        Find the shortest path between two concepts, considering only certain relations
        
        Parameters:
        -----------
        source : str
            Source concept
        target : str
            Target concept
        relation_filter : list or None
            Filter by relation types
        max_length : int
            Maximum path length
            
        Returns:
        --------
        list
            List of (concept, relation, concept) tuples, or None if no path
        """
        import networkx as nx
        
        # Check if concepts exist
        if source not in self.graph or target not in self.graph:
            return f"One or both concepts not found: {source}, {target}"
        
        # Create a view of the graph with only the desired relations
        if relation_filter is not None:
            view = nx.MultiDiGraph()
            
            for u, v, key, data in self.graph.edges(keys=True, data=True):
                if data.get('relation') in relation_filter:
                    view.add_edge(u, v, key=key, **data)
                    
            search_graph = view
        else:
            search_graph = self.graph
        
        # Find the shortest path
        try:
            path = nx.shortest_path(search_graph, source, target, weight=lambda u, v, data: 1/max(data.get('weight', 0.1), 0.1))
            
            # Construct the full path with relations
            full_path = []
            for i in range(len(path)-1):
                edges = search_graph.get_edge_data(path[i], path[i+1])
                if not edges:  # No edge exists (shouldn't happen given how path was found)
                    relation = 'unknown'
                    weight = 0.0
                else:
                    # Get the highest weighted edge if multiple
                    best_key = max(edges.keys(), key=lambda k: edges[k].get('weight', 0))
                    relation = edges[best_key].get('relation', 'unknown')
                    weight = edges[best_key].get('weight', 0.0)
                
                full_path.append((path[i], relation, path[i+1], weight))
            
            return full_path
            
        except nx.NetworkXNoPath:
            return f"No path found between {source} and {target} with given constraints."
    
    def get_top_concepts(self, n=20, measure='degree'):
        """
        Get the top N concepts by a centrality measure
        
        Parameters:
        -----------
        n : int
            Number of concepts to return
        measure : str
            Centrality measure ('degree', 'in_degree', 'out_degree', 'pagerank')
            
        Returns:
        --------
        pd.DataFrame
            DataFrame with concepts and their centrality scores
        """
        import pandas as pd
        import networkx as nx
        
        if measure == 'degree':
            centrality = {node: degree for node, degree in self.graph.degree()}
        elif measure == 'in_degree':
            centrality = {node: degree for node, degree in self.graph.in_degree()}
        elif measure == 'out_degree':
            centrality = {node: degree for node, degree in self.graph.out_degree()}
        elif measure == 'pagerank':
            # Compute pagerank (can be slow for large graphs)
            print("Computing PageRank (this may take a while)...")
            centrality = nx.pagerank(self.graph, alpha=0.85, weight='weight')
        else:
            raise ValueError(f"Unknown centrality measure: {measure}")
        
        # Convert to DataFrame and get top N
        df = pd.DataFrame(centrality.items(), columns=['concept', 'score'])
        df = df.nlargest(n, 'score')
        
        return df
    
    def analyze_graph_stats(self):
        """
        Compute basic statistics about the knowledge graph
        
        Returns:
        --------
        dict
            Dictionary with graph statistics
        """
        import networkx as nx
        import numpy as np
        
        stats = {
            'nodes': self.graph.number_of_nodes(),
            'edges': self.graph.number_of_edges(),
            'density': nx.density(self.graph),
            'active_nodes': sum(1 for node in self.graph.nodes() if self.graph.degree(node) > 0),
            'isolated_nodes': sum(1 for node in self.graph.nodes() if self.graph.degree(node) == 0),
            'avg_degree': np.mean([d for n, d in self.graph.degree()]),
            'max_degree': max([d for n, d in self.graph.degree()]) if self.graph.number_of_nodes() > 0 else 0,
        }
        
        # Add weight statistics if available
        weights = [data.get('weight', 0) for _, _, data in self.graph.edges(data=True)]
        if weights:
            stats.update({
                'min_weight': min(weights),
                'max_weight': max(weights),
                'avg_weight': np.mean(weights),
                'median_weight': np.median(weights)
            })
        
        # Add relation type counts
        relation_counts = {}
        for _, _, data in self.graph.edges(data=True):
            relation = data.get('relation', 'unknown')
            relation_counts[relation] = relation_counts.get(relation, 0) + 1
        
        stats['relation_counts'] = relation_counts
        stats['unique_relations'] = len(relation_counts)
        
        return stats
    
    def visualize_concept_network(self, central_concept, depth=1, min_weight=0.5, relation_filter=None, max_nodes=30):
        """
        Visualize the network around a concept
        
        Parameters:
        -----------
        central_concept : str
            The central concept to visualize
        depth : int
            How many hops to include
        min_weight : float
            Minimum edge weight
        relation_filter : list
            Only include these relation types
        max_nodes : int
            Maximum number of nodes to include
            
        Returns:
        --------
        networkx.Graph
            A subgraph centered on the concept
        """
        import networkx as nx
        
        if central_concept not in self.graph:
            return f"Concept '{central_concept}' not found in the graph."
        
        # Start with the central concept
        nodes_to_include = {central_concept}
        
        # BFS to add concepts up to the specified depth
        current_depth = 0
        frontier = {central_concept}
        
        while current_depth < depth and len(nodes_to_include) < max_nodes:
            next_frontier = set()
            
            for node in frontier:
                # Add outgoing edges
                for _, target, data in self.graph.out_edges(node, data=True):
                    if (len(nodes_to_include) < max_nodes and 
                        target not in nodes_to_include and
                        data.get('weight', 0) >= min_weight and
                        (relation_filter is None or data.get('relation') in relation_filter)):
                        nodes_to_include.add(target)
                        next_frontier.add(target)
                
                # Add incoming edges
                for source, _, data in self.graph.in_edges(node, data=True):
                    if (len(nodes_to_include) < max_nodes and
                        source not in nodes_to_include and
                        data.get('weight', 0) >= min_weight and
                        (relation_filter is None or data.get('relation') in relation_filter)):
                        nodes_to_include.add(source)
                        next_frontier.add(source)
            
            frontier = next_frontier
            current_depth += 1
        
        # Extract the subgraph
        subgraph = self.graph.subgraph(nodes_to_include)
        
        # Filter edges by weight and relation
        view = nx.MultiDiGraph()
        view.add_nodes_from(subgraph.nodes(data=True))
        
        for u, v, key, data in subgraph.edges(keys=True, data=True):
            if (data.get('weight', 0) >= min_weight and 
                (relation_filter is None or data.get('relation') in relation_filter)):
                view.add_edge(u, v, key=key, **data)
        
        return view
    
    def search_by_pattern(self, pattern, relation_types=None, limit=20):
        """
        Search the graph for concepts matching a pattern
        
        Parameters:
        -----------
        pattern : str
            Text pattern to search for
        relation_types : list
            Limit search to edges with these relations
        limit : int
            Maximum results to return
            
        Returns:
        --------
        list
            List of matching concepts
        """
        import re
        
        pattern = pattern.lower()
        regex = re.compile(pattern)
        
        # Search for matching nodes
        matches = []
        for node in self.graph.nodes():
            if regex.search(str(node).lower()):
                matches.append(node)
                if len(matches) >= limit:
                    break
        
        return matches

    def extract_taxonomy(self, root_concept, relation_types=['IsA'], max_depth=5):
        """
        Extract a taxonomy (tree) starting from a root concept
        
        Parameters:
        -----------
        root_concept : str
            Starting concept
        relation_types : list
            Relations to consider for the taxonomy (default: IsA)
        max_depth : int
            Maximum depth to traverse
            
        Returns:
        --------
        dict
            Nested dictionary representing the taxonomy
        """
        if root_concept not in self.graph:
            return f"Concept '{root_concept}' not found in the graph."
        
        def build_tree(concept, depth=0):
            if depth >= max_depth:
                return {}
            
            tree = {}
            
            # Get all outgoing edges with the specified relation types
            for _, target, data in self.graph.out_edges(concept, data=True):
                if data.get('relation') in relation_types:
                    tree[target] = build_tree(target, depth + 1)
            
            return tree
        
        taxonomy = {root_concept: build_tree(root_concept)}
        return taxonomy

In [None]:
# Example of using the explorer with the trained knowledge graph
# Uncomment and run after training

# Load a trained graph
import os
import networkx as nx
import pickle

# You can load either a GraphML file or a pickle file
trained_graph_path = os.path.join('..', 'Data', 'Output', 'nx_semantic_final.graphml')
G_trained = nx.read_graphml(trained_graph_path)

# Initialize the explorer
explorer = KnowledgeGraphExplorer(G_trained)

# Get basic graph stats
stats = explorer.analyze_graph_stats()
print("Graph Statistics:")
for key, value in stats.items():
    if key != 'relation_counts':  # Skip printing the full relation counts
        print(f"  {key}: {value}")

# Explore a concept
print("\nExploring 'dog':")
dog_relations = explorer.explore_concept('dog', min_weight=1.5, limit=10)
print(dog_relations)

# Find a path between concepts
print("\nFinding path from 'dog' to 'computer':")
path = explorer.find_path('dog', 'computer', max_length=3)
print(path)

# Get top concepts
print("\nTop concepts by degree:")
top_concepts = explorer.get_top_concepts(10, 'degree')
print(top_concepts)