In [171]:
import openai             # For LLM interaction
import json               # For parsing LLM responses
import networkx as nx     # For creating and managing the graph data structure
import ipycytoscape       # For interactive in-notebook graph visualization
import ipywidgets         # For interactive elements
import pandas as pd       # For displaying data in tables
import os                 # For accessing environment variables (safer for API keys)
import math               # For basic math operations
import re                 # For basic text cleaning (regular expressions)
import warnings           # To suppress potential deprecation warnings
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, AutoModelForCausalLM
import torch
from llama_cpp import Llama

from dotenv import load_dotenv
load_dotenv()

True

In [172]:
unstructured_text = """
Marie Curie, born Maria Skłodowska in Warsaw, Poland, was a pioneering physicist and chemist.
She conducted groundbreaking research on radioactivity. Together with her husband, Pierre Curie,
she discovered the elements polonium and radium. Marie Curie was the first woman to win a Nobel Prize,
the first person and only woman to win the Nobel Prize twice, and the only person to win the Nobel Prize
in two different scientific fields. She won the Nobel Prize in Physics in 1903 with Pierre Curie
and Henri Becquerel. Later, she won the Nobel Prize in Chemistry in 1911 for her work on radium and
polonium. During World War I, she developed mobile radiography units, known as 'petites Curies',
to provide X-ray services to field hospitals. Marie Curie died in 1934 from aplastic anemia, likely
caused by her long-term exposure to radiation.

Marie was born on November 7, 1867, to a family of teachers who valued education. She received her
early schooling in Warsaw but moved to Paris in 1891 to continue her studies at the Sorbonne, where
she earned degrees in physics and mathematics. She met Pierre Curie, a professor of physics, in 1894, 
and they married in 1895, beginning a productive scientific partnership. Following Pierre's tragic 
death in a street accident in 1906, Marie took over his teaching position, becoming the first female 
professor at the Sorbonne.

The Curies' work on radioactivity was conducted in challenging conditions, in a poorly equipped shed 
with no proper ventilation, as they processed tons of pitchblende ore to isolate radium. Marie Curie
established the Curie Institute in Paris, which became a major center for medical research. She had
two daughters: Irène, who later won a Nobel Prize in Chemistry with her husband, and Eve, who became
a writer. Marie's notebooks are still radioactive today and are kept in lead-lined boxes. Her legacy
includes not only her scientific discoveries but also her role in breaking gender barriers in academia
and science.
"""

In [173]:
unstructured_text2 = """

The Battle of Çanakkale, also known as the Gallipoli Campaign (1915), was a defining moment in World War I that took place on the Gallipoli Peninsula in Turkey. 
The Allied Powers, mainly Britain and France, attempted to force a passage through the Dardanelles Strait to capture Constantinople (Istanbul) and open a supply route to Russia. 
However, they faced fierce resistance from the Ottoman Empire, which turned the tide with strategic defense and high morale.

Key figures in the battle include Mustafa Kemal Atatürk, then a young Ottoman commander, who played a critical role in organizing the defense at Anafartalar and Conkbayırı. His leadership and famous command, “I do not order you to attack; I order you to die,” became legendary and cemented his status as a national hero. On the Allied side, commanders like General Ian Hamilton struggled with underestimating the terrain and Ottoman resistance, leading to heavy casualties.

The campaign ended in failure for the Allies after months of stalemate, suffering over 250,000 casualties on each side. For the Ottomans, the victory became a symbol of national pride and resistance. For Australia and New Zealand, whose ANZAC troops fought valiantly, it marked a tragic but formative experience that is commemorated annually on ANZAC Day, April 25. The battle significantly shaped Turkish national identity and contributed to the eventual founding of the Republic of Turkey under Atatürk.

The Gallipoli Campaign is remembered for its strategic blunders, the harsh conditions faced by soldiers, and the heroism displayed on both sides. The campaign's failure led to a reevaluation of Allied strategies in the war and had lasting implications for the Middle East and the post-war world order.
"""

In [174]:
# --- Chunking Configuration ---
chunk_size = 100  # Number of words per chunk (adjust as needed)
overlap = 20     # Number of words to overlap (must be < chunk_size)

if overlap >= chunk_size and chunk_size > 0:
    raise SystemExit("Chunking configuration error.")
else:
    print("Chunking configuration is valid.")


Chunking configuration is valid.


In [175]:
words = unstructured_text2.split()
total_words = len(words)

print(f"Text split into {total_words} words.")
# Visualize the first 20 words
print(f"First 20 words: {words[:20]}")

Text split into 275 words.
First 20 words: ['The', 'Battle', 'of', 'Çanakkale,', 'also', 'known', 'as', 'the', 'Gallipoli', 'Campaign', '(1915),', 'was', 'a', 'defining', 'moment', 'in', 'World', 'War', 'I', 'that']


In [176]:
chunks = []
start_index = 0
chunk_number = 1

while start_index < total_words:
    end_index = min(start_index + chunk_size, total_words)
    chunk_text = " ".join(words[start_index:end_index])
    chunks.append({"text": chunk_text, "chunk_number": chunk_number})

    # Calculate the start of the next chunk
    next_start_index = start_index + chunk_size - overlap

    # Ensure progress is made
    if next_start_index <= start_index:
        if end_index == total_words:
             break # Already processed the last part
        next_start_index = start_index + 1

    start_index = next_start_index
    chunk_number += 1

    # Safety break (optional)
    if chunk_number > total_words: # Simple safety
        print("Warning: Chunking loop exceeded total word count, breaking.")
        break

print(f"\nText successfully split into {len(chunks)} chunks.")


Text successfully split into 4 chunks.


In [177]:
if chunks:
    # Create a DataFrame for better visualization
    chunks_df = pd.DataFrame(chunks)
    chunks_df['word_count'] = chunks_df['text'].apply(lambda x: len(x.split()))
    display(chunks_df[['chunk_number', 'word_count', 'text']])
else:
    print("No chunks were created (text might be shorter than chunk size).")
print("-" * 25)

Unnamed: 0,chunk_number,word_count,text
0,1,100,"The Battle of Çanakkale, also known as the Gal..."
1,2,100,"Mustafa Kemal Atatürk, then a young Ottoman co..."
2,3,100,"stalemate, suffering over 250,000 casualties o..."
3,4,35,"by soldiers, and the heroism displayed on both..."


-------------------------


In [178]:
#   - **Text Chunk:** Marie Curie discovered Radium in 1898.

#   - **System Prompt:** You are an expert in information extraction.

#   - **User Prompt:** Extract SPO triples. Rules:
#   - Follow the pattern.
#   - Text: text__chunk_placeholder.
#   - Required JSON format: Your JSON:

In [179]:
# --- System Prompt: Sets the context/role for the LLM ---
extraction_system_prompt = """
You are an AI expert specialized in knowledge graph extraction.
Your task is to identify and extract factual Subject-Predicate-Object (SPO) triples from the given text.
Focus on accuracy and adhere strictly to the JSON output format requested in the user prompt.
Extract core entities and the most direct relationship.
"""

# --- User Prompt Template: Contains specific instructions and the text ---
extraction_user_prompt_template = """
Please extract Subject-Predicate-Object (S-P-O) triples from the text below.

**STRICT INSTRUCTIONS – FOLLOW EXACTLY:**

1.  **Output Format:** Respond ONLY with a single, valid JSON array. Each element MUST be an object with keys "subject", "predicate", "object".
2.  **JSON Only:** Do NOT include any text before or after the JSON array (e.g., no 'Here is the JSON:' or explanations). Do NOT use markdown ```json ... ``` tags.
3. **NO NESTED OBJECTS:** If `object` is a dictionary or array, convert it into flat text. Use the most specific and informative string. For example:
   - ✅ "object": "nobel prize in chemistry" instead of ❌ "object": {{ "type": "nobel prize", "subtype": ["chemistry"] }}
   - ✅ "object": "radium" instead of ❌ "object": {{ "type": ["element", "radium"] }}
4. **Only Triples:** Do NOT include additional keys like `"subject2"`, `"type"`, `"name"`, etc.
4.  **Concise Predicates:** Keep the 'predicate' value concise (1-3 words, ideally 1-2). Use verbs or short verb phrases (e.g., 'discovered', 'was born in', 'won').
5.  **Lowercase:** ALL values for 'subject', 'predicate', and 'object' MUST be lowercase.
6.  **Pronoun Resolution:** Replace pronouns (she, he, it, her, etc.) with the specific lowercase entity name they refer to based on the text context (e.g., 'marie curie').
7.  **Specificity:** Capture specific details (e.g., 'nobel prize in physics' instead of just 'nobel prize' if specified).
8.  **Completeness:** Extract all distinct factual relationships mentioned and triples present in the text.

**Text to Process:**
{text_chunk}
"""

In [180]:
# Text Chunks
#       |
# Format Prompt
# System + User + Chunk
#       |
# Send to LLM API
#       |
# Receive Response
#       |
# Parse JSON
#    /      \
# Validate Triples   if failure
#    /     \        
# if invalid    Handle Errors / Failures
#    \     /
# Store Valid Triples
#       |
#     (loop back to Text Chunks)

In [181]:
# Initialize lists to store results and failures
all_extracted_triples = []
failed_chunks = []

chunk_index = 0  # Process first chunk only

llm_model_path = "./mistral-7b-instruct-v0.2.Q4_K_M.gguf"
llm_max_tokens = 512  # Allowing 1536 tokens for prompt, 512 for generation

print(
    f"Starting triple extraction from {len(chunks)} chunks using model '{llm_model_path}'..."
)
# We will process chunks one by one in the following cells.

Starting triple extraction from 4 chunks using model './mistral-7b-instruct-v0.2.Q4_K_M.gguf'...


In [182]:
# Initialize the LLM model

llm = Llama(
    model_path=llm_model_path,
    n_ctx=2048,  #  if prompt + response exceeds 1024 tokens, the model truncates the output mid-JSON.
    n_threads=16,  # Use 16 of your 20 logical threads for good performance
    verbose=True,  # Optional: Show debug info
    n_batch=32,  # Batch size for processing, Keep moderate; increase only if RAM allows (64)
)

llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from ./mistral-7b-instruct-v0.2.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.2
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                

💡 If You Still Get Incomplete Output
Reduce chunk_size in text splitter (e.g., from 150 to 100 words).

Shorten your prompt wording to conserve tokens.

Stream and truncate at the first ] as discussed earlier.

In [183]:
def clean_and_parse_llm_output(llm_output_raw: str, chunk_num: int):
    cleaned = re.sub(r"(?i)^```json|```", "", llm_output_raw.strip(), flags=re.IGNORECASE).strip()
    if cleaned.count("[") > cleaned.count("]"):
        cleaned += "]"

    # print(f"[Chunk {chunk_num}] Cleaned LLM output: {cleaned}...") 

    # Step 2: Extract only the first JSON array using regex
    match = re.search(r"\[\s*{.*?}\s*\]", cleaned, re.DOTALL)
    if match:
        json_part = match.group(0)
    else:
        print(f"[Chunk {chunk_num}] ❌ ERROR: No JSON array detected.")
        return []

    # Save Cleaned LLM output to a txt file (append mode)
    with open("debug_llm_output.txt", "a", encoding="utf-8") as f:
        f.write(f"\n--- Chunk {chunk_num} ---\n")
        f.write(json_part + "\n")

    try:
        parsed = json.loads(json_part)
    except json.JSONDecodeError as e:
        print(f"[Chunk {chunk_num}] JSONDecodeError: {e}")
        return []

    triples = []
    def extract(item, parent_subject=None):
        if not isinstance(item, dict): return
        subj = item.get("subject") or parent_subject
        pred = item.get("predicate")
        obj = item.get("object")

        if isinstance(obj, str):
            triples.append({"subject": subj, "predicate": pred, "object": obj})
        elif isinstance(obj, dict):
            for key, val in obj.items():
                if isinstance(val, list):
                    for v in val:
                        extract(v, parent_subject=v.get("name", subj))
        elif isinstance(obj, list):
            for v in obj:
                extract(v, parent_subject=subj)

    if isinstance(parsed, list):
        for item in parsed:
            extract(item)
    elif isinstance(parsed, dict):
        extract(parsed)

    return triples

In [184]:
for chunk_index, chunk_info in enumerate(chunks):

    chunk_text = chunk_info["text"]
    chunk_num = chunk_info["chunk_number"]

    print(f"\n--- Processing Chunk {chunk_num}/{len(chunks)} --- ")

    # 1. Format the User Prompt
    user_prompt = extraction_user_prompt_template.format(text_chunk=chunk_text)
    full_prompt = f"{extraction_system_prompt.strip()}\n\n{user_prompt.strip()}"

    llm_output = None
    error_message = None

    try:

        # 2. Send the formatted prompt to the LLM API
        response = llm.create_completion(
            prompt=full_prompt,
            max_tokens=llm_max_tokens,  # allow the model to generate enough tokens for the response
            temperature=0.0,
            stop=["</s>"],
        )

        # 3. Extract Raw Response Content
        raw_chunks = []

        if response and "choices" in response:
            for choice in response["choices"]:
                if choice and "text" in choice:
                    raw_chunks.append(choice["text"])

        # Combine chunks into a single string
        llm_output = "".join(raw_chunks).strip()

        # Clean and parse the JSON-like output
        triples = clean_and_parse_llm_output(llm_output, chunk_num)
        
        all_extracted_triples.extend(triples)

        print("-" * 200)

    except Exception as e:
        error_message = str(e)
        print(f"[Chunk {chunk_num}] Error: {error_message}")


print(f"\n all_extracted_triples: {all_extracted_triples} triples extracted.")

      


--- Processing Chunk 1/4 --- 


llama_perf_context_print:        load time =   26482.90 ms
llama_perf_context_print: prompt eval time =   26479.68 ms /   699 tokens (   37.88 ms per token,    26.40 tokens per second)
llama_perf_context_print:        eval time =   40613.11 ms /   367 runs   (  110.66 ms per token,     9.04 tokens per second)
llama_perf_context_print:       total time =   67327.31 ms /  1066 tokens
Llama.generate: 547 prefix-match hit, remaining 154 prompt tokens to eval


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Processing Chunk 2/4 --- 


llama_perf_context_print:        load time =   26482.90 ms
llama_perf_context_print: prompt eval time =    4056.40 ms /   154 tokens (   26.34 ms per token,    37.96 tokens per second)
llama_perf_context_print:        eval time =   46597.10 ms /   416 runs   (  112.01 ms per token,     8.93 tokens per second)
llama_perf_context_print:       total time =   50895.19 ms /   570 tokens
Llama.generate: 547 prefix-match hit, remaining 149 prompt tokens to eval


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Processing Chunk 3/4 --- 


llama_perf_context_print:        load time =   26482.90 ms
llama_perf_context_print: prompt eval time =    4009.49 ms /   149 tokens (   26.91 ms per token,    37.16 tokens per second)
llama_perf_context_print:        eval time =   50741.26 ms /   459 runs   (  110.55 ms per token,     9.05 tokens per second)
llama_perf_context_print:       total time =   55063.44 ms /   608 tokens
Llama.generate: 547 prefix-match hit, remaining 46 prompt tokens to eval


--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

--- Processing Chunk 4/4 --- 


llama_perf_context_print:        load time =   26482.90 ms
llama_perf_context_print: prompt eval time =    1419.31 ms /    46 tokens (   30.85 ms per token,    32.41 tokens per second)
llama_perf_context_print:        eval time =   45756.67 ms /   415 runs   (  110.26 ms per token,     9.07 tokens per second)
llama_perf_context_print:       total time =   47430.87 ms /   461 tokens


[Chunk 4] JSONDecodeError: Expecting property name enclosed in double quotes: line 35 column 3 (char 674)
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

 all_extracted_triples: [{'subject': 'the battle of çanakkale', 'predicate': 'took place', 'object': 'from february 19, 1915, to january 9, 1916'}, {'subject': 'the allied powers', 'predicate': 'attempted to force a passage through', 'object': 'the dardanelles strait'}, {'subject': 'the ottoman empire', 'predicate': 'faced fierce resistance from', 'object': 'the allied powers'}, {'subject': 'mustafa kemal atatürk', 'predicate': 'played a critical role in organizing the defense at', 'object': 'anafartalar'}, {'subject': 'mustafa kemal atatürk', 'predicate': 'later became the founder of', 'object': 'modern turkey'}, {'subject': 'mustafa kemal atatürk', 'predicate': 'played_a_critical_r

In [185]:
# --- Summary of Extraction (Reflecting state after the single chunk demo) ---
print(f"\n--- Overall Extraction Summary ---")
print(f"Total chunks defined: {len(chunks)}")
processed_chunks = len(chunks) - len(failed_chunks) # Approximation if loop isn't run fully
print(f"Chunks processed (attempted): {processed_chunks + len(failed_chunks)}") # Chunks we looped through
print(f"Total valid triples extracted across all processed chunks: {len(all_extracted_triples)}")
print(f"Number of chunks that failed API call or parsing: {len(failed_chunks)}")

if failed_chunks:
    print("\nDetails of Failed Chunks:")
    for failure in failed_chunks:
        print(f"  Chunk {failure['chunk_number']}: Error: {failure['error']}")
        # print(f"    Response (start): {failure.get('response', '')[:100]}...") # Uncomment for more detail
print("-" * 25)

# Display all extracted triples using Pandas
print("\n--- All Extracted Triples (Before Normalization) ---")
if all_extracted_triples:
    all_triples_df = pd.DataFrame(all_extracted_triples)
    display(all_triples_df)
else:
    print("No triples were successfully extracted.")
print("-" * 25)


--- Overall Extraction Summary ---
Total chunks defined: 4
Chunks processed (attempted): 4
Total valid triples extracted across all processed chunks: 21
Number of chunks that failed API call or parsing: 0
-------------------------

--- All Extracted Triples (Before Normalization) ---


Unnamed: 0,subject,predicate,object
0,the battle of çanakkale,took place,"from february 19, 1915, to january 9, 1916"
1,the allied powers,attempted to force a passage through,the dardanelles strait
2,the ottoman empire,faced fierce resistance from,the allied powers
3,mustafa kemal atatürk,played a critical role in organizing the defen...,anafartalar
4,mustafa kemal atatürk,later became the founder of,modern turkey
5,mustafa kemal atatürk,played_a_critical_role_in,organizing_the_defense_at_anafartalar_and_conk...
6,mustafa kemal atatürk,became_legendary_for,his_command_i_do_not_order_you_to_attack_i_ord...
7,general_ian_hamilton,struggled_with,underestimating_the_terrain_and_ottoman_resist...
8,allied_side,suffered_over,250000_casualties
9,ottomans,suffered_over,250000_casualties


-------------------------


In [186]:
# Initialize lists and tracking variables

# Normalize: Trim whitespace, convert to lowercase.
# Filter: Remove triples with empty parts after normalization.
# De-duplicate: Remove exact duplicate (subject, predicate, object) combinations.

normalized_triples = []
seen_triples = set() # Tracks (subject, predicate, object) tuples
original_count = len(all_extracted_triples)
empty_removed_count = 0
duplicates_removed_count = 0

print(f"Starting normalization and de-duplication of {original_count} triples...")

Starting normalization and de-duplication of 21 triples...


In [187]:
print("Processing triples for normalization (showing first 5 examples):")
example_limit = 5
processed_count = 0

for i, triple in enumerate(all_extracted_triples):
    show_example = (i < example_limit)
    if show_example:
        print(f"\n--- Example {i+1} ---")
        print(f"Original Triple (Chunk {triple.get('chunk', '?')}): {triple}")
        
    subject_raw = triple.get('subject')
    predicate_raw = triple.get('predicate')
    object_raw = triple.get('object')
    chunk_num = triple.get('chunk', 'unknown')
    
    triple_valid = False
    normalized_sub, normalized_pred, normalized_obj = None, None, None

    if isinstance(subject_raw, str) and isinstance(predicate_raw, str) and isinstance(object_raw, str):
        # 1. Normalize
        normalized_sub = subject_raw.strip().lower()
        normalized_pred = re.sub(r'\s+', ' ', predicate_raw.strip().lower()).strip()
        normalized_obj = object_raw.strip().lower()
        if show_example:
            print(f"Normalized: SUB='{normalized_sub}', PRED='{normalized_pred}', OBJ='{normalized_obj}'")

        # 2. Filter Empty
        if normalized_sub and normalized_pred and normalized_obj:
            triple_identifier = (normalized_sub, normalized_pred, normalized_obj)
            
            # 3. De-duplicate
            if triple_identifier not in seen_triples:
                normalized_triples.append({
                    'subject': normalized_sub,
                    'predicate': normalized_pred,
                    'object': normalized_obj,
                    'source_chunk': chunk_num
                })
                seen_triples.add(triple_identifier)
                triple_valid = True
                if show_example:
                    print("Status: Kept (New Unique Triple)")
            else:
                duplicates_removed_count += 1
                if show_example:
                    print("Status: Discarded (Duplicate)")
        else:
            empty_removed_count += 1
            if show_example:
                print("Status: Discarded (Empty component after normalization)")
    else:
        empty_removed_count += 1 # Count non-string/missing as needing removal
        if show_example:
             print("Status: Discarded (Non-string or missing component)")
    processed_count += 1

print(f"\n... Finished processing {processed_count} triples.")

Processing triples for normalization (showing first 5 examples):

--- Example 1 ---
Original Triple (Chunk ?): {'subject': 'the battle of çanakkale', 'predicate': 'took place', 'object': 'from february 19, 1915, to january 9, 1916'}
Normalized: SUB='the battle of çanakkale', PRED='took place', OBJ='from february 19, 1915, to january 9, 1916'
Status: Kept (New Unique Triple)

--- Example 2 ---
Original Triple (Chunk ?): {'subject': 'the allied powers', 'predicate': 'attempted to force a passage through', 'object': 'the dardanelles strait'}
Normalized: SUB='the allied powers', PRED='attempted to force a passage through', OBJ='the dardanelles strait'
Status: Kept (New Unique Triple)

--- Example 3 ---
Original Triple (Chunk ?): {'subject': 'the ottoman empire', 'predicate': 'faced fierce resistance from', 'object': 'the allied powers'}
Normalized: SUB='the ottoman empire', PRED='faced fierce resistance from', OBJ='the allied powers'
Status: Kept (New Unique Triple)

--- Example 4 ---
Orig

In [188]:
# --- Summary of Normalization --- 
print(f"\n--- Normalization & De-duplication Summary ---")
print(f"Original extracted triple count: {original_count}")
print(f"Triples removed (empty/invalid components): {empty_removed_count}")
print(f"Duplicate triples removed: {duplicates_removed_count}")
final_count = len(normalized_triples)
print(f"Final unique, normalized triple count: {final_count}")
print("-" * 25)

# Display a sample of normalized triples using Pandas
print("\n--- Final Normalized Triples ---")
if normalized_triples:
    normalized_df = pd.DataFrame(normalized_triples)
    display(normalized_df)
else:
    print("No valid triples remain after normalization.")
print("-" * 25)


--- Normalization & De-duplication Summary ---
Original extracted triple count: 21
Triples removed (empty/invalid components): 0
Duplicate triples removed: 0
Final unique, normalized triple count: 21
-------------------------

--- Final Normalized Triples ---


Unnamed: 0,subject,predicate,object,source_chunk
0,the battle of çanakkale,took place,"from february 19, 1915, to january 9, 1916",unknown
1,the allied powers,attempted to force a passage through,the dardanelles strait,unknown
2,the ottoman empire,faced fierce resistance from,the allied powers,unknown
3,mustafa kemal atatürk,played a critical role in organizing the defen...,anafartalar,unknown
4,mustafa kemal atatürk,later became the founder of,modern turkey,unknown
5,mustafa kemal atatürk,played_a_critical_role_in,organizing_the_defense_at_anafartalar_and_conk...,unknown
6,mustafa kemal atatürk,became_legendary_for,his_command_i_do_not_order_you_to_attack_i_ord...,unknown
7,general_ian_hamilton,struggled_with,underestimating_the_terrain_and_ottoman_resist...,unknown
8,allied_side,suffered_over,250000_casualties,unknown
9,ottomans,suffered_over,250000_casualties,unknown


-------------------------


In [189]:
# Create an empty directed graph
knowledge_graph = nx.DiGraph()

print("Initialized an empty NetworkX DiGraph.")
# Visualize the initial empty graph state
print("--- Initial Graph Info ---")
try:
    # Try the newer method first
    print(nx.info(knowledge_graph))
except AttributeError:
    # Fallback for different NetworkX versions
    print(f"Type: {type(knowledge_graph).__name__}")
    print(f"Number of nodes: {knowledge_graph.number_of_nodes()}")
    print(f"Number of edges: {knowledge_graph.number_of_edges()}")
print("-" * 25)

Initialized an empty NetworkX DiGraph.
--- Initial Graph Info ---
Type: DiGraph
Number of nodes: 0
Number of edges: 0
-------------------------


In [190]:
print("Adding triples to the NetworkX graph...")

added_edges_count = 0
update_interval = 5 # How often to print graph info update

if not normalized_triples:
    print("Warning: No normalized triples to add to the graph.")
else:
    for i, triple in enumerate(normalized_triples):
        subject_node = triple['subject']
        object_node = triple['object']
        predicate_label = triple['predicate']
        
        # Nodes are added automatically when adding edges, but explicit calls are fine too
        knowledge_graph.add_node(subject_node) 
        knowledge_graph.add_node(object_node)
        
        # Add the directed edge with the predicate as a 'label' attribute
        knowledge_graph.add_edge(subject_node, object_node, label=predicate_label)
        added_edges_count += 1
        
        # --- Visualize Graph Growth --- 
        if (i + 1) % update_interval == 0 or (i + 1) == len(normalized_triples):
            print(f"\n--- Graph Info after adding Triple #{i+1} --- ({subject_node} -> {object_node})")
            try:
                # Try the newer method first
                print(nx.info(knowledge_graph))
            except AttributeError:
                # Fallback for different NetworkX versions
                print(f"Type: {type(knowledge_graph).__name__}")
                print(f"Number of nodes: {knowledge_graph.number_of_nodes()}")
                print(f"Number of edges: {knowledge_graph.number_of_edges()}")
            # For very large graphs, printing info too often can be slow. Adjust interval.

print(f"\nFinished adding triples. Processed {added_edges_count} edges.")

Adding triples to the NetworkX graph...

--- Graph Info after adding Triple #5 --- (mustafa kemal atatürk -> modern turkey)
Type: DiGraph
Number of nodes: 8
Number of edges: 5

--- Graph Info after adding Triple #10 --- (ottomans -> 250000_casualties)
Type: DiGraph
Number of nodes: 15
Number of edges: 10

--- Graph Info after adding Triple #15 --- (gallipoli campaign -> its strategic blunders, harsh conditions, and heroism)
Type: DiGraph
Number of nodes: 21
Number of edges: 15

--- Graph Info after adding Triple #20 --- (anzac day -> april 25)
Type: DiGraph
Number of nodes: 29
Number of edges: 20

--- Graph Info after adding Triple #21 --- (atatürk -> the republic of turkey)
Type: DiGraph
Number of nodes: 31
Number of edges: 21

Finished adding triples. Processed 21 edges.


In [191]:
# --- Final Graph Statistics --- 
num_nodes = knowledge_graph.number_of_nodes()
num_edges = knowledge_graph.number_of_edges()

print(f"\n--- Final NetworkX Graph Summary ---")
print(f"Total unique nodes (entities): {num_nodes}")
print(f"Total unique edges (relationships): {num_edges}")

if num_edges != added_edges_count and isinstance(knowledge_graph, nx.DiGraph):
     print(f"Note: Added {added_edges_count} edges, but graph has {num_edges}. DiGraph overwrites edges with same source/target. Use MultiDiGraph if multiple edges needed.")

if num_nodes > 0:
    try:
       density = nx.density(knowledge_graph)
       print(f"Graph density: {density:.4f}")
       if nx.is_weakly_connected(knowledge_graph):
           print("The graph is weakly connected (all nodes reachable ignoring direction).")
       else:
           num_components = nx.number_weakly_connected_components(knowledge_graph)
           print(f"The graph has {num_components} weakly connected components.")
    except Exception as e:
        print(f"Could not calculate some graph metrics: {e}") # Handle potential errors on empty/small graphs
else:
    print("Graph is empty, cannot calculate metrics.")
print("-" * 25)

# --- Sample Nodes --- 
print("\n--- Sample Nodes (First 10) ---")
if num_nodes > 0:
    nodes_sample = list(knowledge_graph.nodes())[:10]
    display(pd.DataFrame(nodes_sample, columns=['Node Sample']))
else:
    print("Graph has no nodes.")

# --- Sample Edges --- 
print("\n--- Sample Edges (First 10 with Labels) ---")
if num_edges > 0:
    edges_sample = []
    for u, v, data in list(knowledge_graph.edges(data=True))[:10]:
        edges_sample.append({'Source': u, 'Target': v, 'Label': data.get('label', 'N/A')})
    display(pd.DataFrame(edges_sample))
else:
    print("Graph has no edges.")
print("-" * 25)


--- Final NetworkX Graph Summary ---
Total unique nodes (entities): 31
Total unique edges (relationships): 21
Graph density: 0.0226
The graph has 10 weakly connected components.
-------------------------

--- Sample Nodes (First 10) ---


Unnamed: 0,Node Sample
0,the battle of çanakkale
1,"from february 19, 1915, to january 9, 1916"
2,the allied powers
3,the dardanelles strait
4,the ottoman empire
5,mustafa kemal atatürk
6,anafartalar
7,modern turkey
8,organizing_the_defense_at_anafartalar_and_conk...
9,his_command_i_do_not_order_you_to_attack_i_ord...



--- Sample Edges (First 10 with Labels) ---


Unnamed: 0,Source,Target,Label
0,the battle of çanakkale,"from february 19, 1915, to january 9, 1916",took place
1,the allied powers,the dardanelles strait,attempted to force a passage through
2,the ottoman empire,the allied powers,faced fierce resistance from
3,mustafa kemal atatürk,anafartalar,played a critical role in organizing the defen...
4,mustafa kemal atatürk,modern turkey,later became the founder of
5,mustafa kemal atatürk,organizing_the_defense_at_anafartalar_and_conk...,played_a_critical_role_in
6,mustafa kemal atatürk,his_command_i_do_not_order_you_to_attack_i_ord...,became_legendary_for
7,general_ian_hamilton,underestimating_the_terrain_and_ottoman_resist...,struggled_with
8,allied_side,250000_casualties,suffered_over
9,allied_side,months_of_stalemate,ended_in_failure_after


-------------------------


In [192]:
print("Preparing interactive visualization...")

# --- Check Graph Validity for Visualization --- 
can_visualize = False
if 'knowledge_graph' not in locals() or not isinstance(knowledge_graph, nx.Graph):
    print("Error: 'knowledge_graph' not found or is not a NetworkX graph.")
elif knowledge_graph.number_of_nodes() == 0:
    print("NetworkX Graph is empty. Cannot visualize.")
else:
    print(f"Graph seems valid for visualization ({knowledge_graph.number_of_nodes()} nodes, {knowledge_graph.number_of_edges()} edges).")
    can_visualize = True

Preparing interactive visualization...
Graph seems valid for visualization (31 nodes, 21 edges).


In [193]:
cytoscape_nodes = []
cytoscape_edges = []

if can_visualize:
    print("Converting nodes...")
    # Calculate degrees for node sizing
    node_degrees = dict(knowledge_graph.degree())
    max_degree = max(node_degrees.values()) if node_degrees else 1
    
    for node_id in knowledge_graph.nodes():
        degree = node_degrees.get(node_id, 0)
        # Simple scaling for node size (adjust logic as needed)
        node_size = 15 + (degree / max_degree) * 50 if max_degree > 0 else 15
        
        cytoscape_nodes.append({
            'data': {
                'id': str(node_id), # ID must be string
                'label': str(node_id).replace(' ', '\n'), # Display label (wrap spaces)
                'degree': degree,
                'size': node_size,
                'tooltip_text': f"Entity: {str(node_id)}\nDegree: {degree}" # Tooltip on hover
            }
        })
    print(f"Converted {len(cytoscape_nodes)} nodes.")
    
    print("Converting edges...")
    edge_count = 0
    for u, v, data in knowledge_graph.edges(data=True):
        edge_id = f"edge_{edge_count}" # Unique edge ID
        predicate_label = data.get('label', '')
        cytoscape_edges.append({
            'data': {
                'id': edge_id,
                'source': str(u),
                'target': str(v),
                'label': predicate_label, # Label on edge
                'tooltip_text': f"Relationship: {predicate_label}" # Tooltip on hover
            }
        })
        edge_count += 1
    print(f"Converted {len(cytoscape_edges)} edges.")
    
    # Combine into the final structure
    cytoscape_graph_data = {'nodes': cytoscape_nodes, 'edges': cytoscape_edges}
    
    # Visualize the converted structure (first few nodes/edges)
    print("\n--- Sample Cytoscape Node Data (First 2) ---")
    print(json.dumps(cytoscape_graph_data['nodes'][:2], indent=2))
    print("\n--- Sample Cytoscape Edge Data (First 2) ---")
    print(json.dumps(cytoscape_graph_data['edges'][:2], indent=2))
    print("-" * 25)
else:
     print("Skipping data conversion as graph is not valid for visualization.")
     cytoscape_graph_data = {'nodes': [], 'edges': []}

Converting nodes...
Converted 31 nodes.
Converting edges...
Converted 21 edges.

--- Sample Cytoscape Node Data (First 2) ---
[
  {
    "data": {
      "id": "the battle of \u00e7anakkale",
      "label": "the\nbattle\nof\n\u00e7anakkale",
      "degree": 1,
      "size": 25.0,
      "tooltip_text": "Entity: the battle of \u00e7anakkale\nDegree: 1"
    }
  },
  {
    "data": {
      "id": "from february 19, 1915, to january 9, 1916",
      "label": "from\nfebruary\n19,\n1915,\nto\njanuary\n9,\n1916",
      "degree": 1,
      "size": 25.0,
      "tooltip_text": "Entity: from february 19, 1915, to january 9, 1916\nDegree: 1"
    }
  }
]

--- Sample Cytoscape Edge Data (First 2) ---
[
  {
    "data": {
      "id": "edge_0",
      "source": "the battle of \u00e7anakkale",
      "target": "from february 19, 1915, to january 9, 1916",
      "label": "took place",
      "tooltip_text": "Relationship: took place"
    }
  },
  {
    "data": {
      "id": "edge_1",
      "source": "the allied po

In [194]:
if can_visualize:
    print("Creating ipycytoscape widget...")
    cyto_widget = ipycytoscape.CytoscapeWidget()
    print("Widget created.")
    
    print("Loading graph data into widget...")
    cyto_widget.graph.add_graph_from_json(cytoscape_graph_data, directed=True)
    print("Data loaded.")
else:
    print("Skipping widget creation.")
    cyto_widget = None

Creating ipycytoscape widget...
Widget created.
Loading graph data into widget...
Data loaded.


In [195]:
if cyto_widget:
    print("Defining enhanced colorful and interactive visual style...")
    # More vibrant and colorful styling with a modern color scheme
    visual_style = [
        {
            'selector': 'node',
            'style': {
                'label': 'data(label)',
                'width': 'data(size)',
                'height': 'data(size)',
                'background-color': '#3498db',  # Bright blue
                'background-opacity': 0.9,
                'color': '#ffffff',             # White text
                'font-size': '12px',
                'font-weight': 'bold',
                'text-valign': 'center',
                'text-halign': 'center',
                'text-wrap': 'wrap',
                'text-max-width': '100px',
                'text-outline-width': 2,
                'text-outline-color': '#2980b9',  # Matching outline
                'text-outline-opacity': 0.7,
                'border-width': 3,
                'border-color': '#1abc9c',      # Turquoise border
                'border-opacity': 0.9,
                'shape': 'ellipse',
                'transition-property': 'background-color, border-color, border-width, width, height',
                'transition-duration': '0.3s',
                'tooltip-text': 'data(tooltip_text)'
            }
        },
        {
            'selector': 'node:selected',
            'style': {
                'background-color': '#e74c3c',  # Pomegranate red
                'border-width': 4,
                'border-color': '#c0392b',
                'text-outline-color': '#e74c3c',
                'width': 'data(size) * 1.2',    # Enlarge selected nodes
                'height': 'data(size) * 1.2'
            }
        },
        {
            'selector': 'node:hover',
            'style': {
                'background-color': '#9b59b6',  # Purple on hover
                'border-width': 4,
                'border-color': '#8e44ad',
                'cursor': 'pointer',
                'z-index': 999
            }
        },
        {
            'selector': 'edge',
            'style': {
                'label': 'data(label)',
                'width': 2.5,
                'curve-style': 'bezier',
                'line-color': '#2ecc71',         # Green
                'line-opacity': 0.8,
                'target-arrow-color': '#27ae60',
                'target-arrow-shape': 'triangle',
                'arrow-scale': 1.5,
                'font-size': '10px',
                'font-weight': 'normal',
                'color': '#2c3e50',
                'text-background-opacity': 0.9,
                'text-background-color': '#ecf0f1',
                'text-background-shape': 'roundrectangle',
                'text-background-padding': '3px',
                'text-rotation': 'autorotate',
                'edge-text-rotation': 'autorotate',
                'transition-property': 'line-color, width, target-arrow-color',
                'transition-duration': '0.3s',
                'tooltip-text': 'data(tooltip_text)'
            }
        },
        {
            'selector': 'edge:selected',
            'style': {
                'line-color': '#f39c12',         # Yellow-orange
                'target-arrow-color': '#d35400',
                'width': 4,
                'text-background-color': '#f1c40f',
                'color': '#ffffff',               # White text
                'z-index': 998
            }
        },
        {
            'selector': 'edge:hover',
            'style': {
                'line-color': '#e67e22',         # Orange on hover
                'width': 3.5,
                'cursor': 'pointer',
                'target-arrow-color': '#d35400',
                'z-index': 997
            }
        },
        {
            'selector': '.center-node',
            'style': {
                'background-color': '#16a085',    # Teal
                'background-opacity': 1,
                'border-width': 4,
                'border-color': '#1abc9c',        # Turquoise border
                'border-opacity': 1
            }
        }
    ]
    
    print("Setting enhanced visual style on widget...")
    cyto_widget.set_style(visual_style)
    
    # Apply a better animated layout
    cyto_widget.set_layout(name='cose', 
                          nodeRepulsion=5000, 
                          nodeOverlap=40, 
                          idealEdgeLength=120, 
                          edgeElasticity=200, 
                          nestingFactor=6, 
                          gravity=90, 
                          numIter=2500,
                          animate=True,
                          animationDuration=1000,
                          initialTemp=300,
                          coolingFactor=0.95)
    
    # Add a special class to main nodes (Marie Curie)
    if len(cyto_widget.graph.nodes) > 0:
        main_nodes = [node.data['id'] for node in cyto_widget.graph.nodes 
                     if node.data.get('degree', 0) > 10]
        
        # Create gradient styles for center nodes
        for i, node_id in enumerate(main_nodes):
            # Use vibrant colors for center nodes
            center_style = {
                'selector': f'node[id = "{node_id}"]',
                'style': {
                    'background-color': '#9b59b6',   # Purple
                    'background-opacity': 0.95,
                    'border-width': 4,
                    'border-color': '#8e44ad',      # Darker purple border
                    'border-opacity': 1,
                    'text-outline-width': 3,
                    'text-outline-color': '#8e44ad',
                    'font-size': '14px'
                }
            }
            visual_style.append(center_style)
        
        # Update the style with the new additions
        cyto_widget.set_style(visual_style)
    
    print("Enhanced colorful and interactive style applied successfully.")
else:
    print("Skipping style definition.")

Defining enhanced colorful and interactive visual style...
Setting enhanced visual style on widget...
Enhanced colorful and interactive style applied successfully.


In [196]:
if cyto_widget:
    print("Setting layout algorithm ('cose')...")
    # cose (Compound Spring Embedder) is often good for exploring connections
    cyto_widget.set_layout(name='cose', 
                           animate=True, 
                           # Adjust parameters for better spacing/layout
                           nodeRepulsion=4000, # Increase repulsion 
                           nodeOverlap=40,    # Increase overlap avoidance
                           idealEdgeLength=120, # Slightly longer ideal edges
                           edgeElasticity=150, 
                           nestingFactor=5, 
                           gravity=100,        # Increase gravity slightly
                           numIter=1500,      # More iterations
                           initialTemp=200,
                           coolingFactor=0.95,
                           minTemp=1.0)
    print("Layout set. The graph will arrange itself when displayed.")
else:
     print("Skipping layout setting.")

Setting layout algorithm ('cose')...
Layout set. The graph will arrange itself when displayed.


In [197]:
if cyto_widget:
    print("Displaying interactive graph widget below...")
    print("Interact: Zoom (scroll), Pan (drag background), Move Nodes (drag nodes), Hover for details.")
    display(cyto_widget)
else:
    print("No widget to display.")

# Add a clear separator
print("\n" + "-" * 25 + "\nEnd of Visualization Step." + "\n" + "-" * 25)

Displaying interactive graph widget below...
Interact: Zoom (scroll), Pan (drag background), Move Nodes (drag nodes), Hover for details.


CytoscapeWidget(cytoscape_layout={'name': 'cose', 'nodeRepulsion': 4000, 'nodeOverlap': 40, 'idealEdgeLength':…


-------------------------
End of Visualization Step.
-------------------------
