# Exploring the Largest Japanese Lexical Network Graph

This notebook explores the largest NetworkX graph pickle file: `extract_lexical_network_with_graph_japanese_full.pkl` (190MB)

## Overview
- **File**: `extract_lexical_network_with_graph_japanese_full.pkl`
- **Size**: 190MB
- **Type**: NetworkX Graph (likely MultiDiGraph based on other files)
- **Content**: Japanese lexical network with synonyms, antonyms, and semantic relationships

In [2]:
!pip install kiwisolver





In [None]:
# Import required libraries
import pickle
import networkx as nx
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
from collections import Counter
import time
import os

# # Set up plotting
# plt.style.use('default')
# sns.set_palette("husl")

# # Display options
# pd.set_option('display.max_columns', None)
# pd.set_option('display.max_rows', 50)
# pd.set_option('display.width', None)

: 

## 1. File Information and Loading

In [7]:
# Check file information
file_path = 'extract_lexical_network_with_graph_japanese_full.pkl'
file_size_mb = os.path.getsize(file_path) / (1024 * 1024)
print(f"File: {file_path}")
print(f"Size: {file_size_mb:.2f} MB")
print(f"File exists: {os.path.exists(file_path)}")

File: extract_lexical_network_with_graph_japanese_full.pkl
Size: 190.40 MB
File exists: True


In [8]:
# Load the graph with progress tracking
print("Loading the large graph...")
start_time = time.time()

try:
    with open(file_path, 'rb') as f:
        G = pickle.load(f)
    
    load_time = time.time() - start_time
    print(f"Graph loaded successfully in {load_time:.2f} seconds")
    
except Exception as e:
    print(f"Error loading graph: {e}")
    G = None

Loading the large graph...
Graph loaded successfully in 10.40 seconds


## 2. Basic Graph Information

In [9]:
if G is not None:
    print("=== GRAPH BASIC INFORMATION ===")
    print(f"Graph type: {type(G)}")
    print(f"Number of nodes: {G.number_of_nodes():,}")
    print(f"Number of edges: {G.number_of_edges():,}")
    print(f"Is directed: {G.is_directed()}")
    print(f"Is multigraph: {G.is_multigraph()}")
    print(f"Graph density: {nx.density(G):.6f}")
    
    # Check if graph is connected
    if G.is_directed():
        print(f"Number of weakly connected components: {nx.number_weakly_connected_components(G)}")
        print(f"Number of strongly connected components: {nx.number_strongly_connected_components(G)}")
    else:
        print(f"Number of connected components: {nx.number_connected_components(G)}")

=== GRAPH BASIC INFORMATION ===
Graph type: <class 'networkx.classes.multidigraph.MultiDiGraph'>
Number of nodes: 264,306
Number of edges: 1,132,834
Is directed: True
Is multigraph: True
Graph density: 0.000016
Number of weakly connected components: 96935
Number of strongly connected components: 152187


## 3. Node Analysis

In [10]:
if G is not None:
    print("=== NODE ANALYSIS ===")
    
    # Get sample nodes
    sample_nodes = list(G.nodes())[:10]
    print(f"Sample nodes: {sample_nodes}")
    
    # Analyze node attributes
    node_attrs = {}
    for node in sample_nodes:
        node_attrs[node] = dict(G.nodes[node])
    
    print("\nSample node attributes:")
    for node, attrs in node_attrs.items():
        print(f"{node}: {attrs}")
    
    # Get all unique attribute keys
    all_attr_keys = set()
    for node in G.nodes():
        all_attr_keys.update(G.nodes[node].keys())
    
    print(f"\nAll node attribute keys: {sorted(all_attr_keys)}")

=== NODE ANALYSIS ===
Sample nodes: ['物', '品', '物と品', '物品', '物体', '物と物体', '物理', '物と物品', '商業', '対象']

Sample node attributes:
物: {'kanji': '物', 'hiragana': 'もの', 'translation': 'thing'}
品: {'kanji': '品', 'hiragana': 'しな', 'translation': 'item'}
物と品: {'kanji': '物と品', 'hiragana': 'ものとしな', 'translation': 'thing and item'}
物品: {'kanji': '物品', 'hiragana': 'ぶっぴん', 'translation': 'goods'}
物体: {'kanji': '物体', 'hiragana': 'ぶったい', 'translation': 'body'}
物と物体: {'kanji': '物と物体', 'hiragana': 'ものとぶったい', 'translation': 'thing and body'}
物理: {'kanji': '物理', 'hiragana': 'ぶつり', 'translation': 'physics'}
物と物品: {'kanji': '物と物品', 'hiragana': 'ものとぶっぴん', 'translation': 'thing and goods'}
商業: {'kanji': '商業', 'hiragana': 'しょうぎょう', 'translation': 'commerce'}
対象: {'kanji': '対象', 'hiragana': 'たいしょう', 'translation': 'target'}

All node attribute keys: ['hiragana', 'kanji', 'translation']


In [11]:
if G is not None:
    # Degree distribution
    degrees = [d for n, d in G.degree()]
    
    print("=== DEGREE DISTRIBUTION ===")
    print(f"Average degree: {np.mean(degrees):.2f}")
    print(f"Median degree: {np.median(degrees):.2f}")
    print(f"Max degree: {max(degrees)}")
    print(f"Min degree: {min(degrees)}")
    
    # Top nodes by degree
    top_nodes = sorted(G.degree(), key=lambda x: x[1], reverse=True)[:20]
    print("\nTop 20 nodes by degree:")
    for node, degree in top_nodes:
        print(f"{node}: {degree}")

=== DEGREE DISTRIBUTION ===
Average degree: 8.57
Median degree: 1.00
Max degree: 4874
Min degree: 0

Top 20 nodes by degree:
社会: 4874
文化: 4864
生物: 3307
教育: 3179
経済: 2742
環境: 2485
技術: 2388
感情: 2213
自然: 1988
構造: 1931
生態系: 1864
情報: 1773
法律: 1740
文書: 1733
組織: 1713
芸術: 1711
地域: 1703
ビジネス: 1697
表現: 1614
地理: 1589


## 4. Edge Analysis

In [12]:
if G is not None:
    print("=== EDGE ANALYSIS ===")
    
    # Get sample edges
    sample_edges = list(G.edges(data=True))[:10]
    print("Sample edges:")
    for u, v, data in sample_edges:
        print(f"{u} -> {v}: {data}")
    
    # Analyze edge attributes
    all_edge_attr_keys = set()
    for u, v, data in G.edges(data=True):
        all_edge_attr_keys.update(data.keys())
    
    print(f"\nAll edge attribute keys: {sorted(all_edge_attr_keys)}")

=== EDGE ANALYSIS ===
Sample edges:
物 -> 品: {'relation_type': 'SYNONYM', 'synonymity_strength': 0.9, 'relation_type_strength': 0.9, 'relation_explanation': 'Both words refer to objects or items, often used interchangeably in certain contexts.'}
物 -> 物品: {'relation_type': 'HAS_DOMAIN', 'relation_type_strength': 0.8, 'relation_explanation': 'Both words belong to the category of physical objects or goods.'}
物 -> 物品: {'relation_type': 'SYNONYM', 'synonymity_strength': 0.9, 'relation_type_strength': 0.9, 'relation_explanation': 'Both words refer to tangible items or products, often used in commerce.'}
物 -> 物品: {'relation_type': 'HYPONYM', 'synonymity_strength': 0.7, 'relation_type_strength': 0.85, 'relation_explanation': '物品 refers to tangible items or goods, which are specific instances of 物.'}
物 -> 物品: {'relation_type': 'CO-HYPONYM', 'synonymity_strength': 0.7, 'relation_type_strength': 0.8, 'relation_explanation': "Both words refer to tangible items, with '物品' being a specific type of '物

In [13]:
if G is not None:
    # Analyze edge types if 'relation' attribute exists
    if 'relation' in all_edge_attr_keys:
        edge_types = Counter()
        for u, v, data in G.edges(data=True):
            if 'relation' in data:
                edge_types[data['relation']] += 1
        
        print("=== EDGE TYPES ===")
        for edge_type, count in edge_types.most_common():
            print(f"{edge_type}: {count:,}")
    
    # Analyze edge types if 'type' attribute exists
    if 'type' in all_edge_attr_keys:
        edge_types = Counter()
        for u, v, data in G.edges(data=True):
            if 'type' in data:
                edge_types[data['type']] += 1
        
        print("\n=== EDGE TYPES (type attribute) ===")
        for edge_type, count in edge_types.most_common():
            print(f"{edge_type}: {count:,}")

## 5. Graph Structure Analysis

In [14]:
if G is not None:
    print("=== GRAPH STRUCTURE ANALYSIS ===")
    
    # Component analysis
    if G.is_directed():
        components = list(nx.weakly_connected_components(G))
    else:
        components = list(nx.connected_components(G))
    
    component_sizes = [len(c) for c in components]
    component_sizes.sort(reverse=True)
    
    print(f"Number of components: {len(components)}")
    print(f"Largest component size: {component_sizes[0]:,}")
    print(f"Second largest component size: {component_sizes[1] if len(component_sizes) > 1 else 0:,}")
    print(f"Average component size: {np.mean(component_sizes):.2f}")
    
    # Show top 10 component sizes
    print("\nTop 10 component sizes:")
    for i, size in enumerate(component_sizes[:10]):
        print(f"{i+1}. {size:,}")

=== GRAPH STRUCTURE ANALYSIS ===
Number of components: 96935
Largest component size: 167,372
Second largest component size: 1
Average component size: 2.73

Top 10 component sizes:
1. 167,372
2. 1
3. 1
4. 1
5. 1
6. 1
7. 1
8. 1
9. 1
10. 1


## 6. Japanese Language Specific Analysis

In [15]:
if G is not None:
    print("=== JAPANESE LANGUAGE ANALYSIS ===")
    
    # Analyze Japanese characters in node names
    japanese_chars = []
    kanji_chars = []
    hiragana_chars = []
    katakana_chars = []
    
    for node in list(G.nodes())[:1000]:  # Sample first 1000 nodes
        if isinstance(node, str):
            for char in node:
                if '\u4e00' <= char <= '\u9fff':  # Kanji
                    kanji_chars.append(char)
                elif '\u3040' <= char <= '\u309f':  # Hiragana
                    hiragana_chars.append(char)
                elif '\u30a0' <= char <= '\u30ff':  # Katakana
                    katakana_chars.append(char)
    
    print(f"Sample analysis of first 1000 nodes:")
    print(f"Kanji characters found: {len(set(kanji_chars))}")
    print(f"Hiragana characters found: {len(set(hiragana_chars))}")
    print(f"Katakana characters found: {len(set(katakana_chars))}")
    
    # Show some examples
    print(f"\nSample kanji characters: {list(set(kanji_chars))[:20]}")
    print(f"Sample hiragana characters: {list(set(hiragana_chars))[:20]}")
    print(f"Sample katakana characters: {list(set(katakana_chars))[:20]}")

=== JAPANESE LANGUAGE ANALYSIS ===
Sample analysis of first 1000 nodes:
Kanji characters found: 447
Hiragana characters found: 44
Katakana characters found: 64

Sample kanji characters: ['皆', '標', '治', '柄', '微', '離', '観', '住', '四', '資', '史', '自', '更', '取', '度', '日', '企', '複', '節', '繰']
Sample hiragana characters: ['よ', 'び', 'け', 'く', 'り', 'わ', 'の', 'こ', 'と', 'ん', 'て', 'ら', 'あ', 'ゆ', 'も', 'む', 'に', 'れ', 'げ', 'を']
Sample katakana characters: ['ハ', 'サ', 'ィ', 'デ', 'ロ', 'メ', 'テ', 'ム', 'チ', 'シ', 'ン', 'ビ', 'ナ', 'ッ', 'ケ', 'ズ', 'ポ', 'ツ', 'フ', 'ル']


## 7. Memory Usage and Performance

In [16]:
if G is not None:
    print("=== MEMORY AND PERFORMANCE ANALYSIS ===")
    
    # Estimate memory usage
    import sys
    graph_size = sys.getsizeof(G)
    
    # Rough estimation of node and edge memory
    node_memory = sum(sys.getsizeof(node) for node in list(G.nodes())[:100]) * G.number_of_nodes() / 100
    edge_memory = sum(sys.getsizeof(edge) for edge in list(G.edges())[:100]) * G.number_of_edges() / 100
    
    print(f"Graph object size: {graph_size / (1024*1024):.2f} MB")
    print(f"Estimated nodes memory: {node_memory / (1024*1024):.2f} MB")
    print(f"Estimated edges memory: {edge_memory / (1024*1024):.2f} MB")
    print(f"Total estimated memory: {(graph_size + node_memory + edge_memory) / (1024*1024):.2f} MB")

=== MEMORY AND PERFORMANCE ANALYSIS ===
Graph object size: 0.00 MB
Estimated nodes memory: 20.11 MB
Estimated edges memory: 60.50 MB
Total estimated memory: 80.61 MB


## 8. Summary and Recommendations

In [17]:
if G is not None:
    print("=== SUMMARY ===")
    print(f"✅ Successfully loaded graph with {G.number_of_nodes():,} nodes and {G.number_of_edges():,} edges")
    print(f"✅ Graph type: {type(G).__name__}")
    print(f"✅ File size: {file_size_mb:.2f} MB")
    
    print("\n=== RECOMMENDATIONS ===")
    print("1. For visualization: Use subgraphs or sample the graph")
    print("2. For analysis: Consider using NetworkX's built-in algorithms")
    print("3. For memory efficiency: Load only when needed")
    print("4. For performance: Use parallel processing for large computations")
    
    print("\n=== NEXT STEPS ===")
    print("1. Create visualizations of subgraphs")
    print("2. Analyze specific Japanese lexical relationships")
    print("3. Export specific subgraphs for further analysis")
    print("4. Compare with other graph versions in the project")

=== SUMMARY ===
✅ Successfully loaded graph with 264,306 nodes and 1,132,834 edges
✅ Graph type: MultiDiGraph
✅ File size: 190.40 MB

=== RECOMMENDATIONS ===
1. For visualization: Use subgraphs or sample the graph
2. For analysis: Consider using NetworkX's built-in algorithms
3. For memory efficiency: Load only when needed
4. For performance: Use parallel processing for large computations

=== NEXT STEPS ===
1. Create visualizations of subgraphs
2. Analyze specific Japanese lexical relationships
3. Export specific subgraphs for further analysis
4. Compare with other graph versions in the project
