# Complete Workflow: From Text to Network Contagion Analysis

This notebook demonstrates a full end-to-end workflow using the Network Inference Toolkit:

1. **Data Loading & Preparation** - Load and clean text data
2. **Network Building** - Build semantic network from co-occurrence
3. **Network Analysis** - Analyze network structure and communities
4. **Contagion Simulation** - Model information spread on the network
5. **Visualization** - Visualize networks and simulation results

**Dataset**: We'll use sample political discussion data, but this workflow applies to any text corpus.

In [None]:
import pandas as pd
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
from pathlib import Path

# Suppress warnings for cleaner output
import warnings
warnings.filterwarnings('ignore')

print("✓ Imports successful")

## 1. Data Loading & Preparation

First, we'll load sample data and inspect its structure.

In [None]:
# Create sample data if needed
sample_texts = [
    "Climate change requires urgent policy action and international cooperation",
    "Economic policy should focus on job creation and sustainable growth",
    "Healthcare reform needs bipartisan support to improve access",
    "Climate policy and environmental protection are critical priorities",
    "International trade policy affects domestic job markets significantly",
    "Sustainable development requires balancing economic and environmental goals",
    "Healthcare access remains a pressing policy challenge nationwide",
    "Job creation through infrastructure investment boosts economic growth",
    "Environmental regulations impact both climate and economic policy",
    "International cooperation on climate change shows mixed results",
] * 10  # Repeat to have more data

# Create DataFrame
df = pd.DataFrame({
    'text': sample_texts,
    'id': range(len(sample_texts)),
    'timestamp': pd.date_range('2024-01-01', periods=len(sample_texts), freq='H')
})

print(f"Loaded {len(df)} documents")
print(f"\nSample data:")
df.head(3)

In [None]:
# Basic data quality checks
print("Data Quality Summary:")
print(f"  Missing values: {df['text'].isna().sum()}")
print(f"  Empty strings: {(df['text'].str.len() == 0).sum()}")
print(f"  Avg text length: {df['text'].str.len().mean():.1f} characters")
print(f"  Date range: {df['timestamp'].min()} to {df['timestamp'].max()}")

## 2. Network Building

We'll build a semantic network based on term co-occurrence patterns.

In [None]:
# Save data to temporary file for CLI processing
import tempfile
import os

temp_dir = tempfile.mkdtemp()
input_file = os.path.join(temp_dir, 'input.csv')
output_dir = os.path.join(temp_dir, 'output')

df.to_csv(input_file, index=False)
print(f"Data saved to: {input_file}")
print(f"Output directory: {output_dir}")

In [None]:
# Build semantic network using CLI
!python -m src.semantic.build_semantic_network \
    --input {input_file} \
    --outdir {output_dir} \
    --min-df 2 \
    --topk 10 \
    --output-format csv

In [None]:
# Load network results
nodes_df = pd.read_csv(os.path.join(output_dir, 'nodes.csv'))
edges_df = pd.read_csv(os.path.join(output_dir, 'edges.csv'))

print(f"Network Statistics:")
print(f"  Nodes (terms): {len(nodes_df)}")
print(f"  Edges (co-occurrences): {len(edges_df)}")
print(f"  Avg degree: {2 * len(edges_df) / len(nodes_df):.2f}")

print(f"\nTop 10 terms by frequency:")
nodes_df.nlargest(10, 'count')[['token', 'count']]

In [None]:
# Inspect edge weights
print("Edge weight distribution:")
print(edges_df['weight'].describe())

print(f"\nTop 10 strongest edges:")
edges_df.nlargest(10, 'weight')[['src', 'dst', 'weight']]

## 3. Network Analysis

Analyze the network structure and detect communities.

In [None]:
# Load network into NetworkX
G = nx.Graph()

# Add nodes with attributes
for _, row in nodes_df.iterrows():
    G.add_node(row['id'], token=row['token'], count=row['count'])

# Add edges with weights
for _, row in edges_df.iterrows():
    G.add_edge(row['src'], row['dst'], weight=row['weight'])

print(f"NetworkX Graph:")
print(f"  Nodes: {G.number_of_nodes()}")
print(f"  Edges: {G.number_of_edges()}")
print(f"  Connected: {nx.is_connected(G)}")

In [None]:
# Compute network metrics
print("Network Metrics:")
print(f"  Density: {nx.density(G):.4f}")
print(f"  Average clustering: {nx.average_clustering(G):.4f}")

if nx.is_connected(G):
    print(f"  Average path length: {nx.average_shortest_path_length(G):.2f}")
    print(f"  Diameter: {nx.diameter(G)}")
else:
    print(f"  Connected components: {nx.number_connected_components(G)}")
    largest_cc = max(nx.connected_components(G), key=len)
    print(f"  Largest component size: {len(largest_cc)}")

In [None]:
# Detect communities using Louvain
from networkx.algorithms import community

communities = community.louvain_communities(G, seed=42)
print(f"\nCommunity Detection:")
print(f"  Number of communities: {len(communities)}")
print(f"  Community sizes: {[len(c) for c in communities]}")

# Assign community labels
node_to_comm = {}
for i, comm in enumerate(communities):
    for node in comm:
        node_to_comm[node] = i

# Show community memberships
for i, comm in enumerate(communities):
    tokens = [G.nodes[n]['token'] for n in comm]
    print(f"\nCommunity {i}: {', '.join(tokens[:10])}{'...' if len(tokens) > 10 else ''}")

## 4. Contagion Simulation

Simulate information spread on the semantic network.

In [None]:
# Run SI model simulation using CLI
sim_output = os.path.join(temp_dir, 'simulation')

!python -m src.contagion.cli \
    {os.path.join(output_dir, 'edges.csv')} \
    --model si \
    --beta 0.1 \
    --timesteps 50 \
    --initial-frac 0.05 \
    --seed 42 \
    --output-path {sim_output} \
    --output-format csv

In [None]:
# Load simulation results
sim_results = pd.read_csv(sim_output + '.csv')

print("Simulation Summary:")
print(f"  Total timesteps: {len(sim_results)}")
print(f"  Initial infected: {sim_results.iloc[0]['infected']}")
print(f"  Final infected: {sim_results.iloc[-1]['infected']}")
print(f"  Adoption rate: {sim_results.iloc[-1]['infected'] / len(nodes_df):.2%}")

sim_results.head()

In [None]:
# Visualize contagion spread over time
plt.figure(figsize=(10, 6))
plt.plot(sim_results['timestep'], sim_results['infected'], linewidth=2)
plt.xlabel('Timestep', fontsize=12)
plt.ylabel('Number of Infected Nodes', fontsize=12)
plt.title('SI Model: Information Spread on Semantic Network', fontsize=14)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nInterpretation: The curve shows how information spreads through the semantic network.")
print(f"With beta={0.1}, the infection reaches {sim_results.iloc[-1]['infected']} out of {len(nodes_df)} terms.")

## 5. Visualization

Visualize the network structure and highlight important nodes.

In [None]:
# Calculate node centralities
degree_cent = nx.degree_centrality(G)
betweenness_cent = nx.betweenness_centrality(G)
eigenvector_cent = nx.eigenvector_centrality(G, max_iter=1000)

# Create centrality DataFrame
centrality_df = pd.DataFrame({
    'node': list(G.nodes()),
    'token': [G.nodes[n]['token'] for n in G.nodes()],
    'degree': [degree_cent[n] for n in G.nodes()],
    'betweenness': [betweenness_cent[n] for n in G.nodes()],
    'eigenvector': [eigenvector_cent[n] for n in G.nodes()]
})

print("Top 10 most central terms (by degree centrality):")
centrality_df.nlargest(10, 'degree')[['token', 'degree', 'betweenness', 'eigenvector']]

In [None]:
# Visualize network
plt.figure(figsize=(12, 10))

# Layout
pos = nx.spring_layout(G, k=0.5, iterations=50, seed=42)

# Node sizes based on degree
node_sizes = [300 * degree_cent[n] for n in G.nodes()]

# Node colors based on community
node_colors = [node_to_comm.get(n, 0) for n in G.nodes()]

# Draw network
nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=node_colors, 
                       cmap=plt.cm.Set3, alpha=0.8)
nx.draw_networkx_edges(G, pos, alpha=0.2, width=0.5)

# Label top nodes
top_nodes = centrality_df.nlargest(10, 'degree')['node'].values
labels = {n: G.nodes[n]['token'] for n in top_nodes}
nx.draw_networkx_labels(G, pos, labels, font_size=10, font_weight='bold')

plt.title('Semantic Network with Communities', fontsize=14)
plt.axis('off')
plt.tight_layout()
plt.show()

print("\nVisualization: Node size = degree centrality, color = community")

## 6. Advanced Analysis: Parameter Inference

Infer optimal contagion parameters from observed cascade size.

In [None]:
# Suppose we observed a cascade that infected 60% of the network
observed_size = int(0.6 * len(nodes_df))

inference_output = os.path.join(temp_dir, 'inference')

!python -m src.contagion.cli_inference \
    {os.path.join(output_dir, 'edges.csv')} \
    --model si \
    --observed-final-size {observed_size} \
    --beta-min 0.01 \
    --beta-max 0.5 \
    --n-samples 10 \
    --timesteps 50 \
    --seed 42 \
    --output-path {inference_output} \
    --output-format csv

In [None]:
# Load inference results
trials_df = pd.read_csv(inference_output + '_trials.csv')
best_params_df = pd.read_csv(inference_output + '_best_params.csv')

print("Parameter Inference Results:")
print(best_params_df)

# Visualize parameter search
plt.figure(figsize=(10, 6))
plt.scatter(trials_df['beta'], trials_df['score'], alpha=0.6, s=100)
plt.xlabel('Beta (infection rate)', fontsize=12)
plt.ylabel('Score (negative abs error)', fontsize=12)
plt.title('Parameter Search: Beta vs Score', fontsize=14)
plt.grid(True, alpha=0.3)

# Highlight best
best_beta = best_params_df.iloc[0]['beta']
best_score = best_params_df.iloc[0]['score']
plt.scatter([best_beta], [best_score], color='red', s=200, marker='*', 
           label=f'Best: β={best_beta:.3f}')
plt.legend(fontsize=10)
plt.tight_layout()
plt.show()

print(f"\nBest beta value: {best_beta:.3f} achieves final size closest to target of {observed_size}")

## 7. Comparing Models

Compare SI, SIS, and SIR models on the same network.

In [None]:
# Run multiple models
models_to_test = [
    ('si', {'beta': 0.1}),
    ('sis', {'beta': 0.1, 'gamma': 0.05}),
    ('sir', {'beta': 0.1, 'gamma': 0.05})
]

results = {}

for model_name, params in models_to_test:
    model_output = os.path.join(temp_dir, f'{model_name}_results')
    
    cmd = f"python -m src.contagion.cli {os.path.join(output_dir, 'edges.csv')} "
    cmd += f"--model {model_name} --timesteps 50 --initial-frac 0.05 --seed 42 "
    cmd += f"--output-path {model_output} --output-format csv "
    
    for param, value in params.items():
        cmd += f"--{param} {value} "
    
    !{cmd}
    
    # Load results
    results[model_name] = pd.read_csv(model_output + '.csv')

print("✓ All models simulated")

In [None]:
# Compare models visually
plt.figure(figsize=(12, 6))

for model_name, df in results.items():
    plt.plot(df['timestep'], df['infected'], label=model_name.upper(), linewidth=2)

plt.xlabel('Timestep', fontsize=12)
plt.ylabel('Number of Infected Nodes', fontsize=12)
plt.title('Model Comparison: SI vs SIS vs SIR', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\nKey differences:")
print("- SI: Monotonic growth (no recovery)")
print("- SIS: Oscillates (nodes can be re-infected)")
print("- SIR: Peak then decline (immunity after recovery)")

## Summary and Next Steps

### What We Accomplished

1. ✅ Loaded and prepared text data
2. ✅ Built semantic network from co-occurrence patterns
3. ✅ Analyzed network structure (centrality, communities)
4. ✅ Simulated contagion spread (SI/SIS/SIR models)
5. ✅ Inferred optimal parameters from observed cascades
6. ✅ Visualized networks and simulation results

### Key Findings

- Network has strong community structure around key topics
- Information spreads efficiently through high-degree hubs
- Different contagion models show distinct spread dynamics
- Parameter inference can match observed cascade sizes

### Next Steps

**For More Advanced Analysis:**
- Use transformer networks for semantic similarity (`transformers_cli`)
- Build time-sliced networks to track evolution (`time_slice_cli`)
- Try complex contagion models (Watts, K-Reinforcement)
- Export to Gephi for interactive visualization

**For Real Data:**
- Scale up with larger datasets (10K+ documents)
- Tune parameters (min_df, topk, similarity thresholds)
- Use config files for reproducible workflows
- Save results in Parquet format for efficiency

**For Further Reading:**
- See `README.md` for complete documentation
- See `CONTAGION.md` for detailed contagion modeling guide
- Check `examples/` for more specialized notebooks

In [None]:
# Cleanup temporary files
import shutil
shutil.rmtree(temp_dir)
print("✓ Cleanup complete")