<a href="https://colab.research.google.com/github/ryandale7/ML-on-Graphs/blob/main/4_Visualization%2C_Interpretation%2C_and_Communication_of_Results.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Running black


from google.colab import drive


# 1. Mount Drive
drive.mount("/content/drive")


# 2. Change directory to your notebooks folder
%cd /content/drive/MyDrive/Colab\ Notebooks


# 3. Install nbqa and black (required each new session)
!pip install nbqa black


# 4. Run nbqa black on all notebooks in the current directory
!nbqa black .

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks
Collecting nbqa
  Downloading nbqa-1.9.1-py3-none-any.whl.metadata (31 kB)
Collecting black
  Downloading black-24.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_28_x86_64.whl.metadata (79 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m79.2/79.2 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting autopep8>=1.5 (from nbqa)
  Downloading autopep8-2.3.1-py2.py3-none-any.whl.metadata (16 kB)
Collecting tokenize-rt>=3.2.0 (from nbqa)
  Downloading tokenize_rt-6.1.0-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting mypy-extensions>=0.4.3 (from black)
  Downloading mypy_extensions-1.0.0-py3-none-any.whl.metadata (1.1 kB)
Collecting pathspec>=0.9.0 (from black)
  Downloading pathspec-0.12.1-py3-none-any.whl.metadata (21 kB)
Collecting pycodestyle>=2.12.0 (from autopep8>=1.5->nbqa)
  Downloading pycodestyle-2.12.1-py2.py3-none-any.whl.metadata (4.5 kB)
Collecting jedi>=0.16 

In [None]:
# PRACTICE ACTIVITY: Load the Bitcoin OTC Dataset and Prepare for Visualization

# Step 1: Navigate to data directory if needed
# %cd /content/drive/MyDrive/Colab Notebooks/Data

# Step 2: Basic imports
import networkx as nx
import matplotlib.pyplot as plt

# (Optional) Plotly import
# !pip install plotly
import plotly.express as px

# Step 3: Load the CSV
# The CSV might have columns like: source, target, rating, [optional other cols].
# We'll parse only the first three columns (src, tgt, weight).
file_path = "/content/drive/MyDrive/Colab Notebooks/Data/soc-sign-bitcoinotc.csv"
G = nx.DiGraph()

with open(file_path, "r") as f:
    for line in f:
        # Split the line by commas
        fields = line.strip().split(",")
        # Extract only the first three entries (ignore extras if present)
        src, tgt, w = fields[0], fields[1], fields[2]
        G.add_edge(src, tgt, weight=float(w))

print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())

# Quick peek at some edges and their weights
edges_list = list(G.edges(data=True))[:5]
print("Sample edges (with weights):", edges_list)

Number of nodes: 5881
Number of edges: 35592
Sample edges (with weights): [('6', '2', {'weight': 4.0}), ('6', '5', {'weight': 2.0}), ('6', '4', {'weight': 2.0}), ('6', '7', {'weight': 5.0}), ('6', '114', {'weight': 2.0})]


### Explanation / Feedback

- We created a directed graph `G` from the CSV using `networkx.DiGraph()`.
- Each edge is assigned a `weight` based on the trust score (-10 to +10).
- The dataset has around 5,881 nodes and 35,592 edges, so plotting **all** edges at once might be too large.

---

### Practice Activity: Subset Visualization with Plotly

**Task**:  
1. Create a subgraph of ~200 nodes (e.g., pick random nodes or the top nodes with the highest outdegree).  
2. Use Plotly to color edges by their weight (negative vs. positive).

"Try to accomplish this by writing code that filters nodes and then calls a Plotly function to visualize edges."


In [None]:
# CODE CELL: Subgraph Visualization with Plotly

import random

# Step 1: Get a subset of nodes for visualization
all_nodes = list(G.nodes())
subset_nodes = random.sample(all_nodes, 200)  # pick 200 random nodes
H = G.subgraph(subset_nodes).copy()

# Step 2: Extract edge data for Plotly
edge_x = []
edge_y = []
edge_color = []

pos = nx.spring_layout(H, seed=42)  # 2D layout

for u, v, data in H.edges(data=True):
    x0, y0 = pos[u]
    x1, y1 = pos[v]
    edge_x.append(x0)
    edge_x.append(x1)
    edge_x.append(None)
    edge_y.append(y0)
    edge_y.append(y1)
    edge_y.append(None)

    # Decide color based on weight sign
    w = data["weight"]
    if w < 0:
        edge_color.append("red")
    else:
        edge_color.append("green")
    edge_color.append("green")  # or red, but we have to keep consistent segments
    edge_color.append(None)

fig = px.scatter(
    x=[pos[n][0] for n in H.nodes()],
    y=[pos[n][1] for n in H.nodes()],
    text=[n for n in H.nodes()],
    title="Plotly Visualization of Subset of Bitcoin OTC Network",
)

# Plot edges using "line" segments approach
fig.add_scatter(
    x=edge_x, y=edge_y, mode="lines", line=dict(color="blue", width=1), name="edges"
)

fig.show()

### Feedback / Sample Discussion

- We used a **spring layout** for a 2D projection.
- **Edge colors** could be improved by segmenting negative vs. positive edges distinctly (this example is simplified).
- For large networks, Plotly might lag. Consider tools like Gephi for larger-scale analysis.

---

### Quiz: Visualization Tools

**Q1 (Multiple Choice)**: Which tool is primarily a standalone software for large-scale graph visualization?  
- a) Plotly  
- b) Gephi  
- c) Pandas  
- d) Matplotlib  

**Q2 (Short Answer)**: Why might you export data to Neo4j Bloom instead of only visualizing in Python?

---

#### Answers

- **A1**: (b) **Gephi** is standalone software specifically for network visualization.  
- **A2**: Neo4j Bloom provides an **interactive** graph database environment, allowing advanced querying, exploration, and possibly real-time updates outside of a Python notebook.


## 4.2 Interpreting and Explaining Model Predictions for Stakeholders

### 4.2.1 Understanding Model Outputs
- In a weighted signed network, model predictions might include:
  - Predicted trust score for a user or an edge
  - Risk/fraud likelihood scores for specific users

### 4.2.2 Visual Interpretation of Trust vs. Distrust
- Color edges by negative or positive weight
- Focus on clusters of strongly positive or negative relationships

### 4.2.3 Explaining Results to Different Audiences
- **Technical**: Detailed metrics, confidence intervals, reproducibility steps
- **Non-technical**: Focus on the "big picture" (e.g., how many suspicious users found)

### 4.2.4 Real-Life Examples: Bitcoin OTC Network
- Some users consistently rated at +10 by many peers → highly trusted
- Others might have multiple negative edges → potential risk/fraud

**Practice Activity**:  
- "Try to accomplish X by writing code that calculates the top 10 most trusted users (based on average incoming edge weight)."


In [None]:
# CODE CELL: Compute and Display Top 10 Most Trusted Users

# Partial Starter Code
import statistics


def top_trusted_users(graph, top_n=10):
    # Calculate average incoming edge weight for each node
    avg_incoming = {}
    for node in graph.nodes():
        in_edges = graph.in_edges(node, data=True)
        weights = [data["weight"] for (_, _, data) in in_edges]
        if weights:
            avg_incoming[node] = statistics.mean(weights)
        else:
            avg_incoming[node] = 0  # no incoming edges means no rating

    # Sort by average rating
    sorted_users = sorted(avg_incoming.items(), key=lambda x: x[1], reverse=True)
    return sorted_users[:top_n]


top_10_trusted = top_trusted_users(G, top_n=10)
print("Top 10 Most Trusted Users (by avg incoming weight):")
for user, rating in top_10_trusted:
    print(user, rating)

Top 10 Most Trusted Users (by avg incoming weight):
529 10.0
814 10.0
1122 10.0
1261 10.0
1326 10.0
1340 10.0
1501 10.0
1545 10.0
1663 10.0
2078 10.0


### Explanation / Feedback

- We iterated through each node's incoming edges, took the average weight, and used Python's `statistics.mean`.
- **Possible variation**: Weighted by the number of ratings, or ignoring negative edges, etc.
- **Interpretation**: These top users tend to receive positive trust scores from many other users.

---

### Quiz: Why Might a "Highly Trusted" User Still Be Risky?

**Short Answer**: Provide at least one reason why a user with many +10 edges could still pose a risk or be fraudulent.

**Sample Answer**:  
- The user might engage in collusive activity where a small group of accomplices rate them highly.
- They might have very few total transactions, so a small number of positive ratings inflates their average.

---


## 4.3 Effective Communication Strategies for Technical and Non-Technical Audiences

### 4.3.1 Storytelling with Data
- Narratives around trust and distrust relationships
- Emphasize real-world impacts (safe trading vs. fraud)

### 4.3.2 Simplifying Complex Concepts
- Avoid jargon when talking to non-technical stakeholders
- Focus on the "why" (why these scores matter, why some users are risky)

### 4.3.3 Practice Activity
- Create a short summary for management explaining the top-level insight:
  1. How many high-trust users exist
  2. What fraction of edges are negative
  3. One recommended action to reduce fraud risk

"Try writing 2-3 bullet points in clear, non-technical language."


## 4.4 Ensuring Reproducibility, Transparency, and Trust in Graph ML Outcomes

### 4.4.1 Reproducibility in Jupyter
- Document library versions (e.g., `pip freeze`)
- Share code notebooks along with the dataset location

### 4.4.2 Data and Code Transparency
- Provide clear data preprocessing steps
- Reference the original Bitcoin OTC CSV source

### 4.4.3 Ethical Considerations
- Privacy concerns, even if user IDs are anonymized
- Avoiding biased interpretations or undue suspicion

### 4.4.4 Practice Activity
- "Try to accomplish X by writing a short script that logs your environment setup and any data transformations."

**Quiz**  
- MCQ: Which file type is best for sharing reproducible notebooks? (e.g., `.ipynb`, `.py`, `.html`)  
- Short Answer: Give two ways to ensure a notebook can be trusted by peers reviewing it.


# More Advanced Examples with the Bitcoin OTC Dataset

In this notebook, we explore deeper analysis and transformations of the **Bitcoin OTC trust weighted signed network**. We will:
- Compute **weighted PageRank** (ignoring negative edges or transforming them).
- Separate the graph into **positive** and **negative** subgraphs.
- Perform **shortest path analysis** with custom distance transformations for signed edges.
- Apply **community detection** on a positive subgraph.
- Investigate **signed triad analysis** to see whether the network follows balance theory.

Each example demonstrates a different approach to handling or interpreting signed, weighted edges.


## Dataset Reminder

- **Dataset**: soc-sign-bitcoinotc.csv
- **Nodes**: 5,881
- **Edges**: 35,592
- **Range of Edge Weight**: -10 (full distrust) to +10 (full trust)
- **Percentage of Positive Edges**: ~89%

We assume you have already loaded the full graph into a variable called `G` (a `nx.DiGraph`) with an edge attribute `'weight'` for each trust score. If you have not loaded it yet, remember to do so before running the advanced code examples.

Example (from a previous step):
```python
import networkx as nx
G = nx.DiGraph()
with open('soc-sign-bitcoinotc.csv', 'r') as f:
    for line in f:
        src, tgt, w = line.strip().split(',')[:3]
        G.add_edge(src, tgt, weight=float(w))


## 1. PageRank with Edge Weights (Weighted PageRank)

**Goal**: Compute a PageRank-like metric where higher trust edges contribute more to a node's rank. Negative edges can be ignored or transformed, depending on your modeling preference.

### Approach: Ignore Negative Edges
- Create a new directed graph containing **only non-negative edges** (weight >= 0).
- Run `nx.pagerank` with `weight='weight'`.


In [None]:
import networkx as nx

# Step 1: Filter the original G to retain only edges with weight >= 0
G_pos = nx.DiGraph()
for u, v, data in G.edges(data=True):
    w = data["weight"]
    if w >= 0:
        G_pos.add_edge(u, v, weight=w)

# Step 2: Run weighted PageRank
pr = nx.pagerank(G_pos, alpha=0.85, weight="weight")

# Step 3: Print top 5 nodes by PageRank
top_5_pr = sorted(pr.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 nodes by (non-negative) Weighted PageRank:")
for node, score in top_5_pr:
    print(f"{node} -> {score:.5f}")

Top 5 nodes by (non-negative) Weighted PageRank:
35 -> 0.01606
2642 -> 0.01348
1 -> 0.00911
7 -> 0.00883
1810 -> 0.00760


## 2. Signed Edge Transformation (Positive vs. Negative Graphs)

**Goal**: Separate the original graph into **two subgraphs**:
- **G_pos** (only positive edges)
- **G_neg** (only negative edges)

This can be helpful if you want to analyze "trust" vs. "distrust" networks separately.


# Create positive and negative subgraphs from G
G_pos = nx.DiGraph()
G_neg = nx.DiGraph()

for u, v, data in G.edges(data=True):
    w = data['weight']
    if w > 0:
        G_pos.add_edge(u, v, weight=w)
    elif w < 0:
        # Optionally store absolute value
        G_neg.add_edge(u, v, weight=abs(w))

# Basic stats
print("Positive subgraph:")
print("  Nodes:", G_pos.number_of_nodes(), "Edges:", G_pos.number_of_edges())

print("Negative subgraph:")
print("  Nodes:", G_neg.number_of_nodes(), "Edges:", G_neg.number_of_edges())

# Example: largest strongly connected component in negative subgraph
if G_neg.number_of_nodes() > 0:
    largest_neg_scc = max(nx.strongly_connected_components(G_neg), key=len)
    print("Size of largest SCC in negative subgraph:", len(largest_neg_scc))
else:
    print("No nodes in G_neg, so no SCCs.")


## 3. Shortest Path Analysis with Signed Weights

**Goal**: Use a custom transformation to treat negative edges as higher "cost."
- For instance, map edge weight w to a distance d = 11 - w if w in [-10, +10].
  - Then +10 becomes distance=1, and -10 becomes distance=21.

We can then apply standard shortest path algorithms like Dijkstra or Bellman-Ford to find minimal "cost" paths favoring positive edges.


In [None]:
import random

# Create a new graph H_dist where edge "weight" is actually the cost/distance
H_dist = nx.DiGraph()
for u, v, data in G.edges(data=True):
    w = data["weight"]
    # Example transform: distance = 11 - w
    # +10 -> 1,  -10 -> 21
    dist = 11 - w
    H_dist.add_edge(u, v, weight=dist)

# Pick two random nodes to measure the "cost-based" shortest path
all_nodes = list(H_dist.nodes())
if len(all_nodes) > 1:
    src, tgt = random.sample(all_nodes, 2)
    try:
        path = nx.shortest_path(H_dist, source=src, target=tgt, weight="weight")
        path_cost = nx.shortest_path_length(
            H_dist, source=src, target=tgt, weight="weight"
        )
        print(f"Shortest path from {src} to {tgt} (transformed cost):")
        print(f"  Path: {path}")
        print(f"  Total cost: {path_cost:.2f}")
    except nx.NetworkXNoPath:
        print(f"No path found from {src} to {tgt}.")
else:
    print("Not enough nodes to pick a random source and target.")

Shortest path from 5271 to 1001 (transformed cost):
  Path: ['5271', '4002', '35', '1437', '492', '1001']
  Total cost: 30.00


## 4. Signed Triad Analysis (Balance Theory)

**Goal**: Investigate whether triads follow "balance theory" (friend of a friend is a friend, etc.).
- We look at triplets of nodes and check whether their edges align as "balanced" or "unbalanced."

**Note**: This can be computationally heavy for large networks. We might sample a subset of nodes.


In [None]:
from itertools import combinations
import random


def is_balanced_triad(u, v, w, graph):
    """
    Returns True if the triad (u, v, w) is balanced, False otherwise.
    We'll treat edges as undirected for simplicity and gather signs.
    If an edge doesn't exist, we consider it sign=0 (neutral).
    Balanced if the product of non-zero signs is positive.
    """
    edges = [(u, v), (v, w), (u, w)]
    signs = []

    for e in edges:
        # Check e in both directions
        if graph.has_edge(e[0], e[1]):
            sign_val = 1 if graph[e[0]][e[1]]["weight"] >= 0 else -1
        elif graph.has_edge(e[1], e[0]):
            sign_val = 1 if graph[e[1]][e[0]]["weight"] >= 0 else -1
        else:
            sign_val = 0  # no edge
        signs.append(sign_val)

    product = 1
    for s in signs:
        if s == 0:
            continue
        product *= s
    return product > 0


# Sample 300 nodes to reduce runtime
all_nodes = list(G.nodes())
sampled_nodes = random.sample(all_nodes, min(len(all_nodes), 300))

triad_count = 0
balanced_count = 0

for u, v, w in combinations(sampled_nodes, 3):
    triad_count += 1
    if is_balanced_triad(u, v, w, G):
        balanced_count += 1

print(f"Checked {triad_count} triads in sample, {balanced_count} are balanced.")
if triad_count > 0:
    print(f"({balanced_count/triad_count*100:.2f}% balanced)")

Checked 4455100 triads in sample, 4452130 are balanced.
(99.93% balanced)


## 4.5 References and Further Reading

- **Dataset Citation**:
  - S. Kumar et al. (2016, 2018). Papers on Edge Weight Prediction in Weighted Signed Networks.
- **Tool Documentation**:
  - [Plotly](https://plotly.com/python/)
  - [Gephi](https://gephi.org/)
  - [Neo4j Bloom](https://neo4j.com/developer/neo4j-bloom/)
- **Advanced Topics**:
  - Real-time graph updates, streaming data, interactive dashboards, etc.

---

# End of Unit 4: Visualization, Interpretation, and Communication
