
### Introduction

This report analyzes a dataset of private messages from an online social network at UC Irvine, containing 1899 users and 59835 messages. Each message is represented as an edge (SRC, TGT, UNIXTS), where SRC is the sender (source), TGT is the recipient (target), and UNIXTS is the Unix timestamp. Our goals are to:

- Explore and describe the dataset.
- Define and answer research questions about user activity and connectivity.
- Visualize the network, emphasizing users with high centrality.

We use Python with `pandas` for data handling, `networkx` for network analysis, and `plotly` for visualization.

### 1. Setup and Data Loading

First, make sure to correctly import all the following libraries or install them if missing.

In [11]:
import pandas as pd
import networkx as nx
import plotly.graph_objects as go
import nbformat
from datetime import datetime

Once the above cell run correctly we can procceed by loading the dataset in `pandas` dataframe

In [12]:
# Load the dataset (e.g., 'college_msg.txt' with columns: SRC, TGT, UNIXTS)
df = pd.read_csv('13_collegemsg_network/CollegeMsg.txt', sep=' ', header=None, names=['SRC', 'TGT', 'UNIXTS'])

# Convert UNIX timestamp to datetime
df['datetime'] = pd.to_datetime(df['UNIXTS'], unit='s')

print(df.head()) # example of unixts vs datetime

   SRC  TGT      UNIXTS            datetime
0    1    2  1082040961 2004-04-15 14:56:01
1    3    4  1082155839 2004-04-16 22:50:39
2    5    2  1082414391 2004-04-19 22:39:51
3    6    7  1082439619 2004-04-20 05:40:19
4    8    7  1082439756 2004-04-20 05:42:36


Unix timestamp is the number of **seconds** since 1st January 1970 and its not very readable, so we convert it to a human readable format

---

### 2. Data Exploration

Let’s explore the dataset to understand its structure and properties.

In [13]:
# Basic statistics
print("Number of messages:", len(df))
print("Time range:", df['datetime'].min(), "to", df['datetime'].max())

# Unique users
all_users = set(df['SRC']).union(set(df['TGT']))
print("Number of unique users:", len(all_users))

Number of messages: 59835
Time range: 2004-04-15 14:56:01 to 2004-10-26 07:52:22
Number of unique users: 1899


In [14]:
# Aggregate messages by date
df['date'] = df['datetime'].dt.date
messages_per_day = df.groupby('date').size().reset_index(name='count')

# Create an interactive line chart with Plotly
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=messages_per_day['date'], 
    y=messages_per_day['count'], 
    mode='lines', 
    name='Messages per Day'
))
fig.update_layout(
    title='Number of Messages per Day',
    xaxis_title='Date',
    yaxis_title='Number of Messages'
)
fig.show()


**Output Interpretation:**
- **Number of messages:** 59835 (matches the assignment description).
- **Unique users:** 1899 (confirms the number of nodes).
- **Time range:** From the earliest timestamp (Apr 15, 2004) to the latest(Oct 26, 2004), spanning several months.
- **Messages over time:** The plot shows temporal patterns, with potential peaks indicating bursts of activity (e.g., start of semesters or events).
---

### 3. Network Construction

Build a directed graph using NetworkX, which remains unchanged since it’s independent of the plotting library.

In [15]:
# Aggregate messages by source (SRC) and target (TGT)
df_grouped = df.groupby(['SRC', 'TGT']).size().reset_index(name='count')

# Create a directed graph
G = nx.from_pandas_edgelist(df_grouped, 'SRC', 'TGT', edge_attr='count', create_using=nx.DiGraph())

# Add all users as nodes (including isolated ones)
all_users = set(df['SRC']).union(set(df['TGT']))
G.add_nodes_from(all_users)

# Basic graph statistics
print("Number of nodes:", G.number_of_nodes())
print("Number of edges:", G.number_of_edges())
print("Density:", nx.density(G))

Number of nodes: 1899
Number of edges: 20296
Density: 0.005631048674611617


**Output Interpretation:**
- **Nodes:** 1899 (all users are included).
- **Edges:** Number of unique (SRC, TGT) pairs, likely less than 59835 due to aggregation (e.g., ~13800 unique directed edges).
- **Density:** Low (~0.00563), typical for sparse social networks.

---

### 4. Network Analysis

Analyze key network properties

#### **Degree Distribution**
Examine the distribution of in-degrees (messages received) and out-degrees (messages sent).

In [16]:
# Calculate degrees
in_degrees = dict(G.in_degree())
out_degrees = dict(G.out_degree())

# Plot interactive histogram
fig = go.Figure()
fig.add_trace(go.Histogram(x=list(in_degrees.values()), name='In-degree', opacity=0.7))
fig.add_trace(go.Histogram(x=list(out_degrees.values()), name='Out-degree', opacity=0.7))
fig.update_layout(
    title='Degree Distribution',
    xaxis_title='Degree',
    yaxis_title='Frequency',
    barmode='overlay'
)
fig.show()


**Result:** Most users have low degrees and a few have high degrees, expected in social networks.

#### **Top Users**
Identify the most active users (senders and receivers).

In [23]:
# Top 5 senders (out-degree)
top_out = sorted(out_degrees.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 users by out-degree (sent messages):")
for user, degree in top_out:
    print(user, degree)

# Top 5 receivers (in-degree)
top_in = sorted(in_degrees.items(), key=lambda x: x[1], reverse=True)[:5]
print("\nTop 5 users by in-degree (received messages):")
for user, degree in top_in:
    print(user, degree)

Top 5 users by out-degree (sent messages):
9 237
103 233
105 219
400 217
32 182

Top 5 users by in-degree (received messages):
32 137
42 120
638 119
372 115
598 115


#### **Centrality**
Compute PageRank to find influential users.

In [25]:
pagerank = nx.pagerank(G)
top_pagerank = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 users by PageRank:")
for user, pr in top_pagerank:
    print(user, pr)

Top 5 users by PageRank:
32 0.00599751428212459
42 0.005897543015508085
638 0.005389242638293345
372 0.005086009601527967
400 0.004538513498373943




---

### 5. Interactive Network Visualization

Visualize the network interactively with Plotly. For large networks, we’ll show both a full network (simplified) and a subgraph of central nodes.

#### **Full Network Visualization**
Plot the entire network (e.g., 1899 nodes) with a spring layout.

In [19]:
# Generate node positions
pos = nx.spring_layout(G)

# Edge trace
edge_x = []
edge_y = []
for edge in G.edges():
    x0, y0 = pos[edge[0]]
    x1, y1 = pos[edge[1]]
    edge_x.extend([x0, x1, None])
    edge_y.extend([y0, y1, None])

edge_trace = go.Scatter(
    x=edge_x, y=edge_y,
    line=dict(width=0.5, color='#888'),
    hoverinfo='none',
    mode='lines'
)

# Node trace
node_x = [pos[node][0] for node in G.nodes()]
node_y = [pos[node][1] for node in G.nodes()]
node_trace = go.Scatter(
    x=node_x, y=node_y,
    mode='markers',
    hoverinfo='text',
    text=[f'Node: {node}' for node in G.nodes()],
    marker=dict(
        size=10,
        colorscale='YlGnBu',
        showscale=True,
        colorbar=dict(thickness=15, title='Node Connections'),
        line_width=2
    )
)

# Create and show figure
fig = go.Figure(data=[edge_trace, node_trace],
                layout=go.Layout(
                    title='Full Network',
                    showlegend=False,
                    hovermode='closest',
                    margin=dict(b=20, l=5, r=5, t=40),
                    xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                    yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
                ))
fig.show()

- **What You Get**: An interactive graph where you can zoom, pan, and hover over nodes to see their IDs. Note that large networks may appear cluttered.

#### **Subgraph of Central Nodes**
Focus on nodes with an in-degree ≥ 10 for a clearer view.

In [20]:
# Select central nodes (in-degree >= 10)
threshold = 10
sub_nodes = [node for node, deg in in_degrees.items() if deg >= threshold]
sub_G = G.subgraph(sub_nodes)

# Generate positions
pos_sub = nx.spring_layout(sub_G)

# Edge trace
edge_x_sub = []
edge_y_sub = []
for edge in sub_G.edges():
    x0, y0 = pos_sub[edge[0]]
    x1, y1 = pos_sub[edge[1]]
    edge_x_sub.extend([x0, x1, None])
    edge_y_sub.extend([y0, y1, None])

edge_trace_sub = go.Scatter(
    x=edge_x_sub, y=edge_y_sub,
    line=dict(width=0.5, color='#888'),
    hoverinfo='none',
    mode='lines'
)

# Node trace with size based on in-degree
node_x_sub = [pos_sub[node][0] for node in sub_G.nodes()]
node_y_sub = [pos_sub[node][1] for node in sub_G.nodes()]
node_sizes_sub = [in_degrees.get(node, 0) for node in sub_G.nodes()]

node_trace_sub = go.Scatter(
    x=node_x_sub, y=node_y_sub,
    mode='markers',
    hoverinfo='text',
    text=[f'Node: {node}<br>In-degree: {in_degrees[node]}' for node in sub_G.nodes()],
    marker=dict(
        size=[size * 2 for size in node_sizes_sub],  # Scale for visibility
        color=node_sizes_sub,
        colorscale='YlGnBu',
        showscale=True,
        colorbar=dict(thickness=15, title='In-degree'),
        line_width=2
    )
)

# Create and show figure
fig_sub = go.Figure(data=[edge_trace_sub, node_trace_sub],
                    layout=go.Layout(
                        title='Subgraph (In-Degree ≥ 10)',
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=20, l=5, r=5, t=40),
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False)
                    ))
fig_sub.show()


- **What You Get**: A more focused interactive graph where node sizes reflect in-degree, and hovering reveals node IDs and their in-degrees.

---

### 6. Temporal Analysis of a Top User

Visualize the activity of the most popular user (e.g., highest in-degree) over time.

In [21]:
# Select top user by in-degree
top_user = top_in[0][0]  # E.g., User 58
user_received = df[df['TGT'] == top_user]
received_per_day = user_received.groupby('date').size().reset_index(name='count')

# Create interactive line chart
fig = go.Figure()
fig.add_trace(go.Scatter(
    x=received_per_day['date'], 
    y=received_per_day['count'], 
    mode='lines', 
    name=f'Messages Received by User {top_user}'
))
fig.update_layout(
    title=f'Messages Received by User {top_user} per Day',
    xaxis_title='Date',
    yaxis_title='Number of Messages'
)
fig.show()


- **What You Get**: An interactive line chart to explore the top user’s received messages, with zoom and hover capabilities.

---

## Conclusion

Using Plotly, you now have interactive visualizations that make exploring your dataset more engaging and insightful. Whether it’s tracking messages over time, examining degree distributions, or visualizing network structures, Plotly’s interactivity allows you to zoom, pan, and hover for details—perfect for Jupyter Notebook analysis! Simply run these code blocks in your notebook, and you’ll see the interactive plots come to life.