# Twitter Graph Analysis
This notebook loads a pre-computed pickled NetworkX graph, extracts node attributes, and performs several analyses.

## Steps:
1. Load the graph from a pickle file.
2. Inspect the graph's basic stats (number of nodes, edges).
3. Convert node attributes to a DataFrame.
4. Conduct example queries and visualizations:
   - Top accounts by PageRank.
   - Top accounts by follower count.
   - Correlation plots.
   - Any additional queries.


In [3]:
# Step 1: Import required libraries
import pickle
import networkx as nx
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from pathlib import Path

# For nicer plots
sns.set_theme(style='whitegrid')


In [None]:
# Step 2: Load the pickled graph
pickle_path = Path('graph_with_pagerank.pickle')  # Adjust if your pickle file has a different name
with open(pickle_path, 'rb') as f:
    G = pickle.load(f)

print(f"Loaded graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges.")
pagerank_params = G.graph.get('pagerank_params', {})
print("PageRank parameters used:", pagerank_params)


In [None]:
# Step 3: Convert node attributes to a DataFrame
# We'll collect user_id, username, follower_count, following_count, is_verified, and pagerank_score.

node_data = []
for node, attrs in G.nodes(data=True):
    node_data.append({
        'user_id': node,
        'username': attrs.get('username', 'unknown'),
        'follower_count': attrs.get('follower_count', 0),
        'following_count': attrs.get('following_count', 0),
        'is_verified': attrs.get('is_verified', 0),
        'pagerank_score': attrs.get('pagerank_score', 0.0)
    })

df = pd.DataFrame(node_data)
print(f"DataFrame shape: {df.shape}")
df.head()


## Basic Statistics

In [None]:
# Let's see how many verified vs. non-verified accounts are present
verified_counts = df['is_verified'].value_counts()
print("Verified distribution:")
print(verified_counts)

# Distribution of follower counts
print("\nFollower count stats:")
print(df['follower_count'].describe())


## Top Accounts by PageRank Score

In [None]:
top_by_pagerank = df.sort_values(by='pagerank_score', ascending=False).head(10)
top_by_pagerank

## Top Accounts by Follower Count

In [None]:
top_by_followers = df.sort_values(by='follower_count', ascending=False).head(10)
top_by_followers

## Correlation between Follower Count and PageRank

In [None]:
# Quick correlation check
corr_val = df[['follower_count', 'pagerank_score']].corr().iloc[0,1]
print(f"Correlation between follower_count and pagerank_score: {corr_val:.4f}")


In [None]:
# Scatterplot of PageRank Score vs. Follower Count
plt.figure(figsize=(8,6))
sns.scatterplot(data=df, x='follower_count', y='pagerank_score', alpha=0.6)
plt.title('PageRank Score vs. Follower Count')
plt.xscale('log')  # log-scale on x for better visibility if large range
plt.yscale('log')  # might also put the y-axis on log scale
plt.show()

## In-Degree and Out-Degree Analysis
We can also look at the graph structure by computing degrees.

In [None]:
in_degrees = dict(G.in_degree())
out_degrees = dict(G.out_degree())

# Add these as columns in the df
df['in_degree'] = df['user_id'].map(in_degrees)
df['out_degree'] = df['user_id'].map(out_degrees)

df[['user_id','username','in_degree','out_degree','pagerank_score']].head(10)

## Visualizing Degree Distributions

In [None]:
plt.figure(figsize=(12,5))
plt.subplot(1,2,1)
sns.histplot(df['in_degree'], log_scale=(True, False), bins=30)
plt.title('In-Degree Distribution')

plt.subplot(1,2,2)
sns.histplot(df['out_degree'], log_scale=(True, False), bins=30)
plt.title('Out-Degree Distribution')

plt.tight_layout()
plt.show()

## Sample Query: Largest In-Degree vs. Follower Count
Sometimes, you want to see if a node's recorded `follower_count` aligns with the actual in-degree in your dataset.


In [None]:
df['follower_diff'] = df['follower_count'] - df['in_degree']
df_sorted = df.sort_values(by='in_degree', ascending=False).head(10)
df_sorted[['user_id','username','in_degree','follower_count','follower_diff']]

## Final Thoughts
You can continue adding custom queries, grouping, or advanced analytics here.  
Additional ideas:
- Look at subgraphs (e.g., only verified users).
- Conduct BFS from a seed user to see how influence might propagate.
- Compare multiple PageRank runs with different weighting schemes.
- If you have time-series data (like different snapshots of the follow graph), do a temporal analysis.

Feel free to adapt any part of the above code to fit your exploration needs!