# Social Bot Detection & Network Forensics

## Project Overview
This analysis project combines principles from **Data Visualization Theory**, **Statistics**, **Social Network Analysis**, and **Computational Linguistics** to identify automated behavior on social media.

The methodology is derived from three foundational papers in the field:
1.  **Varol et al. (2017):** *"The Anatomy of a Social Bot"* (Feature Engineering & Random Forests).
2.  **Ferrara et al. (2016):** *"The Rise of Social Bots"* (Network Topology & Botometer).
3.  **Complex Networks (2020):** *"Coordinated Inauthentic Behavior"* (Community Detection & Content Similarity).

### Prerequisites
Run the following cell to install necessary libraries:

In [None]:
!pip install pandas numpy matplotlib seaborn networkx scikit-learn textblob wordcloud scipy

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import networkx as nx
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import MinMaxScaler
from wordcloud import WordCloud
from scipy.stats import entropy
import random
from datetime import datetime, timedelta
from math import pi
import itertools
import collections

# Set Global Aesthetic
sns.set_theme(style="whitegrid")
plt.rcParams['figure.figsize'] = (15, 10)
np.random.seed(42)
random.seed(42)

## Part 1: Feature Engineering (Varol et al. Framework)
**Objective:** Detect individual bots based on behavioral anomalies.

### Methodology
- **Shannon Entropy (Math):** $H(X) = -\sum p(x) \log p(x)$. used to measure linguistic complexity. Bots often use templates, resulting in low entropy.
- **Inter-Arrival Time (Method):** We calculate the standard deviation of time between posts. Bots programmed on cron jobs show low variance.
- **Random Forest (Model):** Used to determine feature importance.

In [None]:
# --- 1. Data Simulation (Feature Level) ---
def generate_feature_data(n_users=60, posts_per_user=15):
    data = []
    users = []
    
    human_lexicon = ["love", "hate", "tired", "politics", "coffee", "weekend", "work", "random", "thought"]
    bot_lexicon = ["buy", "click", "deal", "now", "free", "limited", "offer", "crypto", "win"]
    start_time = datetime.now() - timedelta(days=30)

    for i in range(n_users):
        is_bot = i < (n_users * 0.25) # 25% Bots
        user_id = f"Bot_{i}" if is_bot else f"User_{i}"
        
        # Bot: Low Variance Intervals. Human: Bursty (Poisson).
        if is_bot:
            intervals = [120 + random.randint(-2, 2) for _ in range(posts_per_user)]
        else:
            intervals = [int(random.expovariate(1/120)) + 1 for _ in range(posts_per_user)]
            
        current_time = start_time
        for interval in intervals:
            current_time += timedelta(minutes=interval)
            
            # Content Generation
            if is_bot:
                # Low Entropy Template
                text = f"Click here for {random.choice(bot_lexicon)} {random.choice(bot_lexicon)}"
            else:
                # High Entropy Organic
                text = " ".join([random.choice(human_lexicon) for _ in range(random.randint(5, 12))])

            data.append({
                "user": user_id,
                "is_bot_ground_truth": is_bot,
                "timestamp": current_time,
                "text": text,
                "followers": random.randint(50, 500) if is_bot else random.randint(100, 2000)
            })
    return pd.DataFrame(data)

# --- 2. Feature Calculation ---
df_feat = generate_feature_data()

def calc_entropy(text):
    words = text.split()
    if not words: return 0
    counts = collections.Counter(words)
    probs = [c / len(words) for c in counts.values()]
    return entropy(probs, base=2)

df_feat['entropy'] = df_feat['text'].apply(calc_entropy)

# Aggregating per user
user_stats = df_feat.groupby('user').agg({
    'timestamp': list,
    'entropy': 'mean',
    'followers': 'first',
    'is_bot_ground_truth': 'first'
}).reset_index()

def get_iat_std(timestamps):
    if len(timestamps) < 2: return 0
    timestamps.sort()
    diffs = [(t2 - t1).total_seconds()/60 for t1, t2 in zip(timestamps[:-1], timestamps[1:])]
    return np.std(diffs)

user_stats['iat_std'] = user_stats['timestamp'].apply(get_iat_std)

# --- 3. Modeling & Visualization ---
X = user_stats[['entropy', 'iat_std', 'followers']]
y = user_stats['is_bot_ground_truth']
rf = RandomForestClassifier(n_estimators=100).fit(X, y)
importances = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_})

# Plotting
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(18, 6))

# Scatter: The "Math" Test
sns.scatterplot(x='entropy', y='iat_std', hue='is_bot_ground_truth', data=user_stats, 
                palette={True: 'red', False: 'blue'}, s=100, ax=ax1)
ax1.set_title("Feature Space: Entropy vs. Temporal Variance", fontsize=14)
ax1.set_xlabel("Linguistic Entropy (Randomness)")
ax1.set_ylabel("Inter-Arrival Time Std Dev (Regularity)")
ax1.text(0.5, 5, "Bots Cluster Here\n(Low Entropy, Regular Timing)", bbox=dict(facecolor='yellow', alpha=0.5))

# Bar: Feature Importance
sns.barplot(x='importance', y='feature', data=importances, palette='viridis', ax=ax2)
ax2.set_title("Random Forest Feature Importance", fontsize=14)

plt.show()

## Part 2: Network Topology & Infiltration (Ferrara et al.)
**Objective:** Detect bots attempting to infiltrate a network.

### Methodology
- **Reciprocity (Math):** $\rho = \frac{L^\leftrightarrow}{L}$. Bots follow many, but few follow back.
- **Clustering Coefficient:** Humans form triangles (communities); Bots form stars (broadcasters).
- **The Radar Chart:** A visualization technique to compare the "Digital DNA" of bots vs humans.

In [None]:
# --- 1. Network Simulation ---
def generate_network_data(n_humans=40, n_bots=10):
    G = nx.DiGraph()
    users = [f"Human_{i}" for i in range(n_humans)] + [f"Bot_{i}" for i in range(n_bots)]
    
    for uid in users:
        if "Bot" in uid:
            # Bots: Aggressive following, low reciprocity
            targets = random.sample([u for u in users if u != uid], 20)
            for t in targets:
                G.add_edge(uid, t)
        else:
            # Humans: Reciprocal friendships
            targets = random.sample([u for u in users if u != uid], 5)
            for t in targets:
                G.add_edge(uid, t)
                if "Human" in t and random.random() < 0.7: G.add_edge(t, uid)
    return G

G_net = generate_network_data()

# --- 2. Calculate Metrics ---
metrics = []
for node in G_net.nodes():
    # Local Clustering for directed graph
    clus = nx.clustering(G_net, node)
    # Indegree vs Outdegree ratio
    in_d = G_net.in_degree(node)
    out_d = G_net.out_degree(node)
    ratio = in_d / (out_d + 1)
    
    metrics.append({
        'user': node,
        'is_bot': "Bot" in node,
        'clustering': clus,
        'follow_ratio': ratio,
        'centrality': nx.degree_centrality(G_net)[node]
    })

df_net = pd.DataFrame(metrics)

# --- 3. Radar Chart Visualization ---
# Normalize for Radar
scaler = MinMaxScaler()
cols = ['clustering', 'follow_ratio', 'centrality']
df_norm = df_net.copy()
df_norm[cols] = scaler.fit_transform(df_net[cols])

avg_prof = df_norm.groupby('is_bot')[cols].mean()

categories = list(avg_prof.columns)
N = len(categories)
angles = [n / float(N) * 2 * pi for n in range(N)]
angles += [angles[0]]

fig = plt.figure(figsize=(18, 8))

# Radar Plot
ax1 = fig.add_subplot(1, 2, 1, polar=True)
values_h = avg_prof.loc[False].tolist(); values_h += [values_h[0]]
values_b = avg_prof.loc[True].tolist(); values_b += [values_b[0]]

ax1.plot(angles, values_h, linewidth=2, label='Humans', color='blue')
ax1.fill(angles, values_h, 'blue', alpha=0.1)
ax1.plot(angles, values_b, linewidth=2, label='Bots', color='red')
ax1.fill(angles, values_b, 'red', alpha=0.1)
ax1.set_xticks(angles[:-1])
ax1.set_xticklabels(categories)
ax1.set_title("The 'BotOrNot' Radar Signature", fontsize=15, pad=20)
ax1.legend()

# Network Graph
ax2 = fig.add_subplot(1, 2, 2)
pos = nx.spring_layout(G_net, k=0.15)
colors = ['red' if "Bot" in n else 'blue' for n in G_net.nodes()]
nx.draw_networkx_nodes(G_net, pos, node_color=colors, node_size=100, ax=ax2)
nx.draw_networkx_edges(G_net, pos, alpha=0.2, ax=ax2)
ax2.set_title("Infiltration Topology (Red=Bots)", fontsize=15)
ax2.axis('off')

plt.show()

## Part 3: Coordinated Inauthentic Behavior (Complex Networks)
**Objective:** Detect Bot Farms (groups of bots working together).

### Methodology
- **Projected Similarity Graph:** Instead of who follows whom, we map who *behaves* like whom.
- **Cosine Similarity (Math):** Measures the angle between TF-IDF vectors of user content.
- **Community Detection:** Identifying dense clusters in the similarity graph.
- **Comparative Word Clouds:** Visualizing the specific narrative of the farm.

In [None]:
# --- 1. Simulation: The Bot Farm ---
def generate_farm_data(n_users=50):
    users = [f"User_{i}" for i in range(n_users)]
    farm_members = users[:15] # First 15 are the farm
    data = []
    
    narratives = ["Support the initiative", "The data is fake", "Buy crypto now"]
    
    for i in range(300):
        user = random.choice(users)
        if user in farm_members:
            # Farm: High Coordinated Content
            base = random.choice(narratives)
            text = base + random.choice(["!", ".", "!!"])
        else:
            # Organic: Random Content
            text = f"Just having {random.choice(['lunch', 'fun', 'trouble'])} today"
        
        data.append({'user': user, 'text': text})
    return pd.DataFrame(data), farm_members

df_farm, farm_members = generate_farm_data()

# --- 2. Projected Similarity Graph ---
# Group text by user
user_docs = df_farm.groupby('user')['text'].apply(' '.join).reset_index()

# TF-IDF & Cosine Sim
tfidf = TfidfVectorizer(stop_words='english')
matrix = tfidf.fit_transform(user_docs['text'])
sim_matrix = cosine_similarity(matrix)

# Build Graph
G_sim = nx.Graph()
users = user_docs['user'].tolist()
threshold = 0.5
rows, cols = np.where(sim_matrix > threshold)
edges = [(users[r], users[c]) for r, c in zip(rows, cols) if r != c]
G_sim.add_edges_from(edges)

# Detect Communities
communities = list(nx.community.greedy_modularity_communities(G_sim))
comm_map = {}
for i, comm in enumerate(communities):
    for n in comm: comm_map[n] = i

# --- 3. Visualizations ---
fig = plt.figure(figsize=(20, 12))

# A. Network Projection
ax1 = fig.add_subplot(2, 2, 1)
pos = nx.spring_layout(G_sim, k=0.2)
colors = [comm_map.get(n, -1) for n in G_sim.nodes()]
nx.draw_networkx_nodes(G_sim, pos, node_color=colors, cmap='tab10', node_size=200, ax=ax1)
nx.draw_networkx_edges(G_sim, pos, alpha=0.3, ax=ax1)
ax1.set_title("Projected Similarity Graph (Clusters = Coordination)", fontsize=14)
ax1.axis('off')

# B. Similarity Heatmap
ax2 = fig.add_subplot(2, 2, 2)
sorted_idx = np.argsort(list(comm_map.values())) # Sort by community to show blocks
sorted_sim = sim_matrix[sorted_idx, :][:, sorted_idx]
sns.heatmap(sorted_sim, cmap='viridis', ax=ax2)
ax2.set_title("Adjacency Heatmap (Block Diagonal = Bot Farm)", fontsize=14)

# C. Comparative Word Clouds
# Isolate texts
farm_detected = list(communities[0]) # Assuming largest is farm
organic_detected = list(itertools.chain.from_iterable(communities[1:]))

text_farm = " ".join(df_farm[df_farm['user'].isin(farm_detected)]['text'])
text_org = " ".join(df_farm[df_farm['user'].isin(organic_detected)]['text'])

ax3 = fig.add_subplot(2, 2, 3)
wc_farm = WordCloud(background_color='black', colormap='Reds').generate(text_farm)
ax3.imshow(wc_farm, interpolation='bilinear')
ax3.set_title("Detected Bot Narrative", fontsize=14, color='darkred')
ax3.axis('off')

ax4 = fig.add_subplot(2, 2, 4)
wc_org = WordCloud(background_color='white', colormap='Blues').generate(text_org)
ax4.imshow(wc_org, interpolation='bilinear')
ax4.set_title("Organic Conversation", fontsize=14, color='darkblue')
ax4.axis('off')

plt.tight_layout()
plt.show()