# 01_data_cleaning

**Purpose:**  
1. Load the raw MGTAB graph via our `MGTAB` loader  
2. Profile key elements (features, labels, edges) for missing values, duplicates, self‐loops  
3. Perform simple cleaning operations (e.g. remove self‐loops, dedupe edges)  
4. Save cleaned artifacts for downstream use  


In [1]:
import sys, os
# ensure repo root on path
os.chdir(os.path.abspath('..'))
sys.path.insert(0, os.path.abspath('.'))

import torch
import pandas as pd
import numpy as np
import yaml
from src.data.mgtab_dataset import MGTAB
from torch_geometric.utils import to_networkx, remove_self_loops, coalesce


In [2]:
# load config
import yaml
cfg_path = os.path.join('config', 'config.yaml')
print("Loading config from:", cfg_path)
with open(cfg_path) as f:
    cfg = yaml.safe_load(f)

# instantiate dataset
from src.data.mgtab_dataset import MGTAB
dataset = MGTAB(root=cfg['mgtab_root'])
data = dataset[0]

print(data)


Loading config from: config/config.yaml
Data(x=[10199, 788], edge_index=[2, 1700108], edge_type=[1700108], edge_weight=[1700108], y_stance=[10199], y_bot=[10199], train_mask=[10199], val_mask=[10199], test_mask=[10199])


## 4.1 Feature Matrix Inspection
- Shape  
- Missing or NaN values per feature  
- Feature distribution summary  


In [3]:
feat = data.x.numpy()
print("Features shape:", feat.shape)

# NaN counts
nan_counts = np.isnan(feat).sum(axis=0)
print("Num features with NaNs:", np.count_nonzero(nan_counts))
print("Top 5 features by NaN count:", 
      np.argsort(-nan_counts)[:5], nan_counts[np.argsort(-nan_counts)[:5]])

# If any NaNs, replace with column median
if np.any(nan_counts):
    medians = np.nanmedian(feat, axis=0)
    inds = np.where(np.isnan(feat))
    feat[inds] = np.take(medians, inds[1])
    data.x = torch.tensor(feat, dtype=torch.float)
    print("Replaced NaNs with medians.")


Features shape: (10199, 788)
Num features with NaNs: 0
Top 5 features by NaN count: [  0 520 521 522 523] [0 0 0 0 0]


**Shape:** (10199, 788)
We have 10 199 nodes (users) and 788 features per node. That’s a fairly high‐dimensional feature space—plenty of information for later classification, but we may eventually consider dimensionality‐reduction or feature‐selection to speed up modeling.

**No NaNs anywhere**
Since 0 features contain missing values, we don’t need to impute or drop any features. All of our node attributes are “clean,” which simplifies preprocessing.

## 4.2 Label Distribution
- Bot vs non-bot counts  
- Stance label distribution  


In [4]:
y_bot = data.y_bot.numpy()
y_stance = data.y_stance.numpy()
print("Bot label counts:", np.bincount(y_bot))
print("Stance label counts:", np.bincount(y_stance))


Bot label counts: [7451 2748]
Stance label counts: [3776 3637 2786]


**Bot labels: [7 451 non-bots, 2 748 bots]**
About 27 % of the accounts are labeled as bots. This class imbalance is moderate—most classifiers can handle a 3:1 ratio, but we should still monitor precision/recall (or use class-weighted losses) to avoid biasing toward the majority class.

**Stance labels: [3 776, 3 637, 2 786]**
The three stance categories are roughly balanced (about 37 %, 36 %, and 27 %). No single class dominates unduly.

## 4.3 Edge Inspection & Cleaning
- Total edges (directed?)  
- Self‐loops  
- Duplicate edges  


In [5]:
# Convert to undirected for counting duplicates if needed
edge_index = data.edge_index
print("Raw edge count:", edge_index.shape[1])

# 1) Remove self-loops
edge_index_clean, _ = remove_self_loops(edge_index)
print("After self-loop removal:", edge_index_clean.shape[1])

# 2) Deduplicate & sort edges (coalesce)
edge_index_clean, edge_attr = coalesce(
    edge_index_clean, 
    torch.ones(edge_index_clean.shape[1]), 
    num_nodes=data.num_nodes
)
print("After dedupe (coalesce):", edge_index_clean.shape[1])

# Update data
data.edge_index = edge_index_clean
data.edge_weight = edge_attr  # if you want weights=1


Raw edge count: 1700108
After self-loop removal: 1494289
After dedupe (coalesce): 1132578


**Raw edges: 1 700 108**
Our original graph had 1.7 million directed edges.

**After self-loop removal: 1 494 289**
We dropped 205 819 self-loops (≈12 % of all edges). Bots often auto-reply to themselves, so removing these prevents overestimating their influence.

**After dedupe (coalesce): 1 132 578**
Consolidating duplicate edges reduced the edge count by another 361 711 edges. This “coalescing” merges multiple interactions into a single weighted edge—key for accurate network metrics and faster algorithms.

## 4.4 Connectivity Check
- Fraction of nodes in the largest connected component  


In [6]:
import networkx as nx
G = to_networkx(data, to_undirected=True)
comp_sizes = [len(c) for c in nx.connected_components(G)]
largest = max(comp_sizes)
print(f"Largest component: {largest} / {data.num_nodes} nodes ({largest/data.num_nodes:.2%})")


Largest component: 10145 / 10199 nodes (99.47%)


**Largest component: 10 145 / 10 199 nodes (99.47 %)**
Virtually the entire graph is in one big connected component. Only 54 nodes live in tiny isolated fragments. This means almost all users (and bots) are reachable via some path. Great for community detection, since we won’t have to handle many “dead ends.”

In [7]:
# Save cleaned PyG Data for analysis phase
cleaned_path = os.path.join(cfg['mgtab_root'], 'processed', 'cleaned_data.pt')
os.makedirs(os.path.dirname(cleaned_path), exist_ok=True)
torch.save(data, cleaned_path)
print("Cleaned data saved to:", cleaned_path)


Cleaned data saved to: data/processed/cleaned_data.pt
