<a href="https://colab.research.google.com/github/fjadidi2001/fake_news_detection/blob/main/BERTGNN_Apr2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Definition
> combining Graph Neural Networks (GNNs) for social network analysis and BERT for text processing, with the facebook-fact-check.csv dataset and the embedding/modeling scripts. This dataset includes social network features (share_count, reaction_count, comment_count) and text (Context Post), making it a great fit for this hybrid approach. **The goal is to classify posts (e.g., binary classification: "mostly true" vs. others) by integrating graph-based social interactions and text semantics.**



# Project Overview



- Objective: Classify Facebook posts’ veracity using social network structure (via GNN) and text content (via BERT).

- Dataset: facebook-fact-check.csv (2282 rows, with account_id, post_id, network features, and Context Post).

- Output: Binary classification (0: "mostly true", 1: others).



# Step-by-Step Development Process



## Step 1: Data Preprocessing and Exploration

> Goal: Prepare the dataset for GNN and BERT, ensuring compatibility with both models.

Tasks:

1. Load and Inspect Data: Use the existing embedding script’s loading logic.

2. Labels: Map Rating to binary labels (0 vs. 1).

3. Network Features: Extract share_count, reaction_count, comment_count and standardize them.

4. Graph Construction: Create a graph where nodes are posts (post_id), edges are based on shared account_id or interactions (e.g., co-occurring in the dataset), and node features are network metrics.

5. Text Data: Keep Context Post raw for BERT input (no tokenization yet; BERT handles it internally).



In [5]:
import pandas as pd
import numpy as np
from scipy import io as sio
from sklearn.preprocessing import StandardScaler
from google.colab import drive
import networkx as nx

drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Projects/Hayat/facebook-fact-check.csv'
df = pd.read_csv(file_path, encoding='latin-1')

# Label mapping
label2id = {'mostly true': 0, 'mixture of true and false': 1, 'no factual content': 1, 'mostly false': 1}
df['Rating'] = df['Rating'].map(label2id)
y = df['Rating'].astype(int).to_numpy()
print("Label distribution:", np.bincount(y))

# Network features
network_cols = ['share_count', 'reaction_count', 'comment_count']
X_network = df[network_cols].fillna(0).to_numpy()
scaler = StandardScaler()
X_net_std = scaler.fit_transform(X_network)
print("X_network shape:", X_net_std.shape)  # (2282, 3)

# Graph construction: Use row indices as nodes
G = nx.Graph()
for idx in range(len(df)):
    G.add_node(idx, features=X_net_std[idx])

# Add edges between posts with same account_id
account_groups = df.groupby('account_id').indices
for account_id, indices in account_groups.items():
    indices = list(indices)
    for i in range(len(indices)):
        for j in range(i + 1, len(indices)):
            G.add_edge(indices[i], indices[j])

print("Graph nodes:", G.number_of_nodes(), "Edges:", G.number_of_edges())

# Save for later use
sio.savemat('labels.mat', {'y': y})
sio.savemat('network.mat', {'X_net_std': X_net_std})

Mounted at /content/drive
Label distribution: [1669  613]
X_network shape: (2282, 3)
Graph nodes: 2282 Edges: 368312


# Step 2: Graph Neural Network (GNN) Setup

> Goal: Model social network interactions using a GNN (e.g., Graph Convolutional Network, GCN).
- Tools: Use torch_geometric for GNN implementation.


Tasks:
- Convert Graph to PyTorch Geometric Format: Map network.mat features to nodes and define edges.

- Define GCN Model: Process node features (3D network data) to produce node embeddings.

- Output: GNN embeddings for each post (e.g., 128D per node).


In [6]:
!pip install torch torch-geometric -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m103.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.utils import add_self_loops
import numpy as np
import os

In [None]:

# Reset CUDA environment
torch.cuda.empty_cache()
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"  # For precise CUDA error reporting

# Load network features
X_net_std = sio.loadmat('network.mat')['X_net_std']  # (2282, 3)

# Edge index from graph (using row indices)
edges = list(G.edges)
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()
x = torch.tensor(X_net_std, dtype=torch.float)  # Node features (2282, 3)

# Verify edge_index validity
print("Max edge index:", edge_index.max(), "Num nodes:", x.shape[0])
assert edge_index.max() < x.shape[0], "Edge indices exceed number of nodes!"

# Create PyTorch Geometric data object
data = Data(x=x, edge_index=edge_index)
print("GNN Data before self-loops:", data)

# Add self-loops
edge_index, _ = add_self_loops(data.edge_index, num_nodes=data.num_nodes)
data.edge_index = edge_index
print("GNN Data after self-loops:", data)

# Define GCN model
class GCN(torch.nn.Module):
    def __init__(self, in_channels=3, hidden_channels=64, out_channels=128):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = self.relu(x)
        x = self.conv2(x, edge_index)
        return x

# Test on CPU first
print("\nRunning on CPU:")
device = torch.device('cpu')
gcn_model = GCN().to(device)
data = data.to(device)
gcn_embeddings = gcn_model(data)
print("GCN Embeddings shape (CPU):", gcn_embeddings.shape)

# Then try CUDA
print("\nRunning on CUDA:")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gcn_model = GCN().to(device)
data = data.to(device)
gcn_embeddings = gcn_model(data)
print("GCN Embeddings shape (CUDA):", gcn_embeddings.shape)

Collecting sio
  Downloading sio-0.1.0-py3-none-any.whl.metadata (760 bytes)
Collecting shortio>=0.1.0 (from sio)
  Downloading shortio-0.1.0-py3-none-any.whl.metadata (1.5 kB)
Downloading sio-0.1.0-py3-none-any.whl (2.5 kB)
Downloading shortio-0.1.0-py3-none-any.whl (4.9 kB)
Installing collected packages: shortio, sio
Successfully installed shortio-0.1.0 sio-0.1.0


NameError: name 'sio' is not defined