<a href="https://colab.research.google.com/github/fjadidi2001/fake_news_detection/blob/main/BERTGNN_Apr2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Definition
> combining Graph Neural Networks (GNNs) for social network analysis and BERT for text processing, with the facebook-fact-check.csv dataset and the embedding/modeling scripts. This dataset includes social network features (share_count, reaction_count, comment_count) and text (Context Post), making it a great fit for this hybrid approach. **The goal is to classify posts (e.g., binary classification: "mostly true" vs. others) by integrating graph-based social interactions and text semantics.**



# Project Overview



- Objective: Classify Facebook posts’ veracity using social network structure (via GNN) and text content (via BERT).

- Dataset: facebook-fact-check.csv (2282 rows, with account_id, post_id, network features, and Context Post).

- Output: Binary classification (0: "mostly true", 1: others).



# Step-by-Step Development Process



## Step 1: Data Preprocessing and Exploration

> Goal: Prepare the dataset for GNN and BERT, ensuring compatibility with both models.

Tasks:

1. Load and Inspect Data: Use the existing embedding script’s loading logic.

2. Labels: Map Rating to binary labels (0 vs. 1).

3. Network Features: Extract share_count, reaction_count, comment_count and standardize them.

4. Graph Construction: Create a graph where nodes are posts (post_id), edges are based on shared account_id or interactions (e.g., co-occurring in the dataset), and node features are network metrics.

5. Text Data: Keep Context Post raw for BERT input (no tokenization yet; BERT handles it internally).



In [2]:
import pandas as pd
import numpy as np
from scipy import io as sio
from sklearn.preprocessing import StandardScaler
from google.colab import drive

drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Projects/Hayat/facebook-fact-check.csv'
df = pd.read_csv(file_path, encoding='latin-1')

# Label mapping
label2id = {'mostly true': 0, 'mixture of true and false': 1, 'no factual content': 1, 'mostly false': 1}
df['Rating'] = df['Rating'].map(label2id)
y = df['Rating'].astype(int).to_numpy()
print("Label distribution:", np.bincount(y))  # Check balance

# Network features
network_cols = ['share_count', 'reaction_count', 'comment_count']
X_network = df[network_cols].fillna(0).to_numpy()
scaler = StandardScaler()
X_net_std = scaler.fit_transform(X_network)
print("X_network shape:", X_net_std.shape)  # (2282, 3)

# Save for later use
sio.savemat('labels.mat', {'y': y})
sio.savemat('network.mat', {'X_net_std': X_net_std})

# Graph construction (simplified: nodes = posts, edges = same account_id)
import networkx as nx
G = nx.Graph()
for idx, row in df.iterrows():
    G.add_node(row['post_id'], features=X_net_std[idx])
for account_id, group in df.groupby('account_id'):
    posts = group['post_id'].tolist()
    for i in range(len(posts)):
        for j in range(i + 1, len(posts)):
            G.add_edge(posts[i], posts[j])  # Edge if same account
print("Graph nodes:", G.number_of_nodes(), "Edges:", G.number_of_edges())

Mounted at /content/drive
Label distribution: [1669  613]
X_network shape: (2282, 3)
Graph nodes: 16 Edges: 30
