<a href="https://colab.research.google.com/github/fjadidi2001/fake_news_detection/blob/main/BERTGNN_Apr2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Definition
> combining Graph Neural Networks (GNNs) for social network analysis and BERT for text processing, with the facebook-fact-check.csv dataset and the embedding/modeling scripts. This dataset includes social network features (share_count, reaction_count, comment_count) and text (Context Post), making it a great fit for this hybrid approach. **The goal is to classify posts (e.g., binary classification: "mostly true" vs. others) by integrating graph-based social interactions and text semantics.**



# Project Overview



- Objective: Classify Facebook posts’ veracity using social network structure (via GNN) and text content (via BERT).

- Dataset: facebook-fact-check.csv (2282 rows, with account_id, post_id, network features, and Context Post).

- Output: Binary classification (0: "mostly true", 1: others).



# Step-by-Step Development Process



## Step 1: Data Preprocessing and Exploration

> Goal: Prepare the dataset for GNN and BERT, ensuring compatibility with both models.

Tasks:

1. Load and Inspect Data: Use the existing embedding script’s loading logic.

2. Labels: Map Rating to binary labels (0 vs. 1).

3. Network Features: Extract share_count, reaction_count, comment_count and standardize them.

4. Graph Construction: Create a graph where nodes are posts (post_id), edges are based on shared account_id or interactions (e.g., co-occurring in the dataset), and node features are network metrics.

5. Text Data: Keep Context Post raw for BERT input (no tokenization yet; BERT handles it internally).



In [5]:
import pandas as pd
import numpy as np
from scipy import io as sio
from sklearn.preprocessing import StandardScaler
from google.colab import drive
import networkx as nx

drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Projects/Hayat/facebook-fact-check.csv'
df = pd.read_csv(file_path, encoding='latin-1')

# Label mapping
label2id = {'mostly true': 0, 'mixture of true and false': 1, 'no factual content': 1, 'mostly false': 1}
df['Rating'] = df['Rating'].map(label2id)
y = df['Rating'].astype(int).to_numpy()
print("Label distribution:", np.bincount(y))

# Network features
network_cols = ['share_count', 'reaction_count', 'comment_count']
X_network = df[network_cols].fillna(0).to_numpy()
scaler = StandardScaler()
X_net_std = scaler.fit_transform(X_network)
print("X_network shape:", X_net_std.shape)  # (2282, 3)

# Graph construction: Use row indices as nodes
G = nx.Graph()
for idx in range(len(df)):
    G.add_node(idx, features=X_net_std[idx])

# Add edges between posts with same account_id
account_groups = df.groupby('account_id').indices
for account_id, indices in account_groups.items():
    indices = list(indices)
    for i in range(len(indices)):
        for j in range(i + 1, len(indices)):
            G.add_edge(indices[i], indices[j])

print("Graph nodes:", G.number_of_nodes(), "Edges:", G.number_of_edges())

# Save for later use
sio.savemat('labels.mat', {'y': y})
sio.savemat('network.mat', {'X_net_std': X_net_std})

Mounted at /content/drive
Label distribution: [1669  613]
X_network shape: (2282, 3)
Graph nodes: 2282 Edges: 368312


# Step 2: Graph Neural Network (GNN) Setup

> Goal: Model social network interactions using a GNN (e.g., Graph Convolutional Network, GCN).
- Tools: Use torch_geometric for GNN implementation.


Tasks:
- Convert Graph to PyTorch Geometric Format: Map network.mat features to nodes and define edges.

- Define GCN Model: Process node features (3D network data) to produce node embeddings.

- Output: GNN embeddings for each post (e.g., 128D per node).


In [6]:
!pip install torch torch-geometric -q


[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m63.1/63.1 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m5.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m103.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m84.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m49.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m11.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [7]:
import torch
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.utils import add_self_loops
import numpy as np
import os

In [8]:
# Reset CUDA environment
torch.cuda.empty_cache()
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"  # For precise CUDA error reporting

In [9]:



# Load network features
X_net_std = sio.loadmat('network.mat')['X_net_std']  # (2282, 3)

# Edge index from graph (using row indices)
edges = list(G.edges)
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()
x = torch.tensor(X_net_std, dtype=torch.float)  # Node features (2282, 3)

# Verify edge_index validity
print("Max edge index:", edge_index.max(), "Num nodes:", x.shape[0])
assert edge_index.max() < x.shape[0], "Edge indices exceed number of nodes!"

# Create PyTorch Geometric data object
data = Data(x=x, edge_index=edge_index)
print("GNN Data before self-loops:", data)

# Add self-loops
edge_index, _ = add_self_loops(data.edge_index, num_nodes=data.num_nodes)
data.edge_index = edge_index
print("GNN Data after self-loops:", data)

# Define GCN model
class GCN(torch.nn.Module):
    def __init__(self, in_channels=3, hidden_channels=64, out_channels=128):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = self.relu(x)
        x = self.conv2(x, edge_index)
        return x

# Test on CPU first
print("\nRunning on CPU:")
device = torch.device('cpu')
gcn_model = GCN().to(device)
data = data.to(device)
gcn_embeddings = gcn_model(data)
print("GCN Embeddings shape (CPU):", gcn_embeddings.shape)

# Then try CUDA
print("\nRunning on CUDA:")
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gcn_model = GCN().to(device)
data = data.to(device)
gcn_embeddings = gcn_model(data)
print("GCN Embeddings shape (CUDA):", gcn_embeddings.shape)

Max edge index: tensor(2281) Num nodes: 2282
GNN Data before self-loops: Data(x=[2282, 3], edge_index=[2, 368312])
GNN Data after self-loops: Data(x=[2282, 3], edge_index=[2, 370594])

Running on CPU:
GCN Embeddings shape (CPU): torch.Size([2282, 128])

Running on CUDA:
GCN Embeddings shape (CUDA): torch.Size([2282, 128])


# Step 3: BERT Text Processing

> Goal: Generate 768-dimensional BERT embeddings for each of the 2282 Context Post entries in your dataset.

- Tools: Use transformers from Hugging Face with bert-base-uncased.

Tasks:
1. Load BERT tokenizer and model.

2. Tokenize and encode Context Post texts in batches.

3. Extract embeddings (e.g., [CLS] token) for each post.

4. Output embeddings of shape (2282, 768).





In [10]:
# Install dependencies
!pip install transformers -q
import pandas as pd
from scipy import io as sio
import torch
from transformers import BertTokenizer, BertModel
from google.colab import drive

# Mount drive and load dataset
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/Projects/Hayat/facebook-fact-check.csv'
df = pd.read_csv(file_path, encoding='latin-1')
texts = df['Context Post'].fillna("").tolist()  # 2282 posts
print("Number of texts:", len(texts))

# Load BERT
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
bert_model = BertModel.from_pretrained('bert-base-uncased').to(device)

# Function to get BERT embeddings in batches
def get_bert_embeddings(texts, batch_size=32, max_length=117):
    bert_embeddings = []
    bert_model.eval()  # Set to evaluation mode
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]
        inputs = tokenizer(batch, return_tensors='pt', padding=True, truncation=True, max_length=max_length)
        inputs = {k: v.to(device) for k, v in inputs.items()}  # Move to GPU
        with torch.no_grad():
            outputs = bert_model(**inputs)
        # Use [CLS] token embedding (first token)
        cls_embeddings = outputs.last_hidden_state[:, 0, :]
        bert_embeddings.append(cls_embeddings.cpu())  # Move to CPU to save GPU memory
        print(f"Processed batch {i//batch_size + 1}/{len(texts)//batch_size + 1}")
    return torch.cat(bert_embeddings, dim=0)

# Generate BERT embeddings
bert_embeddings = get_bert_embeddings(texts, batch_size=32, max_length=117)
print("BERT Embeddings shape:", bert_embeddings.shape)

# Save embeddings for later use
torch.save(bert_embeddings, 'bert_embeddings.pt')
print("BERT embeddings saved to 'bert_embeddings.pt'")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Number of texts: 2282


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Processed batch 1/72
Processed batch 2/72
Processed batch 3/72
Processed batch 4/72
Processed batch 5/72
Processed batch 6/72
Processed batch 7/72
Processed batch 8/72
Processed batch 9/72
Processed batch 10/72
Processed batch 11/72
Processed batch 12/72
Processed batch 13/72
Processed batch 14/72
Processed batch 15/72
Processed batch 16/72
Processed batch 17/72
Processed batch 18/72
Processed batch 19/72
Processed batch 20/72
Processed batch 21/72
Processed batch 22/72
Processed batch 23/72
Processed batch 24/72
Processed batch 25/72
Processed batch 26/72
Processed batch 27/72
Processed batch 28/72
Processed batch 29/72
Processed batch 30/72
Processed batch 31/72
Processed batch 32/72
Processed batch 33/72
Processed batch 34/72
Processed batch 35/72
Processed batch 36/72
Processed batch 37/72
Processed batch 38/72
Processed batch 39/72
Processed batch 40/72
Processed batch 41/72
Processed batch 42/72
Processed batch 43/72
Processed batch 44/72
Processed batch 45/72
Processed batch 46/

# Step 4: Combine GNN and BERT Embeddings

> Goal: Integrate the GNN embeddings (128D) and BERT embeddings (768D) into a combined representation (896D per post) and train a classifier to predict the binary labels (0: "mostly true", 1: others).
- Tools: PyTorch for model definition and training, scikit-learn for metrics.



Tasks:

1. Load GCN and BERT embeddings.

2. Concatenate them into a single feature vector per post.

3. Split data into train/validation/test sets.

4. Define and train a feedforward neural network classifier.

5. Evaluate performance with accuracy, precision, and recall.



In [15]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from torch_geometric.data import Data
from torch_geometric.nn import GCNConv
from torch_geometric.utils import add_self_loops
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score
from scipy import io as sio
import numpy as np
import networkx as nx
import pandas as pd

# Load labels and network features
y = sio.loadmat('labels.mat')['y'][0]  # (2282,)
X_net_std = sio.loadmat('network.mat')['X_net_std']  # (2282, 3)
print("Labels shape:", y.shape)
print("X_network shape:", X_net_std.shape)

# Reconstruct graph (from Step 1)
df = pd.read_csv('/content/drive/MyDrive/Projects/Hayat/facebook-fact-check.csv', encoding='latin-1')
G = nx.Graph()
for idx in range(len(df)):
    G.add_node(idx, features=X_net_std[idx])
account_groups = df.groupby('account_id').indices
for account_id, indices in account_groups.items():
    indices = list(indices)
    for i in range(len(indices)):
        for j in range(i + 1, len(indices)):
            G.add_edge(indices[i], indices[j])

# Prepare GNN data
edges = list(G.edges)
edge_index = torch.tensor(edges, dtype=torch.long).t().contiguous()
x = torch.tensor(X_net_std, dtype=torch.float)  # (2282, 3)
data = Data(x=x, edge_index=edge_index)
edge_index, _ = add_self_loops(data.edge_index, num_nodes=data.num_nodes)
data.edge_index = edge_index

# Define GCN model (from Step 2)
class GCN(torch.nn.Module):
    def __init__(self, in_channels=3, hidden_channels=64, out_channels=128):
        super(GCN, self).__init__()
        self.conv1 = GCNConv(in_channels, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, out_channels)
        self.relu = torch.nn.ReLU()

    def forward(self, data):
        x, edge_index = data.x, data.edge_index
        x = self.conv1(x, edge_index)
        x = self.relu(x)
        x = self.conv2(x, edge_index)
        return x

# Compute GCN embeddings and detach from graph
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
gcn_model = GCN().to(device)
gcn_model.eval()  # Set to evaluation mode to avoid gradient tracking
data = data.to(device)
with torch.no_grad():  # Disable gradient computation
    gcn_embeddings = gcn_model(data).detach()  # Detach from graph
print("GCN Embeddings shape:", gcn_embeddings.shape)  # (2282, 128)

# Load BERT embeddings
bert_embeddings = torch.load('bert_embeddings.pt')  # (2282, 768)
print("BERT Embeddings shape:", bert_embeddings.shape)

# Ensure embeddings are on CPU and match
gcn_embeddings = gcn_embeddings.cpu()
bert_embeddings = bert_embeddings.cpu()
assert gcn_embeddings.shape[0] == bert_embeddings.shape[0] == len(y), "Embedding sizes don’t match labels!"

# Concatenate embeddings
combined_embeddings = torch.cat((gcn_embeddings, bert_embeddings), dim=1)  # (2282, 896)
print("Combined Embeddings shape:", combined_embeddings.shape)

# Prepare data for training
X_train, X_test, y_train, y_test = train_test_split(
    combined_embeddings, y, test_size=0.3, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

# Convert to tensors
y_train = torch.tensor(y_train, dtype=torch.long)
y_val = torch.tensor(y_val, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)

# Create data loaders
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

print("Train size:", len(train_dataset), "Val size:", len(val_dataset), "Test size:", len(test_dataset))

# Define classifier
class CombinedClassifier(nn.Module):
    def __init__(self, input_dim=896, hidden_dim=256, num_classes=2):
        super(CombinedClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Training setup
model = CombinedClassifier().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Training loop with early stopping
num_epochs = 50
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    # Validation
    model.eval()
    val_loss = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            val_loss += criterion(outputs, labels).item()

    train_loss_avg = train_loss / len(train_loader)
    val_loss_avg = val_loss / len(val_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, "
          f"Train Loss: {train_loss_avg:.4f}, "
          f"Val Loss: {val_loss_avg:.4f}")

    # Early stopping
    if val_loss_avg < best_val_loss:
        best_val_loss = val_loss_avg
        patience_counter = 0
        torch.save(model.state_dict(), 'best_combined_classifier.pth')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping triggered")
            break

# Load best model
model.load_state_dict(torch.load('best_combined_classifier.pth'))

# Evaluation
model.eval()
y_true, y_pred = [], []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(preds.cpu().numpy())

# Compute metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')

print("\nTest Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

# Save final model
torch.save(model.state_dict(), 'combined_classifier.pth')
print("Final model saved to 'combined_classifier.pth'")

Labels shape: (2282,)
X_network shape: (2282, 3)
GCN Embeddings shape: torch.Size([2282, 128])
BERT Embeddings shape: torch.Size([2282, 768])
Combined Embeddings shape: torch.Size([2282, 896])
Train size: 1277 Val size: 320 Test size: 685
Epoch 1/50, Train Loss: 0.4710, Val Loss: 0.4488
Epoch 2/50, Train Loss: 0.4309, Val Loss: 0.4283
Epoch 3/50, Train Loss: 0.4141, Val Loss: 0.4232
Epoch 4/50, Train Loss: 0.4074, Val Loss: 0.4239
Epoch 5/50, Train Loss: 0.3978, Val Loss: 0.4115
Epoch 6/50, Train Loss: 0.4080, Val Loss: 0.4215
Epoch 7/50, Train Loss: 0.3831, Val Loss: 0.4339
Epoch 8/50, Train Loss: 0.3787, Val Loss: 0.4083
Epoch 9/50, Train Loss: 0.3694, Val Loss: 0.4319
Epoch 10/50, Train Loss: 0.3765, Val Loss: 0.4088
Epoch 11/50, Train Loss: 0.3748, Val Loss: 0.4259
Epoch 12/50, Train Loss: 0.3585, Val Loss: 0.4200
Epoch 13/50, Train Loss: 0.3453, Val Loss: 0.4353
Epoch 14/50, Train Loss: 0.3351, Val Loss: 0.4172
Epoch 15/50, Train Loss: 0.3222, Val Loss: 0.4273
Epoch 16/50, Train L

# Step 5: Evaluation and Refinement

> Goal: Assess the model’s performance in detail, identify weaknesses (e.g., class imbalance), and refine it for better results.

Tasks
1. Detailed Evaluation: Compute per-class metrics (confusion matrix, precision/recall per class) to check for bias.

2. Refinement: Address imbalance (e.g., class weights) and optimize the model (e.g., architecture, hyperparameters).

3. Comparison: Test GCN-only and BERT-only models to quantify each component’s contribution.



In [16]:
import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np

# Load data and model
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
combined_embeddings = torch.load('combined_embeddings.pt') if 'combined_embeddings.pt' in globals() else torch.cat((gcn_embeddings.cpu(), bert_embeddings.cpu()), dim=1)
y = sio.loadmat('labels.mat')['y'][0]

# Re-split data (same as Step 4)
X_train, X_test, y_train, y_test = train_test_split(
    combined_embeddings, y, test_size=0.3, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.2, stratify=y_train, random_state=42
)

y_train = torch.tensor(y_train, dtype=torch.long)
y_val = torch.tensor(y_val, dtype=torch.long)
y_test = torch.tensor(y_test, dtype=torch.long)

train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)
test_dataset = TensorDataset(X_test, y_test)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

# Define classifier (same as Step 4)
class CombinedClassifier(nn.Module):
    def __init__(self, input_dim=896, hidden_dim=256, num_classes=2):
        super(CombinedClassifier, self).__init__()
        self.fc1 = nn.Linear(input_dim, hidden_dim)
        self.relu = nn.ReLU()
        self.dropout = nn.Dropout(0.3)
        self.fc2 = nn.Linear(hidden_dim, num_classes)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.dropout(x)
        x = self.fc2(x)
        return x

# Load trained model
model = CombinedClassifier().to(device)
model.load_state_dict(torch.load('best_combined_classifier.pth'))
model.eval()

# Detailed evaluation
y_true, y_pred = [], []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(preds.cpu().numpy())

# Confusion matrix and classification report
conf_matrix = confusion_matrix(y_true, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['mostly true (0)', 'others (1)']))

# Refinement: Retrain with class weights to handle imbalance
class_weights = torch.tensor([1.0, 1669/613], dtype=torch.float).to(device)  # Weight class 1 higher
criterion = nn.CrossEntropyLoss(weight=class_weights)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Retraining loop
num_epochs = 50
best_val_loss = float('inf')
patience = 10
patience_counter = 0

for epoch in range(num_epochs):
    model.train()
    train_loss = 0
    for inputs, labels in train_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        optimizer.zero_grad()
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()
        train_loss += loss.item()

    model.eval()
    val_loss = 0
    with torch.no_grad():
        for inputs, labels in val_loader:
            inputs, labels = inputs.to(device), labels.to(device)
            outputs = model(inputs)
            val_loss += criterion(outputs, labels).item()

    train_loss_avg = train_loss / len(train_loader)
    val_loss_avg = val_loss / len(val_loader)
    print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {train_loss_avg:.4f}, Val Loss: {val_loss_avg:.4f}")

    if val_loss_avg < best_val_loss:
        best_val_loss = val_loss_avg
        patience_counter = 0
        torch.save(model.state_dict(), 'refined_combined_classifier.pth')
    else:
        patience_counter += 1
        if patience_counter >= patience:
            print("Early stopping triggered")
            break

# Load refined model and re-evaluate
model.load_state_dict(torch.load('refined_combined_classifier.pth'))
model.eval()
y_true, y_pred = [], []
with torch.no_grad():
    for inputs, labels in test_loader:
        inputs, labels = inputs.to(device), labels.to(device)
        outputs = model(inputs)
        _, preds = torch.max(outputs, 1)
        y_true.extend(labels.cpu().numpy())
        y_pred.extend(preds.cpu().numpy())

# Updated metrics
accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred, average='weighted')
recall = recall_score(y_true, y_pred, average='weighted')
print("\nRefined Test Results:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print("\nRefined Classification Report:")
print(classification_report(y_true, y_pred, target_names=['mostly true (0)', 'others (1)']))

# Save refined model
torch.save(model.state_dict(), 'final_refined_classifier.pth')
print("Refined model saved to 'final_refined_classifier.pth'")


Confusion Matrix:
[[491  10]
 [162  22]]

Classification Report:
                 precision    recall  f1-score   support

mostly true (0)       0.75      0.98      0.85       501
     others (1)       0.69      0.12      0.20       184

       accuracy                           0.75       685
      macro avg       0.72      0.55      0.53       685
   weighted avg       0.73      0.75      0.68       685

Epoch 1/50, Train Loss: 0.4323, Val Loss: 0.4466
Epoch 2/50, Train Loss: 0.4052, Val Loss: 0.4556
Epoch 3/50, Train Loss: 0.3923, Val Loss: 0.4538
Epoch 4/50, Train Loss: 0.3879, Val Loss: 0.4787
Epoch 5/50, Train Loss: 0.3827, Val Loss: 0.4639
Epoch 6/50, Train Loss: 0.3759, Val Loss: 0.4729
Epoch 7/50, Train Loss: 0.3620, Val Loss: 0.4829
Epoch 8/50, Train Loss: 0.3652, Val Loss: 0.4795
Epoch 9/50, Train Loss: 0.3567, Val Loss: 0.4669
Epoch 10/50, Train Loss: 0.3322, Val Loss: 0.4840
Epoch 11/50, Train Loss: 0.3264, Val Loss: 0.4876
Early stopping triggered

Refined Test Results:


# Problems Identified

1. Class Imbalance: 73% "mostly true" (1669/2282) vs. 27% "others" (613/2282) skews the model. The original model overpredicts class 0; the refined model overcorrects toward class 1.

2. Accuracy Stagnation: Both models hover around 0.72-0.75, barely beating the baseline, suggesting limited learning capacity or feature quality.

3. Bias Shift: Class weights improved class 1 recall but sacrificed overall accuracy and class 0 performance, indicating the model isn’t generalizing well.



## Improvement Strategies

Better Class Balancing: Use oversampling (SMOTE) instead of just weights to balance training data.

Enhanced Model Capacity: Increase the classifier’s complexity (more layers, units) to capture patterns better.

Fine-Tune BERT: Use trainable BERT embeddings instead of static ones to improve text feature quality.

Hyperparameter Tuning: Adjust learning rate, dropout, and batch size.

