# Fake News Detection — Ensemble of Transformers (BERT + RoBERTa)

This notebook builds a **fake news detector** using advanced AI techniques: fine-tuning pretrained **Transformer** models (DistilBERT / BERT and RoBERTa) and combining them in an ensemble. The dataset used is the **'Fake and Real News'** dataset from Kaggle.


In [34]:

# Verify versions
import transformers, datasets, sklearn
print('transformers', transformers.__version__)
print('datasets', datasets.__version__)
print('sklearn', sklearn.__version__)


transformers 4.57.0
datasets 4.1.1
sklearn 1.7.2


## 1) Download dataset from Kaggle

The Kaggle dataset used: **Fake and Real News Dataset** — two CSV files (`Fake.csv` and `True.csv`).

Kaggle dataset page: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset


In [36]:
# Load dataset directly from local files
import pandas as pd
import os

# Ensure data folder exists
os.makedirs("data", exist_ok=True)

# Load both datasets
fake_df = pd.read_csv("Fake.csv")
true_df = pd.read_csv("True.csv")

# Add labels
fake_df["label"] = 0
true_df["label"] = 1

# Combine both into one dataframe
df = pd.concat([fake_df, true_df], axis=0).sample(frac=1, random_state=42).reset_index(drop=True)

print(f" Dataset loaded successfully!")
print(f"Total records: {len(df)}")
print(df.head())


 Dataset loaded successfully!
Total records: 44898
                                               title  \
0  Ben Stein Calls Out 9th Circuit Court: Committ...   
1  Trump drops Steve Bannon from National Securit...   
2  Puerto Rico expects U.S. to lift Jones Act shi...   
3   OOPS: Trump Just Accidentally Confirmed He Le...   
4  Donald Trump heads for Scotland to reopen a go...   

                                                text       subject  \
0  21st Century Wire says Ben Stein, reputable pr...       US_News   
1  WASHINGTON (Reuters) - U.S. President Donald T...  politicsNews   
2  (Reuters) - Puerto Rico Governor Ricardo Rosse...  politicsNews   
3  On Monday, Donald Trump once again embarrassed...          News   
4  GLASGOW, Scotland (Reuters) - Most U.S. presid...  politicsNews   

                  date  label  
0    February 13, 2017      0  
1       April 5, 2017       1  
2  September 27, 2017       1  
3         May 22, 2017      0  
4       June 24, 2016       1  

In [3]:
import pandas as pd
import os

fake_path = 'Fake.csv'
true_path = 'True.csv'
if os.path.exists(fake_path) and os.path.exists(true_path):
    fake = pd.read_csv(fake_path)
    true = pd.read_csv(true_path)
    df = pd.concat([fake.assign(label='fake'), true.assign(label='real')], ignore_index=True)
    
    print('Total samples:', len(df))
    print('Label distribution:\n', df['label'].value_counts())
    display(df.head())
else:
    print('Data files not found. Place Fake.csv and True.csv into the ./data folder.')


Total samples: 44898
Label distribution:
 label
fake    23481
real    21417
Name: count, dtype: int64


Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",fake


In [37]:
from datasets import Dataset
import numpy as np
import pandas as pd

if 'df' in globals():
    def combine_text(row):
        parts = []
        if 'title' in row and pd.notna(row['title']):
            parts.append(str(row['title']))
        if 'text' in row and pd.notna(row['text']):
            parts.append(str(row['text']))
        return '\n'.join(parts)

    df['text_all'] = df.apply(combine_text, axis=1)
    df = df[['text_all','label']].rename(columns={'text_all':'text'})
    df['label_id'] = (df['label']=='real').astype(int)
    dataset = Dataset.from_pandas(df[['text','label_id']].rename(columns={'label_id':'label'}))
    display(dataset.shuffle(seed=42).select(range(5)).to_pandas())
else:
    print('Dataset not prepared because original CSVs are missing.')


Unnamed: 0,text,label
0,"As World Trade Center Fell, Donald Trump Boas...",0
1,NYC mayor warns Trump: 'stop and frisk' will m...,0
2,U.S. options market not very 'Trumped up' ahea...,0
3,Electric vehicle sales fall far short of Obama...,0
4,OBAMACARE: Your Dog Might Have Better Healthca...,0


In [41]:
from transformers import AutoTokenizer
from IPython.display import display, HTML
import ipywidgets as widgets

# Model names
model_name_1 = 'distilbert-base-uncased'
model_name_2 = 'roberta-base'

# Tokenizers
tokenizer1 = AutoTokenizer.from_pretrained(model_name_1)
tokenizer2 = AutoTokenizer.from_pretrained(model_name_2)

# Define tokenization functions
def tokenize1(batch):
    return tokenizer1(batch['text'], truncation=True, padding='max_length', max_length=256)

def tokenize2(batch):
    return tokenizer2(batch['text'], truncation=True, padding='max_length', max_length=256)

# --- Widget setup (compatible with JupyterLab 4.4.9) ---
info_box = widgets.Output()
with info_box:
    display(HTML("<b>Initializing tokenization process...</b>"))

display(info_box)

# --- Main tokenization workflow ---
try:
    if 'dataset' in globals():
        with info_box:
            print("Dataset found. Splitting and tokenizing...")
        ds = dataset.train_test_split(test_size=0.15, seed=42)

        # Smaller demo sample (optional)
        DEMO = True
        if DEMO:
            ds['train'] = ds['train'].shuffle(seed=42).select(range(min(2000, len(ds['train']))))
            ds['test'] = ds['test'].shuffle(seed=42).select(range(min(500, len(ds['test']))))

        tokenized1 = ds.map(tokenize1, batched=True)
        tokenized2 = ds.map(tokenize2, batched=True)

        with info_box:
            print(" Tokenization completed successfully.")
    else:
        with info_box:
            print("Tokenization skipped — dataset not available.")
except Exception as e:
    with info_box:
        print(f" Error during tokenization: {e}")


Output()

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/500 [00:00<?, ? examples/s]

In [57]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {'accuracy': acc, 'precision': precision, 'recall': recall, 'f1': f1}

if 'tokenized1' in globals() and 'tokenized2' in globals():
    model1 = AutoModelForSequenceClassification.from_pretrained(model_name_1, num_labels=2)
    model2 = AutoModelForSequenceClassification.from_pretrained(model_name_2, num_labels=2)

    # Faster training parameters
    args1 = TrainingArguments(
        output_dir='models/distilbert',
        per_device_train_batch_size=8,   # reduced batch size
        per_device_eval_batch_size=16,   # reduced batch size
        num_train_epochs=1,              # reduced epochs
        logging_steps=50,
        learning_rate=2e-5,
    )
    setattr(args1, "do_eval", True)
    setattr(args1, "evaluation_strategy", "epoch")
    setattr(args1, "save_strategy", "epoch")
    setattr(args1, "save_total_limit", 2)

    args2 = TrainingArguments(
        output_dir='models/roberta',
        per_device_train_batch_size=4,   # reduced batch size
        per_device_eval_batch_size=16,   # reduced batch size
        num_train_epochs=1,              # reduced epochs
        logging_steps=50,
        learning_rate=2e-5,
    )
    setattr(args2, "do_eval", True)
    setattr(args2, "evaluation_strategy", "epoch")
    setattr(args2, "save_strategy", "epoch")
    setattr(args2, "save_total_limit", 2)

    trainer1 = Trainer(
        model=model1,
        args=args1,
        train_dataset=tokenized1['train'],
        eval_dataset=tokenized1['test'],
        tokenizer=tokenizer1,
        compute_metrics=compute_metrics
    )

    trainer2 = Trainer(
        model=model2,
        args=args2,
        train_dataset=tokenized2['train'],
        eval_dataset=tokenized2['test'],
        tokenizer=tokenizer2,
        compute_metrics=compute_metrics
    )

    print(" Trainers initialized. To start training run `trainer1.train()` and `trainer2.train()`.")
else:
    print('Training setup skipped because tokenized datasets are not present.')


Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.dense.bias', 'classifier.dense.weight', 'classifier.out_proj.bias', 'classifier.out_proj.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


 Trainers initialized. To start training run `trainer1.train()` and `trainer2.train()`.


  trainer1 = Trainer(
  trainer2 = Trainer(


In [59]:
trainer1.train()
trainer2.train()




Step,Training Loss
50,0.1489
100,0.0036
150,0.0018
200,0.0013
250,0.0011




Step,Training Loss
50,0.0972
100,0.0002
150,0.0001
200,0.0001
250,0.0001
300,0.0001
350,0.0001
400,0.0001
450,0.0001
500,0.0001


TrainOutput(global_step=500, training_loss=0.009807163601275534, metrics={'train_runtime': 7607.8104, 'train_samples_per_second': 0.263, 'train_steps_per_second': 0.066, 'total_flos': 263111055360000.0, 'train_loss': 0.009807163601275534, 'epoch': 1.0})

## Ensemble inference (average logits)

After fine-tuning both models, you can produce predictions by averaging the output logits (or probabilities) from both models and taking the argmax.

In [61]:
import torch
def ensemble_predict(texts, model1, tokenizer1, model2, tokenizer2, device=None):
    enc1 = tokenizer1(texts, truncation=True, padding=True, return_tensors='pt', max_length=256)
    enc2 = tokenizer2(texts, truncation=True, padding=True, return_tensors='pt', max_length=256)
    if device is not None:
        model1.to(device)
        model2.to(device)
        enc1 = {k:v.to(device) for k,v in enc1.items()}
        enc2 = {k:v.to(device) for k,v in enc2.items()}
    with torch.no_grad():
        out1 = model1(**enc1).logits.cpu().numpy()
        out2 = model2(**enc2).logits.cpu().numpy()
    avg = (out1 + out2) / 2.0
    preds = avg.argmax(axis=1)
    return preds

print('Ensemble inference function defined.')


Ensemble inference function defined.


In [63]:
import torch

# Ensemble prediction function
def ensemble_predict(texts, model1, tokenizer1, model2, tokenizer2, device=None):
    enc1 = tokenizer1(texts, truncation=True, padding=True, return_tensors='pt', max_length=256)
    enc2 = tokenizer2(texts, truncation=True, padding=True, return_tensors='pt', max_length=256)
    
    if device is not None:
        model1.to(device)
        model2.to(device)
        enc1 = {k:v.to(device) for k,v in enc1.items()}
        enc2 = {k:v.to(device) for k,v in enc2.items()}
    
    with torch.no_grad():
        out1 = model1(**enc1).logits.cpu().numpy()
        out2 = model2(**enc2).logits.cpu().numpy()
    
    avg = (out1 + out2) / 2.0
    preds = avg.argmax(axis=1)
    return preds

# Select device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Example texts
texts = [
    "The government announced a new healthcare policy today.",
    "Aliens landed in London and started playing football!"
]

# Run ensemble prediction
preds = ensemble_predict(texts, model1, tokenizer1, model2, tokenizer2, device=device)

# Map numeric predictions to labels (optional)
label_names = {0: "Fake News", 1: "Real News"}
pred_labels = [label_names[p] for p in preds]

print("Predicted numeric labels:", preds)
print("Predicted text labels:", pred_labels)


Predicted numeric labels: [0 0]
Predicted text labels: ['Fake News', 'Fake News']
