# Zero-shot classification for fake news detection

In this notebook, we will try using a pretrained language model for zero-shot classification in order to detect fake news for both the ISOT and Kaggle Real or Fake datasets. For this, we won't need the same preprocessing as before, nor the train dataset. As for other models, we will only use the test dataset for the evaluation.

In [24]:
import pandas as pd

from transformers import pipeline

from sklearn.metrics import classification_report

In [29]:
# Load datasets: choose ISOT or Kaggle True or Fake datasets
# test_df = pd.read_csv("data/kaggle/preprocessed/test.csv")

test_df = pd.read_csv("data/isot/preprocessed/test.csv")

In [30]:
# Choosing Hugging Face model
classifier = pipeline("zero-shot-classification", model="roberta-large-mnli")

# Defining the labels in natural language for the model
candidate_labels = ['Fake', 'Real']

Some weights of the model checkpoint at roberta-large-mnli were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [31]:
batch_size = 32  # Adjust batch size as needed
results = []

# Loop through the dataset in batches
for i in range(0, len(test_df), batch_size):
    batch = test_df.iloc[i:i+batch_size]
    texts = batch["text"].tolist()
    
    # Classify in one batch
    result = classifier(texts, candidate_labels=candidate_labels)
    
    # Collect results
    for text, res in zip(texts, result):
        results.append({
            "text": text,
            "predicted_label": res["labels"][0],
            "score": res["scores"][0]
        })

# Create DataFrame from results
results_df = pd.DataFrame(results)

In [32]:
# True labels and predicted labels
y_true = test_df["label"]
y_pred = results_df["predicted_label"].map({"Fake": 0, "Real": 1})

# Print classification report
print(classification_report(y_true, y_pred))


              precision    recall  f1-score   support

           0       0.81      0.20      0.32      4702
           1       0.52      0.95      0.67      4278

    accuracy                           0.56      8980
   macro avg       0.67      0.58      0.50      8980
weighted avg       0.67      0.56      0.49      8980

