<a href="https://colab.research.google.com/github/carlosprr29/ai-progetto-spagnoli/blob/main/notebooks/02_Baseline_Model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Setup Environment

In [None]:

##Installation and loading cell
!pip install -q datasets pandas scikit-learn

from datasets import load_dataset
import pandas as pd

# Download the WELFake dataset
print("Downloading dataset...")
dataset_raw = load_dataset("davanstrien/WELFake")
df = pd.DataFrame(dataset_raw['train'])

# Quick cleaning: remove rows that do not have text or a title
df = df.dropna(subset=['title', 'text'])
print(f"✅ Dataset loaded: {len(df)} rows.")


In [None]:
#Data Split Cell
from sklearn.model_selection import train_test_split

# We split the dataset (80% training, 20% testing)
# 'stratify' ensures that there is the same proportion of real and fake news in both groups
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

print(f"Data for training: {len(train_df)}")
print(f"Data for evaluation: {len(test_df)}")



In [None]:
## Experiment 1: Baseline with Logistic Regression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, accuracy_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# We define the training and assessment function.
def train_and_evaluate(X_train, X_test, y_train, y_test, experiment_name):
    # 1. We create the model (TF-IDF + Logistic Regression)
    # TF-IDF converts words into numbers according to their importance (max_features prevents it from being too heavy)
    pipeline = Pipeline([
        ('tfidf', TfidfVectorizer(stop_words='english', max_features=5000)),
        ('clf', LogisticRegression(max_iter=1000))
    ])

    # 2. We train
    print(f"\n Training model for: {experiment_name}...")
    pipeline.fit(X_train, y_train)

    # 3. We predict
    predictions = pipeline.predict(X_test)

    # 4. Results
    acc = accuracy_score(y_test, predictions)
    print(f"✅ Accuracy ({experiment_name}): {acc:.4f}")
    print(classification_report(y_test, predictions))

    return acc

# --- PERFORMANCE OF THE ABLATION STUDY ---

# CASE 1: Title only
acc_titles = train_and_evaluate(
      train_df['title'], test_df['title'],
      train_df['label'], test_df['label'],
      "TITLES ONLY"
)

# CASE 2: Title + Text (We created the 'total' column just before)
train_df['total'] = train_df['title'] + " " + train_df['text']
test_df['total'] = test_df['title'] + " " + test_df['text']

acc_completo = train_and_evaluate(
    train_df['total'], test_df['total'],
    train_df['label'], test_df['label'],
    "FULL TEXT"
)

**Baseline effectiveness:** 94.36% in full text is an extremely high score. This indicates that the WELFake dataset has very clear word patterns that make it easy to distinguish real news from fake news.

**Ablation Study (Initial finding)**: There is a difference of 5 points (89% vs 94%).

**Conclusion:** Although the headline alone is very informative, the body of the news article provides the necessary context to capture an extra 5% of cases that the headline does not reveal.

**The challenge for BERT:** We have to see if BERT is capable of understanding more ambiguous news stories where the keywords are not so obvious.
