# Clickbait Project – Clean & Merge Raw Clickbait Datasets

Goal: load selected raw datasets, harmonize labels, keep only `title` + `label`, and save a merged dataframe under `data/merged/`.

## Objectives

In this notebook we:

- Summarize the purpose and structure of each dataset in `data/raw`.
- Inspect the first few rows of every dataset to understand schema and content.
- Compute simple descriptive statistics:
  - Number of rows/columns per dataset.
  - Label distributions (clickbait vs non-clickbait or topics) where available.
  - Text length statistics for key text fields.
- Visualize:
  - Dataset sizes.
  - Label distributions.
  - Text length distributions.

The heavy lifting (loading and simple metrics) is implemented in `src/eda`.
This notebook focuses on presentation: tables, diagrams, and narrative.


## 0. Import Libraries and Module

In [1]:
import sys
from pathlib import Path

PROJECT_ROOT = Path.cwd().parent

if str(PROJECT_ROOT) not in sys.path:
    sys.path.insert(0, str(PROJECT_ROOT))

DATA_RAW = PROJECT_ROOT / "data" / "raw"
DATA_MERGED = PROJECT_ROOT / "data" / "merged"
DATA_MERGED.mkdir(parents=True, exist_ok=True)

PROJECT_ROOT, DATA_RAW, DATA_MERGED

import pandas as pd

from src.eda import (
    load_github_clickbait,
    load_kaggle_clickbait_data,
    load_kaggle_train2,
    load_kaggle_clickbait_news_detection,
)

from src.vectorization import embed_all

## 1. Load source datasets

Sources to merge (only `title` + `label` will be kept):
- GitHub snapshot (`clickbait.csv`)
- Kaggle top-level CSVs:
  - `clickbait_data.csv` (column: `headline`, label: `clickbait`)
  - `train2.csv` (column: `title`, label: `label` in {news, clickbait})
- Kaggle `clickbait-news-detection`:
  - `train.csv`, `valid.csv` (columns: `title`, label: `label` in {news, clickbait})

In [2]:
github_df = load_github_clickbait()
k_clickbait_data = load_kaggle_clickbait_data()
k_train2 = load_kaggle_train2()

cbd = load_kaggle_clickbait_news_detection()
cbd_train = cbd.get("train")
cbd_valid = cbd.get("valid")

for name, df in [
    ("github_df", github_df),
    ("k_clickbait_data", k_clickbait_data),
    ("k_train2", k_train2),
    ("cbd_train", cbd_train),
    ("cbd_valid", cbd_valid),
]:
    print(name, df.shape)
    display(df.head())

github_df (31986, 2)


Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


k_clickbait_data (32000, 2)


Unnamed: 0,headline,clickbait
0,Should I Get Bings,1
1,Which TV Female Friend Group Do You Belong In,1
2,"The New ""Star Wars: The Force Awakens"" Trailer...",1
3,"This Vine Of New York On ""Celebrity Big Brothe...",1
4,A Couple Did A Stunning Photo Shoot With Their...,1


k_train2 (21029, 2)


Unnamed: 0,label,title
0,news,China and Economic Reform: Xi Jinping’s Track ...
1,news,Trade to Be a Big Topic in Theresa May’s U.S. ...
2,clickbait,"The Top Beaches In The World, According To Nat..."
3,clickbait,Sheriff’s Report Provides New Details on Tamir...
4,news,Surgeon claiming he will transplant volunteer'...


cbd_train (24871, 4)


Unnamed: 0,id,title,text,label
0,0,China and Economic Reform: Xi Jinping’s Track ...,Economists generally agree: China must overhau...,news
1,1,Trade to Be a Big Topic in Theresa May’s U.S. ...,LONDON—British Prime Minister Theresa May said...,news
2,2,"The Top Beaches In The World, According To Nat...",Beaches come in all sorts of shapes and sizes ...,clickbait
3,3,Sheriff’s Report Provides New Details on Tamir...,"A timeline of what happened after Tamir Rice, ...",clickbait
4,4,Surgeon claiming he will transplant volunteer'...,An Italian neurosurgeon who has claimed for mo...,news


cbd_valid (3552, 4)


Unnamed: 0,id,title,text,label
0,0,Trump says he is releasing something 'phenomen...,"Bob Bryan, Business Insider 9.02.2017, 16:25 1...",news
1,1,Fidel Castro's ashes make their final journey ...,Cubans have been lining the streets from Havan...,news
2,2,Obama Administration Sending $500 Million to G...,WASHINGTON—The Obama administration announced ...,news
3,3,Insurers Are Worried About The House GOP Healt...,The main industry groups representing health i...,news
4,4,Kobe Bryant and Nike Form Youth Basketball 'Ma...,A year after Kobe Bryant concluded his NBA car...,news


## 2. Label harmonization

Target: binary `label` where 1 = clickbait, 0 = non-clickbait.

- GitHub: column `label` already 0/1.
- Kaggle `clickbait_data.csv`: column `clickbait` (0/1).
- Kaggle `train2.csv`: column `label` with values `clickbait` / `news`.
- `clickbait-news-detection` train/valid: column `label` with values `clickbait` / `news`.

In [3]:
def normalize_github(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.rename(columns={"title": "title", "label": "label"})
    return out[["title", "label"]]

def normalize_clickbait_data(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.rename(columns={"headline": "title", "clickbait": "label"})
    return out[["title", "label"]]

def normalize_train2(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.rename(columns={"title": "title", "label": "label"})
    out["label"] = out["label"].str.lower().map({"clickbait": 1, "news": 0})
    return out[["title", "label"]].dropna()

def normalize_cbd(df: pd.DataFrame) -> pd.DataFrame:
    out = df.copy()
    out = out.rename(columns={"title": "title", "label": "label"})
    out["label"] = out["label"].str.lower().map({"clickbait": 1, "news": 0})
    return out[["title", "label"]].dropna()


In [4]:
github_norm = normalize_github(github_df)
clickbait_data_norm = normalize_clickbait_data(k_clickbait_data)
train2_norm = normalize_train2(k_train2)
cbd_train_norm = normalize_cbd(cbd_train)
cbd_valid_norm = normalize_cbd(cbd_valid)

for name, df in [
    ("github_norm", github_norm),
    ("clickbait_data_norm", clickbait_data_norm),
    ("train2_norm", train2_norm),
    ("cbd_train_norm", cbd_train_norm),
    ("cbd_valid_norm", cbd_valid_norm),
]:
    print(name, df.shape, df["label"].value_counts(dropna=False).to_dict())
    display(df.head())

github_norm (31986, 2) {0: 16000, 1: 15986}


Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1
1,250 Nuns Just Cycled All The Way From Kathmand...,1
2,"Australian comedians ""could have been shot"" du...",0
3,Lycos launches screensaver to increase spammer...,0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0


clickbait_data_norm (32000, 2) {0: 16001, 1: 15999}


Unnamed: 0,title,label
0,Should I Get Bings,1
1,Which TV Female Friend Group Do You Belong In,1
2,"The New ""Star Wars: The Force Awakens"" Trailer...",1
3,"This Vine Of New York On ""Celebrity Big Brothe...",1
4,A Couple Did A Stunning Photo Shoot With Their...,1


train2_norm (21029, 2) {0: 16738, 1: 4291}


Unnamed: 0,title,label
0,China and Economic Reform: Xi Jinping’s Track ...,0
1,Trade to Be a Big Topic in Theresa May’s U.S. ...,0
2,"The Top Beaches In The World, According To Nat...",1
3,Sheriff’s Report Provides New Details on Tamir...,1
4,Surgeon claiming he will transplant volunteer'...,0


cbd_train_norm (18398, 2) {0.0: 14650, 1.0: 3748}


Unnamed: 0,title,label
0,China and Economic Reform: Xi Jinping’s Track ...,0.0
1,Trade to Be a Big Topic in Theresa May’s U.S. ...,0.0
2,"The Top Beaches In The World, According To Nat...",1.0
3,Sheriff’s Report Provides New Details on Tamir...,1.0
4,Surgeon claiming he will transplant volunteer'...,0.0


cbd_valid_norm (2631, 2) {0.0: 2088, 1.0: 543}


Unnamed: 0,title,label
0,Trump says he is releasing something 'phenomen...,0.0
1,Fidel Castro's ashes make their final journey ...,0.0
2,Obama Administration Sending $500 Million to G...,0.0
3,Insurers Are Worried About The House GOP Healt...,0.0
4,Kobe Bryant and Nike Form Youth Basketball 'Ma...,0.0


## 3. Concatenate normalized datasets

We keep only `title` and `label` and stack all sources into one dataframe.

In [5]:
merged_df = pd.concat(
    [
        github_norm,
        clickbait_data_norm,
        train2_norm,
        cbd_train_norm,
        cbd_valid_norm,
    ],
    axis=0,
    ignore_index=True,
)

print("Merged shape:", merged_df.shape)
display(merged_df.head())

print("\nLabel distribution in merged:")
display(merged_df["label"].value_counts(normalize=False).to_frame("count"))


Merged shape: (106044, 2)


Unnamed: 0,title,label
0,"15 Highly Important Questions About Adulthood,...",1.0
1,250 Nuns Just Cycled All The Way From Kathmand...,1.0
2,"Australian comedians ""could have been shot"" du...",0.0
3,Lycos launches screensaver to increase spammer...,0.0
4,Fußball-Bundesliga 2008–09: Goalkeeper Butt si...,0.0



Label distribution in merged:


Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0.0,65477
1.0,40567


## 4. Save merged dataframe

We persist to `data/merged/` for downstream notebooks.

In [6]:
merged_csv = DATA_MERGED / "data_merged.csv"
merged_df.to_csv(merged_csv, index=False)
merged_csv

PosixPath('/mnt/c/Users/user/Desktop/MSc - AI/Εξάμηνο 1ο/Μηχανική Μάθηση/Clickbait_Machine_Learning_Project/data/merged/data_merged.csv')

## 5. Vectorize merged titles

We will embed `data/merged/data_merged.csv` using the default Gemma 3–4B model from `src.vectorization` and save the embeddings under `data/merged/`.

- Input root: `data/merged/` (contains `titles_labels_merged.csv`)
- Output root: `data/merged/`
- Model: default (`google/gemma-3-4b-it`)
- Column candidates: defaults to (`title`, `headline`, `targetTitle`)
- Output format: parquet (default in `embed_all`)

In [7]:
output_paths = embed_all(
    raw_root=DATA_MERGED,
    embedded_root=DATA_MERGED,
    column=("title", "headline", "targetTitle"),
    batch_size=100,
    output_format="parquet",
    skip_existing=False,
)
output_paths

`torch_dtype` is deprecated! Use `dtype` instead!
2025-12-14 18:10:17.092438: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2025-12-14 18:10:17.092578: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2025-12-14 18:10:17.141398: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2025-12-14 18:10:17.251660: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI AVX512_BF16 FMA, in other operations, rebuild TensorFlow with the

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

100%|███████████████████████████████████████████████████████████████████████████████| 1061/1061 [13:44<00:00,  1.29it/s]


Embeddings have been created successfully!!
[INFO] Saved embeddings for /mnt/c/Users/user/Desktop/MSc - AI/Εξάμηνο 1ο/Μηχανική Μάθηση/Clickbait_Machine_Learning_Project/data/merged/data_merged.csv -> /mnt/c/Users/user/Desktop/MSc - AI/Εξάμηνο 1ο/Μηχανική Μάθηση/Clickbait_Machine_Learning_Project/data/merged/google__gemma-3-4b-it/data_merged_embed.parquet


[PosixPath('/mnt/c/Users/user/Desktop/MSc - AI/Εξάμηνο 1ο/Μηχανική Μάθηση/Clickbait_Machine_Learning_Project/data/merged/google__gemma-3-4b-it/data_merged_embed.parquet')]

## 6. Inspect the saved embeddings

In [8]:
embed_path = output_paths[0]
embeddings_df = pd.read_parquet(embed_path)
print(embed_path, embeddings_df.shape)
display(embeddings_df.head())

/mnt/c/Users/user/Desktop/MSc - AI/Εξάμηνο 1ο/Μηχανική Μάθηση/Clickbait_Machine_Learning_Project/data/merged/google__gemma-3-4b-it/data_merged_embed.parquet (106044, 2560)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2550,2551,2552,2553,2554,2555,2556,2557,2558,2559
0,-1.523438,-3.09375,2.46875,0.125977,0.707031,-1.179688,0.076172,1.242188,-0.202148,3.109375,...,1.976562,0.953125,2.484375,-0.152344,-0.011353,-2.578125,-0.449219,-1.476562,-0.291016,0.488281
1,0.146484,-2.28125,1.140625,0.519531,-0.18457,-0.933594,-0.621094,0.384766,-0.859375,0.988281,...,0.859375,0.186523,2.0625,-1.28125,-0.123535,-3.5625,0.289062,-0.941406,0.064453,-1.171875
2,-1.265625,-5.78125,2.515625,-0.933594,-0.441406,-1.625,1.992188,1.359375,-3.4375,0.115234,...,0.082031,0.769531,1.710938,-0.255859,-1.351562,-2.546875,-1.695312,-2.59375,0.96875,-0.875
3,-2.046875,-5.84375,3.625,-0.004639,-1.242188,-2.671875,1.539062,2.484375,-3.59375,0.209961,...,2.296875,0.574219,1.523438,-0.147461,-0.875,-2.453125,-2.546875,-2.28125,0.271484,-0.173828
4,0.246094,0.251953,0.699219,1.375,-1.109375,0.177734,-1.773438,2.0625,2.296875,0.080566,...,-0.113281,1.546875,1.203125,-0.226562,0.166016,-2.0625,-0.108398,1.421875,-1.328125,-0.363281
