# News Clickbait Detection Model Using ANN

This notebook trains a model to detect clickbait headlines.

**Description:** Installs the TensorFlow library, version 2.13.0, which is required for building and training the neural network model.

In [None]:
!pip install tensorflow==2.13.0

Collecting tensorflow==2.13.0
  Downloading tensorflow-2.13.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.4 kB)
Collecting gast<=0.4.0,>=0.2.1 (from tensorflow==2.13.0)
  Downloading gast-0.4.0-py3-none-any.whl.metadata (1.1 kB)
Collecting keras<2.14,>=2.13.1 (from tensorflow==2.13.0)
  Downloading keras-2.13.1-py3-none-any.whl.metadata (2.4 kB)
Collecting numpy<=1.24.3,>=1.22 (from tensorflow==2.13.0)
  Downloading numpy-1.24.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting tensorboard<2.14,>=2.13 (from tensorflow==2.13.0)
  Downloading tensorboard-2.13.0-py3-none-any.whl.metadata (1.8 kB)
Collecting tensorflow-estimator<2.14,>=2.13.0 (from tensorflow==2.13.0)
  Downloading tensorflow_estimator-2.13.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting typing-extensions<4.6.0,>=3.6.6 (from tensorflow==2.13.0)
  Downloading typing_extensions-4.5.0-py3-none-any.whl.metadata (8.5 kB)
Collecting google-auth-oauthlib<1.1,>=0

**Description:** Imports necessary libraries for data manipulation, text processing, model building, and evaluation. Libraries include pandas, NumPy, NLTK, and scikit-learn.

In [1]:
import pandas as pd
import numpy as np
import nltk
import re
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('punkt_tab')
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
import tensorflow as tf
import joblib
import pickle
import kagglehub

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


# Data Preparation

https://www.kaggle.com/datasets/amananandrai/clickbait-dataset

The clickbait headlines are collected from sites such as ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’.
The relevant or non-clickbait headlines are collected from many trustworthy news sites such as ‘WikiNews’, ’New York Times’, ‘The Guardian’, and ‘The Hindu’.

**Description:** Downloads the clickbait dataset using `kagglehub`, loads it into a pandas DataFrame, and preprocesses the text data. Preprocessing includes:
   - Balancing the dataset by sampling.
   - Handling missing values and duplicates.
   - Converting headlines to lowercase.
   - Removing punctuation and numbers.
   - Stemming words using the Porter Stemmer.
   - Creating TF-IDF features from the headlines.
   - Scaling features using MinMaxScaler.
   - Splitting the data into training and testing sets.

In [2]:
# Download the dataset
path = kagglehub.dataset_download("amananandrai/clickbait-dataset")
df = pd.read_csv(path + "/clickbait_data.csv")

Downloading from https://www.kaggle.com/api/v1/datasets/download/amananandrai/clickbait-dataset?dataset_version_number=1...


100%|██████████| 743k/743k [00:00<00:00, 71.2MB/s]

Extracting files...





In [3]:
# Preprocessing
df = pd.concat([df[df['clickbait'] == 1].sample(frac=0.5, random_state=42), df[df['clickbait'] == 0].sample(frac=0.5, random_state=42)]).sample(frac=1, random_state=42)

# check null
if df.isnull().any().any():
  df = df.dropna()

# check duplicated
if df.duplicated().sum() > 0:
  df = df.drop_duplicates()

# to lowercase
df['headline'] = df['headline'].str.lower()

# removes all characters from the 'headline' column that are not letters (a-z, A-Z) or whitespace
df['headline'] = df['headline'].str.replace(r'[^a-zA-Z\s]', '', regex=True)

# remove number
df['headline'] = df['headline'].str.replace(r'\d+', '', regex=True)

# removes any leading or trailing whitespace
df['headline'] = df['headline'].str.strip()

# stemmer
stemmer = PorterStemmer()
def stem_headline(headline):
  tokens = word_tokenize(headline)
  stemmed_tokens = [stemmer.stem(token) for token in tokens]
  return ' '.join(stemmed_tokens)
df['headline'] = df['headline'].apply(stem_headline)

# TF-IDF
vectorizer = TfidfVectorizer(stop_words='english')
response = vectorizer.fit_transform(df['headline'])
X = response.toarray()
y = df['clickbait'].to_numpy()

# Normalization
scaler = MinMaxScaler(feature_range=(0, 1))
X = scaler.fit_transform(X)

# Split Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [4]:
# prompt: turn X_test and y_test into pandas dataframe, them export as csv

# Create pandas DataFrames
X_test_df = pd.DataFrame(X_test)
y_test_df = pd.DataFrame(y_test)

# Export to CSV
X_test_df.to_csv('X_test.csv', index=False)
y_test_df.to_csv('y_test.csv', index=False)

In [None]:
from joblib import dump, load
dump(vectorizer, 'tfidf.joblib')

['tfidf.joblib']

# Modeling

https://www.tensorflow.org/

**Description:** Defines a sequential neural network model using TensorFlow/Keras. The model consists of an input layer, two hidden layers with ReLU activation, and an output layer with sigmoid activation. It is compiled with the Adam optimizer, binary cross-entropy loss function, and accuracy metric. The model is trained on the training data for 10 epochs.

In [5]:
# Model
model = tf.keras.models.Sequential([
  tf.keras.layers.Input(shape=(X.shape[-1],)),
  tf.keras.layers.Dense(64, activation='relu'),
  tf.keras.layers.Dense(32, activation='relu'),
  tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)

Epoch 1/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 12ms/step - accuracy: 0.8139 - loss: 0.4143
Epoch 2/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 15ms/step - accuracy: 0.9861 - loss: 0.0456
Epoch 3/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 0.9964 - loss: 0.0151
Epoch 4/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 11ms/step - accuracy: 0.9994 - loss: 0.0053
Epoch 5/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 15ms/step - accuracy: 0.9999 - loss: 0.0014
Epoch 6/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 13ms/step - accuracy: 1.0000 - loss: 5.1469e-04
Epoch 7/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 15ms/step - accuracy: 1.0000 - loss: 3.5714e-04
Epoch 8/10
[1m400/400[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 12ms/step - accuracy: 1.0000 - loss: 2.6890e-04
Epoch 9/10
[1m400/

<keras.src.callbacks.history.History at 0x7e189ecd42e0>

# Evaluate

**Description:** Evaluates the trained model on the testing data. Calculates and prints the confusion matrix, accuracy, precision, and recall.

In [6]:
# Evaluate
y_true = y_test
y_pred = model.predict(X_test) > 0.5
cm = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(cm)

accuracy = accuracy_score(y_true, y_pred)
precision = precision_score(y_true, y_pred)
recall = recall_score(y_true, y_pred)

print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")

[1m100/100[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 12ms/step
Confusion Matrix:
[[1458  125]
 [  92 1525]]
Accuracy: 0.9322
Precision: 0.9242
Recall: 0.9431


**Description:** Saves the trained model to a file named 'clickbait_model.h5' in the HDF5 format. This allows you to load and reuse the model later.

In [None]:
model.save('model.h5')

  saving_api.save_model(


In [7]:
model.summary()