# Challenge - Clickbait Title Detection

# Background information

Clickbait titles and tumbnails are plagueing the internet and lead to lesser user satisfaction with services like YouTube or news servers. Due to the amount on new content on these sites, it is impossible to control content manually. That is why giants like Facebook(Meta), Twitter, Amazon or Google(Alphabet) are investing huge resources towards creating NLP systems that ae able to curate internet enviroment autonomously.

To make our Clickbait Detection model we will use Bag of Words encoding and sequential model.

# Data

We will use clickbait data, which you can download from our GitHub.It has 2 categories ("headline" - containing titles & clickbait - containing the labels). As the separator, we use ";" because comma can be problematic on some system due to commas being also used in the text.

In [None]:
#Importing required libraries and download NLTK resources
import numpy as np
from numpy import array
import pandas as pd
import tensorflow as tf
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Embedding

import nltk
nltk.download('all')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import *
from nltk.corpus import stopwords

In [None]:
#Load data into dataframe
DATA_PATH = 'train.csv'
df = pd.read_csv(DATA_PATH)
df.dropna(subset = ["clickbait"], inplace=True)
np.random.shuffle(df.values)

In [None]:
#Load corpus 
corpus = list(df["headline"])
labels = array(df["clickbait"])

# Preprocessing

In NLP there are multiple ways how to approach preprocessing. It is more or less up to us, what kinds of preprocessing we want to do and not always are all of them helpful.
The most common preprocessing techniques are:
- Removing stopwords
- Lemmatizaton
- Stemming

In [None]:
#Get all unique words
lemmatizer = WordNetLemmatizer();
stemmer = PorterStemmer()
stopwords = set(stopwords.words('english'))

all_words = []
for sent in corpus:
    tokenize_word = word_tokenize(sent)
    for word in tokenize_word:
        if word not in stopwords:
            if word not in stopwords:
                word = stemmer.stem(word)
                word = lemmatizer.lemmatize(word)
                all_words.append(word)
unique_words = set(all_words)
print(len(unique_words))

# Embeddings

Creating embeddings could be also seen as a form of preprocessing, which is maybe the most important choice you make when building NLP model. We are using the Bag of Words approach, which is very simplistic. They are better embeddings for this task, but there are situation when BoW is the best option.

In [None]:
#Create embeddings
vocab_length = len(unique_words)+5
embedded_sentences = [one_hot(sent, vocab_length) for sent in corpus]

In [None]:
#Split data
from sklearn.model_selection import train_test_split
data_train, data_test, labels_train, labels_test = train_test_split(padded_sentences, labels, test_size=0.3)

In [None]:
#Create model
model = Sequential()
model.add(Embedding(vocab_length, 20, input_length=length_long_sentence))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))

In [None]:
#Compile model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

In [None]:
#Fit model
model.fit(data_train, labels_train, epochs=5, verbose=1)

In [None]:
#Count accuracy
loss, accuracy = model.evaluate(data_test, labels_test, verbose=0,batch_size=10)
print('Accuracy: %f' % (accuracy*100))