**Long-Short Term Memory for Clickbait Detection in ClickbaitTR**

---

The data is received from the preprocessing stage. As is explained in the paper, there are two types of input determined for detecting a clickbait tweet. The first type of information is received from the text, whereas the second type of information is received from the special characters and text-based features such as uppercase count, word length and tweet length of tweets.



In [0]:
!git clone -v https://github.com/ahmetax/kalbur.git 

Cloning into 'kalbur'...
POST git-upload-pack (165 bytes)
remote: Enumerating objects: 205, done.[K
remote: Total 205 (delta 0), reused 0 (delta 0), pack-reused 205[K
Receiving objects: 100% (205/205), 1.24 MiB | 1.15 MiB/s, done.
Resolving deltas: 100% (106/106), done.


In [0]:
import preprocessing as pr
import sys

pr.current_path = "/content/"

with open(pr.current_path + 'kalbur/kelime_bol.py', 'r') as file :
  filedata = file.read()

filedata = filedata.replace('veri/', pr.current_path + "kalbur/veri/")

with open(pr.current_path + 'kalbur/kelime_bol.py', 'w') as file:
  file.write(filedata)

sys.path.append(pr.current_path + "kalbur/")

import kelime_bol as kb

In [0]:
no_of_samples = 1000 # toy example

csv_files = {"limon":"dataset/limon_clickbait.csv",
             "evrensel":"dataset/evrensel_non-clickbait.csv",
             "spoiler":"dataset/spoiler_clickbait.csv",
             "diken":"dataset/diken_non-clickbait.csv"}

clickbait, non_clickbait = pr.return_data(csv_files)

special_characters = ["#", "?", "!", ".", "@"]

words_will_be_removed = ["işçi", "eylem", "meteoroloji", "katliam", 
                          "murat", "altı", "seçim", "diren", "dev", 
                          "gazze", "blog", "protesto", "beş", 
                          "yaşam", "manşet", "günaydın", "türkiye", 
                          "sınır","chp", "grev", "yaralı", "ateşkes", "yazı", "maden", "bayi"]

X_train, Xsc_train, Y_train, X_test, Xsc_test, Y_test, unique_word_list = pr.generatesample(clickbait[:no_of_samples], non_clickbait[:no_of_samples], 
                                                                   special_characters, words_will_be_removed, 
                                                                   isseparate=True, scaling=True)

LSTM network receives the full text of tweets after preprocessing as a first input and receives information about special characters, uppercase letters and number of letters and words as a second input

In [0]:
from keras.layers import Input, Dense, Embedding, LSTM, concatenate, CuDNNLSTM
from keras.models import Model
from keras.utils import plot_model
from keras import optimizers
from keras.callbacks import EarlyStopping
import matplotlib.pyplot as plt

es = EarlyStopping(monitor='val_accuracy', mode='max', verbose=1, patience=2)

max_tweet_length = len(unique_word_list)
text_input = Input(shape=(None,), name='text_input')
x = Embedding(input_dim = max_tweet_length + 1, output_dim=3)(text_input)
lstm_out = LSTM(2)(x)

special_characters = Input(shape=(9, ), name='special_input')
b = Dense(2)(special_characters)
merged = concatenate([lstm_out, b])

output = Dense(1, activation='sigmoid', name='output')(merged)

model = Model(inputs=[text_input, special_characters], outputs=output)
adam = optimizers.Adam(lr=0.005, beta_1=0.9, beta_2=0.999, amsgrad=False)
model.compile(loss='binary_crossentropy',
              optimizer=adam,
              metrics=['accuracy'])
model.summary()

plot_model(model, to_file='model.png')

history = model.fit({'text_input': X_train, 'special_input': Xsc_train},
          {'output': Y_train}, validation_data = [{'text_input': X_test, 'special_input': Xsc_test},
          {'output': Y_test}],
          epochs=32, 
          batch_size=128,
          callbacks=[es])

# Plot training & validation accuracy values
csfont = {'fontname':'Arial'}
plt.figure(figsize=(20, 15))
plt.rcParams['font.size'] = 30
plt.plot(history.history['acc'], linewidth=4.0)
plt.plot(history.history['val_acc'], linewidth=4.0)
plt.title('Model accuracy', **csfont)
plt.ylabel('Accuracy', **csfont)
plt.xlabel('Epoch', **csfont)
plt.legend(['Train', 'Test'], loc='upper left')
plt.show()
plt.savefig('LSTM_graph.png', bbox_inches = 'tight')

# Plot training & validation loss values
plt.figure(figsize=(20, 15))
plt.plot(history.history['loss'], linewidth=4.0)
plt.plot(history.history['val_loss'], linewidth=4.0)
plt.title('Model loss', **csfont)
plt.ylabel('Loss', **csfont)
plt.xlabel('Epoch', **csfont)
plt.legend(['Train', 'Test'], loc='upper left')
plt.rcParams['font.size'] = 30
plt.show()
plt.savefig('LSTM_loss_graph.png', bbox_inches = 'tight')