# Davide Vaccari
[dv2438]

## Covid Tweet Misinformation Classification Hackathon






# Report Description

This is a python notebook that aims to create an effective machine learning model to recognize tweets that contain real information and tweets that contain false information using Recurrent Neural Networks for sequential data and not. Conv1d, LSTM, GRU layers will be used.  

This notebook is also available on [Github](link).

## Citation
The dataset used in this report comes from the following paper:  

*Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak.*  
**"An exploratory study of covid-19 misinformation on twitter."**  
Online Social Networks and Media 22 (2021): 100104.

In [4]:
import os
import tensorflow as tf
import re

In [5]:
! pip install delayed



## Import Data

In [6]:
#Source:Fighting an Infodemic: COVID-19 Fake News Dataset, https://github.com/diptamath/covid_fake_news,https://arxiv.org/abs/2011.03327 

import pandas as pd
trainingdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv", usecols = ['tweet','label'])
testdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/english_test_with_labels.csv", usecols = ['tweet','label'])

trainingdata

Unnamed: 0,tweet,label
0,The CDC currently reports 99031 deaths. In gen...,real
1,States reported 1121 deaths a small rise from ...,real
2,Politically Correct Woman (Almost) Uses Pandem...,fake
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,Populous states can generate large case counts...,real
...,...,...
6415,A tiger tested positive for COVID-19 please st...,fake
6416,???Autopsies prove that COVID-19 is??� a blood...,fake
6417,_A post claims a COVID-19 vaccine has already ...,fake
6418,Aamir Khan Donate 250 Cr. In PM Relief Cares Fund,fake


### Examples of tweets containing real information

In [7]:
trainingdata.tweet[0]

'The CDC currently reports 99031 deaths. In general the discrepancies in death counts between different sources are small and explicable. The death toll stands at roughly 100000 people today.'

In [8]:
# trainingdata.label[trainingdata.label.eq('real')].sample(3).index

In [9]:
print("Example tweet with real news 1: ", trainingdata.tweet[5476])
print("Example tweet with real news 2: ", trainingdata.tweet[1959])
print("Example tweet with real news 3: ", trainingdata.tweet[4364])

Example tweet with real news 1:  Survey of people with COVID-19 in Colorado finds half reported close contact with someone with symptoms of or lab-confirmed COVID-19 in the 14 days before showing symptoms. If you have #COVID19 symptoms stay home and avoid contact with others. @CDCMMWR https://t.co/urssmS60ac https://t.co/EcoSyr7Mo3
Example tweet with real news 2:  A #coronavirus testing centre in Kent has been shut - reportedly to make way for a lorry park in the run up to the next big Brexit deadline https://t.co/FaQTTuRLLK
Example tweet with real news 3:  “It’s all our individual responsibility to assess the risks for ourselves.” Professor Karol Sikora says we “should carry on as we go” to find the right balance in tackling the spread of #coronavirus. Get the latest on #COVID19: https://t.co/N4ZDlImWKK https://t.co/RecbKF4Y1Y


### Examples of tweets containing fake news

In [10]:
trainingdata.tweet[6416]

'???Autopsies prove that COVID-19 is??� a blood clot, not pneumonia, ???and ought to be fought with antibiotics??� and the whole world has been wrong in treating the ???so-called??� pandemic.'

In [11]:
# trainingdata.label[trainingdata.label.eq('fake')].sample(3).index

In [12]:
print("Example fake news 1: ", trainingdata.tweet[1694])
print("Example fake news 2: ", trainingdata.tweet[5335])
print("Example fake news 3: ", trainingdata.tweet[2165])

Example fake news 1:  Devastated stock brokers with faces in hands clearly not taking coronavirus precautions seriously
#coronavirus #COVID19 #StockMarketCrash2020 
https://t.co/NtCvhtQ1yT https://t.co/DqddhnetiM
Example fake news 2:  Video of Chinese Prime Minister Li Keqiang visiting a mosque and offering prayers seeking protection to the country from coronavirus.
Example fake news 3:  An old man infected 300 people with coronavirus in Jaipur India through tobacco smoke.


In [13]:
trainingdata['label'].value_counts()

real    3360
fake    3060
Name: label, dtype: int64

In [14]:
testdata['label'].value_counts()

real    1120
fake    1020
Name: label, dtype: int64

## Helper functions

In [15]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stopwords(input_text):

    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 
    
def remove_mentions(input_text):

    return re.sub(r"@\w+", "", input_text) 
    
def remove_links(input_text):
    
    return re.sub(r"http\S+", "", input_text)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [16]:
trainingdata['tweet'] = trainingdata['tweet'].apply(remove_stopwords).apply(remove_mentions).apply(remove_links)
testdata['tweet'] = testdata['tweet'].apply(remove_stopwords).apply(remove_mentions).apply(remove_links)

I decided to remove stopwords and mentions from the tweets. This may backfire since it's a loss of information and some fake tweets may use unnecessary words.

Let's see now the previous tweets again.

In [17]:
print("Example tweet with real news 1: ", trainingdata.tweet[5476])
print("Example tweet with real news 2: ", trainingdata.tweet[1959])
print("Example tweet with real news 3: ", trainingdata.tweet[4364])

Example tweet with real news 1:  Survey people COVID-19 Colorado finds half reported close contact someone symptoms lab-confirmed COVID-19 14 days showing symptoms. If #COVID19 symptoms stay home avoid contact others.   
Example tweet with real news 2:  #coronavirus testing centre Kent shut reportedly make way lorry park run next big Brexit deadline 
Example tweet with real news 3:  “It’s individual responsibility assess risks ourselves.” Professor Karol Sikora says “should carry go” find right balance tackling spread #coronavirus. Get latest #COVID19:  


In [18]:
print("Example fake news 1: ", trainingdata.tweet[1694])
print("Example fake news 2: ", trainingdata.tweet[5335])
print("Example fake news 3: ", trainingdata.tweet[2165])

Example fake news 1:  Devastated stock brokers faces hands clearly not taking coronavirus precautions seriously #coronavirus #COVID19 #StockMarketCrash2020  
Example fake news 2:  Video Chinese Prime Minister Li Keqiang visiting mosque offering prayers seeking protection country coronavirus.
Example fake news 3:  An old man infected 300 people coronavirus Jaipur India tobacco smoke.


Perfect, text has been cleaned.

## Discussion
The training dataset contains 3360 tweets containing real information and 3060 tweets containing fake news.  

The plague of fake news has hit almost every possible social sector in the world; famously the American elections. However, during the coronavirus pandemic, fake news could have had even more serious effects, convincing people to directly risk their lives or those of their families. A model capable of autonomously selecting tweets containing fake news has proved to be a useful and important tool, such that Twitter itself, but also Facebook, have created models capable of interpreting the contents of posts containing information regarding the coronavirus pandemic and, possibly, censor them.

## Define Preprocessor

In [19]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000) #, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{"}~\t\n', lower=True, char_level=False, split=' ')
# I've tried to use these in my tokenizers, but the results got worse. It's likely that weird symbols might be an indicator of a tweet being fake.

tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

## Prepare Train and Test Data

In [20]:
maxlen = 100

In [21]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=maxlen, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=maxlen, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [22]:
print(X_train.shape)
print(X_test.shape)

(6420, 100)
(2140, 100)


# 1. Simple Base Model

Let's start with a simple base model.

In [23]:
from tensorflow.keras.layers import Dense, Embedding,Flatten
from tensorflow.keras.models import Sequential
from tensorflow.keras import regularizers, optimizers, initializers

# replace this model with the architectures from the task description
model = Sequential()
model.add(Embedding(10000, 16, input_length=maxlen, embeddings_initializer='RandomUniform'))
model.add(Flatten())
model.add(Dense(2, activation='softmax', kernel_initializer='GlorotUniform', bias_initializer='Zeros'))

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 100, 16)           160000    
_________________________________________________________________
flatten (Flatten)            (None, 1600)              0         
_________________________________________________________________
dense (Dense)                (None, 2)                 3202      
Total params: 163,202
Trainable params: 163,202
Non-trainable params: 0
_________________________________________________________________


In [None]:
with tf.device('/device:GPU:0'): #"/GPU:0": Short-hand notation for the first GPU of your machine that is visible to TensorFlow.
  from tensorflow.python.keras.callbacks import ReduceLROnPlateau
  from tensorflow.python.keras.callbacks import ModelCheckpoint
  from keras.callbacks import EarlyStopping
  
  mc = ModelCheckpoint('best_model_1.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
  red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
  es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)


  model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

  history = model.fit(X_train, y_train,
                      epochs=50,
                      batch_size=32,
                      validation_split=0.20, callbacks=[mc,red_lr, es])

In [25]:
# format y_pred as labels 
model.load_weights('best_model_1.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
print(len(predicted_labels))
predicted_labels[0:5]

2140


['real', 'fake', 'fake', 'real', 'real']

In [26]:
labels_pred = pd.get_dummies(predicted_labels)

In [27]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score

# accuracy: (tp + tn) / (p + n)
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
# precision tp / (tp + fp)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
# recall: tp / (tp + fn)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
# f1: 2 tp / (2 tp + fp + fn)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)
 
# ROC AUC
auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.929907
Precision: [0.92315175 0.93615108]
Recall: [0.93039216 0.92946429]
F1 score: [0.92675781 0.9327957 ]
ROC AUC: 0.929928


In [28]:
# How to write a preprocessor function for text preprocessing using keras?

def preprocessor_for_test(textinput, maxlen=maxlen):
        from tensorflow.keras.preprocessing.text import Tokenizer
        from tensorflow.keras.preprocessing.sequence import pad_sequences        

        sequences = tokenizer.texts_to_sequences(textinput) # converts words in each text to each word's numeric index in tokenizer dictionary.

        data = pad_sequences(sequences, maxlen=maxlen)
        return data



#See preprocessor output
print(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood chinese virus"]).shape)
print(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood chinese virus"]))

print(model.predict(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."])))
print(model.predict(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood virus"])))

(1, 100)
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0 5824 3219  252    1    2  311  257  864
   256   32]]
[[4.2742642e-04 9.9957258e-01]]
[[9.9978405e-01 2.1596857e-04]]


The first prediction is for "fake", the second for "real". The numbers add up to 1. This is how "softmax" evaluates our "multiclass". This model, in fact, was not trained as binary between real and fake, but both were considered classes that could be 1 or 0. This is why we used "softmax" and "categorical_crossentropy". 

As we can see, our model predicts the first phrase as real (which in fact is, I took it from the dataset and it was labelled as real) with slightly more than 99% confidence, and the second one as fake with almost 100% confidence. The second one is a combination of a real fake tweet and my imagination.

# Bonus: External Weights

Let's apply transfer learning and import weights from outside.

### Glove embedding matrix weights

In [None]:
! wget http://nlp.stanford.edu/data/wordvecs/glove.6B.zip

In [30]:
! unzip glove.6B.zip 

Archive:  glove.6B.zip
replace glove.6B.100d.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: N


In [31]:
# Extract embedding data for 100 feature embedding matrix
glove_dir = os.getcwd()

embeddings_index = {}
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400001 word vectors.


In [32]:
sequences = tokenizer.texts_to_sequences(trainingdata['tweet'])
data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(trainingdata['label'])

In [33]:
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)

Shape of data tensor: (6420, 100)
Shape of label tensor: (6420,)


In [34]:
training_samples = 10000  # We will be training on 10000 samples
validation_samples = 10000  # We will be validating on 10000 samples
max_words = 10000  # We will only consider the top 10,000 words in the dataset

x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

In [35]:
word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

Found 14876 unique tokens.


In [36]:
# Build embedding matrix
embedding_dim = 100 # change if you use txt files using larger number of features

embedding_matrix = np.zeros((max_words, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < max_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [37]:
from tensorflow.keras import layers

model = Sequential()
model.add(Embedding(10000, embedding_dim, input_length=maxlen))
model.add(layers.Conv1D(32, 4, activation='relu')) 
model.add(layers.MaxPooling1D(5)) 
model.add(layers.Conv1D(32, 4, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(Dense(2, activation='softmax', kernel_initializer='GlorotUniform', bias_initializer='Zeros',
                kernel_regularizer=regularizers.l1_l2(l1=1e-5, l2=1e-4),
                bias_regularizer=regularizers.l2(1e-4),
                activity_regularizer=regularizers.l2(1e-5)))

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 100, 100)          1000000   
_________________________________________________________________
conv1d (Conv1D)              (None, 97, 32)            12832     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 19, 32)            0         
_________________________________________________________________
conv1d_1 (Conv1D)            (None, 16, 32)            4128      
_________________________________________________________________
global_max_pooling1d (Global (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 66        
Total params: 1,017,026
Trainable params: 1,017,026
Non-trainable params: 0
____________________________________________

In [None]:
model.layers[0].set_weights([embedding_matrix])
model.layers[0].trainable = False

# mc = ModelCheckpoint('best_model.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True)#
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0)
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)


model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                      epochs=50,
                      batch_size=32,
                      validation_split=0.20, callbacks=[red_lr,es])

model.save_weights('pre_trained_glove_model.h5')

In [39]:
model.load_weights('pre_trained_glove_model.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)
predicted_labels[0:5]

['real', 'fake', 'fake', 'real', 'fake']

In [40]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.874766
Precision: [0.86647173 0.88240575]
Recall: [0.87156863 0.87767857]
F1 score: [0.86901271 0.88003581]
ROC AUC: 0.874624


I tried to use pre-trained weights trying to improve the model, but it actually got worse!

In [41]:
print(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."]).shape)
print(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."]))

print(model.predict(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."])))
print(model.predict(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood chinese virus"])))

(1, 100)
[[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0    0    0    0    0    0    7   97  231  218 6985   10   53
   617 6986   86 1204  407 1718  669 6987    7   86  639  530 1719  888
     8   23]]
[[1.3508716e-05 9.9998653e-01]]
[[9.9999988e-01 1.7527901e-07]]


Nonetheless, this model appears to be quite certain about the labels of these two tweets and it gets both right.

# 2. A model with Convolutional 1D layers


In [42]:
model = Sequential()
model.add(layers.Embedding(10000, 16, input_length=maxlen))
model.add(layers.Conv1D(32, 4, activation='relu')) 
model.add(layers.MaxPooling1D(5)) 
model.add(layers.Conv1D(32, 4, activation='relu'))
model.add(layers.GlobalMaxPooling1D())
model.add(layers.Dense(2, activation="softmax"))

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 100, 16)           160000    
_________________________________________________________________
conv1d_2 (Conv1D)            (None, 97, 32)            2080      
_________________________________________________________________
max_pooling1d_1 (MaxPooling1 (None, 19, 32)            0         
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 16, 32)            4128      
_________________________________________________________________
global_max_pooling1d_1 (Glob (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 66        
Total params: 166,274
Trainable params: 166,274
Non-trainable params: 0
________________________________________________

In [None]:
from tensorflow.keras.optimizers import RMSprop

model.compile(optimizer=RMSprop(lr=1e-4),
              loss='categorical_crossentropy',
              metrics=['acc'])

mc = ModelCheckpoint('best_model_2.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [44]:
model.load_weights('best_model_2.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)
predicted_labels[0:5]

['real', 'fake', 'fake', 'real', 'real']

In [45]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.908411
Precision: [0.90155945 0.91472172]
Recall: [0.90686275 0.90982143]
F1 score: [0.90420332 0.912265  ]
ROC AUC: 0.908342


Overall a good model! It's my best one yet, and it has a reasonable amount of parameters.

In [46]:
print(model.predict(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."])))
print(model.predict(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood chinese virus"])))

[[0.01222894 0.98777103]]
[[0.98807424 0.01192572]]


# Bonus: Simple RNN

In [47]:
from tensorflow.keras.layers import SimpleRNN, LSTM, GRU

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(SimpleRNN(32))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
simple_rnn (SimpleRNN)       (None, 32)                2080      
_________________________________________________________________
dense_3 (Dense)              (None, 2)                 66        
Total params: 322,146
Trainable params: 322,146
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model_bonus2.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=25)

history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [49]:
model.load_weights('best_model_bonus2.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [50]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.917290
Precision: [0.89282386 0.94189316]
Recall: [0.93921569 0.89732143]
F1 score: [0.91543239 0.91906722]
ROC AUC: 0.918269


In [51]:
print(model.predict(preprocessor_for_test(["The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."])))
print(model.predict(preprocessor_for_test(["???Autopsies prove that COVID-19 is?? a blood chinese virus"])))

[[0.00120524 0.9987948 ]]
[[9.996692e-01 3.307948e-04]]


A decent model. Gets right the two example tweets.

# 3. One sequential layer (LSTM or GRU)

### Let's start with **LSTM**

In [52]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
lstm (LSTM)                  (None, 32)                8320      
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 66        
Total params: 328,386
Trainable params: 328,386
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model_3.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)

history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [54]:
model.load_weights('best_model_3.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [55]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.928505
Precision: [0.9172281  0.93914623]
Recall: [0.93431373 0.92321429]
F1 score: [0.92569208 0.93111211]
ROC AUC: 0.928764


In [56]:
def tweet_stats(tweet):

  print("Tweet:", tweet)

  print("Tweet chances:", model.predict(preprocessor_for_test([tweet])))
    
  if (model.predict(preprocessor_for_test([tweet]))[0][0] > 0.5):
    print((model.predict(preprocessor_for_test([tweet]))[0][0] *100) , "% CHANCE OF BEING FAKE NEEEWS")
  else:
    print("It's telling the truth")

In [57]:
tweet = "The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today."

tweet_stats(tweet)

Tweet: The CDC currently reports 99031 deaths. In general discrepancies death counts different sources small explicable. The death toll stands roughly 100000 people today.
Tweet chances: [[0.01990921 0.9800908 ]]
It's telling the truth


In [58]:
tweet2 = "???Autopsies prove that COVID-19 is?? a blood chinese virus"

tweet_stats(tweet2)

Tweet: ???Autopsies prove that COVID-19 is?? a blood chinese virus
Tweet chances: [[9.9977428e-01 2.2568514e-04]]
99.97742772102356 % CHANCE OF BEING FAKE NEEEWS


This is my new best model!

### Let's try now **GRU**

In [59]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(GRU(32))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
gru (GRU)                    (None, 32)                6336      
_________________________________________________________________
dense_5 (Dense)              (None, 2)                 66        
Total params: 326,402
Trainable params: 326,402
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model_3bis.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)

history = model.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [61]:
model.load_weights('best_model_3bis.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [62]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.927103
Precision: [0.89416058 0.96168582]
Recall: [0.96078431 0.89642857]
F1 score: [0.92627599 0.92791128]
ROC AUC: 0.928606


Results are very similar to those of LSTM. Slightly worse.

# 4.  Stacked Sequential Layers (LSTM or GRU)

In [63]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(64, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(LSTM(32, return_sequences=True, dropout=0.1, recurrent_dropout=0.1))
model.add(LSTM(16))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_6"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100, 64)           24832     
_________________________________________________________________
lstm_2 (LSTM)                (None, 100, 32)           12416     
_________________________________________________________________
lstm_3 (LSTM)                (None, 100, 32)           8320      
_________________________________________________________________
lstm_4 (LSTM)                (None, 16)                3136      
_________________________________________________________________
dense_6 (Dense)              (None, 2)                 34        
Total params: 368,738
Trainable params: 368,738
Non-trainable params: 0
________________________________________________

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model_4.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [65]:
model.load_weights('best_model_4.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [66]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.929439
Precision: [0.9402229  0.92020815]
Recall: [0.90980392 0.94732143]
F1 score: [0.92476333 0.93356797]
ROC AUC: 0.928563


Not that great for a model that took so much time to train.

# 5. Bidirectional Sequential Layers (LSTM)

In [67]:
#Example 4: Bidirectional LSTM
from tensorflow.keras.layers import Bidirectional

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(Bidirectional(LSTM(32, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
model.add(Bidirectional(LSTM(32, dropout=0.1, recurrent_dropout=0.1)))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
bidirectional (Bidirectional (None, 100, 64)           16640     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 64)                24832     
_________________________________________________________________
dense_7 (Dense)              (None, 2)                 130       
Total params: 361,602
Trainable params: 361,602
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model_5.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr,es])

In [69]:
model.load_weights('best_model_5.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [70]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.931776
Precision: [0.91382576 0.94926199]
Recall: [0.94607843 0.91875   ]
F1 score: [0.92967245 0.93375681]
ROC AUC: 0.932414


# Trying to create the best model

In [71]:
from keras.layers import Dropout

model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen, embeddings_initializer='RandomUniform'))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(32, return_sequences=True, dropout=0.2, recurrent_dropout=0.2)))
model.add(LSTM(32, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
model.add(LSTM(32, dropout=0.1, recurrent_dropout=0.1))
model.add(Flatten())
model.add(Dense(8, activation='relu', kernel_initializer='GlorotUniform', bias_initializer='Zeros' ))
model.add(Dense(2, activation='softmax', kernel_initializer='GlorotUniform', bias_initializer='Zeros'))

model.summary()

Model: "sequential_8"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
dropout (Dropout)            (None, 100, 32)           0         
_________________________________________________________________
bidirectional_2 (Bidirection (None, 100, 64)           16640     
_________________________________________________________________
lstm_8 (LSTM)                (None, 100, 32)           12416     
_________________________________________________________________
lstm_9 (LSTM)                (None, 32)                8320      
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 8)                

In [None]:
model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

mc = ModelCheckpoint('best_model.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=5)

history = model.fit(X_train, y_train,
                    epochs=20,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr, es])

In [73]:
model.load_weights('best_model.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [74]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.926636
Precision: [0.91371045 0.93892434]
Recall: [0.93431373 0.91964286]
F1 score: [0.92389724 0.92918358]
ROC AUC: 0.926978


I tried to do something fancier than the other models. Especially, I tried to fix the overfitting, something that caused problems while working on this notebook. However, this didn't result in the best model.

# My best model

In [75]:
model = Sequential()
model.add(Embedding(10000, 32, input_length=maxlen))
model.add(LSTM(32))
model.add(Dense(2, activation='softmax'))

model.summary()

Model: "sequential_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_9 (Embedding)      (None, 100, 32)           320000    
_________________________________________________________________
lstm_10 (LSTM)               (None, 32)                8320      
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 66        
Total params: 328,386
Trainable params: 328,386
Non-trainable params: 0
_________________________________________________________________


In [76]:
model.load_weights('best_model_3.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [77]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.928505
Precision: [0.9172281  0.93914623]
Recall: [0.93431373 0.92321429]
F1 score: [0.92569208 0.93111211]
ROC AUC: 0.928764


Overall, this has been my best model. I tried different variations of it (trying different initializers, changing optimizers) but nothing increased the F1 score.

These models are plenty of hyper-parameters to be tuned. The most evident is the number of neurons per layer. But each layer has an exceptional number of [arguments](https://keras.io/api/layers/recurrent_layers/lstm/) to suit the model to its goal.

# Submitting the best Model

In [78]:
# install aimodelshare library
%%capture
! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [79]:
import aimodelshare as ai
from aimodelshare.aimsonnx import model_to_onnx

In [80]:
# save preprocessor
ai.export_preprocessor(preprocessor,"")

In [81]:
# save model in onnx format
onnx_model = model_to_onnx(model, framework='keras',
                          transfer_learning=False,
                          deep_learning=True)

with open("onnx_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())



INFO:tensorflow:Assets written to: /tmp/assets


INFO:tensorflow:Assets written to: /tmp/assets


In [82]:
# set credentials for modeltoapi function 
# make sure you have uploaded your credentials.txt file
from aimodelshare.aws import set_credentials
api_url = "https://wvr23l2z9i.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=api_url,credential_file="credentials.txt", type="submit_model", manual=False)

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [84]:
# submit model and predictions to competition
ai.submit_model("onnx_model.onnx",
                api_url,
                prediction_submission=predicted_labels,
                preprocessor="preprocessor.zip")

'Your model has been submitted as model version 82'

In [85]:
# check leaderboard
data=ai.get_leaderboard(api_url, verbose=2)
data=data.loc[data.iloc[:,0:8].duplicated()==False,:] #getting rid of any duplicate model submissions
data.fillna(0,inplace=True)
ai.leaderboard.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,bidirectional_layers,conv1d_layers,dense_layers,embedding_layers,flatten_layers,globalmaxpooling1d_layers,lstm_layers,maxpooling1d_layers,simplernn_layers,loss,optimizer,model_config,username,version
0,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,0.0,0.0,1,1,1.0,0.0,0.0,0.0,0.0,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,67
2,95.00%,94.99%,94.97%,95.02%,keras,False,True,Sequential,5,1081482,1.0,0.0,2,1,0.0,0.0,1.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_29', 'lay...",kagenlim,61
3,94.86%,94.85%,94.84%,94.87%,keras,False,True,Sequential,5,1035746,0.0,0.0,2,1,0.0,0.0,2.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_3', 'laye...",kagenlim,19
4,94.77%,94.76%,94.74%,94.78%,keras,False,True,Sequential,9,1313030,0.0,0.0,2,1,1.0,0.0,1.0,0.0,4.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",kka2120,69
5,94.58%,94.57%,94.57%,94.57%,keras,False,True,Sequential,5,1070202,0.0,0.0,2,1,0.0,0.0,2.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_4', 'laye...",kagenlim,60
6,94.49%,94.47%,94.47%,94.48%,keras,False,True,Sequential,3,161282,0.0,0.0,1,1,1.0,0.0,0.0,0.0,0.0,str,RMSprop,"{'name': 'sequential', 'layers...",newusertest,4
7,94.35%,94.34%,94.32%,94.37%,keras,False,True,Sequential,6,148066,0.0,2.0,1,1,1.0,0.0,0.0,1.0,0.0,str,RMSprop,"{'name': 'sequential_72', 'lay...",prajseth,40
8,94.25%,94.24%,94.24%,94.24%,keras,False,True,Sequential,3,98818,0.0,0.0,1,1,0.0,0.0,1.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_78', 'lay...",prajseth,41
9,94.21%,94.19%,94.18%,94.21%,keras,False,True,Sequential,3,402690,0.0,0.0,1,1,0.0,0.0,1.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_5', 'laye...",xc2303_xc,63
11,94.21%,94.19%,94.20%,94.19%,keras,False,True,Sequential,7,3203362,4.0,0.0,1,1,1.0,0.0,0.0,0.0,0.0,str,RMSprop,"{'name': 'sequential_31', 'lay...",rboswell,71


In [86]:
 # Get best model architecture and view model summary, change version arg as needed
 
 bestmodel = ai.aimsonnx.instantiate_model(api_url, version=67) 

 bestmodel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 16)            160000    
_________________________________________________________________
flatten (Flatten)            (None, 960)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1922      
Total params: 161,922
Trainable params: 161,922
Non-trainable params: 0
_________________________________________________________________


Ironically, the best model for this competition happened to be a non-sequential model. The best model is composed of just three layers: embedding, flatten, and dense. My best model has an LSTM layer instead of the flatten.

In [87]:
# Compare two model versions to see diffs
ai.aimsonnx.compare_models(api_url, version_list=[1,3]) 



Unnamed: 0,Model_1_Layer,Model_1_Shape,Model_1_Params,Model_3_Layer,Model_3_Shape,Model_3_Params
0,Embedding,"(None, 100, 16)",160000,Embedding,"(None, 40, 16)",160000
1,Flatten,"(None, 1600)",0,Flatten,"(None, 640)",0
2,Dense,"(None, 2)",3202,Dense,"(None, 2)",1282


Let's see the configuration of the best model.

In [88]:
data.iloc[0]['model_config']

"{'name': 'sequential', 'layers': [{'class_name': 'InputLayer', 'config': {'batch_input_shape': (None, 60), 'dtype': 'float32', 'sparse': False, 'ragged': False, 'name': 'embedding_input'}}, {'class_name': 'Embedding', 'config': {'name': 'embedding', 'trainable': True, 'batch_input_shape': (None, 60), 'dtype': 'float32', 'input_dim': 10000, 'output_dim': 16, 'embeddings_initializer': {'class_name': 'RandomUniform', 'config': {'minval': -0.05, 'maxval': 0.05, 'seed': None}}, 'embeddings_regularizer': None, 'activity_regularizer': None, 'embeddings_constraint': None, 'mask_zero': False, 'input_length': 60}}, {'class_name': 'Flatten', 'config': {'name': 'flatten', 'trainable': True, 'dtype': 'float32', 'data_format': 'channels_last'}}, {'class_name': 'Dense', 'config': {'name': 'dense', 'trainable': True, 'dtype': 'float32', 'units': 2, 'activation': 'softmax', 'use_bias': True, 'kernel_initializer': {'class_name': 'GlorotUniform', 'config': {'seed': None}}, 'bias_initializer': {'class_na

In [93]:
maxlen = 60

In [94]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=maxlen, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=maxlen, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [95]:
bestmodel.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

In [None]:
mc = ModelCheckpoint('model_ai.h5', monitor='val_acc',mode='max', verbose=1, save_best_only=True) # evaluating val_acc maximization
red_lr= ReduceLROnPlateau(monitor='val_acc',patience=3,verbose=1,factor=0.5, min_lr=0) # dividing lr by half when val_accuracy fails to improve after 2 epochs
es = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15)

history = bestmodel.fit(X_train, y_train,
                    epochs=50,
                    batch_size=32,
                    validation_split=0.2, callbacks=[mc, red_lr, es])

In [100]:
bestmodel.load_weights('model_ai.h5')

y_pred = bestmodel.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [101]:
accuracy = accuracy_score(y_test, labels_pred)
print('Accuracy: %f' % accuracy)
precision = precision_score(y_test, labels_pred, average=None)
print('Precision:', precision)
recall = recall_score(y_test, labels_pred, average=None)
print('Recall:', recall)
f1 = f1_score(y_test, labels_pred, average=None)
print('F1 score:', f1)

auc = roc_auc_score(y_test, labels_pred)
print('ROC AUC: %f' % auc)

Accuracy: 0.932243
Precision: [0.92766373 0.93643688]
Recall: [0.93039216 0.93392857]
F1 score: [0.92902594 0.93518105]
ROC AUC: 0.932160


On my data the best model from the leaderboard didn't achieve the same results. This is likely to be because of my different preprocessing.

# Conclusion
To conclude, I want to feed my best model with some sample tweets I've found the web and some I created myself.

In [103]:
maxlen = 100

In [104]:
# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=maxlen, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=maxlen, max_words=10000)

# ohe encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [105]:
model.load_weights('best_model_3.h5')

y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
labels_pred = pd.get_dummies(predicted_labels)

In [109]:
tweet_online = "#COVID: If they haven’t isolated the virus, how can they make a vaccine?"
tweet_stats(tweet_online)

Tweet: #COVID: If they haven’t isolated the virus, how can they make a vaccine?
Tweet chances: [[0.9540024  0.04599761]]
95.40023803710938 % CHANCE OF BEING FAKE NEEEWS


In [112]:
tweet_online2 = "#Covid isn’t dangerous to elite athletes. The #Covid vaccine, however..."
tweet_stats(tweet_online2)

Tweet: #Covid isn’t dangerous to elite athletes. The #Covid vaccine, however...
Tweet chances: [[0.9887743  0.01122575]]
98.8774299621582 % CHANCE OF BEING FAKE NEEEWS


In [123]:
tweet_online3 = "Getting my #CovidVaccination on Thursday!!! Yes!!! #COVID19"
tweet_stats(tweet_online3)

Tweet: Getting my #CovidVaccination on Thursday!!! Yes!!! #COVID19
Tweet chances: [[0.5742912  0.42570874]]
57.42912292480469 % CHANCE OF BEING FAKE NEEEWS


Here the model is a little off. I'd consider this tweet to be truthful.

In [129]:
tweet_online4 = "Frontline workers: We need more resources to deal with #COVID19"
tweet_stats(tweet_online4)

Tweet: Frontline workers: We need more resources to deal with #COVID19
Tweet chances: [[0.08119022 0.9188098 ]]
It's telling the truth


After testing it, I wouldn't say this model does a good job with real Twitter data outside the dataset. Nonetheless, I was surprised about how much time I needed to spend on Twitter to find truthful information! (I looked at #covid and #covid19 hashtags. There really is a lot of misinformation out-there!)

In [134]:
fantasy_tweet = "Covid-19 killed more than 2 million people worldwide"
tweet_stats(fantasy_tweet)

Tweet: Covid-19 killed more than 2 million people worldwide
Tweet chances: [[0.89894193 0.10105804]]
89.89419341087341 % CHANCE OF BEING FAKE NEEEWS


Unfortunately, this is true.

In [135]:
fantasy_tweet2 = "Vaccines are deadlier than the virus!!"
tweet_stats(fantasy_tweet2)

Tweet: Vaccines are deadlier than the virus!!
Tweet chances: [[0.98448527 0.01551468]]
98.44852685928345 % CHANCE OF BEING FAKE NEEEWS


In [138]:
fantasy_tweet3 = "Coronaviruses are viruses that circulate among animals. Some coronaviruses can infect humans."
tweet_stats(fantasy_tweet3)

Tweet: Coronaviruses are viruses that circulate among animals
Tweet chances: [[0.9915074  0.00849254]]
99.15074110031128 % CHANCE OF BEING FAKE NEEEWS


Unfortunately, also true! This model is not ready for real-world deployment!

In [147]:
fantasy_tweet4 = "The vaccines are helping us diminishing infection numbers. The total number active cases COVID-19 worldwide has been diminishing steadly."
tweet_stats(fantasy_tweet4)

Tweet: The vaccines are helping us diminishing infection numbers. The total number active cases COVID-19 worldwide has been diminishing steadly.
Tweet chances: [[0.2424441 0.7575559]]
It's telling the truth


Seems like a more articulated phrase with better vocabulary has better chances of being labeled as "true".