### Predict Covid Tweets Misinformation
Cheng Zhong <br>
cz2632@columbia.edu <br>
github link to the project: https://github.com/chengzhong666/Covid-Misinformation-Analysis

### Citation of paper providing original dataset
Shahi, Gautam Kishore, Anne Dirkson, and Tim A. Majchrzak. "An exploratory study of covid-19 misinformation on twitter." Online Social Networks and Media 22 (2021): 100104.

In [1]:
# Colab Setup: 
# note that tabular preprocessors require scikit-learn>=0.24.0
# Newest Tensorflow 2 has some bugs for onnx conversion
!pip install scikit-learn --upgrade 
import os
os.environ['TF_KERAS'] = '1'
% tensorflow_version 1

Requirement already up-to-date: scikit-learn in /usr/local/lib/python3.7/dist-packages (0.24.1)
`%tensorflow_version` only switches the major version: 1.x or 2.x.
You set: `1`. This will be interpreted as: `1.x`.


TensorFlow 1.x selected.


In [2]:
#Source: Fighting an Infodemic: COVID-19 Fake News Dataset, https://github.com/diptamath/covid_fake_news,https://arxiv.org/abs/2011.03327 

import pandas as pd
trainingdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/Constraint_Train.csv", usecols = ['tweet','label'])
testdata=pd.read_csv("https://raw.githubusercontent.com/diptamath/covid_fake_news/main/data/english_test_with_labels.csv", usecols = ['tweet','label'])

In [3]:
trainingdata.head()

Unnamed: 0,tweet,label
0,The CDC currently reports 99031 deaths. In gen...,real
1,States reported 1121 deaths a small rise from ...,real
2,Politically Correct Woman (Almost) Uses Pandem...,fake
3,#IndiaFightsCorona: We have 1524 #COVID testin...,real
4,Populous states can generate large case counts...,real


### Examples of tweets from the dataset that demonstrate real information or misinformation

In [4]:
real_tweets = list(trainingdata[trainingdata['label'] == 'real']['tweet'])
fake_tweets = list(trainingdata[trainingdata['label'] == 'fake']['tweet'])

In [5]:
real_tweets[0]

'The CDC currently reports 99031 deaths. In general the discrepancies in death counts between different sources are small and explicable. The death toll stands at roughly 100000 people today.'

In [6]:
real_tweets[1]

'States reported 1121 deaths a small rise from last Tuesday. Southern states reported 640 of those deaths. https://t.co/YASGRTT4ux'

In [7]:
real_tweets[2]

'#IndiaFightsCorona: We have 1524 #COVID testing laboratories in India and as on 25th August 2020 36827520 tests have been done : @ProfBhargava DG @ICMRDELHI #StaySafe #IndiaWillWin https://t.co/Yh3ZxknnhZ'

In [8]:
fake_tweets[0]

'Politically Correct Woman (Almost) Uses Pandemic as Excuse Not to Reuse Plastic Bag https://t.co/thF8GuNFPe #coronavirus #nashville'

In [9]:
fake_tweets[1]

'Obama Calls Trump’s Coronavirus Response A Chaotic Disaster https://t.co/DeDqZEhAsB'

In [10]:
fake_tweets[2]

'???Clearly, the Obama administration did not leave any kind of game plan for something like this.??�'

### Discuss the dataset in general terms and describe why building a predictive model using this data might be practically useful.  Who could benefit from a model like this? Explain.

In [11]:
trainingdata.describe()

Unnamed: 0,tweet,label
count,6420,6420
unique,6420,2
top,@dbaerwald1 @pommylee @SCClemons The latter. S...,real
freq,1,3360


In [12]:
testdata.describe()

Unnamed: 0,tweet,label
count,2140,2140
unique,2140,2
top,There’s a critical lack of representative clin...,real
freq,1,1120


This dataset contains text data on covid-19 information tweets. The labels for the tweets are two categories, real and false. Building a predictive model using that is practically useful for identifying the truthfulness of information. It could improve the efficiency for the public to adopt correct knowledge for covid and prevent the spread of rumors.

The text data reflect different patterns for true and false tweets. For instance, veracious tweets generally show a neutral tone, use informative language, and avoid hateful speech. On the other hand, false news tweets show their inflammatory nature, deny scientific approaches to fight over the pandemic, and incite ignorance and hatred.

By applying deep learning algorithms to this dataset, these patterns of real and false tweets could be analyzed and identified in a relatively automated way. The models generated could be used for future inputs, and the decision makers could predict future trends and regulations and optimize resources.

### Run at least four prediction models to try to predict real or fake tweets well.
- Use Embedding layers and at least one LSTM layer for at least one of these models
- Experiment with Bidirectional LSTMs, stacked LSTMS, and dropout regularization with at least two models.
- Use Embedding layers and at least one 1D Convolution layer for at least one of these models
- Discuss which models performed better and point out relevant hyper-parameter values for successful models.

In [13]:
# Define preprocessor

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# Build vocabulary from training text data
tokenizer = Tokenizer(num_words=10000)
tokenizer.fit_on_texts(trainingdata.tweet)

# preprocessor tokenizes words and makes sure all documents have the same length
def preprocessor(data, maxlen, max_words):

    sequences = tokenizer.texts_to_sequences(data)

    word_index = tokenizer.word_index
    X = pad_sequences(sequences, maxlen=maxlen)

    return X

In [14]:
# Prepare train and test data

# tokenize and pad X data
X_train = preprocessor(trainingdata.tweet, maxlen=40, max_words=10000)
X_test = preprocessor(testdata.tweet, maxlen=40, max_words=10000)

# one encode Y data
y_train = pd.get_dummies(trainingdata.label)
y_test = pd.get_dummies(testdata.label)

In [15]:
print(X_train.shape)
print(X_test.shape)

(6420, 40)
(2140, 40)


In [16]:
trainingdata.label.value_counts()

real    3360
fake    3060
Name: label, dtype: int64

In [17]:
testdata.label.value_counts()

real    1120
fake    1020
Name: label, dtype: int64

In [18]:
# load model_eval_metrics() function to calculate metrics

import numpy as np
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
import pandas as pd
from math import sqrt

def model_eval_metrics(y_true, y_pred,classification="TRUE"):
     if classification=="TRUE":
        accuracy_eval = accuracy_score(y_true, y_pred)
        f1_score_eval = f1_score(y_true, y_pred,average="macro",zero_division=0)
        precision_eval = precision_score(y_true, y_pred,average="macro",zero_division=0)
        recall_eval = recall_score(y_true, y_pred,average="macro",zero_division=0)
        mse_eval = 0
        rmse_eval = 0
        mae_eval = 0
        r2_eval = 0
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     else:
        accuracy_eval = 0
        f1_score_eval = 0
        precision_eval = 0
        recall_eval = 0
        mse_eval = mean_squared_error(y_true, y_pred)
        rmse_eval = sqrt(mean_squared_error(y_true, y_pred))
        mae_eval = mean_absolute_error(y_true, y_pred)
        r2_eval = r2_score(y_true, y_pred)
        metricdata = {'accuracy': [accuracy_eval], 'f1_score': [f1_score_eval], 'precision': [precision_eval], 'recall': [recall_eval], 'mse': [mse_eval], 'rmse': [rmse_eval], 'mae': [mae_eval], 'r2': [r2_eval]}
        finalmetricdata = pd.DataFrame.from_dict(metricdata)
     return finalmetricdata

In [19]:
# Callbacks

from tensorflow.python.keras.callbacks import ReduceLROnPlateau, ModelCheckpoint, EarlyStopping 
mc = ModelCheckpoint('best_model_embeddings.h5', monitor='acc',mode='max', verbose=1, save_best_only=True) 
red_lr= ReduceLROnPlateau(monitor='acc',patience=2,verbose=1,factor=0.5, min_lr=0.001)
es = EarlyStopping(monitor='acc', mode='max', verbose=1, patience=3)

### Train Placeholder Model
1 embedding layer + 1 dense layer

In [20]:
from tensorflow.keras.layers import Dense, Embedding, Flatten
from tensorflow.keras.models import Sequential

# replace this model with the architectures from the task description
model = Sequential()
model.add(Embedding(10000, 16, input_length=40))
model.add(Flatten())
model.add(Dense(2, activation='softmax'))

model.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model.fit(X_train, y_train,
                    epochs=10,
                    batch_size=32,
                    validation_split=0.2)

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
Train on 5136 samples, validate on 1284 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [21]:
# format y_pred as labels 
y_pred = model.predict(X_test).argmax(axis=1)
predicted_labels = [y_test.columns[i] for i in y_pred]
predicted_labels[0:5]

['real', 'fake', 'fake', 'real', 'real']

### Model 1
1 embedding layer + 2 LSTM layers
(no dropout regularization)

In [22]:
maxlen = 40
max_words = 10000 
embedding_dim = 100

In [23]:
from tensorflow.keras.layers import Dense, Embedding,Flatten, LSTM
from tensorflow.keras.models import Sequential
model1 = Sequential()
model1.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model1.add(LSTM(60, activation='tanh', return_sequences=True))
model1.add(LSTM(60, activation='tanh'))
model1.add(Dense(40, activation='relu'))
model1.add(Dense(2, activation='softmax'))
model1.summary()

model1.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model1.fit(X_train, y_train,
                    epochs=100,
                    batch_size=40,
                    verbose=1,callbacks=[es,mc,red_lr])

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 40, 100)           1000000   
_________________________________________________________________
lstm (LSTM)                  (None, 40, 60)            38640     
_________________________________________________________________
lstm_1 (LSTM)                (None, 60)                29040     
_________________________________________________________________
dense_1 (Dense)              (None, 40)                2440      
_________________________________________________________________
dense_2 (Dense)              (None, 2)                 82        
Total params: 1,070,202
Trainable params: 1,070,202
Non-trainable params: 0
_________________________________________________________________
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Train on 6420 sa

In [24]:
y_eva1 = y_test.idxmax(1)

In [25]:
y_pred1 = model1.predict(X_test)
prediction_index1= np.argmax(y_pred1,axis=1)

# get labels from one hot encoded y_train data
labels=pd.get_dummies(y_train).columns

# Iterate through all predicted indices using map method
predicted_labels1=list(map(lambda x: labels[x], prediction_index1))

model_eval_metrics(y_eva1,predicted_labels1,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.941121,0.940948,0.94136,0.940643,0,0,0,0


### Model 2
1 embedding layer + 2 LSTM layers (with dropout regularization on the second layer)

In [26]:
model2 = Sequential()
model2.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model2.add(LSTM(60, activation='tanh', return_sequences=True))
model2.add(LSTM(60, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model2.add(Dense(40, activation='relu'))
model2.add(Dense(2, activation='softmax'))
model2.summary()

model2.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model2.fit(X_train, y_train,
                    epochs=100,
                    batch_size=40,
                    verbose=1,callbacks=[es,mc,red_lr])

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 40, 100)           1000000   
_________________________________________________________________
lstm_2 (LSTM)                (None, 40, 60)            38640     
_________________________________________________________________
lstm_3 (LSTM)                (None, 60)                29040     
_________________________________________________________________
dense_3 (Dense)              (None, 40)                2440      
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 82        
Total params: 1,070,202
Trainable params: 1,070,202
Non-trainable params: 0
_________________________________________________________________
Train on 6420 samples
Epoch 1/100
Epoch 00001: acc did not improve from 1.00000
Epoch 2/100
Epoch 00002: acc d

In [27]:
y_eva2 = y_test.idxmax(1)

y_pred2 = model2.predict(X_test)
prediction_index2= np.argmax(y_pred2,axis=1)

# get labels from one hot encoded y_train data
labels=pd.get_dummies(y_train).columns

# Iterate through all predicted indices using map method
predicted_labels2=list(map(lambda x: labels[x], prediction_index2))

model_eval_metrics(y_eva2,predicted_labels2,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.939252,0.939145,0.938992,0.939338,0,0,0,0


### Model 3
1 embedding layer + 1 conv 1D layer + 2 LSTM layers (with dropout regularization on the second LSTM layer)

In [28]:
from tensorflow.keras.models import Sequential
from tensorflow.keras import layers
from tensorflow.keras.optimizers import RMSprop
from tensorflow.keras.layers import SimpleRNN, LSTM,Embedding

model3 = Sequential()
model3.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model3.add(layers.Conv1D(60, 7, activation='relu')) 
model3.add(layers.MaxPooling1D(2))
model3.add(LSTM(40, activation='tanh', return_sequences=True))
model3.add(LSTM(60, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model3.add(Dense(40, activation='relu'))
model3.add(Dense(2, activation='softmax'))
model3.summary()

model3.compile(optimizer='rmsprop', loss='categorical_crossentropy', metrics=['acc'])

history = model3.fit(X_train, y_train,
                    epochs=100,
                    batch_size=40,
                    verbose=1,callbacks=[es,mc,red_lr])

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 40, 100)           1000000   
_________________________________________________________________
conv1d (Conv1D)              (None, 34, 60)            42060     
_________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 17, 60)            0         
_________________________________________________________________
lstm_4 (LSTM)                (None, 17, 40)            16160     
_________________________________________________________________
lstm_5 (LSTM)                (None, 60)                24240     
_________________________________________________________________
dense_5 (Dense)              (None, 40)                2440      
_________________________________________________________________
dense_6 (Dense)              (None, 2)                

In [29]:
y_eva3 = y_test.idxmax(1)

y_pred3 = model3.predict(X_test)
prediction_index3= np.argmax(y_pred3,axis=1)

# get labels from one hot encoded y_train data
labels=pd.get_dummies(y_train).columns

# Iterate through all predicted indices using map method
predicted_labels3=list(map(lambda x: labels[x], prediction_index3))

model_eval_metrics(y_eva3,predicted_labels3,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.936916,0.936821,0.936611,0.93715,0,0,0,0


### Model 4:
1 embedding layer + 1 bidirectional LSTM layer + 1 LSTM layer with dropout regularization

In [30]:
from tensorflow.keras.layers import Bidirectional

model4 = Sequential()
model4.add(Embedding(max_words, embedding_dim, input_length=maxlen))
model4.add(Bidirectional(LSTM(40, activation='tanh', return_sequences=True)))
model4.add(LSTM(60, dropout=0.2, recurrent_dropout=0.2, activation='tanh'))
model4.add(Dense(40, activation='relu'))
model4.add(Dense(2, activation='softmax'))
model4.summary()

model4.compile(optimizer='rmsprop',
              loss='binary_crossentropy',
              metrics=['acc'])
history = model4.fit(X_train, y_train,
                    epochs=100,
                    batch_size=40,
                    verbose=1,callbacks=[es,mc,red_lr])

Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Instructions for updating:
Call initializer instance with the dtype argument instead of passing it to the constructor
Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 40, 100)           1000000   
_________________________________________________________________
bidirectional (Bidirectional (None, 40, 80)            45120     
_________________________________________________________________
lstm_7 (LSTM)                (None, 60)                33840     
_________________________________________________________________
dense_7 (Dense)              (None, 40)                2440      
______________________________

In [31]:
y_eva4 = y_test.idxmax(1)

y_pred4 = model4.predict(X_test)
prediction_index4= np.argmax(y_pred4,axis=1)

# get labels from one hot encoded y_train data
labels=pd.get_dummies(y_train).columns

# Iterate through all predicted indices using map method
predicted_labels4=list(map(lambda x: labels[x], prediction_index4))

model_eval_metrics(y_eva4,predicted_labels4,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.947196,0.946992,0.94808,0.946359,0,0,0,0


### Summary

Model 1: 1 embedding layer + 2 LSTM layers (no dropout regularization)

Model 2: 1 embedding layer + 2 LSTM layers (with dropout regularization on the second LSTM layer)

Model 3: 1 embedding layer + 1 conv 1D layer + 2 LSTM layers (with dropout regularization on the second LSTM layer)

Model 4: 1 embedding layer + 1 bidirectional LSTM layer + 1 LSTM layer with dropout regularization

In [32]:
model_eval_metrics(y_eva1,predicted_labels1,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.941121,0.940948,0.94136,0.940643,0,0,0,0


In [33]:
model_eval_metrics(y_eva2,predicted_labels2,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.939252,0.939145,0.938992,0.939338,0,0,0,0


In [34]:
model_eval_metrics(y_eva3,predicted_labels3,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.936916,0.936821,0.936611,0.93715,0,0,0,0


In [35]:
model_eval_metrics(y_eva4,predicted_labels4,classification="TRUE")

Unnamed: 0,accuracy,f1_score,precision,recall,mse,rmse,mae,r2
0,0.947196,0.946992,0.94808,0.946359,0,0,0,0


My best model after experimenting with different layers is Model 4 with a 94.72% accuracy. It has the structure of one embedding layer, one bidirectional LSTM layer, and one LSTM layer with dropout regularization. This result makes sense because the bidirectional recurrent layer could gain information from past (backwards) and future (forward) states simultaneously. Also, this method does not require the input data to be fixed, and future input information is reachable from the current state. In the case of twitter text analysis, context of the text is a key element to understand the true meaning of the input. Therefore, bidirectional recurrent neural networks could be an effective approach.

Model 1 is the second best model to predict true and false tweets. This model doesn't have a dropout regularization to randomly drop out nodes during training. Dropout regularization is known to reduce overfitting and improve generalization error. This model might compromise on the geralization error to achieve a higher accuracy.

All models were excecuted with callback checkpoints, so the number of epochs is optimized without overfitting. All batch sizes were set to 40, but for future implementation, this factor could also be substitute with higher numbers.

It is worth noticing all models have similar accuracy with differences around 1%. Therefore, future experiment should be done to decide a truly distinguished algorithm.

### Submit the best model to the leader board for the Covid Misinformation AI Model Share competition

In [None]:
# install aimodelshare library

! pip install aimodelshare --upgrade --extra-index-url https://test.pypi.org/simple/ 

In [55]:
import aimodelshare as ai
from aimodelshare.aimsonnx import model_to_onnx

In [58]:
# save preprocessor
ai.export_preprocessor(preprocessor,"")

In [None]:
# save model in onnx format
onnx_model = model_to_onnx(model4, framework='keras', transfer_learning=False, deep_learning=True)

with open("onnx_model.onnx", "wb") as f:
    f.write(onnx_model.SerializeToString())

In [50]:
# set credentials for modeltoapi function 
# make sure you have uploaded your credentials.txt file
from aimodelshare.aws import set_credentials
api_url = "https://wvr23l2z9i.execute-api.us-east-1.amazonaws.com/prod/m"

set_credentials(apiurl=api_url,credential_file="credentials.txt", type="submit_model", manual=False)

AI Model Share login credentials set successfully.
AWS credentials set successfully.


In [None]:
ai.submit_model("onnx_model.onnx",
                api_url,
                prediction_submission=predicted_labels4,
                preprocessor="preprocessor.zip")

In [60]:
data=ai.get_leaderboard(api_url, verbose=3)
ai.leaderboard.stylize_leaderboard(data)

Unnamed: 0,accuracy,f1_score,precision,recall,ml_framework,transfer_learning,deep_learning,model_type,depth,num_params,bidirectional_layers,conv1d_layers,dense_layers,embedding_layers,flatten_layers,globalmaxpooling1d_layers,lstm_layers,maxpooling1d_layers,simplernn_layers,relu_act,sigmoid_act,softmax_act,tanh_act,loss,optimizer,model_config,username,version
0,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,67
1,95.09%,95.09%,95.07%,95.12%,keras,False,True,Sequential,3,161922,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",hpeters,66
2,95.00%,94.99%,94.97%,95.02%,keras,False,True,Sequential,5,1081482,1.0,,2,1,,,1.0,,,1.0,,1.0,1.0,str,RMSprop,"{'name': 'sequential_29', 'lay...",kagenlim,61
3,94.86%,94.85%,94.84%,94.87%,keras,False,True,Sequential,5,1035746,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_3', 'laye...",kagenlim,19
4,94.77%,94.76%,94.74%,94.78%,keras,False,True,Sequential,9,1313030,,,2,1,1.0,,1.0,,4.0,,3.0,,4.0,str,RMSprop,"{'name': 'sequential_1', 'laye...",kka2120,69
5,94.58%,94.57%,94.57%,94.57%,keras,False,True,Sequential,5,1070202,,,2,1,,,2.0,,,1.0,,1.0,2.0,str,RMSprop,"{'name': 'sequential_4', 'laye...",kagenlim,60
6,94.49%,94.47%,94.47%,94.48%,keras,False,True,Sequential,3,161282,,,1,1,1.0,,,,,,,1.0,,str,RMSprop,"{'name': 'sequential', 'layers...",newusertest,4
7,94.35%,94.34%,94.32%,94.37%,keras,False,True,Sequential,6,148066,,2.0,1,1,1.0,,,1.0,,2.0,,1.0,,str,RMSprop,"{'name': 'sequential_72', 'lay...",prajseth,40
8,94.25%,94.24%,94.24%,94.24%,keras,False,True,Sequential,3,98818,,,1,1,,,1.0,,,,,1.0,1.0,str,RMSprop,"{'name': 'sequential_78', 'lay...",prajseth,41
9,94.21%,94.19%,94.18%,94.21%,keras,False,True,Sequential,3,402690,,,1,1,,,1.0,,,,1.0,,1.0,str,RMSprop,"{'name': 'sequential_5', 'laye...",xc2303_xc,63


In [61]:
bestmodel = ai.aimsonnx.instantiate_model(api_url, version=67) 
bestmodel.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 16)            160000    
_________________________________________________________________
flatten (Flatten)            (None, 960)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1922      
Total params: 161,922
Trainable params: 161,922
Non-trainable params: 0
_________________________________________________________________


In [64]:
bestmodel2 = ai.aimsonnx.instantiate_model(api_url, version=66) 
bestmodel2.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 60, 16)            160000    
_________________________________________________________________
flatten (Flatten)            (None, 960)               0         
_________________________________________________________________
dense (Dense)                (None, 2)                 1922      
Total params: 161,922
Trainable params: 161,922
Non-trainable params: 0
_________________________________________________________________


In [65]:
bestmodel3 = ai.aimsonnx.instantiate_model(api_url, version=61) 
bestmodel3.summary()

Model: "sequential_29"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_27 (Embedding)     (None, 40, 100)           1000000   
_________________________________________________________________
bidirectional_5 (Bidirection (None, 40, 80)            45120     
_________________________________________________________________
lstm_37 (LSTM)               (None, 60)                33840     
_________________________________________________________________
dense_43 (Dense)             (None, 40)                2440      
_________________________________________________________________
dense_44 (Dense)             (None, 2)                 82        
Total params: 1,081,482
Trainable params: 1,081,482
Non-trainable params: 0
_________________________________________________________________


In [66]:
ai.aimsonnx.compare_models(api_url, version_list=[67,61])

Unnamed: 0,Model_67_Layer,Model_67_Shape,Model_67_Params,Model_61_Layer,Model_61_Shape,Model_61_Params
0,Embedding,"(None, 60, 16)",160000.0,Embedding,"(None, 40, 100)",1000000
1,Flatten,"(None, 960)",0.0,Bidirectional,"(None, 40, 80)",45120
2,Dense,"(None, 2)",1922.0,LSTM,"(None, 60)",33840
3,,,,Dense,"(None, 40)",2440
4,,,,Dense,"(None, 2)",82


The top two models didn't make use of LSTM models, but the third best model used a bidirectional layer with a LSTM layer. Because of the bidirectional layer, there are more parameters in each layer. This model is similar to my best model and could be possibly improved by tuning the batch size and dimentionality.

### Feed the model some realistic tweets to see if it returns meaningful/useful results

In [67]:
# real test tweets
test1 = "Half of all adults in the US have received at least one Covid-19 shot, the government says."
test2 = "#DYK? Older adults are at high risk of getting seriously ill with #COVID19. To help protect them, older adults, their caregivers, and families all need to get vaccinated. Learn how community organizations can help: "
# fake test tweets
test3 = "Coronavirus is just a variant of flu. There's nothing to be alarmed about"
test4 = "China should be responsible for covid."

In [69]:
testlist = [test1, test2, test3, test4]
testdf = pd.DataFrame(testlist)
testdf['label'] = ['real','real','fake','fake']
testdf.iloc[:,0]

0    Half of all adults in the US have received at ...
1    #DYK? Older adults are at high risk of getting...
2    Coronavirus is just a variant of flu. There's ...
3               China should be responsible for covid.
Name: 0, dtype: object

In [71]:
y_pred5 = model4.predict(preprocessor(testdf.iloc[:, 0], maxlen=40, max_words=10000))

In [73]:
prediction= np.argmax(y_pred5 ,axis=1)
labels=pd.get_dummies(y_train).columns
predicted_labels5=list(map(lambda x: labels[x], prediction_index))
testdf['predicted'] = pd.Series(predicted_labels5)

In [74]:
testdf

Unnamed: 0,0,label,predicted
0,Half of all adults in the US have received at ...,real,real
1,#DYK? Older adults are at high risk of getting...,real,fake
2,Coronavirus is just a variant of flu. There's ...,fake,fake
3,China should be responsible for covid.,fake,real


Well... the best model failed to predict a real information tweet from the CDC. Test4 also failed, but may be the result of political reasons. More analysis could be done to identify how the length and tone of the tweet could influence on the result.