<a href="https://colab.research.google.com/github/fajemila/zindi-Vaccinate/blob/main/Simple_LSTM_Zindi_To_Vaccinate_or_Not_To_Vaccinate.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this Notebook we'll be developing a machine learning model to assess if a Twitter post related to vaccinations is positive, neutral, or negative. This solution could help governments and other public health actors monitor public sentiment towards COVID-19 vaccinations and help improve public health policy, vaccine communication strategies, and vaccination programs across the world.
We will start very simple to understand the general concepts whilst not really caring about good results.

# Download Data
on zindi page, right click on Train.csv and click inspect , auth_token and url should be visible

In [1]:
import requests
import requests, zipfile
#the url and auth_value from the website 
url1 = "https://api.zindi.africa/v1/competitions/to-vaccinate-or-not-to-vaccinate-its-not-a-question/files/Train.csv"
url2 = "https://api.zindi.africa/v1/competitions/to-vaccinate-or-not-to-vaccinate-its-not-a-question/files/Test.csv"

myobj = {'auth_token': '########################'}

x = requests.post(url1, data = myobj,stream=True)
y = requests.post(url2, data = myobj,stream=True)

target_path1 = 'Train.csv'
target_path2 = 'Test.csv'


In [2]:
handle = open(target_path1, "wb")
for chunk in x.iter_content(chunk_size=512):
    if chunk:  # filter out keep-alive new chunks
        handle.write(chunk)
handle.close()
handle = open(target_path2, "wb")
for chunk in y.iter_content(chunk_size=512):
    if chunk:  # filter out keep-alive new chunks
        handle.write(chunk)
handle.close()

# Load Libraries

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import warnings
warnings.filterwarnings('ignore')
import seaborn as sns
from sklearn.model_selection import train_test_split
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
import time 
import re
# Natural Language Tool Kit 
import nltk  
nltk.download('stopwords') 
from nltk.corpus import stopwords 

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


## 2. Load Data

In [4]:
train = pd.read_csv("Train.csv")
test = pd.read_csv("Test.csv")

In [5]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.0
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.0
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.0


In [6]:
test.head()

Unnamed: 0,tweet_id,safe_text
0,00BHHHP1,<user> <user> ... &amp; 4 a vaccine given 2 he...
1,00UNMD0E,Students starting school without whooping coug...
2,01AXPTJF,"I'm kinda over every ep of <user> being ""rippe..."
3,01HOEQJW,How many innocent children die for lack of vac...
4,01JUKMAO,"CDC eyeing bird flu vaccine for humans, though..."


In [7]:
train['label'].value_counts()

 0.000000    4908
 1.000000    4053
-1.000000    1038
 0.666667       1
Name: label, dtype: int64

## 2. Cleaning Data
- ensure it is a string
- convert capital letters to small letters
- remove characters and replace with space
- remove punctuations

In [8]:
train['safe_text'] = train['safe_text'].apply(str)
test['safe_text'] = test['safe_text'].apply(str)

In [9]:
train['safe_text'] = train['safe_text'].apply(str.lower)
test['safe_text'] = test['safe_text'].apply(str.lower)

In [10]:
train['safe_text'] = train['safe_text'].apply(lambda x: x.replace('&amp;', ' '))
test['safe_text'] = test['safe_text'].apply(lambda x: x.replace('&amp;', ' '))

In [11]:
train['safe_text'] = train['safe_text'].apply(lambda x: x.replace('<user>', ' '))
test['safe_text'] = test['safe_text'].apply(lambda x: x.replace('<user>', ' '))

In [12]:
train['safe_text'] = train['safe_text'].apply(lambda x: x.replace('<url>', ' '))
test['safe_text'] = test['safe_text'].apply(lambda x: x.replace('<url>', ' '))

In [13]:
train['safe_text'] = train['safe_text'].apply(lambda x: x.replace('#', ' '))
test['safe_text'] = test['safe_text'].apply(lambda x: x.replace('#', ' '))

In [14]:
train['safe_text'] = train['safe_text'].apply(lambda x: x.strip('.').strip())
test['safe_text'] = test['safe_text'].apply(lambda x: x.strip('.').strip())

we only have one row with label 0.6667 and no agreement, so we would be dropping the row

In [15]:
train.drop(index=[4798, 4799], inplace=True)
train.reset_index(drop=True, inplace=True)

In [16]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,me the big homie meanboy3000 meanboy mb m...,0.0,1.0
1,E3303EME,i'm 100% thinking of devoting my career to pro...,1.0,1.0
2,M4IVFSMS,"whatcausesautism vaccines, do not vaccinate yo...",-1.0,1.0
3,1DR6ROZ4,i mean if they immunize my kid with something ...,-1.0,1.0
4,J77ENIIE,thanks to catch me performing at la nuit nyc...,0.0,1.0


In [17]:
# max number of words in each sentence
SEQUENCE_LENGTH = 50
EMBEDDING_SIZE = 100
# number of words to use, discarding the rest
N_WORDS = 10000
# out of vocabulary token
OOV_TOKEN = "<unk>"
padding_type = 'post'
trunc_type = 'post'

## 3. Preprocessing Data
Check Out this Link to Learn More About the steps Below
https://www.kdnuggets.com/2020/03/tensorflow-keras-tokenization-text-data-prep.html

- Our Labels are within -1 to 1, it should not be less than 1, so we + 1 
- Convert Labels for a multi classification task using to categorical
- split our data
- We Instantiate our tokenizer class to tokenize training data text
- we pad our tokenized text to ensure  equal lengths

In [18]:
train['label'].value_counts()

 0.0    4908
 1.0    4053
-1.0    1038
Name: label, dtype: int64

In [19]:
X = train['safe_text']
y = to_categorical(train['label']+1,num_classes=3)

In [20]:
X_train,X_val,y_train,y_val = train_test_split(X,y,test_size=0.2)

In [21]:
print('The Shape of Xraining ',X_train.shape)
print('The Shape of Validation',X_val.shape)

The Shape of Xraining  (7999,)
The Shape of Validation (2000,)


In [22]:
tokenizer = Tokenizer(num_words=N_WORDS, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(X_train)

In [23]:
word_index = tokenizer.word_index

In [24]:
print("THe first word Index are: ")
for x in list(word_index)[0:15]:
    print (" {},  {} ".format(x,  word_index[x]))

THe first word Index are: 
 <unk>,  1 
 the,  2 
 to,  3 
 measles,  4 
 a,  5 
 of,  6 
 in,  7 
 and,  8 
 i,  9 
 vaccine,  10 
 is,  11 
 for,  12 
 vaccines,  13 
 kids,  14 
 you,  15 


In [25]:
training_sequences = tokenizer.texts_to_sequences(X_train)
training_padded = pad_sequences(training_sequences, maxlen=SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)

In [26]:
print(train.safe_text[1])
print(training_sequences[1])

i'm 100% thinking of devoting my career to proving autism isn't caused by vaccines due to the idiotic posts i've seen about world autism day
[18, 148, 171, 28, 126, 27, 13, 140, 49, 150, 11, 197, 32, 5, 932, 795, 1882, 150, 11, 5664]


In [27]:
test['safe_text'] = test['safe_text'].astype(str)

In [28]:
val_sequences = tokenizer.texts_to_sequences(X_val)
val_padded = pad_sequences(val_sequences, maxlen=SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)

In [29]:
test_sequences = tokenizer.texts_to_sequences(test['safe_text'])
test_padded = pad_sequences(test_sequences, maxlen=SEQUENCE_LENGTH, padding=padding_type, truncating=trunc_type)

In [30]:
train.isnull().any()

tweet_id     False
safe_text    False
label        False
agreement    False
dtype: bool

# 4. Creating the Model

Our four layers are an embedding layer, our LSTM, and two linear layers.

In [33]:
optimizer = tf.keras.optimizers.Adam(lr=0.01)

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(N_WORDS, EMBEDDING_SIZE, input_length=SEQUENCE_LENGTH),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(64)),
    tf.keras.layers.Dense(14, activation='relu'),
    tf.keras.layers.Dense(3,activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer=optimizer)

In [34]:
start_time = time.time()

num_epochs = 1
history = model.fit(training_padded, y_train, epochs=num_epochs, validation_data=(val_padded, y_val))



In [35]:
pred = model.predict(test_padded)

In [36]:
pred

array([[0.35070682, 0.46285847, 0.4277427 ],
       [0.09110484, 0.17956445, 0.698197  ],
       [0.07381827, 0.72601336, 0.23211257],
       ...,
       [0.2325881 , 0.41443717, 0.42840785],
       [0.07699373, 0.14703174, 0.73255193],
       [0.34249035, 0.45593917, 0.4201953 ]], dtype=float32)

# Important Information

we can go ahead and do `test['label'] = np.argmax(pred,axis=1)` for a normal multi classification task, with classification metrics  `accuracy`, but the metric given for the challenge is a root mean squared error metrics hence performance on leaderboard would be bad, so we preprocess our predictions to have values between -1 and 1, better preprocessing can be done to improve scores on leaderboard.

In [37]:
def process_prediction(preds):
  r'''
    This function helps us go from a classifiaction
    problem to a regression one.
    The regression values range are in [-1, 1].
  '''

  final_preds = []
  for pred in preds:
    argmax = np.argmax(pred, axis=0)
    if argmax == 0: final_preds.append( -1*pred[0] )
    elif argmax == 1: final_preds.append( 0 )
    else: final_preds.append( pred[2] )
    
  return final_preds


def rmse(true, pred):
  return np.sqrt(mean_squared_error(true, pred))

In [38]:
predictions = process_prediction(pred)

In [39]:
test['label'] = predictions

In [40]:
test.head()

Unnamed: 0,tweet_id,safe_text,label
0,00BHHHP1,"... 4 a vaccine given 2 healthy peeps, fda t...",0.0
1,00UNMD0E,students starting school without whooping coug...,0.698197
2,01AXPTJF,"i'm kinda over every ep of being ""ripped fro...",0.0
3,01HOEQJW,how many innocent children die for lack of vac...,0.70251
4,01JUKMAO,"cdc eyeing bird flu vaccine for humans, though...",0.0


In [41]:
test[['tweet_id','label']].to_csv('submissions.csv',index=False)

## Some useful Insights
- The Text Cleaning done here is vanilla, you can do a lot them
- use a pretrained word embeddings e.g **Glove** 
- Try out the Transformer Architecture, or Use a Pretrained Transformer Architecture
- Tweak Neural Network Parameters
- increase no of epoch