# Vaccination-Stance Classification

The following Jupyter notebook contains the python programs for generating the results given in the paper. The code was written using Python 3. It uses additional python libraries from nltk, pandas, and scikit-learn. Specifically, the following code was developed to classify vaccine-related tweets into 3 classes (pro-vaccine, anti-vaccine, and neutral).

To perform the classification, the tweets are preprocessed as follows:
- We first apply NLTK's TweetTokenizer() function to convert the tweets into lower case and then segment them into a set of tokens (words, hashtags, and mentions).
- Stopwords are removed from the extracted tokens.
- We then apply scikit-learn's CountVectorizer() function to count the frequency of each token (including bigrams).

After preprocessing, we use scikit-learn's implementation of l1-regularized logistic regression to train the model (with varying regularization parameters, C). Performance of the classifier is evaluated using 5-fold cross validation. Oversampling was performed to the smaller class on the training set to handle the imbalanced class distribution. Results are reported in terms of the overall model accuracy as well as the precision, recall, and F-measure for each class.

**Configuration Parameters:**

In [1]:
datadir = '../data/'
resultdir = '../results/'

# Tweet tokenizer parameters
twtToken_handler = True
twtToken_len = True

# Feature extraction parameter
ngrams = (1,3)                # extract unigrams and bigrams
mindf = 3

# Classification parameters
oversampling = True           # stratification by oversampling the smaller classes in training set
numFolds = 10                  # number of folds for cross-validation

In [2]:
import nltk
import re
#nltk.download('stopwords')

from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
import re

def getStopwords():
    stopwordlist = set(stopwords.words('english'))
    with open(datadir + 'stopwords.txt', 'r') as f:
        for line in f:
            stopwordlist.add(line.rstrip('\n'))
    
    return stopwordlist
               
def preprocess(sentence):
    tkz = TweetTokenizer(strip_handles=twtToken_handler, reduce_len=twtToken_len, preserve_case=False)
    stop_words = getStopwords()
    temp = []
    for word in tkz.tokenize(sentence.lower()):
        word = re.sub(r"http[s]?", "", word)
        if word != '' and not (word in stop_words): 
            temp.append(word)
    separator = ' '
    return separator.join(temp)

def getAllFeatures(sp_mat, features):
    vocab = dict([(value, key) for key, value in features.items()])
    result = []
    for i in range(sp_mat.shape[0]):
        p, q = sp_mat[i].nonzero()
        temp = ' '.join([vocab[q[j]] for j in range(len(q))])
        result.append(temp)
    return result

def getTopFeatures(coef, features):
    vocab = dict([(value, key) for key, value in features.items()])
    sorted_coef = pd.Series(coef).sort_values(ascending=False)
    temp = list(vocab)
    for i in range(sorted_coef.shape[0]):
        temp[i] = vocab[sorted_coef.index[i]]
    result = pd.DataFrame(temp, columns=['term'])
    result['coef'] = sorted_coef.values
    return result

In [3]:
import pandas as pd

rawdata = pd.read_csv(datadir + "pro_anti.csv",header = 'infer')
rawdata = rawdata.drop(columns=['vax_class'])
rawdata = rawdata.reset_index(drop=True)
print(rawdata.shape)

print('Pro-vs-anti-vs-neutral class distribution:')
distrib = rawdata['class'].value_counts()
print(distrib)
probs = distrib/sum(distrib)
print(probs)

rawdata

(4842, 2)
Pro-vs-anti-vs-neutral class distribution:
 1    1875
 0    1565
-1    1402
Name: class, dtype: int64
 1    0.387237
 0    0.323214
-1    0.289550
Name: class, dtype: float64


Unnamed: 0,tweet,class
0,@DickDugan @maddow They make much more selling...,-1
1,@theheraldsun How does an unvaccinated person ...,-1
2,@cameronjowens @leighsales We have much higher...,1
3,Officials I trust fear there is not enough int...,1
4,"My daughter 30 years old, care worker. Double ...",-1
...,...,...
4837,M?ori tribe tells anti-Covid vaccine protester...,0
4838,Anti-vaccine protesters display Nazi symbols o...,0
4839,@Joc_face @TravisR96776163 @ElijahSchaffer If ...,1
4840,"@ponderousthings @1NewsNZ @jordyn_rudd ""The SA...",0


In [4]:
data = rawdata.copy()
data['X'] = data['tweet'].apply(preprocess)
data

Unnamed: 0,tweet,class,X
0,@DickDugan @maddow They make much more selling...,-1,make much selling vaccine ivermectin . profit ...
1,@theheraldsun How does an unvaccinated person ...,-1,unvaccinated person affect vaccinated person ?...
2,@cameronjowens @leighsales We have much higher...,1,much higher rates vaccinations icus hitting ca...
3,Officials I trust fear there is not enough int...,1,officials trust fear enough intensive care cap...
4,"My daughter 30 years old, care worker. Double ...",-1,"daughter 30 years old , care worker . double j..."
...,...,...,...
4837,M?ori tribe tells anti-Covid vaccine protester...,0,? ori tribe tells anti-covid vaccine protester...
4838,Anti-vaccine protesters display Nazi symbols o...,0,anti-vaccine protesters display nazi symbols o...
4839,@Joc_face @TravisR96776163 @ElijahSchaffer If ...,1,"indeed got jab , point actually : � pro-vaxx &..."
4840,"@ponderousthings @1NewsNZ @jordyn_rudd ""The SA...",0,""" sar household contacts exposed delta variant..."


In [5]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

#vectorizer = CountVectorizer(max_df=0.95, min_df=mindf, ngram_range=ngrams)    
vectorizer = TfidfVectorizer(max_df=0.95, min_df=mindf, ngram_range=ngrams)    
X = vectorizer.fit_transform(data['X'].values)
features = vectorizer.vocabulary_
Y = rawdata['class']
X.shape

(4842, 7044)

In [6]:
print('Class distribution:')
distrib = Y.value_counts()
print(distrib)

probs = distrib/sum(distrib)
print(probs)

Class distribution:
 1    1875
 0    1565
-1    1402
Name: class, dtype: int64
 1    0.387237
 0    0.323214
-1    0.289550
Name: class, dtype: float64


In [8]:
import numpy as np
from sklearn.model_selection import KFold
from scipy.sparse import csr_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score

regularizer = 1
kf = KFold(n_splits=numFolds, shuffle=True, random_state=1234)
fold = 0
Ypred = Y.copy()
Yprob = np.zeros((Y.shape[0],3))

for train_index, test_index in kf.split(X):
    fold += 1
    print('\nFold %d:' % (fold))
    
    train = pd.DataFrame(X[train_index].toarray())
    train['class'] = Y[train_index].tolist()
    
    if oversampling:
        max_size = train['class'].value_counts().max()

        lst = [train]
        for class_index, group in train.groupby('class'):
            lst.append(group.sample(max_size-len(group), replace=True))
        data = pd.concat(lst)
    else:
        data = train
        
    Y_train = data['class']
    X_train = data.drop(['class'],axis=1)
    X_train = csr_matrix(X_train)

    clf = LogisticRegression(verbose=1, C=regularizer, random_state=1, solver='liblinear', 
                              class_weight='balanced', penalty='l2',max_iter=5000) 
    clf.fit(X_train,Y_train)

    pred_train = clf.predict(X_train)
    pred_test = clf.predict(X[test_index])
    Ypred[test_index] = pred_test
    Yprob[test_index,:] = clf.predict_proba(X[test_index])

    print('\nTrain accuracy:' + str(accuracy_score(Y_train, pred_train)))
    print('Test accuracy:' + str(accuracy_score(Y[test_index], pred_test)))
    
    if fold == 1:
        cm = confusion_matrix(Y[test_index], pred_test)
    else:
        cm = cm + confusion_matrix(Y[test_index], pred_test)
    
print("Confusion Matrix:")
print(cm)
print("Accuracy =", sum(np.diag(cm))/sum(sum(cm)))
print("Micro F1 =", f1_score(Y, Ypred, average='micro'))
print("Macro F1 =", f1_score(Y, Ypred, average='macro'))
print("Weighted F1 =", f1_score(Y, Ypred, average='weighted'))
print("Accuracy =", accuracy_score(Y, Ypred))

print('Class -1:')
prec = cm[0][0]/cm[:,0].sum()
recall = cm[0][0]/cm[0,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)
print('Class 0:')
prec = cm[1][1]/cm[:,1].sum()
recall = cm[1][1]/cm[1,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)
print('Class 1:')
prec = cm[2][2]/cm[:,2].sum()
recall = cm[2][2]/cm[2,:].sum()
f1 = 2*prec*recall/(prec + recall)
print('   Precision =', prec)
print('   Recall =', recall)
print('   F-measure =', f1)


Fold 1:
[LibLinear]
Train accuracy:0.8831168831168831
Test accuracy:0.6989690721649484

Fold 2:
[LibLinear]
Train accuracy:0.8877731836975783
Test accuracy:0.6597938144329897

Fold 3:
[LibLinear]
Train accuracy:0.8866185897435898
Test accuracy:0.731404958677686

Fold 4:
[LibLinear]
Train accuracy:0.8845175181053043
Test accuracy:0.6611570247933884

Fold 5:
[LibLinear]
Train accuracy:0.8888231815493791
Test accuracy:0.6797520661157025

Fold 6:
[LibLinear]
Train accuracy:0.8834623783998412
Test accuracy:0.6900826446280992

Fold 7:
[LibLinear]
Train accuracy:0.8841439304072756
Test accuracy:0.7024793388429752

Fold 8:
[LibLinear]
Train accuracy:0.8863993710691824
Test accuracy:0.7231404958677686

Fold 9:
[LibLinear]
Train accuracy:0.8821548821548821
Test accuracy:0.7210743801652892

Fold 10:
[LibLinear]
Train accuracy:0.8827599841834717
Test accuracy:0.7355371900826446
Confusion Matrix:
[[ 949  200  253]
 [ 215 1097  253]
 [ 261  269 1345]]
Accuracy = 0.7003304419661297
Micro F1 = 0.7003

In [9]:
rawdata['prob_anti'] = Yprob[:,0]
rawdata['prob_neutral'] = Yprob[:,1]
rawdata['prob_pro'] = Yprob[:,2]

In [10]:
anti_coef = getTopFeatures(clf.coef_[0], features)
print(anti_coef[:100].values)
anti_coef.columns = ['term','coefficient']

[['pharma' 3.097942308815046]
 ['body' 2.8450444375585016]
 ['big pharma' 2.147711758965817]
 ['officially' 1.9637765291488047]
 ['government' 1.8857527161863532]
 ['toxin' 1.8552335894999605]
 ['prevent' 1.8334317596152307]
 ['immunity' 1.8224471841158323]
 ['spread' 1.7794651899037481]
 ['pushing' 1.7778514935763976]
 ['im antivax' 1.7573778164696618]
 ['poison' 1.7482860947890628]
 ['died' 1.7262111024147142]
 ['jab' 1.7179198741784802]
 ['mass' 1.7093574994129583]
 ['made' 1.619491774336121]
 ['passports' 1.591207274842764]
 ['companies' 1.5127493649295491]
 ['jabs' 1.4698678917619796]
 ['kill' 1.4427682876117844]
 ['bigpharma' 1.430288217633366]
 ['take' 1.4046305958210181]
 ['big' 1.3891883250191408]
 ['sheep' 1.3875849453665345]
 ['consent' 1.3758064614064505]
 ['spread covid' 1.3716234464028094]
 ['course' 1.3618217491718416]
 ['fear' 1.3579232973111877]
 ['pharmaceutical companies' 1.3579196657853996]
 ['pharmaceutical' 1.3579196657853996]
 ['come' 1.3545248189736288]
 ['gene'

In [11]:
pro_coef = getTopFeatures(clf.coef_[1], features)
print(pro_coef[:50].values)
pro_coef.columns = ['term','coefficient']

[['co' 4.086008728852657]
 ['polio vaccine' 1.912962044649195]
 ['aids' 1.8407425419866794]
 ['vacc' 1.6519065991719208]
 ['via' 1.6198005597196685]
 ['says' 1.6058119118975145]
 ['hesitancy' 1.5696336657944305]
 ['moderna' 1.556251673603067]
 ['meme' 1.4710833135917798]
 ['exist' 1.47010772335531]
 ['border' 1.4529208924102293]
 ['india' 1.4482630345086585]
 ['pfizer' 1.441581847452397]
 ['office' 1.4274430614900204]
 ['ya' 1.395993818404662]
 ['rubella vaccine' 1.3781734829738936]
 ['vax unvax' 1.3473023969287428]
 ['immunization' 1.2843331708105155]
 ['mandate' 1.2801475545678898]
 ['wants' 1.2451895521859784]
 ['study' 1.2393535416546049]
 ['coronavirus' 1.2354685079119365]
 ['cholera' 1.2116472146123412]
 ['hill' 1.1827127152702928]
 ['somewhere' 1.182311733008946]
 ['vax mandates' 1.1821744647778343]
 ['white' 1.1732693784102965]
 ['la' 1.1659661027735668]
 ['okay' 1.147936158478676]
 ['naman' 1.1436861782559187]
 ['orders' 1.1410498174223445]
 ['doses' 1.135120671159082]
 ['vacc

In [12]:
rawdata['predicted'] = Ypred
rawdata.to_csv(resultdir + 'pro_anti.csv', index=False)