# Vaccine skepticism detection by network embedding

In this work, we intended to develop techniques that are able to efficiently differentiate between pro-vaxxer and vax-skeptic Twitter content related to COVID19 vaccines. After multiple data preprocessing steps, we evaluated Tweet content and user interaction network classification by combining text classifiers with several node embedding and community detection
models.

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

In [None]:
import sys, os
sys.path.insert(0,"../python")
from vaxxer.models import VaxxerClassifier
from vaxxer.utils import show_dynamic_auc

In [None]:
import joblib
from plotly.offline import init_notebook_mode
import plotly.express as px
import plotly.graph_objects as go
# Make plotly work with Jupyter notebook
init_notebook_mode()

# 1. Download train-test data

We provide a bash script (`download_data.sh`) to download our Twitter data set related to COVID19 vaccine skepticism.

To comply data publication policy of Twitter, we cannot share the raw data. Instead, we publish our data in two different packages to provide reproducibility and encourage future work:

- **[Twitter data identifiers]():** contains only tweet ID and user ID for each collected tweet. We further publish the underlying reply graph that we used to fit node embedding and community detection methods. 

- **[Tweet representations](http://info.ilab.sztaki.hu/~fberes/covid_vaccine_data/covid_vaxxer_representations_2021-09-24.zip):** In this package, we publish the data that we used for model training and evaluation. For tweet classification, we used the following three modalities with logistic regression:

   * **1. text:** 1,000 dimensional TF-IDF vector of tweet text;
   * **2. history:** Four basic statistics calculated from past tweet labels of the same user;
   * **3. embedding:** 128-dimensional user representation in the reply network.

In this notebook, we **only address tweet representations** that we load with `VaxxerClassifier` to analyze the global and dynamic performance for each modality in the classification task.

In [None]:
%%bash
cd ..
if [[ -d data ]]
then
    rm -r data
fi
bash scripts/download_data.sh

# 2. Global model performance

First, we load different combinations of modalities (e.g. text-only, text and network embedding etc.) to incorporate in our tweet classifier. 

In [None]:
model_dir = "../data/covid_vaxxer_representations/"

In [None]:
classifier = VaxxerClassifier("tfidf", "Vax-skeptic", drop_irrelevant=True)

### Different modality settings

In [None]:
configs = {
    "text":(True, False, False),
    "text+history":(True, True, False),
    "text+embedding":(True, False, True),
    "text+embedding+history":(True, True, True)
}

### Experimental setting
We split the tweet data in time to 70% training (2551 tweets) and 30% testing (1094 tweets). Then we calculate AUC for the whole test set.

### Results

- Not surprisingly, **user historical statistics** has strong contribution as users usually stick to their past opinion.
   - 4.27\% improvement compared to text-only model (AUC: 0.8385 -> 0.8743)
-  **User representations from the Twitter reply network** improve performance even more
   - 7.92\% improvement compared to text-only model (AUC: 0.8385 -> 0.9024)

In [None]:
predictions = []
for key, config in configs.items():
    text, history, network = config
    X_tr, X_te = classifier.load(model_dir, use_text=text, use_history=history, use_network=network)
    clf, tr_pred, te_pred = classifier.fit_vaxxer_classifier(X_tr, X_te, {"model":"newton-cg"})
    te_pred["experiment_id"] = key
    predictions.append(te_pred)
    print(key, "AUC:", roc_auc_score(te_pred["label"], te_pred["proba"]))
    print()

In [None]:
len(predictions)

# 3. Dynamic model performance

In [None]:
badrate = te_pred.groupby("date")["label"].mean()

Next, we show the changes in model performance over time as well as the vax-skeptic rate in the labeled data. 

By default, AUC is calculated for a 7 day long sliding window.

In [None]:
time_window = 7*86400
fig = show_dynamic_auc(configs, predictions, badrate, time_window)
fig.show()

# 4. Node embedding visualization

Finally, we visualize the pro-vaxxer and vax-skeptic user clusters that the best performing [Walklets](https://arxiv.org/abs/1605.02115) node embedding model managed to capture.

For our experiments, we used the [karateclub](https://github.com/benedekrozemberczki/karateclub) open-source Python package.

In [None]:
def show_embeddings(pred_df, X, show_hist=False):
    mean_user_labels = pred_df.groupby("usr_id_str")["label"].mean()
    if show_hist:
        mean_user_labels.hist()
    label_map = dict(mean_user_labels)
    pred_tmp_df = pred_df.copy()
    pred_tmp_df["label"] = pred_tmp_df["usr_id_str"].apply(lambda x: round(label_map[x]))
    visu_df = pd.concat([pd.DataFrame(X[:,vax_skeptic_columns], index=pred_tmp_df.index), pred_tmp_df[["usr_id_str","label"]]], axis=1)
    visu_df = visu_df.drop_duplicates(subset="usr_id_str")
    print(visu_df.shape)
    return visu_df

#### Load only node representations from the underlying Twitter reply network (128-dimensional)

In [None]:
X_tr_ne, X_te_ne = classifier.load(model_dir, use_text=False, use_history=False, use_network=True)
clf_ne, tr_pred_ne, te_pred_ne = classifier.fit_vaxxer_classifier(X_tr_ne, X_te_ne, {"model":"newton-cg"})

## Vax-skeptic users in the embedded space

#### Extract the most relevant coefficients of the LogisticRegression classifier that we fitted for this task 

In [None]:
vax_skeptic_coeffs = clf_ne.coef_
vax_skeptic_coeffs.shape

In [None]:
sorted_args = np.argsort(np.max(vax_skeptic_coeffs, axis=0))
vax_skeptic_columns = [sorted_args[0],sorted_args[-1],sorted_args[-2]]
print(vax_skeptic_columns)

### a.) Kernel density estimation on the test set

In [None]:
te_visu_df = show_embeddings(te_pred_ne, X_te_ne)
g = sns.jointplot(
    data=te_visu_df,
    x=1, y=2, hue="label",
    kind="kde",
    legend=False
)
g.ax_joint.set_xlabel("")
g.ax_joint.set_ylabel("")
plt.legend(title='Vaccine view', loc='upper left', labels=['Skeptic', 'Pro'])

### b.) Scatterplot for short time intervals

In the visualization, each point represents a user that was active in the selected time interval

In [None]:
meta_df = te_pred_ne.reset_index(drop=True)
intervals = [
    #("2021-04-27","2021-05-03"),
    ("2021-05-05","2021-05-13"),
    #("2021-05-16","2021-05-22"),
    #("2021-05-29","2021-06-09"),
    #("2021-06-16","2021-06-22"),
    #("2021-07-17","2021-07-29")
]
for from_date, to_date in intervals:
    interval_df = meta_df[(meta_df["date"]>=from_date) & (meta_df["date"]<=to_date)]
    interval_X = X_te_ne[interval_df.index,:]
    print(len(interval_df), interval_X.shape)
    interval_visu_df = show_embeddings(interval_df, interval_X)
    g=sns.jointplot(
        data=interval_visu_df,
        x=1, y=2, hue="label",
        legend=False,
    )
    g.ax_marg_x.set_xlim(-3, 2)
    g.ax_marg_y.set_ylim(-4, 3)
    g.ax_joint.set_xlabel("")
    g.ax_joint.set_ylabel("")
    plt.legend(title='Vaccine view', loc='upper left', labels=['Skeptic', 'Pro'])