Skip to content

Dataset & code used in "Modelling and Predicting Online Vaccination Views Using Bow-tie Decomposition"

Notifications You must be signed in to change notification settings

YuetingH/BT_Vaccination_Views

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BT_Vaccination_Views DOI

This repo documents the dataset and the programming code used in our paper "Modelling and Predicting Online Vaccination Views using Bow-tie Decomposition", Royal Society Open Science, authored by Yueting Han, Marya Bazzi, and Paolo Turrini (2024).

Dataset

In this paper, we investigate a temporal dataset that describes the Facebook vaccination views campaign in network representations, involving nearly 100 million users on Facebook from across countries, continents and languages.

It was provided by Johnson et al. (2020) and Illari et al. (2022) spanning different time periods. The former contains two snapshots in February 2019 and October 2019 (before the COVID-19), which we study in our main paper. The latter contains another two snapshots in November 2019 and December 2020 (at the initial stage of the COVID-19), which we study in Supplementary Material. Both of them are openly available and documented in two papers separately of different formats (PDF & Excel), thus requiring intensive preprocessing.

To make it easier for other researchers to use this dataset, here we reorganise both versions of this dataset in gpickle format (can be directly imported as attributed networks using Python) and in CSV format (for general use of the dataset).

A. Description

Each node represents a public Facebook page that discusses vaccination topics. It is attributed with fan size, that is, the number of members who subscribe to the Facebook page, along with the other attribute polarity including anti-vaccination (red), pro-vaccination (blue) and neutral (green). Whereas its polarity remains the same for February and October snapshots, its fan size changes. A directed edge from node A to B means page A recommends B to all its members at the page level. During preprocessing, we define an edge weight to quantify the significance of each recommendation. It is obtained by the product of both ends fan size.

Some basic info about four network snapshots:

  • G1 (Feb 19): 1326 nodes, 5163 edges
  • G2 (Oct 19): 1326 nodes, 7484 edges
  • G3 (Nov 19): 1356 nodes, 7387 edges
  • G4 (Dec 20): 1356 nodes, 7154 edges -- fan count unavailable

More details are available in our paper.

B. Usage via Python

Dataset import using Python is required to install package NetworkX. Details about using this dataset are available in our Jupyter notebooks. For convenience, we briefly mention some instructions here:

import networkx as nx

G1 = nx.read_gpickle("Main Dataset/Data/Graphs/G1.gpickle")           # Feb 19 network
G2 = nx.read_gpickle("Main Dataset/Data/Graphs/G2.gpickle")           # Oct 19 network
G3 = nx.read_gpickle("Follow-up Dataset (SI)/Data/Graphs/G1.gpickle") # Nov 19 network
G4 = nx.read_gpickle("Follow-up Dataset (SI)/Data/Graphs/G2.gpickle") # Dec 20 network

Programming Resources

A. Method Implementation

B. Network Visualization

See my homepage for details.

C. Analysis Techniques


Repo Structure

BT_Vaccination_Views
¦   README.md   
¦
+--- Main Dataset   
¦   ¦   1_Data_Preprocessing.ipynb                  # Data preprocessing and reorganizing
¦   ¦   2_Basic_Observations.ipynb                  # Figure 2 in the paper
¦   ¦   2_Results_BT_Decomposition.ipynb            # Figure 3 
¦   ¦   4_Results_ML.ipynb                          # Table 1 & 2
¦   ¦   5_Results_SIR.ipynb                         # Figure 4 & 5
¦   ¦   6_Results_SIR_SI.ipynb                      # SI (SIR robustness check)
¦   ¦
¦   +--- Data           
¦   ¦   ¦   Vaccination_data.xlsx                   # EXCEL version (converted from PDF in Johnson et al.)
¦   ¦   ¦   
¦   ¦   +--- Graphs
¦   ¦        ¦  G1.gpickle                          # Python readable graph (Feb 19)
¦   ¦        ¦  G1_bt.gpickle                       # Python readable graph with bt results
¦   ¦        ¦  G1_nodes.csv                        # CSV for general use of the dataset
¦   ¦        ¦  G1_edges.csv                        # CSV for general use of the dataset
¦   ¦        ¦  G2.gpickle                          # Python readable graph (Oct 19)
¦   ¦        ¦  G2_bt.gpickle
¦   ¦        ¦  G2_nodes.csv
¦   ¦        ¦  G2_edges.csv
¦   ¦
¦   +--- Figures
¦   ¦   ¦   ...
¦   ¦  
¦   +--- Modules                                    
¦   ¦   ¦   ...
¦   ¦
¦   +--- Results                                    # Store some time-consuming results (e.g., SFFS, SIR)
¦       ¦   ...
¦   
+--- Follow-up Dataset (SI)                        
¦   ¦   1_Data_Preprocessing.ipynb                 
¦   ¦   2_Results_BT_Decomposition.ipynb          
¦   ¦
¦   +--- Data           
¦   ¦   ¦   Edges at Nov 2019 for Fig.2(c).xlsx     # Original dataset from Illari et al (Nov 19)
¦   ¦   ¦   Edges at Dec 2020 for Fig.2(d).xlsx     # Original dataset from Illari et al (Dec 20)
¦   ¦   ¦   Nodes.xlsx
¦   ¦   ¦   
¦   ¦   +--- Graphs                                 # All formats remain consistant with the main dataset folder
¦   ¦        ¦  G1.gpickle                          
¦   ¦        ¦  G1_bt.gpickle
¦   ¦        ¦  G1_nodes.csv
¦   ¦        ¦  G1_edges.csv
¦   ¦        ¦  G2.gpickle
¦   ¦        ¦  G2_bt.gpickle
¦   ¦        ¦  G2_nodes.csv
¦   ¦        ¦  G2_edges.csv 
¦   ¦
¦   +--- Figures
¦   ¦   ¦   ...
¦   ¦  
¦   +--- Modules
¦   ¦   ¦   ...