Skip to content

ferencberes/twittertennis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

32 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

twittertennis

build codecov PyPI - Python Version Binder

Utility python package for RG17 and UO17 Twitter tennis tournament data sets.

Introduction

This repository is a Python package that ease the interaction with two Twitter data sets related to tennis tournaments: RG17 (Roland-Garros 2017) and UO17 (USOpen 2017). In our research, we used the underlying Twitter mention graphs to analyse the performance of mulitple dynamic centrality measures and temporal node embedding methods. A huge advantage of our data is that the nodes (Twitter accounts) of the network are temporally labeled thus we could compare online graph algortihms in supervised evaluation tasks. The labels encode whether a given node in the Twitter mention network is related to a tennis player who played in a tournament on the given day. For more details on these data sets, see our paper: Temporal walk based centrality metric for graph streams.

How to deploy?

Install

pip install twittertennis

Tests

git clone https://github.com/ferencberes/twittertennis.git
cd twittertennis
python setup.py test

Examples

Quick start

In this short example the RG17 (Roland-Garros 2017) data set is processed by the TennisDataHandler object. The data is automatically downloaded to the '../data/' folder during the first execution! After data preparation steps, mention links and daily node relevance labels are exported for further analysis.

  • Initialize data preprocessor
import twittertennis.handler as tt

handler = tt.TennisDataHandler("../data/", "rg17", include_qualifiers=True)
print(handler.summary())
  • Export mention links:
handler.export_edges(YOUR_OUTPUT_DIR)
  • Export daily node relevance labels:
handler.export_relevance_labels(YOUR_OUTPUT_DIR, binary=True)

OR change the last line of the code if you only want to export relevant nodes for each day:

handler.export_relevance_labels(YOUR_OUTPUT_DIR, binary=True, only_pos_label=True)

Preprocessed file content:

After data preprocessing you will find the following files in your specified folder:

  • edges.csv : edge stream of Twitter mentions. The timestamp in the first column in followed by the source and target node identifiers.
  • label_*.csv : list of relevant node identifiers for each day
  • summary.json : parameters set for TennisDataHandler during data preparation

See more examples in this notebook.

Related research

1. Temporal walk based centrality metric for graph streams: paper code

@article{Beres2018,
author="B{\'e}res, Ferenc
and P{\'a}lovics, R{\'o}bert
and Ol{\'a}h, Anna
and Bencz{\'u}r, Andr{\'a}s A.",
title="Temporal walk based centrality metric for graph streams",
journal="Applied Network Science",
year="2018",
volume="3",
number="32",
pages="26",
issn="2364-8228",
}

2. Node embeddings in dynamic graphs: paper code

@Article{Béres2019,
author="B{\'e}res, Ferenc
and Kelen, Domokos M.
and P{\'a}lovics, R{\'o}bert
and Bencz{\'u}r, Andr{\'a}s A.",
title="Node embeddings in dynamic graphs",
journal="Applied Network Science",
year="2019",
volume="4",
number="64",
pages="25",
}

3. PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models: paper code

@article{RozemberczkiPGT2021,
  author    = {Benedek Rozemberczki and
               Paul Scherer and
               Yixuan He and
               George Panagopoulos and
               Maria Sinziana Astefanoaei and
               Oliver Kiss and
               Ferenc B{\'{e}}res and
               Nicolas Collignon and
               Rik Sarkar},
  title     = {PyTorch Geometric Temporal: Spatiotemporal Signal Processing with
               Neural Machine Learning Models},
  volume    = {abs/2104.07788},
  year      = {2021},
  url       = {https://arxiv.org/abs/2104.07788},
  archivePrefix = {arXiv},
  eprint    = {2104.07788},
}