Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark

The official repository for the paper "Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark".

Installation

If you want to recreate the original environment used for the paper:

run the installation script (a clearer version will come soon)

conda env create -f environment.yml

Otherwise, for a new env:

we use: Python=3.7.9, dgl-cu102==0.4.3, torch==1.6.0

Datasets

In this paper, we develop and introduce a collection of synthetic, semi-synthetic, and real-world datasets. You can find these datasets in the dataset folder.

Synthetic dataset

Based on the analysis framework in this paper, you can adjust the bias level in synthetic data by setting parameters in synthetic_config.yaml. Also, you can save or load the existing synthetic datasets by the code in load_data.py.

Semi-synthetic dataset

Through the functions add_edges and remove_edges in utils.py, we obtain three new semi-synthetic datasets named germanA, creditA, and bailA. Following the analysis framework, You can modify other datasets to achieve the desired bias level.

Real-world dataset

Our real-world datasets both originate the social data from Twitter.

Because the size is limited, download them from Google Drive.

We provide some statistics of our datasets in the table below:

Dataset	Syn-1	Syn-2	New German	New Bail	New Credit	Sport	Occupation
# of nodes	5,000	5,000	1,000	18,876	30,000	3,508	6,951
# of edges	34,363	44,949	20,242	31,5870	1,121,858	136,427	44,166
# of features	48	48	27	18	13	768	768
Sensitive attribute	0/1	0/1	Gender (Male/Female)	Race (Black/White)	Age ($<$25/$>$25)	Race (White/Black)	Gender (Male/Female)
Label	0/1	0/1	Good/bad Credit	Bail/no bail	Payment default/no default	NBA/MLB	Psy/CS
Average degree	13.75	17.98	41.48	34.47	75.79	78.78	13.71

More details on our datasets can be found in the paper.

Running the experiments

To reproduce the experiments, the main scripts running the experiments are in the script folder. For example, you can train GCN among all datasets by typing:

bash ./script/gcn.sh

Certainly, You can change the parameter search space or modify some commands to implement multi-threaded training.

Citation

Please cite our paper if you found our datasets or code helpful.

@misc{qian2024addressing,
      title={Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark}, 
      author={Xiaowei Qian and Zhimeng Guo and Jialiang Li and Haitao Mao and Bingheng Li and Suhang Wang and Yao Ma},
      year={2024},
      eprint={2403.06017},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
dataset		dataset
models		models
script		script
.gitignore		.gitignore
README.md		README.md
config_datasets.yaml		config_datasets.yaml
config_synthetic.yaml		config_synthetic.yaml
environment.yml		environment.yml
get_args.py		get_args.py
load_data.py		load_data.py
report.py		report.py
train_fairGNN.py		train_fairGNN.py
train_gnn.py		train_gnn.py
train_nifty.py		train_nifty.py
utils.py		utils.py

XweiQ/Benchmark-GraphFairness

Folders and files

Latest commit

History

Repository files navigation

Addressing Shortcomings in Fair Graph Learning Datasets: Towards a New Benchmark

Installation

If you want to recreate the original environment used for the paper:

Otherwise, for a new env:

Datasets

Synthetic dataset

Semi-synthetic dataset

Real-world dataset

Running the experiments

Citation

About

Resources

Stars

Watchers

Forks

Languages