ClusterDataSplit

This repository holds the companion code for the benchmarking study reported in the paper:

ClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation by Hanna Wecker, Annemarie Friedrich and Heike Adel. In Proceedings of Evaluation and Comparison of NLP Systems (Eval4NLP).

The paper can be found here. The code allows the users to reproduce and extend the results reported in the study. You can also use the code to generate challenging data splits for your own datasets. Please cite the above paper when reporting, reproducing or extending the results.

In case of questions, please contact the authors as listed on the paper.

Purpose of the project

This software is a research prototype, solely developed for and published as part of the publication cited above. It will neither be maintained nor monitored in any way.

Installing and Using ClusterDataSplit

ClusterDataSplit simply consists of a suite of three Jupyter notebooks for

(1) Data Analysis;
(2) Generation of Data Splits using a variety of algorithms including the Size and Distribution Sensitive K-Means algorithm; and
(3) Model Performance Comparison.

In order to use ClusterDataSplit, install Jupyter Notebook. Install the Conda environment as described by clusterdatasplit.yml, see e.g., here for how to do that. Then navigate to the ClusterDataSplit top folder and start the Jupyter notebook server by typing jupyter notebook. Your browser should open now and you can open and use the notebooks.

Note: ClusterDataSplit only works with sklearn (scikit-learn) <= 0.23.x.

License

ClusterDataSplit is open-sourced under the AGPL-3.0 license. See the LICENSE file for details.

For a list of other open source components included in ClusterDataSplit, see the file 3rd-party-licenses.txt. Our implementation of the Same-Size-K-Means algorithm follows the structure of that in the in the ELKI Data Mining software, but we have re-implemented the algorithm in Python.

The sample data provided in the data folder are released unter the CC-BY-4.0 license as documented in the data/LICENSE file. In particular, the patent data (titles and abstracts) are sampled from the bulk data collection provided by PatentsView.

Name	Name	Last commit message	Last commit date
Latest commit annefriedrich added placeholder directory for models Dec 3, 2020 9133670 · Dec 3, 2020 History 2 Commits
data	data	initial import	Dec 3, 2020
models	models	added placeholder directory for models	Dec 3, 2020
.gitignore	.gitignore	initial import	Dec 3, 2020
3rd-party-licenses.txt	3rd-party-licenses.txt	initial import	Dec 3, 2020
ClusterDataSplit (1) Data Analysis - Binary Classification Example.ipynb	ClusterDataSplit (1) Data Analysis - Binary Classification Example.ipynb	initial import	Dec 3, 2020
ClusterDataSplit (1) Data Analysis - Multi-Class Example.ipynb	ClusterDataSplit (1) Data Analysis - Multi-Class Example.ipynb	initial import	Dec 3, 2020
ClusterDataSplit (2) Creating Strategic Data Splits.ipynb	ClusterDataSplit (2) Creating Strategic Data Splits.ipynb	initial import	Dec 3, 2020
ClusterDataSplit (3) Analyzing Performance.ipynb	ClusterDataSplit (3) Analyzing Performance.ipynb	initial import	Dec 3, 2020
LICENSE	LICENSE	initial import	Dec 3, 2020
README.md	README.md	initial import	Dec 3, 2020
clusterdatasplit.yml	clusterdatasplit.yml	initial import	Dec 3, 2020
same_distribution_kmeans_updating_swaps.py	same_distribution_kmeans_updating_swaps.py	initial import	Dec 3, 2020
utils.py	utils.py	initial import	Dec 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ClusterDataSplit

Purpose of the project

Installing and Using ClusterDataSplit

License

About

Releases

Packages

Languages

License

boschresearch/clusterdatasplit_eval4nlp-2020

Folders and files

Latest commit

History

Repository files navigation

ClusterDataSplit

Purpose of the project

Installing and Using ClusterDataSplit

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages