Links to the HunSum-2 dataset and our baseline models:
Links to the HunSum-1 dataset and our baseline models:
conda create --name my-env python=3.8.13
conda activate my-env
conda install -c conda-forge pandoc
pip install -e .
git clone https://github.com/mattilyra/LSH
cd LSH
git checkout fix/filter_duplicates
pip install -e .
git clone git@github.com:DavidNemeskey/cc_corpus.git
cd cc_corpus
pip install -e .
Arguments:
- text file containing the urls to download:
indexes_to_download.txt
- path of the cc_corpus
- output directory
scripts/download_data.sh indexes_to_download.txt ../cc_corpus/ ../CommonCrawl/
Arguments:
- downloaded data
- output directory
- config file
The cleaned articles will be in the config.clean_out_dir
cd summarization
python entrypoints/parse_warc_pages.py ../../CommonCrawl ../../articles preprocess.yaml
Arguments:
- config file
cd summarization
python entrypoints/calc_doc_similarities.py preprocess.yaml
Arguments:
- config file
cd summarization
python entrypoints/clean.py preprocess.yaml
Arguments:
- config file
cd summarization
python entrypoints/deduplicate.py preprocess.yaml
If you use our dataset or models, please cite the following paper:
@inproceedings {HunSum-1,
title = {{HunSum-1: an Abstractive Summarization Dataset for Hungarian}},
booktitle = {XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023)},
year = {2023},
publisher = {Szegedi Tudományegyetem, Informatikai Intézet},
address = {Szeged, Magyarország},
author = {Barta, Botond and Lakatos, Dorina and Nagy, Attila and Nyist, Mil{\'{a}}n Konor and {\'{A}}cs, Judit},
pages = {231--243}
}