Abstractive and extractive summarization for Hungarian

Links to the HunSum-2 dataset and our baseline models:

Links to the HunSum-1 dataset and our baseline models:

Setup

conda create --name my-env python=3.8.13
conda activate my-env

conda install -c conda-forge pandoc
pip install -e .

Install LSH package used for deduplication

git clone https://github.com/mattilyra/LSH
cd LSH
git checkout fix/filter_duplicates
pip install -e .

Usage

Download data from Common Crawl

Install CommonCrawl Downloader

git clone git@github.com:DavidNemeskey/cc_corpus.git
cd cc_corpus
pip install -e .

Download data

Arguments:

text file containing the urls to download: indexes_to_download.txt
path of the cc_corpus
output directory

scripts/download_data.sh indexes_to_download.txt ../cc_corpus/ ../CommonCrawl/

Parse articles

Arguments:

downloaded data
output directory
config file

The cleaned articles will be in the config.clean_out_dir

cd summarization
python entrypoints/parse_warc_pages.py ../../CommonCrawl ../../articles preprocess.yaml

Calculate document embeddings for leads and articles for cleaning

Arguments:

config file

cd summarization
python entrypoints/calc_doc_similarities.py preprocess.yaml

Clean articles

Arguments:

config file

cd summarization
python entrypoints/clean.py preprocess.yaml

Deduplicate articles

Arguments:

config file

cd summarization
python entrypoints/deduplicate.py preprocess.yaml

Citation

If you use our dataset or models, please cite the following paper:

@inproceedings {HunSum-1,
    title = {{HunSum-1: an Abstractive Summarization Dataset for Hungarian}},
    booktitle = {XIX. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2023)},
    year = {2023},
    publisher = {Szegedi Tudományegyetem, Informatikai Intézet},
    address = {Szeged, Magyarország},
    author = {Barta, Botond and Lakatos, Dorina and Nagy, Attila and Nyist, Mil{\'{a}}n Konor and {\'{A}}cs, Judit},
    pages = {231--243}
}

Name		Name	Last commit message	Last commit date
Latest commit History 451 Commits
scripts		scripts
summarization		summarization
.gitignore		.gitignore
README.md		README.md
indexes_to_download.txt		indexes_to_download.txt
kesma.txt		kesma.txt
main.py		main.py
segments_to_download.txt		segments_to_download.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scripts

scripts

summarization

summarization

.gitignore

.gitignore

README.md

README.md

indexes_to_download.txt

indexes_to_download.txt

kesma.txt

kesma.txt

main.py

main.py

segments_to_download.txt

segments_to_download.txt

setup.py

setup.py

Repository files navigation

Abstractive and extractive summarization for Hungarian

Setup

Install LSH package used for deduplication

Usage

Download data from Common Crawl

Install CommonCrawl Downloader

Download data

Parse articles

Calculate document embeddings for leads and articles for cleaning

Clean articles

Deduplicate articles

Citation

About

Releases

Packages

Contributors 3

Languages

botondbarta/HunSum

Folders and files

Latest commit

History

Repository files navigation

Abstractive and extractive summarization for Hungarian

Setup

Install LSH package used for deduplication

Usage

Download data from Common Crawl

Install CommonCrawl Downloader

Download data

Parse articles

Calculate document embeddings for leads and articles for cleaning

Clean articles

Deduplicate articles

Citation

About

Topics

Resources

Stars

Watchers

Forks

Languages