This repository contains code for our paper "Abstractive Summarization of DBpedia Abstracts Using Language Models." We propose an approach using pre-trained language models, specifically BART and T5, to generate short and comprehensive summaries for DBpedia abstracts in six languages (English, German, French, Italian, Spanish, and Dutch).
The pipeline of DBpedia summarization using language models
- bert-score 0.3.12
- ipykernel 6.17.1
- ipython 8.6.0
- nltk 3.7
- notebook 6.5.2
- pandas 1.5.1
- spacy 3.4.3
- torch 1.13.0
- transformers 4.25.1
├── data
│ ├── info.md
├── data_crowd
│ ├── de_crowd.csv
│ ├── en_crowd.csv
│ ├── es_crowd.csv
│ ├── fr_crowd.csv
│ ├── it_crowd.csv
│ ├── nl_crowd.csv
├── data_eval
│ ├── de_100_summaries.csv
│ ├── en_100_summaries.csv
│ ├── es_100_summaries.csv
│ ├── fr_100_summaries.csv
│ ├── it_100_summaries.csv
│ ├── nl_100_summaries.csv
├── full_abstracts
│ └── info.md
├── short_abstracts
│ └── info.md
├── baselines.ipynb
├── data_creation.ipynb
├── dbepdia-summarization.png
├── DBpepdia-abstractive-summarization.md
├── LICENSE
├── README.md
├── requirements.txt
├── summarization-cuda1.py
└── summarization-split.py
To install the dependencies, run:
pip install -r requirements.txt
- Download the DBpedia abstract file (in .ttl format) for the desired language from this source and place it in the
full_abstracts
folder. - Download the DBpedia short abstract file (in .ttl format) for the desired language from this source and unzip it in the
short_abstracts
folder. - Run the
data_creation
notebook. The final dataframes should be located in thedata
folder.
The complete result of evaluation made by crowdsourcing agents are presented in data_crowd folder. The data were divided by language. Each file contains column 'choice' which demonstrates which summary were chosen by crowdworker.
- The data for generating summaries is located in the
data
folder in .csv files. - The
baselines.ipynb
notebook contains the code for running the pretrained models (T5, BART, and BART-CNN). - The original and short abstracts and generated summaries used for crowdsourcing evaluation are stored in in
data_eval
folder.
More details are descripted here about downloading and processing the entire DBpedia abstracts.
TBD
TBD