Skip to content

chrisliu298/gpt2-arxiv

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 

CSE142 Project6: Fine-Tuning GPT-2 to Generate Research Paper Abstracts

This GPT-2 model is capable of generating abstracts given paper titles.

ArXiv Dataset

This repo contains a part of the dataset from arxiv_archive. Specifically, it contains papers under aritficial intelligence (AI), machine learning (LG), computation and language (CL), and computer vision and pattern recognition (CV) categories.

I use the titles and abstract of these papers to fine-tune my GPT-2 model.

Splits Count Percentage (%) BPE Token Count
Train 90,000 90.11 20,834,012
Validation 4,940 4.95 1,195,056
Test 4,940 4.95 1,218,754
Total 99,880 100 23,247,822

Usage

The script reads the .tsv file, adds special symbols to separate the titile and the abstract, sorts them by dates, and writes a text file for each category and an extra file of shuffled instances. It splits all examples into training, validation, and test sets.

> python preprocess_arxiv.py

Getting Started

Prerequisites

> pip install -r requirements.txt

Fine-Tune

> python train.py

Generate

> python generate.py

Reference

@dataset{r_stuart_geiger_2019_2533436,
    author= {R. Stuart Geiger},
    title={{ArXiV Archive: A tidy and complete archive of metadata for papers on arxiv.org, 1993-2019}},
    month=jan,
    year= 2019,
    publisher={Zenodo},
    version= {v1.0.1},
    doi={10.5281/zenodo.2533436},
    url={https://doi.org/10.5281/zenodo.2533436}
}

Author(s)

  • Chris Liu

  • Yiyun Zheng

About

Fine-tuning GPT-2 to generate research paper abstracts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages