Skip to content

This is to create the NYT dataset for summarization.

Notifications You must be signed in to change notification settings

boya-song/nyt_extract

Repository files navigation

nyt_extract

This repo is to parse and create training data for nyt dataset. Use unzip.py to extract all files in nyt corpus. Use XMLparser.py to parse and extract abstract and full text pairs. Use make_datafiles.py to tokenize and split data into train(90%), val(5%), and test(5%). (Credit for https://github.com/abisee/pointer-generator)

About

This is to create the NYT dataset for summarization.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages