Paracrawl-Paragraphs Dataset

Python Version: Python3.6

Package Requirements: torch==1.4.0 tensorboardX numpy==1.19.0

Framework: Our model and experiments are built upon G-Transformer.

Before running the scripts, please install fairseq dependencies by:

    pip install --editable .

Please also follow the readme under folder raw_data to download raw data.

Data Extraction

we provided the final dataset we used in the paper in the raw_data folder.

for re-extracting the data you can follow the instruction below:

    cd data_scripts
    
    bash extract_data.sh

    cd data_scripts
    
    pip install langid
    
    bash clean_data.sh

    mkdir exp_finetune
    bash exp_gtrans/run-all.sh prepare-finetune exp_finetune

    CUDA_VISIBLE_DEVICES=0,1,2,3 bash exp_gtrans/run-all.sh run-finetune train exp_finetune

    bash exp_gtrans/run-all.sh run-finetune test exp_finetune

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
baselines		baselines
data_scripts		data_scripts
docs		docs
examples		examples
exp_gtrans		exp_gtrans
fairseq		fairseq
fairseq_cli		fairseq_cli
raw_data		raw_data
scripts		scripts
tests		tests
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
generate.py		generate.py
hubconf.py		hubconf.py
preprocess.py		preprocess.py
pyproject.toml		pyproject.toml
run-finetune.sh		run-finetune.sh
setup.py		setup.py
train.py		train.py
utils.py		utils.py
validate.py		validate.py