The sources are intended to allow reproducing the experiments described in the project report
- Clone the repository:
git clone https://github.com/delkind/paraphraser.git
cd paraphraser
- Run setup script
./setup.sh
- Please note that before executing any scripts as per instructions below, the following command should be invoked
to activate virtual environment:
source ./.env/bin/activate
- Download the pre-trained TCNN and LSTM based decoders and pre-built universal embeddings for Bible dataset by running
./dl_uni_emb_files.sh
- To calculate BLEU score for both models for n random samples please run
./uni_emb_calc_bleu.sh --samples <n>
- To emit the original sentences (GOLD) file please run
./uni_emb_create_gold.sh
- To emit LSTM model predictions file please run
./uni_emb_lstm_pred.sh
- To emit TCNN model predictions file please run
./uni_emb_tcnn_pred.sh
The instructions above assume usage of pre-trained models and pre-built embeddings in order to produce the predictions and evaluate the experiment results. Below we provide the instructions for re-building and re-training models and embeddings instead of using the pre-built ones.
- Setup InferSent data files by running
./setup_infersent.sh
- Install PyTorch - follow the instructions here. We haven't provided a script since the installation differs substantially depending on the platform.
- Create embeddings from the YLT and BBE bible corpora by running
./create_uni_emb.sh
- Verify that
exp/uni_embed/embeddings.h5
file is created
We have experimented with the decoder based on LSTM and Temporal CNN architectures. To train the LSTM-based decoder
run
./uni_emb_train_lstm.sh
To train the TCNN-based decoder run
./uni_emb_train_tcnn.sh
In order to specify the number of epochs for training --epochs <n>
parameter can be specified to both scripts
where n is the number of epochs. The default is to train for 10 epochs. The model is saved (and subsequently overwritten)
after each epoch.