Skip to content

delkind/paraphraser

Repository files navigation

Sources for Sentence Paraphraser project

Overview

The sources are intended to allow reproducing the experiments described in the project report

Setting up

  1. Clone the repository:
    git clone https://github.com/delkind/paraphraser.git
    cd paraphraser
  2. Run setup script
    ./setup.sh
  3. Please note that before executing any scripts as per instructions below, the following command should be invoked to activate virtual environment:
    source ./.env/bin/activate

Pre-trained universal embeddings (InferSent) experiment

Creating the paraphrases of the bible BBE corpus in the style of the YLT corpus

  1. Download the pre-trained TCNN and LSTM based decoders and pre-built universal embeddings for Bible dataset by running
    ./dl_uni_emb_files.sh
  2. To calculate BLEU score for both models for n random samples please run
    ./uni_emb_calc_bleu.sh --samples <n>
  3. To emit the original sentences (GOLD) file please run
    ./uni_emb_create_gold.sh
  4. To emit LSTM model predictions file please run
    ./uni_emb_lstm_pred.sh
  5. To emit TCNN model predictions file please run
    ./uni_emb_tcnn_pred.sh

Re-building experiment models and embeddings

The instructions above assume usage of pre-trained models and pre-built embeddings in order to produce the predictions and evaluate the experiment results. Below we provide the instructions for re-building and re-training models and embeddings instead of using the pre-built ones.

Reproducing sentence embedding creation

  1. Setup InferSent data files by running
    ./setup_infersent.sh
  2. Install PyTorch - follow the instructions here. We haven't provided a script since the installation differs substantially depending on the platform.
  3. Create embeddings from the YLT and BBE bible corpora by running
    ./create_uni_emb.sh
  4. Verify that exp/uni_embed/embeddings.h5 file is created

Reproducing models training

We have experimented with the decoder based on LSTM and Temporal CNN architectures. To train the LSTM-based decoder run
./uni_emb_train_lstm.sh
To train the TCNN-based decoder run
./uni_emb_train_tcnn.sh
In order to specify the number of epochs for training --epochs <n> parameter can be specified to both scripts where n is the number of epochs. The default is to train for 10 epochs. The model is saved (and subsequently overwritten) after each epoch.

About

Final project for the NLP Course

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published