# Sentence Paraphrase using HuggingFace.co

In this project, I will use the Facebook Large BART model to train a network to paraphrase sentences. The data was downloaded and can be accessed in the `data/` folder. The pre-trained BART model is provided by huggingface.co and accessed through the `simpletransformers` library in python. 

This project is fairly simple and the BART network will be used as part of a larger NLP project to generate new sentences by paraphrasing. Even though the downstream task is more complex, this script illustrates the paraphraser training and results.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from simpletransformers.seq2seq import Seq2SeqModel, Seq2SeqArgs
already_trained = True
input_folder = '../Data/'
output_folder = '../Data/'

In [2]:
train_file = input_folder + 'train.tsv'
eval_file = input_folder + 'dev.tsv'
train_df = pd.read_csv(train_file, sep="\t").astype(str)
eval_df = pd.read_csv(eval_file, sep="\t").astype(str)
train_df = train_df.loc[train_df['label'] == '1']
eval_df = eval_df.loc[eval_df['label'] == '1']
train_df = train_df.rename(columns={"sentence1": "input_text", "sentence2": "target_text"})
eval_df = eval_df.rename(columns={"sentence1": "input_text", "sentence2": "target_text"})
train_df = train_df[['input_text', 'target_text']]
eval_df = eval_df[['input_text', 'target_text']]

Let's load in the data from our train and dev tsv files. These dataframes have been renamed to have 2 columns `input_text` and `target_text`. 

In [3]:
train_df.head()

Unnamed: 0,input_text,target_text
1,The NBA season of 1975 -- 76 was the 30th seas...,The 1975 -- 76 season of the National Basketba...
3,When comparable rates of flow can be maintaine...,The results are high when comparable flow rate...
4,It is the seat of Zerendi District in Akmola R...,It is the seat of the district of Zerendi in A...
5,William Henry Henry Harman was born on 17 Febr...,"William Henry Harman was born in Waynesboro , ..."
7,With a discrete amount of probabilities Formul...,Given a discrete set of probabilities formula ...


In [4]:
train_df.shape

(21829, 2)

In [5]:
eval_df.shape

(3539, 2)

The train set consists of >20k pairs of sentences, whereas the test set includes 3.5k pairs. Below is an example of a pair of sentences. We see that the paraphrased sentences (in this case and in most of the pairs in our data) are simple positional reorganization of phrases: the paraphrased sentences generally do not consist of out-of-bag words or synonyms for words in the input text. The text is simply shuffled while keeping the meaning the same.

While this paraphrasing is very simple, it is exactly what is needed for my downstream task. If we wanted to do more complex paraphrasing, in which synonyms are used for example, a different and larger dataset is required to adequately train this model.

In [6]:
train_df.iloc[0]['input_text']

'The NBA season of 1975 -- 76 was the 30th season of the National Basketball Association .'

In [7]:
train_df.iloc[0]['target_text']

'The 1975 -- 76 season of the National Basketball Association was the 30th season of the NBA .'

Here are the model arguments required for BART by `simpletransformers`. We will train it for 50 epochs and train it on a GPU. BART is a very large model, so even though we only have 20k sentences, the model still trains for a whole day.

In [8]:
model_args = Seq2SeqArgs()
model_args.output_dir = output_folder
model_args.num_train_epochs = 50
model_args.no_save = False
model_args.use_multiprocessing = False
model_args.evaluate_generated_text = True
model_args.evaluate_during_training = True
model_args.evaluate_during_training_verbose = True

Since I've already trained this BART model, for this script I will not run it again and instead will load it in from my checkpoint.

In [9]:
if already_trained:
    model2 = Seq2SeqModel(
        encoder_decoder_type = "bart",
        encoder_decoder_name = output_folder + 'checkpoint_epoch_50/',
        args = model_args,
        use_cuda = False
        )
else:
    model = Seq2SeqModel(
        encoder_decoder_type = "bart",
        encoder_decoder_name = "facebook/bart-large",
        args = model_args,
        use_cuda = True
        )
    model.train_model(train_df, eval_data=eval_df)

Likewise, the predictions have already been generated, so I will load it in from the pickle file.

In [10]:
to_predict = eval_df['input_text'].tolist()
truth = eval_df["target_text"].tolist()

if already_trained:
    preds_df = pd.read_pickle(output_folder + 'predictions.pkl')
    preds = preds_df['predictions'].tolist()
else:
    preds = model.predict(to_predict)
    preds_df = pd.DataFrame(preds, columns=['predictions'])
    preds_df.to_pickle(output_folder + 'predictions.pkl')

Now let's take look at the results. In example #1, the prediction is a shuffled paraphrasing of the input text. The prediction is not exactly the same as the truth, but the prediction is still grammatically correct, makes semantic sense and is not an exact copy of the input.

In [11]:
print('EXAMPLE 1')
print('INPUT', to_predict[0])
print('TRUTH', truth[0])
print('PREDS', preds[0])

EXAMPLE 1
INPUT They were there to enjoy us and they were there to pray for us .
TRUTH They were there for us to enjoy and they were there for us to pray .
PREDS They were there to enjoy us and they were there for us to pray.


In this case, the prediction did not perform exceptionally well, since the prediction is extremely similar to the input sentence. The only difference were the exclusion of 2 punctuations (`,` and `.`). This demonstrates that possibly cleaning the sentences before training, namely removing punctuations, can potentially improve the performance.

In [12]:
print('EXAMPLE 2')
print('INPUT', to_predict[3])
print('TRUTH', truth[3])
print('PREDS', preds[3])

EXAMPLE 2
INPUT The group toured extensively and became famous in Israel , and even played in New York City in 2007 .
TRUTH The group toured extensively and was famous in Israel and even played in New York City in 2007 .
PREDS The group toured extensively and became famous in Israel and even played in New York City in 2007


In [13]:
print('EXAMPLE 3')
print('INPUT', to_predict[50])
print('TRUTH', truth[50])
print('PREDS', preds[50])

EXAMPLE 3
INPUT From the west end of the bridge , Pennsylvania Route 268 leads south to Parker and north to Emlenton .
TRUTH The Pennsylvania Route 268 leads from the west end of the bridge south to Parker and to the north to Emlenton .
PREDS Pennsylvania Route 268 leads from the west end of the bridge south to Parker and north to


Overall, the model performs fairly well. Some of the paraphrased sentences are not drastically different from the input texts, and some sentences are not grammatically correct. However, for a large model that has only been trained for 50 epochs, the results are astounding. This small project demonstrates the robustness and potential of transformers in NLP related tasks.