# Paraphrase Detection
In this notebook we will be using distilbert model from transformers to classify whether a given text is paraphrase of other text. We will be using TensorFlow 2.0 for this task.

Ensure that we are using TensorFlow 2.x

In [0]:
%tensorflow_version 2.x

TensorFlow 2.x selected.


Install transformers

In [0]:
!pip install transformers

Import required libraries.

In [0]:
import tensorflow as tf
import tensorflow_datasets as tfds
from transformers import *

Provide the name of the model you want to use. We will be using distilbert-base-uncased.

In [0]:
MODEL_NAME = 'distilbert-base-uncased'

Creata a tokenizer for distilbert as an instance of DistilBertTokenizer class and using the pretrained model we want to use.

In [0]:
tokenizer = DistilBertTokenizer.from_pretrained(MODEL_NAME)

Now, Let's instantiate the model. From TFDistilBertForSequnceClassification, we need to create a pretrained model.

In [0]:
classifier = TFDistilBertForSequenceClassification.from_pretrained(MODEL_NAME)

Now, it's time for creating dataset. Let's download GLUE (MRPC) dataset from tensorflow datasets library.

In [0]:
data = tfds.load('glue/mrpc')

Let's separate train and test data and also convert then into the type that the model accepts i.e. features.

In [0]:
train = glue_convert_examples_to_features(dataset['train'], tokenizer, max_length=128, task='mrpc')
test = glue_convert_examples_to_features(dataset['validation'], tokenizer, max_length=128, task='mrpc')

In [0]:
train = train.shuffle(100).batch(32).repeat(5)
test = test.batch(64)

Let's compile the model with desired optimizer,loss function and metrics.

In [0]:
classifier.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['sparse_categorical_accuracy'])

Finally, Let's train the model.

In [0]:
history = classifier.fit(train, epochs=4, steps_per_epoch=115,
                    validation_data=test)

Train for 115 steps
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


Evalute our trained model.

In [0]:
classifier.evaluate(test)

      7/Unknown - 1s 152ms/step - loss: 0.6931 - sparse_categorical_accuracy: 0.6838

[0.6931471824645996, 0.6838235]

Let's test the model on our own text. I know that these texts I provided are paraphrase. Let's see if the model can predict the same. 

In [0]:
text1 = "My name is Rachin."
text2 = "I am Rachin."

In [0]:
inputs_ids = tokenizer.encode_plus(text1, text2, return_tensors='tf')

pred = classifier(inputs_ids['input_ids'])[0].numpy().argmax().item()
print("text2 is", "a paraphrase" if pred else "not a paraphrase", "of text1")

text2 is a paraphrase of text1


# Conclusion:
The availability of transformers in TF2 has made this task so simple and easy. We could see that the model predicted correctly on our provided text. This means the model is pretty accurate and could be used.