# Paraphrase Mining with Transformers

![](https://i.imgur.com/7SXKckD.png)

Transfer Learning is the power of leveraging already trained models and tune \ adapt them to our own downstream tasks.

# Finding similar phrases with Transformer embeddings & Paraphrase Mining

Paraphrase mining is the task of finding paraphrases (texts with identical / similar meaning) in a large corpus of sentences. 

In Semantic Search \ Similarity we saw a simplified version of finding paraphrases in a list of sentences. 

The approach presented there used a brute-force approach to score and rank all pairs.

However, as this has a quadratic runtime, it fails to scale to large (10,000 and more) collections of sentences.

For larger collections, Sentence Transformers offers the `paraphrase_mining` functionality:



![](https://i.imgur.com/4NWbp1w.png)

# Clustering with Transformers and Agglomerative Clustering

Here we use Hierarchical clustering using the Agglomerative Clustering Algorithm. 

In contrast to k-means, we can specify a threshold for the clustering: Clusters below that threshold are merged. 

This algorithm can be useful if the number of clusters is unknown. 

By the threshold, we can control if we want to have many small and fine-grained clusters or few coarse-grained clusters.

In [1]:
!pip install -U sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 6.6 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.19.4-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 35.6 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 57.6 MB/s 
[?25hCollecting huggingface-hub
  Downloading huggingface_hub-0.7.0-py3-none-any.whl (86 kB)
[K     |████████████████████████████████| 86 kB 6.1 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.2 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYA

In [2]:
from sentence_transformers import SentenceTransformer, util

In [3]:
# Single list of sentences - Possible tens of thousands of sentences
sentences = ['The cat sits outside',
             'A man is playing guitar',
             'I love pasta',
             'The new movie is awesome',
             'The cat plays in the garden',
             'A woman watches TV',
             'The new movie is so great',
             'Do you like pizza?']

In [4]:
# https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
# MiniLM: Small and Fast Pre-trained Models for Language Understanding and Generation
# MiniLMv1-L12-H384-uncased: 12-layer, 384-hidden, 12-heads, 33M parameters, 2.7x faster than BERT-Base

embedder = SentenceTransformer('all-MiniLM-L12-v2')

Downloading:   0%|          | 0.00/1.18k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/10.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/573 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/39.3k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/134M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/352 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.2k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

In [6]:
paraphrases = util.paraphrase_mining(embedder, sentences)

In [7]:
len(paraphrases)

28

In [10]:
import pandas as pd

In [16]:
df = pd.DataFrame([[sentences[paraphrase[1]],
                    sentences[paraphrase[2]],
                    paraphrase[0]] 
                      for paraphrase in paraphrases], columns=['S1', 'S2', 'Score'])

df[df['Score'] > 0.40]

Unnamed: 0,S1,S2,Score
0,The new movie is awesome,The new movie is so great,0.891309
1,The cat sits outside,The cat plays in the garden,0.664502
2,I love pasta,Do you like pizza?,0.447631


# Large Scale Paraphrase Mining with Transformers

Instead of computing all pairwise cosine scores and ranking all possible, combintations, the approach is a bit more complex (and hence efficient). 

We chunk our corpus into smaller pieces using `corpus_chunk_size`. 

For example, if we set `corpus_chunk_size=10000`, we look for paraphrases in 10k sentences at a time.

The next critical thing is finding the pairs with the highest similarities. Instead of getting and sorting all n^2 pairwise scores, we take for each query only the `top_k` scores. So with `top_k=100`, we find at most 100 paraphrases per sentence per chunk. You can play around with `top_k` to the ensure a certain behaviour.

So for example, with

`paraphrases = util.paraphrase_mining(model, sentences, corpus_chunk_size=len(sentences), top_k=1)`


You will get for each sentence only the one most other relevant sentence. Note, if B is the most similar sentence for A, A must not be the most similar sentence for B. So it can happen that the returned list contains entries like (A, B) and (B, C).

The final relevant parameter is `max_pairs`, which determines the maximum number of paraphrase pairs you like to get returned.

If you set it to e.g. `max_pairs=100`, you will not get more than 100 paraphrase pairs returned. 

Usually, you get fewer pairs returned as the list is cleaned of duplicates, e.g., if it contains (A, B) and (B, A), then only one is returned.

## Get Duplicate Questions Quora Dataset

In [17]:
import os
import csv
import time
from sentence_transformers import util

# We donwload the Quora Duplicate Questions Dataset (https://www.quora.com/q/quoradata/First-Quora-Dataset-Release-Question-Pairs)
# and find similar question in it
url = "http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv"
dataset_path = "quora_duplicate_questions.tsv"
max_corpus_size = 50000 # We limit our corpus to only the first 50k questions


# Check if the dataset exists. If not, download and extract
# Download dataset if needed
if not os.path.exists(dataset_path):
    print("Download dataset")
    util.http_get(url, dataset_path)

# Get all unique sentences from the file
corpus_sentences = set()
with open(dataset_path, encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_MINIMAL)
    for row in reader:
        corpus_sentences.add(row['question1'])
        corpus_sentences.add(row['question2'])
        if len(corpus_sentences) >= max_corpus_size:
            break

corpus_sentences = list(corpus_sentences)

Download dataset


  0%|          | 0.00/58.2M [00:00<?, ?B/s]

In [18]:
len(corpus_sentences)

50001

In [19]:
corpus_sentences[:5]

['Why do Indians consider hockey as their national game?',
 'What is the best way to constantly improve self-confidence in a relationship?',
 'Do Chinese history have in common with Japanese history?',
 'What is the Sahara, and how do the average temperatures there compare to the ones in the Namib Desert?',
 'In Marvel’s Avergers: Age of Ultron film, how was Tony Stark able to operate his Iron Man Suit after Ultron had De-programmed J.A.R.V.I.S.?']

## Paraphrase Mining with Chunking

In [36]:
print("Start paraphrase mining")
start_time = time.time()

paraphrases = util.paraphrase_mining(embedder, corpus_sentences, corpus_chunk_size=5000, 
                                     top_k=1, max_pairs=300, show_progress_bar=True)

print("Paraphrase mining done after {:.2f} sec".format(time.time() - start_time))

Start paraphrase mining


Batches:   0%|          | 0/1563 [00:00<?, ?it/s]

Paraphrase mining done after 21.67 sec


In [37]:
len(paraphrases)

150

In [38]:
df = pd.DataFrame([[corpus_sentences[paraphrase[1]],
                    corpus_sentences[paraphrase[2]],
                    paraphrase[0]] 
                      for paraphrase in paraphrases], columns=['S1', 'S2', 'Score'])

In [39]:
df.head(10)

Unnamed: 0,S1,S2,Score
0,How should I prepare myself?,How should I prepare myself ?,1.0
1,Daniel Ek: When is Spotify coming to india?,Daniel Ek: When is Spotify coming to India?,1.0
2,Who is the most educated president in the world?,Who is the most educated president in the world ?,1.0
3,Which book should i use for JEE organic chemi...,Which book should I use for JEE organic chemis...,1.0
4,How do I prepare for gre?,How do I prepare for GRE?,1.0
5,How do I post a question in quora?,How do I post a question in Quora?,1.0
6,What are the best car technology gadgets?,What are the best Car technology gadgets?,1.0
7,How do I recover a hacked instagram?,How do I recover a hacked Instagram?,1.0
8,What is the purpose of life?,What is the purpose of life ?,1.0
9,What is the Milky Way?,What is the Milky way?,1.0


In [40]:
df.tail(10)

Unnamed: 0,S1,S2,Score
140,Who invented and designed the human heart?,Who designed and invented the human heart?,0.996611
141,Is there a NRI quota in IIMs in India?,Is there an NRI quota in the IIMs in India?,0.996596
142,Is there any evidence for reincarnation?,Is there any evidence of reincarnation?,0.996565
143,If dark energy is being created with expansion...,If dark energy is created with expansion can i...,0.996564
144,What creative and new activities can be made i...,What are creative and new activities to be mad...,0.99655
145,What is the best time table for a student of m...,What is the best time table for student of mat...,0.996547
146,What is the difference between Ethernet and In...,What is the difference between intranet and et...,0.996535
147,How do I master Java in one month?,How can I master Java in one month?,0.996526
148,What are the best places to visit on a 3-day t...,What are the best places to visit on a 3 day t...,0.996526
149,What are some tips for a beginner investors?,What are some tips for beginner investors?,0.996469
