<a href="https://colab.research.google.com/github/ccstan99/ccstan99.github.io/blob/main/docs/sbert-paraphrase-mining.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>
# Sentence-BERT Paraphrase Mining
[Sentence-BERT (SBERT)](https://sbert.net/) is a modification of BERT but is optimized for generating accurate and useful sentence embeddings. It uses Siamese and triplet network structure to derive embeddings that can be compared efficiently using cosine similarity. This reduces the time for finding the most similar pairs among 10,000 sentences from 65 hours with BERT or RoBERTa down to about 5 seconds, without sacrificing accuracy!

## Setup
Install the sentence-transformers module to run this notebook.

In [None]:
!pip install sentence-transformers

Download sentences from Wiki in JSON format. Substitute and format for your own sentences.

In [None]:
import requests, json
import pandas as pd

wikiURL = "https://stampy.ai/w/api.php?action=ask&query=[[Canonical%20questions]]|format%3Dplainlist|%3FCanonicalQuestions&format=json"
wikiJSON = requests.get(wikiURL).json()

df = pd.DataFrame(wikiJSON["query"]["results"]["Canonical questions"]["printouts"]["CanonicalQuestions"])
df.head()

Unnamed: 0,fulltext,fullurl,namespace,exists,displaytitle
0,Why is AGI dangerous?,https://stampy.ai/wiki/Why_is_AGI_dangerous%3F,0,1,
1,How is AGI different from current AI?,https://stampy.ai/wiki/How_is_AGI_different_fr...,0,1,
2,Why can't we turn the computers off?,https://stampy.ai/wiki/Why_can%27t_we_turn_the...,0,1,
3,Could we program an AI to automatically shut d...,https://stampy.ai/wiki/Could_we_program_an_AI_...,0,1,
4,If AI takes over the world how could it create...,https://stampy.ai/wiki/If_AI_takes_over_the_wo...,0,1,


Authorization to mount Google Drive to create output

In [None]:
from google.colab import drive

drive.mount('/content/drive/', force_remount=True)
PATH = "/content/drive/My Drive/Colab Notebooks"

def paraphrase_filename(model_name, current_time):
    return PATH + "/data/duplicate-questions.md"


Mounted at /content/drive/


## Paraphrase Mining to Find Duplicate Questions
there's a evergrowing number of [pretrained sentence-transformer model checkpoints](https://sbert.net/docs/pretrained_models.html) ranked by size, speed, and other performance metrics. A model can be initialized by passing it a checkpoint that indicates a combination of both the model architecture plus the specific trained weights. Since our goal was to identify pairs of more similar questions or sentences, we tried several models that performed best on semantic search leaderboards.

In [None]:
#@title Choose Model {display-mode: "form"}
# This code will be hidden when the notebook is loaded.
model_name = "paraphrases-multi-qa-mpn"  #@param ['paraphrases-multi-qa-mpn', 'distilbert-base-nli-stsb-quora-ranking', 'multi-qa-mpnet-base-dot-v1', 'all-MiniLM-L6-v2']

The `sentence_tranformers` module provides a super-handy [`paraphrase_mining`](https://sbert.net/examples/applications/paraphrase-mining/README.html#paraphrase-mining) utility that returns a list of tuples sorted by descending similarity scores along with the indices of 2 sentences from the original list of input sentences. A score of 1.0 means the 2 sentences are semantically identical, while a score of 0.0 means they are semantically unrelated.

In [None]:
from sentence_transformers import SentenceTransformer, util
import time

model = SentenceTransformer(model_name)

# Single list of sentences - Possible tens of thousands of sentences
sentences = df["fulltext"].values.tolist()

start_time = time.time()
paraphrases = util.paraphrase_mining(model, sentences)
end_time = time.time()
print(f"Elapsed time: {int((end_time - start_time)*1000)}ms")

Elapsed time: 3761ms


Saves and prints top k=100 most similar pairs of questions. Includes some extra information for recordkeeping.

In [None]:
import datetime

k=100
current_time = str(datetime.datetime.now())
with open(paraphrase_filename(model_name, current_time), 'w') as f:
      
    f.write("## Duplicate Questions\n")
    f.write(f"Language model name: {model_name}\n\n")
    f.write(f"Date generated: {current_time}\n\n")

    f.write("| Question1 | Question2 | Score |\n")
    f.write("| :--- | :--- | :--- |\n")

    for paraphrase in paraphrases[0:k]:
        score, i, j = paraphrase
        print(f"{df['fulltext'][i]}\n{df['fulltext'][j]}\nscore:{score:.2f}\n")
        f.write(f"| [{df['fulltext'][i]}]({df['fullurl'][i]}) | [{df['fulltext'][j]}]({df['fullurl'][j]}) | {score:.2f} |\n")   

Who helped create Stampy?
Who created Stampy?
score:0.98

Is humanity doomed?
How doomed is humanity?
score:0.95

What is a canonical question on Stampy's Wiki?
What is a canonical version of a question on Stampy's Wiki?
score:0.93

Why can’t we just “put the AI in a box” so it can’t influence the outside world?
Couldn’t we keep the AI in a box and never give it the ability to manipulate the external world?
score:0.92

How might a superintelligence technologically manipulate humans?
How might a superintelligence socially manipulate humans?
score:0.92

Why is AI Safety important?
Why is safety important for smarter-than-human AI?
score:0.91

Can we tell an AI just to figure out what we want, then do that?
Can we just tell an AI to do what we want?
score:0.90

What is AI Safety?
Why is AI Safety important?
score:0.90

I’d like a good introduction to AI alignment. Where can I find one?
What are good resources on AI alignment?
score:0.89

I’d like to get deeper into the AI alignment litera

In [None]:
print(f"File generated at:\n{paraphrase_filename(model_name, current_time)}") 

File generated at:
/content/drive/My Drive/Colab Notebooks/data/stampy-duplicate-questions.md


## Additional SBERT Resources
- [Paraphrase Mining](https://sbert.net/examples/applications/paraphrase-mining/README.html#paraphrase-mining)
- [Semantic Search](https://sbert.net/examples/applications/semantic-search/README.html#util-semantic-search)
- [Storing & Loading Embeddings](https://sbert.net/examples/applications/computing-embeddings/README.html#storing-loading-embeddings)
- [Topic Modeling](https://sbert.net/examples/applications/clustering/README.html#topic-modeling)