<a href="https://colab.research.google.com/github/elvinagam/efficient-sentence-search-llms/blob/main/Efficient_Similar_Sentence_Search_on_Large_Corpus_LLMs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Imports

In [4]:
!pip install bitarray mmh3 datasketch pybloom_live

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting bitarray
  Downloading bitarray-2.7.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (272 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m272.7/272.7 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting mmh3
  Downloading mmh3-4.0.0-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (68 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m68.0/68.0 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: mmh3, bitarray
Successfully installed bitarray-2.7.3 mmh3-4.0.0


In [10]:
text = """Once upon a time, there was a beautiful princess who lived in a grand castle. She had everything she could ever want: a loving family, loyal friends, and a kingdom to rule. But the princess was not happy. She longed for adventure and excitement. One day, a handsome prince came to the castle. He was everything the princess had ever dreamed of: brave, strong, and kind. The princess and the prince fell in love and were soon married.

After the wedding, the prince and princess set off on a journey to explore the world. They traveled to far-off lands and met all sorts of interesting people. They had many adventures along the way, including battling dragons, rescuing damsels in distress, and finding lost treasures. Throughout their travels, the prince and princess grew closer and closer. They learned to trust each other and rely on each other. They also learned to love each other more and more each day.

One day, the prince and princess came to a dark and dangerous forest. They knew they had to be careful, but they were determined to find their way through. As they walked through the forest, they heard a noise. They turned around and saw a group of bandits approaching them. The bandits were armed with swords and axes. They were clearly not friendly. The prince and princess knew they had to fight back. They drew their swords and prepared to defend themselves.

The battle was long and hard. The prince and princess fought bravely, but they were outnumbered. Just when it seemed like all hope was lost, a group of knights arrived to help them. The knights were able to defeat the bandits, and the prince and princess were safe. The prince and princess were grateful to the knights for saving their lives. They thanked the knights and continued on their journey. After many more adventures, the prince and princess finally returned home. They were greeted by their families and friends, who were overjoyed to see them. The prince and princess told their stories of their adventures, and everyone was amazed.

The prince and princess lived happily ever after. They ruled their kingdom wisely and justly, and they were loved by all their subjects.

Here are some similar and duplicate senten"""

In [None]:
import re
import difflib
from datasketch import MinHash, MinHashLSH
from pybloom_live import ScalableBloomFilter

In [None]:
# Function to preprocess the text
def preprocess_text(text):
    text = re.sub(r'\W+', ' ', text.lower())
    return text.strip()

# Function to find similar sentences using SequenceMatcher
def find_similar_sentences(sentences):
    similar_sentences = []

    for i in range(len(sentences)):
        for j in range(i+1, len(sentences)):
            seq_matcher = difflib.SequenceMatcher(None, sentences[i], sentences[j])
            similarity_ratio = seq_matcher.ratio()
            if similarity_ratio >= 0.8:
                pair = (sentences[i], sentences[j])
                similar_sentences.append(pair)

    return similar_sentences

In [None]:
# Function to generate MinHash signatures for sentences
def generate_minhash_signature(sentence):
    tokens = sentence.split()
    minhash = MinHash(num_perm=128)
    for token in tokens:
        minhash.update(token.encode('utf-8'))
    return minhash

In [None]:
# Function to find duplicate sentences
def find_duplicate_sentences(sentences):
    duplicate_sentences = []
    duplicates_bloom = ScalableBloomFilter()

    for sentence in sentences:
        sentence = preprocess_text(sentence)
        if sentence in duplicates_bloom:
            duplicate_sentences.append(sentence)
        else:
            duplicates_bloom.add(sentence)

    return duplicate_sentences

In [None]:
# Function to find similar and duplicate sentences
def find_similar_duplicates(sentences):
    similar_sentences = find_similar_sentences(sentences)
    duplicate_sentences = find_duplicate_sentences(sentences)

    return similar_sentences, duplicate_sentences

In [63]:
# Example sentences
sentences = [
    "The cat is sitting on the mat.",
    "The dog is playing in the garden.",
    "The cat is lying on the mat.",
    "A bird is flying in the sky.",
    "The dog is chasing its tail.",
    "The cat is sitting on the mat.",
    "The bird is singing in the tree.",
    "The cat is sleeping on the mat."
]

# Find similar and duplicate sentences
similar_sentences, duplicate_sentences = find_similar_duplicates(sentences)

# Print the results
print("Similar Sentences:")
for pair in similar_sentences:
    print(f"'{pair[0]}' and '{pair[1]}'")

print("\nDuplicate Sentences:")
for sentence in duplicate_sentences:
    print(f"'{sentence}'")

Similar Sentences:
'The cat is sitting on the mat.' and 'The cat is lying on the mat.'
'The cat is sitting on the mat.' and 'The cat is sitting on the mat.'
'The cat is sitting on the mat.' and 'The cat is sleeping on the mat.'
'The cat is lying on the mat.' and 'The cat is sitting on the mat.'
'The cat is lying on the mat.' and 'The cat is sleeping on the mat.'
'The cat is sitting on the mat.' and 'The cat is sleeping on the mat.'

Duplicate Sentences:
'the cat is sitting on the mat'


## Without libs

In [None]:
import re
import nltk
import numpy as np
from sklearn.feature_extraction.text import HashingVectorizer
from datasketch import MinHash, MinHashLSH

# Function to preprocess the text
def preprocess_text(text):
    text = re.sub(r'\W+', ' ', text.lower())
    return text.strip()

In [None]:
# Function to generate MinHash signatures for sentences
def generate_minhash_signature(sentence):
    tokens = nltk.word_tokenize(sentence)
    minhash = MinHash(num_perm=128)
    for token in tokens:
        minhash.update(token.encode('utf-8'))
    return minhash

In [None]:
# Function to find similar and duplicate sentences
def find_similar_duplicates(sentences):
    vectorizer = HashingVectorizer(norm=None, alternate_sign=False)
    sentence_vectors = vectorizer.transform(sentences)

    minhashes = []
    for sentence in sentences:
        minhash = generate_minhash_signature(sentence)
        minhashes.append(minhash)

    lsh = MinHashLSH(threshold=0.5, num_perm=128)
    for i, minhash in enumerate(minhashes):
        lsh.insert(i, minhash)

    similar_sentences = []
    duplicate_sentences = []

    for i, minhash in enumerate(minhashes):
        candidates = lsh.query(minhash)
        for candidate in candidates:
            if i != candidate:
                similarity = minhash.jaccard(minhashes[candidate])
                if similarity >= 0.8:  # Adjust the threshold as per your requirement
                    pair = (sentences[i], sentences[candidate])
                    if pair not in similar_sentences and pair not in duplicate_sentences:
                        duplicate_sentences.append(pair)
                else:
                    pair = (sentences[i], sentences[candidate])
                    if pair not in similar_sentences and pair not in duplicate_sentences:
                        similar_sentences.append(pair)

    return similar_sentences, duplicate_sentences

In [65]:
# Example sentences
sentences = [
"The cat is sitting on the mat.",
    "The dog is playing in the garden.",
    "The cat is lying on the mat.",
    "A bird is flying in the sky.",
    "The dog is chasing its tail.",
    "The cat is sitting on the mat.",
    "The bird is singing in the tree.",
    "The cat is sleeping on the mat."
]

# sentences = text.split(". ")

# Preprocess sentences
preprocessed_sentences = [preprocess_text(sentence) for sentence in sentences]

# Find similar and duplicate sentences
similar_sentences, duplicate_sentences = find_similar_duplicates(preprocessed_sentences)

# Print the results
print("Similar Sentences:")
for pair in similar_sentences:
    print(f"'{pair[0]}' and '{pair[1]}'")

print("\nDuplicate Sentences:")
for pair in duplicate_sentences:
    print(f"'{pair[0]}' and '{pair[1]}'")


Similar Sentences:
'the cat is sitting on the mat' and 'the cat is lying on the mat'
'the cat is sitting on the mat' and 'the cat is sleeping on the mat'
'the cat is lying on the mat' and 'the cat is sitting on the mat'
'the cat is lying on the mat' and 'the cat is sleeping on the mat'
'the cat is sleeping on the mat' and 'the cat is sitting on the mat'
'the cat is sleeping on the mat' and 'the cat is lying on the mat'

Duplicate Sentences:
'the cat is sitting on the mat' and 'the cat is sitting on the mat'
