<a href="https://colab.research.google.com/github/anjal-amin/dedupe/blob/main/near_dupe_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook demonstrates how to create a machine learning model trained on Jaccard's Index so we can quickly and accurately determine how similar the input text compares to stored text.

We define "duplicate" as similar text no matter if the words, sentences or paragraphs have been rearanged.

In [1]:
!pip install faiss-gpu

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2


In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

In [3]:
data = [
    ("This is a sample text", "A sample text for testing", 0.6),
    ("Python is great", "I love programming in Python", 0.4),
]

In [4]:
texts1, texts2, jaccard_similarity = zip(*data)

In [5]:
vectorizer = CountVectorizer(binary=True)
X1 = vectorizer.fit_transform(texts1)
X2 = vectorizer.transform(texts2)

In [6]:
X = X1 + X2
print(X)

  (0, 4)	2
  (0, 3)	2
  (0, 1)	1
  (0, 5)	1
  (1, 0)	1
  (1, 2)	2
  (1, 1)	1


In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, jaccard_similarity, test_size=0.2, random_state=42)

In [8]:
knn_model = KNeighborsRegressor(n_neighbors=1)  # Reduced number of neighbors
knn_model.fit(X_train, y_train)

In [9]:
y_pred = knn_model.predict(X_test)
print(y_pred)

[0.6]


In [10]:
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error: {mse:.2f}")

Mean Squared Error: 0.04
