# Overview

### Quora Question Pairs

It is a large corpus of different questions and is used to detect similar/repeating questions by understanding the semantic meaning of them

### Qdrant

Qdrant is an Open-Source Vector Database and Vector Search Engine written in Rust. It provides fast and scalable vector similarity search service.

### Abstract

This script implements a search engine using the `Quora Duplicate Questions` dataset and the `Qdrant library`. It aims to identify similar questions based on user input queries.

### Methodology

Here's a detailed overview of implementation:

- The script begins by loading the Quora dataset and extracting questions from it. Duplicate questions are removed to ensure uniqueness, and a sample of questions is taken to expedite processing. These questions are then indexed using the `Qdrant library`.
- A search function is defined to query the indexed questions for similar matches to the user input query. The top similar questions found are displayed as results.
- Several example queries are provided to demonstrate the functionality of the search engine. These queries cover various topics, allowing users to observe how the engine retrieves relevant matches based on semantic similarity.

### Summary
In summary, the script offers a practical demonstration of building a search engine for similar questions using real-world data and a specialized library. It provides a starting point for developing more sophisticated search functionalities and can be adapted for various applications requiring semantic similarity matching.

# Setting Up
1. Join the [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs).
2. Download the file [train.csv.zip](https://www.kaggle.com/competitions/quora-question-pairs/data?select=train.csv.zip).
3. Unzip the downloaded file.
4. Save the path to the dataset in `DATA_PATH`.

In [1]:
!unzip /kaggle/input/quora-question-pairs/train.csv.zip

Archive:  /kaggle/input/quora-question-pairs/train.csv.zip
  inflating: train.csv               


In [2]:
DATA_PATH = "/kaggle/working/train.csv"

## Initialize Constants

In [3]:
# Name of Qdrant Collection for saving vectors
QD_COLLECTION_NAME = "collection_name"

# Sample size since the complete dataset is very long and can take long processing time
N = 30_000

# Dataset
- **Title:** Quora Question Pairs
- **Source:** Kaggle Competition
- **Link:** [Quora Question Pairs Competition on Kaggle](https://www.kaggle.com/competitions/quora-question-pairs)

In [4]:
import pandas as pd

df = pd.read_csv(DATA_PATH)

print("Shape of DataFrame:", df.shape)
print("First 10 rows:")
df.head(10)

Shape of DataFrame: (404290, 6)
First 10 rows:


Unnamed: 0,id,qid1,qid2,question1,question2,is_duplicate
0,0,1,2,What is the step by step guide to invest in sh...,What is the step by step guide to invest in sh...,0
1,1,3,4,What is the story of Kohinoor (Koh-i-Noor) Dia...,What would happen if the Indian government sto...,0
2,2,5,6,How can I increase the speed of my internet co...,How can Internet speed be increased by hacking...,0
3,3,7,8,Why am I mentally very lonely? How can I solve...,Find the remainder when [math]23^{24}[/math] i...,0
4,4,9,10,"Which one dissolve in water quikly sugar, salt...",Which fish would survive in salt water?,0
5,5,11,12,Astrology: I am a Capricorn Sun Cap moon and c...,"I'm a triple Capricorn (Sun, Moon and ascendan...",1
6,6,13,14,Should I buy tiago?,What keeps childern active and far from phone ...,0
7,7,15,16,How can I be a good geologist?,What should I do to be a great geologist?,1
8,8,17,18,When do you use シ instead of し?,"When do you use ""&"" instead of ""and""?",0
9,9,19,20,Motorola (company): Can I hack my Charter Moto...,How do I hack Motorola DCX3400 for free internet?,0


## Questions
Extracting Questions from dataset, removing duplications and sample a portion of data to use for search engine

In [5]:
# extract the questions from df
questions = pd.concat([df['question1'], df['question2']], axis=0)

# remove all the duplicate questions
questions = questions.drop_duplicates()

# print total number of questions
print("Total Questions:", len(questions))

# sample questions from complete data to avoid long processing
questions = questions.sample(N)

# print first 10 questions
print("First 10 Questions:")
questions.iloc[:10]

Total Questions: 537361
First 10 Questions:


91814                      Is Dubai a good place to settle?
328520    Why is talking to someone "easier" when you ha...
160048    Can Buddha, Jesus and Mohammad be the same per...
384538    Is light energy real in a way in which heat en...
247131                           How are Altoids so strong?
102424                 How do I view Verizon text messages?
190401    Which school is better for BS in computer scie...
153208             How does the US prevent electoral fraud?
81016                           How do I stream live video?
172748              Free party halls in Chennai triplicane?
dtype: object

# Qdrant

In [6]:
!pip install qdrant-client[fastembed]

Collecting qdrant-client[fastembed]
  Downloading qdrant_client-1.7.3-py3-none-any.whl.metadata (9.3 kB)
Collecting grpcio-tools>=1.41.0 (from qdrant-client[fastembed])
  Downloading grpcio_tools-1.60.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.2 kB)
Collecting httpx>=0.14.0 (from httpx[http2]>=0.14.0->qdrant-client[fastembed])
  Downloading httpx-0.26.0-py3-none-any.whl.metadata (7.6 kB)
Collecting portalocker<3.0.0,>=2.7.0 (from qdrant-client[fastembed])
  Downloading portalocker-2.8.2-py3-none-any.whl.metadata (8.5 kB)
Collecting fastembed==0.1.1 (from qdrant-client[fastembed])
  Downloading fastembed-0.1.1-py3-none-any.whl.metadata (3.8 kB)
Collecting onnxruntime<2.0,>=1.15 (from fastembed==0.1.1->qdrant-client[fastembed])
  Downloading onnxruntime-1.17.0-cp310-cp310-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (4.2 kB)
Collecting tokenizers<0.14,>=0.13 (from fastembed==0.1.1->qdrant-client[fastembed])
  Downloading tokenizers-0

In [7]:
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")

client.add(
    collection_name=QD_COLLECTION_NAME,
    documents=questions,
)

print("Completed")

100%|██████████| 77.7M/77.7M [00:01<00:00, 44.9MiB/s]


Completed


In [8]:
def search(query):
    results = client.query(
        collection_name=QD_COLLECTION_NAME,
        query_text=query,
        limit=5
    )
    print("Query:", query)
    for i, result in enumerate(results):
        print()
        print(f"{i+1}) {result.document}")

In [9]:
search("what is the best earyly morning meal?")

Query: what is the best earyly morning meal?

1) What’s the best breakfast in the morning?

2) What's the healthiest thing to eat for a quick and easy western breakfast?

3) What high protein foods are good for breakfast?

4) What are the healthiest foods to eat for dinner?

5) What are the best breakfast recipes in India?


In [10]:
search("How should one introduce themselves?")

Query: How should one introduce themselves?

1) How should you not introduce yourself?

2) How can I introspect myself?

3) What would be your answer to the interview question "introduce yourself"?

4) How do I effectively introduce myself in college?

5) How do you get to know yourself?


In [11]:
search("Why is the Earth a sphere?")

Query: Why is the Earth a sphere?

1) Why is the earth called earth?

2) What proof do people who say the Earth is flat and not a sphere have?

3) Why is the earth round?

4) What is the shape of the earth?

5) Why do extraterrestrial bodies always appear as a spherical shape? Why not square or cylindrical?


# Explore More

- This notebook has been covered in an article on Medium: [Build a search engine in 5 minutes using Qdrant](https://medium.com/@raoarmaghanshakir040/build-a-search-engine-in-5-minutes-using-qdrant-f43df4fbe8d1)
- [E-Commerce Products Search Engine Using Qdrant](https://www.kaggle.com/code/sacrum/e-commerce-products-search-engine-using-qdrant)
- [Qdrant](https://qdrant.tech)
- [Qdrant Documentation](https://qdrant.tech/documentation/)
- [Qdrant Python Client Documentation](https://python-client.qdrant.tech)
- [Quora Question Pair](https://www.kaggle.com/competitions/quora-question-pairs)
