## Semantic Search Engine

### Installing Required Libraries

In [18]:
pip install transformers sentence-transformers torch



### Importing Required Libraries

In [19]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util
import os
import time
import torch

### Mounting Google Drive

In [20]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Loading Dataset

In [21]:
df = pd.read_csv('/content/drive/MyDrive/Semantic/quora_titles.csv')
df.head(3)

Unnamed: 0.1,Unnamed: 0,Titles
0,0,
1,1,Clinicopathological Features of Invasive Breas...
2,2,Exploration of T cell immune responses by expr...


### Checking shape of the dataset

In [22]:
df.shape

(115175, 2)

### Checking for missing values

In [23]:
df.isnull().sum()

Unnamed: 0,0
Unnamed: 0,0
Titles,1


### Dropping missing values

In [24]:
# As there is 1 missing value so dropping this column
df.dropna(inplace=True)

### Extracting Titles from the DataFrame

In [25]:
titles = df['Titles'].to_list()
titles[:15]

['Clinicopathological Features of Invasive Breast Cancer: A Five-Year Retrospective Study in Southern and South-Western Ethiopia.',
 'Exploration of T cell immune responses by expression of a dominant-negative SHP1 and SHP2.',
 'First insights into region-specific lipidome alterations of prefrontal cortex and hippocampus of mice exposed chronically to microcystins.',
 'Continuous Monitoring of Health and Mobility Indicators in Patients with Cardiovascular Disease: A Review of Recent Technologies.',
 'Uses and Considerations for Cinematic Virtual Reality in Health Care.',
 'Role of Autoerythrocyte Sensitization Test in the Diagnosis of Recurrent Spontaneous Bruising.',
 'Allergy in Cancer Care: Antineoplastic Therapy-Induced Hypersensitivity Reactions.',
 'Good clinical practice and the use of hypofractionation radiation schedules as weapons to reduce the risk of COVID-19 infections in radiation oncology unit: A mono-institutional experience.',
 'Critical COVID-19 patients through first

### Loading the SentenceTransformer Model with the pre-trained model `LaBSE` (Language-agnostic BERT Sentence Embedding)

In [26]:
model = SentenceTransformer('LaBSE')



### Checking length of the titles

In [27]:
len(titles)

115174

### Generating Sentence Embeddings for the titles

In [28]:
embed = model.encode(titles[:15000], show_progress_bar=True, convert_to_tensor=True)

Batches:   0%|          | 0/469 [00:00<?, ?it/s]

### Semantic Search Function: We define a function search to perform semantic search on the pre-encoded titles using the SentenceTransformer embeddings. The function takes an input query, computes its embedding, and finds the most relevant results using cosine similarity.

In [29]:
def search(inp_question):
    start_time = time.time()
    question_embedding = model.encode(inp_question, convert_to_tensor=True)
    hits = util.semantic_search(question_embedding, embed)
    end_time = time.time()
    hits = hits[0]
    print("Input question:", inp_question)
    print("Results (after {:.3f} seconds):".format(end_time - start_time))
    for hit in hits[0:1]:
        print("\t{:.3f}\t{}".format(hit['score'], titles[hit['corpus_id']]))

### Running the Semantic Search to find the most relevant titles for the query

In [30]:
search("Men and Women")

Input question: Men and Women
Results (after 0.198 seconds):
	0.416	Sexual Anatomy and Function in Women With and Without Genital Mutilation: A Cross-Sectional Study.
