# Enhancing Search Engine Relevance for Video Subtitles

## Background
In the fast-evolving landscape of digital content, effective search engines play a pivotal role in connecting users with relevant information. For Google, providing a seamless and accurate search experience is paramount. This project focuses on improving the search relevance for video subtitles, enhancing the accessibility of video content.

## Objective
Develop an advanced search engine algorithm that efficiently retrieves subtitles based on user queries, with a specific emphasis on subtitle content. The primary goal is to leverage natural language processing and machine learning techniques to enhance the relevance and accuracy of search results.

## Keyword-based vs Semantic Search Engines
- **Keyword Based Search Engine:** These search engines rely heavily on exact keyword matches between the user query and the indexed documents.
- **Semantic Search Engines:** Semantic search engines go beyond simple keyword matching to understand the meaning and context of user queries and documents.
- **Comparison:** While keyword-based search engines focus primarily on matching exact keywords in documents, semantic-based search engines aim to understand the deeper meaning and context of user queries to deliver more relevant and meaningful search results.

## Core Logic
To compare a user query against a video subtitle document, the core logic involves three key steps:
1. **Preprocessing of Data:** 
   - Read the given data.
   - Observe that the given data is a database file.
   - Go through the README.txt to understand what is there inside the database.
   - Take care of decoding the files inside the database.
   - If you have limited compute resources, you can take a random 30% of the data.
   - Apply appropriate cleaning steps on subtitle documents (whatever is required).

2. **Vectorization:**
   - Experiment with the following to generate text vectors of subtitle documents:
     - BOW / TFIDF to generate sparse vector representations. Note that this will only help you to build a Keyword Based Search Engine.
     - BERT based “SentenceTransformers” to generate embeddings which encode semantic information. This can help us build a Semantic Search Engine.
   - **Document Chunker:** Consider the challenge of embedding large documents: Information Loss. It is often not practical to embed an entire document as a single vector, particularly when dealing with long documents.
     - Divide a large document into smaller, more manageable chunks for embedding.
     - To mitigate accidentally cutting off important text between chunks, set overlapping windows with a specified amount of tokens to overlap so there are tokens shared between chunks.
   - Store embeddings in a ChromaDB database.

3. **Retrieving Documents:**
   - Take the user's search query.
   - Preprocess the query (if required).
   - Create query embedding.
   - Using cosine distance, calculate the similarity score between embeddings of documents and user search query embedding.
   - These cosine similarity scores will help in returning the most relevant candidate documents as per user’s search query.

## Step-by-Step Process

### Part 1: Ingesting Documents
1. **Read the given data.**
2. **Observe that the given data is a database file.**
3. **Go through the README.txt to understand what is there inside the database.**
4. **Take care of decoding the files inside the database.**
5. **If you have limited compute resources, you can take a random 30% of the data.**
6. **Apply appropriate cleaning steps on subtitle documents (whatever is required).**
7. **Experiment with:**
   - BOW / TFIDF to generate sparse vector representations.
   - BERT based “SentenceTransformers” to generate embeddings which encode semantic information.
8. **Document Chunker:**
   - Divide a large document into smaller, more manageable chunks for embedding.
   - Set overlapping windows with a specified amount of tokens to overlap so there are tokens shared between chunks.
9. **Store embeddings in a ChromaDB database.**

### Part 2: Retrieving Documents
1. **Take the user's search query.**
2. **Preprocess the query (if required).**
3. **Create query embedding.**
4. **Using cosine distance, calculate the similarity score between embeddings of documents and user search query embedding.**
5. **These cosine similarity scores will help in returning the most relevant candidate documents as per user’s search query.**



##### Importing all the libraries

In [1]:
import pandas as pd
import sqlite3
import re
from sentence_transformers import SentenceTransformer, util
from chromadb.utils import embedding_functions

  from .autonotebook import tqdm as notebook_tqdm




### 1. connecting to database

##### The data is given the database.db so we have to first extract the data using sqlite 

In [2]:
conn = sqlite3.connect(r"E:\Github Files\Search Engine\eng_subtitles_database.db")
cursor = conn.cursor()
print(cursor)
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
print(cursor.fetchall())

<sqlite3.Cursor object at 0x000001C577427640>
[('zipfiles',)]


In [3]:
df = pd.read_sql_query("""SELECT * FROM zipfiles""", conn)
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


In [17]:
df

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...
...,...,...,...
82493,9521935,the.prophets.game.(2000).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\xb8\xa6\x...
82494,9521937,west.beirut.(1998).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x13\x97\x...
82495,9521938,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00$\x97\x9aV...
82496,9521940,frankenstein.the.true.story.(1973).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x00\x97\x...


In [4]:
df.shape

(82498, 3)

In [5]:
df.size

247494

In [6]:
df.shape

(82498, 3)

In [7]:
df.isnull().sum()

num        0
name       0
content    0
dtype: int64

In [8]:
df.describe()

Unnamed: 0,num
count,82498.0
mean,9351228.0
std,98820.55
min,9180533.0
25%,9264094.0
50%,9349568.0
75%,9437720.0
max,9521941.0


In [33]:
from tqdm import tqdm, tqdm_notebook
tqdm.pandas()

##### The data is in bytes so we have to decode it using latin-1

In [34]:
import zipfile
import io
def decomp_decode(data):
    with zipfile.ZipFile(io.BytesIO(data)) as zip_file:
        # Extract the first file in the ZIP archive
        file_list = zip_file.namelist()
        first_file = file_list[0]
        decompressed_data = zip_file.read(first_file)
    return decompressed_data.decode('latin-1')

In [9]:
df['content'] = df['content'].progress_apply(lambda x : decomp_decode(x))

In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8250 entries, 17262 to 73848
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   num      8250 non-null   int64 
 1   name     8250 non-null   object
 2   content  8250 non-null   object
dtypes: int64(1), object(2)
memory usage: 257.8+ KB


In [37]:
df = df.sample(frac=0.1, random_state=42)

### 2. **Preprocessing of Data:** 

##### **Data cleaning** step is crucial part, In data there are html tags ,timeseries, numbers and Special charatrics

In [38]:
def clean_text(text):
    text = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r\n', '', text)
    text = re.sub(r'\r\n', ' ', text)
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    return text
df['subtitle_content'] = df['content'].apply(clean_text)

In [10]:
df.head()

Unnamed: 0,num,name,content
0,9180533,the.message.(1976).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x1c\xa9\x...
1,9180583,here.comes.the.grump.s01.e09.joltin.jack.in.bo...,b'PK\x03\x04\x14\x00\x00\x00\x08\x00\x17\xb9\x...
2,9180592,yumis.cells.s02.e13.episode.2.13.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00L\xb9\x99V...
3,9180594,yumis.cells.s02.e14.episode.2.14.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x00U\xa9\x99V...
4,9180600,broker.(2022).eng.1cd,b'PK\x03\x04\x14\x00\x00\x00\x08\x001\xa9\x99V...


### 3. **Chunking**

##### Dividing a large document into 500 token per chunk, more manageable chunks for embedding, and also applied overlapping windows with a 100 amount of tokens to overlap so there are tokens shared between chunks.

In [50]:
chunk_size = 500  # Number of tokens per chunk
overlap = 100     # Number of tokens to overlap between chunks

def split_into_chunks(text, chunk_size, overlap):
    tokens = text.split()
    chunks = []
    start = 0
    while start < len(tokens):
        chunk = tokens[start:start + chunk_size]
        chunks.append(' '.join(chunk))
        start += chunk_size - overlap
    return chunks

def your_processing_function(chunk):
    return chunk.lower()


chunks_list = []

# Iterate over rows
for index, row in df.iterrows():
    tokens_accumulator = ''  # Reset accumulator for each row
    tokens_accumulator += row['subtitle_content'] + ' '  # Append the text of the current row to the accumulator

    # Split accumulated tokens into overlapping chunks
    chunks = split_into_chunks(tokens_accumulator, chunk_size, overlap)
    for chunk_text in chunks:
        # Process the chunk here (e.g., apply processing function)
        processed_chunk = your_processing_function(chunk_text)
        chunks_list.append(processed_chunk)

# Assign chunks to DataFrame
df['chunks'] = chunks_list[:len(df)]  




In [53]:
df1 = df.drop(['content','subtitle_content'],axis=1)

### 4. **Vectorization:**

# Bert Sentence Transfromer 

##### BERT based “SentenceTransformers which generate embeddings which encode semantic information. This can help us build a Semantic Search Engine.

In [10]:
# pip install -U sentence-transformers

In [11]:
model = SentenceTransformer('all-MiniLM-L6-v2')

In [56]:
sentence_embedding = embedding_functions.SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

In [72]:
import chromadb

In [73]:
client = chromadb.PersistentClient(path="E:/Github Files/Search Engine")

In [74]:
collection = client.create_collection(
        name="subtitle_sem",
        metadata={"hnsw:space": "cosine"},
        embedding_function=sentence_embedding
    )

In [125]:
content = df1['chunks'].tolist()
metadatas = [{'subtitle_name': name, 'id': id} for name, id in zip(df1['name'], df1['num'])] 
ids = [str(i) for i in range(len(df1))]

In [83]:
metadatas

[{'subtitle_name': 'outlander.s02.e08.the.foxs.lair.(2016).eng.1cd',
  'id': 9396680},
 {'subtitle_name': 'easter.sunday.(2022).eng.1cd', 'id': 9212121},
 {'subtitle_name': 'the.cuphead.show.s03.e08.down.out.(2022).eng.1cd',
  'id': 9317467},
 {'subtitle_name': 'under.the.vines.s02.e06.episode.2.6.(2023).eng.1cd',
  'id': 9434064},
 {'subtitle_name': 'ancients.behaving.badly.s01.e07.genghis.khan.(2009).eng.1cd',
  'id': 9340687},
 {'subtitle_name': 'nighty.night.s01.e02.episode.1.2.(2004).eng.1cd',
  'id': 9210811},
 {'subtitle_name': 'big.love.s01.e06.robertas.funeral.(2006).eng.1cd',
  'id': 9195925},
 {'subtitle_name': 'idolish7.s01.e12.5.ri.to.2.ri.(2018).eng.1cd',
  'id': 9260677},
 {'subtitle_name': 'kung.fu.s03.e07.villains.(2022).eng.1cd', 'id': 9316153},
 {'subtitle_name': '90210.s02.e21.javianna.(2010).eng.1cd', 'id': 9367888},
 {'subtitle_name': 'spaces.deepest.secrets.spaces.great.wall.().eng.1cd',
  'id': 9352125},
 {'subtitle_name': 'fantastic.voyage.s01.e11.the.spy.satel

In [84]:
batch_size = 5000  
num_batches = (len(content) + batch_size - 1) // batch_size 

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = min((i + 1) * batch_size, len(content))
    
    batch_content = content[start_idx:end_idx]
    batch_metadatas = metadatas[start_idx:end_idx]
    batch_ids = ids[start_idx:end_idx]
    
    collection.add(
        documents=batch_content,
        metadatas=batch_metadatas,
        ids=batch_ids
    ) 

In [85]:
query_text = 'you can mash them with milk oh ' 

In [86]:
result = collection.query(
    query_texts = query_text,
    include=["metadatas", 'distances'],
    n_results=10
)

In [112]:
result

{'ids': [['1', '814', '822', '88', '692', '356', '517', '290', '824', '210']],
 'distances': [[0.6735456585884094,
   0.7436150431043322,
   0.7464099003070729,
   0.7570045590400696,
   0.7650249600410461,
   0.7689965963363647,
   0.7716805934906006,
   0.7819569110870361,
   0.7841434248958915,
   0.7923332452774048]],
 'metadatas': [[{'id': 9212121,
    'subtitle_name': 'easter.sunday.(2022).eng.1cd'},
   {'id': 9191312,
    'subtitle_name': 'ghost.adventures.s16.e04.old.gila.county.jail.and.courthouse.(2018).eng.1cd'},
   {'id': 9191630,
    'subtitle_name': 'heartland.s10.e01.there.will.be.changes.(2016).eng.1cd'},
   {'id': 9200328, 'subtitle_name': 'spud.(2010).eng.1cd'},
   {'id': 9243158,
    'subtitle_name': 'fate.the.winx.saga.s02.e03.your.newfound.popularity.().eng.1cd'},
   {'id': 9304594,
    'subtitle_name': 'zootopia.s01.e02.the.real.rodents.of.little.rodentia.(2022).eng.1cd'},
   {'id': 9493695, 'subtitle_name': 'irl.in.real.love.(2023).eng.1cd'},
   {'id': 9316022, '

In [123]:
ids = result['ids'][0]
distances = result['distances'][0] 
metadatas = result['metadatas'][0] 
zipped_data = zip(ids, distances, metadatas)
sorted_data = sorted(zipped_data, key=lambda x: x[1], reverse=True)
for _, distance, metadata in sorted_data:
    subtitle_name = metadata['subtitle_name']
    print(f"Subtitle Name: {subtitle_name.upper()}")

Subtitle Name: HONG.KONG.FAMILY.(2022).ENG.1CD
Subtitle Name: NATURES.STRANGEST.MYSTERIES.SOLVED.S01.E13.CUDDLY.SHARK.(2019).ENG.1CD
Subtitle Name: 1899.S01.E08.THE.KEY.(2022).ENG.1CD
Subtitle Name: IRL.IN.REAL.LOVE.(2023).ENG.1CD
Subtitle Name: ZOOTOPIA.S01.E02.THE.REAL.RODENTS.OF.LITTLE.RODENTIA.(2022).ENG.1CD
Subtitle Name: FATE.THE.WINX.SAGA.S02.E03.YOUR.NEWFOUND.POPULARITY.().ENG.1CD
Subtitle Name: SPUD.(2010).ENG.1CD
Subtitle Name: HEARTLAND.S10.E01.THERE.WILL.BE.CHANGES.(2016).ENG.1CD
Subtitle Name: GHOST.ADVENTURES.S16.E04.OLD.GILA.COUNTY.JAIL.AND.COURTHOUSE.(2018).ENG.1CD
Subtitle Name: EASTER.SUNDAY.(2022).ENG.1CD


In [131]:
import re
import chromadb
from sentence_transformers import  SentenceTransformer


In [134]:
from sentence_transformers import SentenceTransformer

model_name = 'all-MiniLM-L6-v2' 
model = SentenceTransformer(model_name, device='cpu')

In [39]:
df1['encoding'] = df1.chunks.progress_apply(model.encode)

In [141]:
import chromadb
client = chromadb.PersistentClient(path="searchengine_database")

collection = client.get_or_create_collection(name="search_engine", metadata={"hnsw:space": "cosine"})


### 5. Store embeddings in a ChromaDB database.

In [151]:
import chromadb
import numpy as np

client = chromadb.PersistentClient(path="searchengine_database")
collection = client.get_or_create_collection(name="search_engine", metadata={"hnsw:space": "cosine"})

def encoder(df):
    for i in range(df.shape[0]): 
        # Convert embedding to list
        embedding_list = df['encoding'].iloc[i].tolist()
        collection.add(
            documents=[df['name'].iloc[i]], 
            embeddings=[embedding_list],  # Pass embedding as list
            ids=[str(df['num'].iloc[i])]
        )


%time encoder(df1)


CPU times: total: 5.05 s
Wall time: 8.49 s




### Part 2: **Retrieving Documents**
1. Take the user's search query.
2. Preprocess the query (if required).
3. Create query embedding.
4. Using cosine distance, calculate the similarity score between embeddings of documents and user search query embedding.
5. These cosine similarity scores will help in returning the most relevant candidate documents as per user’s search query.

In [None]:
import re
import chromadb
from sentence_transformers import SentenceTransformer

# Initializing chromaDB
client = chromadb.PersistentClient(path="searchengine_database")
collection = client.get_collection(name="search_engine") 

model_name="paraphrase-MiniLM-L3-V2"
model = SentenceTransformer(model_name, device="cpu")

def clean_data(data): # data is the entire text file entry in the dataframe
    
    data = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r\n', '', data)
    data = re.sub(r'\r\n', ' ', data)
    data = re.sub(r'<[^>]+>', '', data)
    data = re.sub(r'[^a-zA-Z\s]', '', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    return data

def extract_id(id_list):
    new_id_list=[]
    for item in id_list:
        match = re.match(r'^(\d+)', item)
        if match:
            extracted_number = match.group(1)
            new_id_list.append(extracted_number)
    return new_id_list

search_query = input("Enter a dialogue to search: ")

search_query = clean_data(search_query)
query_embed = model.encode(search_query).tolist()

search_results = collection.query(query_embeddings=query_embed, n_results=10)
id_list = search_results['ids'][0]

id_list = extract_id(id_list)
print(id_list)
for id in id_list:
    file_name = collection.get(ids=f"{id}")["documents"][0]
    print(f"https://www.opensubtitles.org/en/subtitles/{id}")


Enter a dialogue to search: the file is
['9316395', '9449504', '9246955', '9392842', '9445441', '9480342', '9274603', '9200158', '9429797', '9343626']
https://www.opensubtitles.org/en/subtitles/9316395
https://www.opensubtitles.org/en/subtitles/9449504
https://www.opensubtitles.org/en/subtitles/9246955
https://www.opensubtitles.org/en/subtitles/9392842
https://www.opensubtitles.org/en/subtitles/9445441
https://www.opensubtitles.org/en/subtitles/9480342
https://www.opensubtitles.org/en/subtitles/9274603
https://www.opensubtitles.org/en/subtitles/9200158
https://www.opensubtitles.org/en/subtitles/9429797
https://www.opensubtitles.org/en/subtitles/9343626


# Streamlit code

##### productionization

In [None]:
import re
import chromadb
import streamlit as st
from sentence_transformers import SentenceTransformer

# Initializing chromaDB
client = chromadb.PersistentClient(path="searchengine_database") #_test_db
collection = client.get_collection(name="search_engine") #test_collection
# collection_name = client.get_collection(name="search_engine_FileName")
model_name = "paraphrase-MiniLM-L3-V2"
model = SentenceTransformer(model_name, device="cpu")

def clean_data(data): 
    data = re.sub(r'\d{2}:\d{2}:\d{2},\d{3} --> \d{2}:\d{2}:\d{2},\d{3}\r\n', '', data)
    data = re.sub(r'\r\n', ' ', data)
    data = re.sub(r'<[^>]+>', '', data)
    data = re.sub(r'[^a-zA-Z\s]', '', data)
    data = re.sub(r'\s+', ' ', data)
    data = data.strip()
    return data

def extract_id(id_list):
    new_id_list = []
    for item in id_list:
        match = re.match(r'^(\d+)', item)
        if match:
            extracted_number = match.group(1)
            new_id_list.append(extracted_number)
    return new_id_list

st.title("Subtitle Search Engine") 

with st.form("search_form"):
    search_query = st.text_input("Enter a dialogue to search:", key="search_query")
    submit_button = st.form_submit_button(label="Search")

if submit_button:
    search_query = clean_data(search_query)
    query_embed = model.encode(search_query).tolist()

    search_results = collection.query(query_embeddings=query_embed, n_results=10)
    id_list = search_results['ids'][0]

    id_list = extract_id(id_list)
    
    with st.expander("Relevant Subtitle Files", expanded=True):
        for index, id in enumerate(id_list, start=1):
            file_name = collection.get(ids=f"{id}")["documents"][0]
            st.markdown(f"**ID: {index}** - [{file_name}](https://www.opensubtitles.org/en/subtitles/{id})")


<img st>