<a href="https://colab.research.google.com/github/desaibhargav/VR/blob/main/notebooks/Semantic_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Dependencies**

In [None]:
!pip install -U -q sentence-transformers
!git clone https://github.com/desaibhargav/VR.git

[K     |████████████████████████████████| 71kB 9.4MB/s 
[K     |████████████████████████████████| 1.8MB 29.8MB/s 
[K     |████████████████████████████████| 1.2MB 56.2MB/s 
[K     |████████████████████████████████| 2.9MB 55.3MB/s 
[K     |████████████████████████████████| 890kB 50.5MB/s 
[?25h  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
Cloning into 'VR'...
remote: Enumerating objects: 142, done.[K
remote: Counting objects: 100% (142/142), done.[K
remote: Compressing objects: 100% (109/109), done.[K
remote: Total 142 (delta 66), reused 71 (delta 19), pack-reused 0[K
Receiving objects: 100% (142/142), 4.63 MiB | 8.66 MiB/s, done.
Resolving deltas: 100% (66/66), done.


## **Imports**

In [None]:
import pandas as pd
import numpy as np
import torch
import time

from typing import Generator
from sentence_transformers import SentenceTransformer, CrossEncoder, util
from VR.backend.chunker import Chunker

## **Dataset**

In [None]:
# load scrapped data (using youtube_client.py)
dataset = pd.read_pickle('VR/datasets/youtube_scrapped.pickle')

# split transcripts of videos to smaller blocks or chunks (using chunker.py)
chunked = Chunker(chunk_by='length', expected_threshold=100, min_tolerable_threshold=75).get_chunks(dataset)

# finally, create dataset
dataset_untagged = dataset.join(chunked).drop(columns=['subtitles', 'timestamps'])
df = dataset_untagged.copy().dropna()
print(f"Average length of block: {df.length_of_block.mean()}, Standard Deviation: {df.length_of_block.std()}")

Average length of block: 107.33691183188671, Standard Deviation: 12.755147492715611


## **Semantic Search**

---

The idea is to compute embeddings of the query (entered by user) and use cosine similarity to find the `top_k` most similar blocks. 

Blocks are nothing but the entire video transcript (big string) split into fixed length strings (small strings, ~100 words). 

---

The reason for such a design choice was threefold, handled by `chunker.py` (refer the repo):

1. First and foremost, some videoes can be very long (over ~40 minutes) which means the transcript for the same is a **massive** string, and we need to avoid hitting the processing length limits of pre-trained models. 

2. Secondly, and more importantly, it is always good to maintain the inputs at a length on which the models being used were trained (to stay as close as poossible to the training set for optimum results).

3. But perhaps, most importantly, the purpose for splitting transcripts to blocks is so that the recommendations can be targeted to a snippet within a video. The vision is to recommend many snippets from various videoes highly relevant to the query, rather than entire videoes themselves in which matching snippets have been found (which may sometimes be long and the content may not always be related to the query).

---

In [None]:
# request to enable GPU 
if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

In [None]:
# load model (to encode the dataset)
bi_encoder = SentenceTransformer('paraphrase-distilroberta-base-v1')

# number of blocks we want to retrieve with the bi-encoder
top_k = 200     

# the bi-encoder will retrieve 50 blocks (top_k). 
# we use a cross-encoder, to re-rank the results list to improve the quality.
cross_encoder = CrossEncoder('cross-encoder/ms-marco-electra-base')

100%|██████████| 306M/306M [00:11<00:00, 25.7MB/s]


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=730.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=438022601.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=316.0, style=ProgressStyle(description_…




In [None]:
# encode dataset
corpus_embeddings = bi_encoder.encode(df.block.to_list(), convert_to_tensor=True, show_progress_bar=True)

# send corpus embeddings to GPU
corpus_embeddings = torch.tensor(corpus_embeddings).cuda()

HBox(children=(FloatProgress(value=0.0, description='Batches', max=274.0, style=ProgressStyle(description_widt…




  """


In [None]:
# this function will search the dataset for passages that answer the query
def search(query):
  start_time = time.time()

  # encode the query using the bi-encoder and find potentially relevant passages
  question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

  # send query embeddings to GPU
  question_embedding = question_embedding.cuda()

  # perform sematic search by computing cosine similarity between corpus and query embeddings
  # return top_k highest similarity matches
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)[0]

  # now, score all retrieved passages with the cross_encoder
  cross_inp = [[query, df.block.to_list()[hit['corpus_id']]] for hit in hits]
  cross_scores = cross_encoder.predict(cross_inp)

  # sort results by the cross-encoder scores
  for idx in range(len(cross_scores)):
      hits[idx]['cross-score'] = cross_scores[idx]
  hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
  end_time = time.time()

  # print output of top-5 hits (for iteractive environments only)
  print(f"Input query: {query}")
  print(f"Results (after {round(end_time - start_time, 2)} seconds):")
  for hit in hits[0:10]:
    print("\t{:.3f}\t{}".format(hit['cross-score'], df.block.to_list()[hit['corpus_id']].replace("\n", " ")))

## **Try some queries!**

In [None]:
query = "I feel lost in life. I feel like there is no purpose of living. How should I deal with this?"
search(query)

Input query: I feel lost in life. I feel like there is no purpose of living. How should I deal with this?
Results (after 4.19 seconds):
	0.959	It is the pettiness of one’s mind that it’ll seek a meaning because psychologically you will feel kind of unconnected with life if you don’t have a purpose and a meaning. People are constantly trying to create these false purposes. Now, they were quite fine and happy. Suddenly, they got married. Now the purpose is the other person. Then they have children. Now they become miserable with each other. Now the whole purpose that I go through all this misery is because the children. Like this, it goes on. These are things that you’re causing and holding these as purposes of life and is there a God-given purpose? What if God does not know you exist? No, I am just asking, by chance. (Laughter) I am saying in this huge cosmos, for which God is supposed to be the Creator and the manager of these hundred billion galaxies, in that this tiny little planet
	

In [None]:
query = "I just recently became a parent and I am feeling very nervous. What is the best way to bring up a child?"
search(query)

Input query: I just recently became a parent and I am feeling very nervous. What is the best way to bring up a child?
Results (after 2.34 seconds):
	0.561	Sadhguru, what should be the role of a good parent in today's world? See, parenthood is a very funny thing You're trying to do something that nobody has ever known how to do it well Yes? Nobody has ever known what is the best way to parent their children Even if you have 12 children, you are still learning You may raise eleven properly the twelfth one can give you works, you know? So.. But you want to do your best what is the best thing you can do? One foremost thing I would say is First thing is to work upon yourself a little bit.
	0.052	it doesn’t matter what their problem is. If you leave that level of openness and friendship with them, if they come to you first, there is every possibility that they won’t get lost on something, isn't it? Especially in a society like this, where the moment the child steps out, you don’t know what i

In [None]:
query = "I had a divorce. I feel like a failure. How should I handle this heartbreak?"
search(query)

Input query: I had a divorce. I feel like a failure. How should I handle this heartbreak?
Results (after 2.41 seconds):
	0.323	But for some reason, you have come to that situation where this is this has to happen - you need to understand this, that divorce essentially means you have chosen to kill something, which is a part of you, because what you call as myself is just a certain volume of memory. Now, to how to conduct this gracefully? Most people think the best way to conduct a divorce is immediately jump into another relationship and another relationship of the same kind. No, you will cause much more struggle and turmoil within the system by doing that. It’s extremely important
	0.109	ripping it apart is almost like tearing yourself apart. Even though you might have begun to almost come to a place, where you can’t stand the person anymore, still it hurts, simply because you’re trying to rip out a memory, which is you, because you are a bundle of memory. If one does the necessary sp

In [None]:
query = "How to be confident while making big decisions in life?"
search(query)

Input query: How to be confident while making big decisions in life?
Results (after 2.63 seconds):
	0.984	somebody says “I am doing this.” somebody says “I am doing this.” So one thing that all of you should do before you make big decisions in your life is, withdraw from these pressures of peers, professors, parents, everybody. Just spend three days to one week by yourself. Look at it, what is it that you really want to do? Not under pressure from other people. What does this life want to do? Do that! It doesn't matter what other people think about it  (Applause)
	0.692	“I do not know,” the longing to know, the seeking to know and the possibility of knowing becomes a living reality. Whatever you don’t know, you believe. If you believe whatever you do not know, you will become confident without clarity.  Confidence without clarity is a disastrous process. Where there is no clarity, it is better there is hesitation. If clarity comes, let’s do everything. If there is no clarity, we should

## **Semantic Search x Auxiliary Features**

This section is under active development. 

---

This purpose of this section is to explore two primary frontiers:

1. Just semantic search yields satisfactory results, but comes at the cost of compute power. The bottleneck for compute power is the cross-encoder step. This section explores how to reduce the search area, so that semantic search (by the cross-encoder) is performed over a small number blocks, significantly cutting down on the recommendation time. 

2. Other than the content itself, several other features such as video statistics (views, likes, dislikes), video titles, video descriptions, video tags present in the dataset can be leveraged to improve the recommendations. 

---

