In [1]:
!pip install -U sentence-transformers

Collecting sentence-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/6a/e2/84d6acfcee2d83164149778a33b6bdd1a74e1bcb59b2b2cd1b861359b339/sentence-transformers-0.4.1.2.tar.gz (64kB)
[K     |████████████████████████████████| 71kB 5.4MB/s 
[?25hCollecting transformers<5.0.0,>=3.1.0
[?25l  Downloading https://files.pythonhosted.org/packages/88/b1/41130a228dd656a1a31ba281598a968320283f48d42782845f6ba567f00b/transformers-4.2.2-py3-none-any.whl (1.8MB)
[K     |████████████████████████████████| 1.8MB 10.7MB/s 
Collecting sentencepiece
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 17.2MB/s 
Collecting tokenizers==0.9.4
[?25l  Downloading https://files.pythonhosted.org/packages/0f/1c/e789a8b12e28be5bc1ce2156cf87cb522b379be9cadc7ad8091a4cc107c4/tokenizers-0.9.4-cp36-cp36m-manyl

In [2]:
import pandas as pd
import numpy as np
import torch
import time

from typing import Generator
from sentence_transformers import SentenceTransformer, CrossEncoder, util

In [4]:
# load data
dataset_untagged = pd.read_pickle('dataset_untagged.pickle')

In [5]:
# drop np.NaNs
df = dataset_untagged.copy().dropna()

In [6]:
# request to enable GPU 
if not torch.cuda.is_available():
  print("Warning: No GPU found. Please add GPU to your notebook")

## **Semantic Search**

---

The idea is to compute embeddings of the query (entered by user) and use cosine similarity to find the `top_k` most similar blocks. 

Blocks are nothing but the entire video transcript (big string) split into fixed length strings (small strings, ~230 words). 

---

The reason for such a design choice was threefold, andled by `chunker.py` (refer the repo):

1. First and foremost, some videoes can be very long (over ~40 minutes) which means the transcript for the same is a **massive** string, and we need to avoid hitting the processing length limits of pre-trained models. 

2. Secondly, and more importantly, it is always good to maintain the inputs at a length on which the models being used were trained (to stay as close as poossible to the training set for optimum results).

3. But perhaps, most importantly, the purpose for splitting transcripts to blocks is so that the recommendations can be targeted to a snippet within a video. The vision is to recommend many snippets from various videoes highly relevant to the query, rather than entire videoes themselves in which matching snippets have been found (which may sometimes be long and the content may not always be related to the query).

---

In [7]:
# load model (to encode the dataset)
bi_encoder = SentenceTransformer('msmarco-distilbert-base-v2')

# number of blocks we want to retrieve with the bi-encoder
top_k = 50     

# the bi-encoder will retrieve 50 blocks (top_k). 
# we use a cross-encoder, to re-rank the results list to improve the quality.
cross_encoder = CrossEncoder('cross-encoder/ms-marco-TinyBERT-L-6')

100%|██████████| 245M/245M [00:14<00:00, 16.4MB/s]


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=612.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=267871721.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=112.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=541.0, style=ProgressStyle(description_…




In [8]:
# encode dataset
corpus_embeddings = bi_encoder.encode(df.block.to_list(), convert_to_tensor=True, show_progress_bar=True)

# send corpus embeddings to GPU
corpus_embeddings = torch.tensor(corpus_embeddings).cuda()

HBox(children=(FloatProgress(value=0.0, description='Batches', max=125.0, style=ProgressStyle(description_widt…




  """


In [10]:
# this function will search the dataset for passages that answer the query
def search(query):
  start_time = time.time()

  # encode the query using the bi-encoder and find potentially relevant passages
  question_embedding = bi_encoder.encode(query, convert_to_tensor=True)

  # send query embeddings to GPU
  question_embedding = question_embedding.cuda()

  # perform sematic search by computing cosine similarity between corpus and query embeddings
  # return top_k highest similarity matches
  hits = util.semantic_search(question_embedding, corpus_embeddings, top_k=top_k)[0]

  # now, score all retrieved passages with the cross_encoder
  cross_inp = [[query, df.block.to_list()[hit['corpus_id']]] for hit in hits]
  cross_scores = cross_encoder.predict(cross_inp)

  # sort results by the cross-encoder scores
  for idx in range(len(cross_scores)):
      hits[idx]['cross-score'] = cross_scores[idx]
  hits = sorted(hits, key=lambda x: x['cross-score'], reverse=True)
  end_time = time.time()

  # print output of top-5 hits (for iteractive environments only)
  print(f"Input query: {query}")
  print(f"Results (after {round(end_time - start_time, 2)} seconds):")
  for hit in hits[0:5]:
    print("\t{:.3f}\t{}".format(hit['cross-score'], df.block.to_list()[hit['corpus_id']].replace("\n", " ")))

In [11]:
query = "I feel lost in life. I feel like there is no purpose of living. How should I deal with this?"
search(query)

Input query: I feel lost in life. I feel like there is no purpose of living. How should I deal with this?
Results (after 0.97 seconds):
	0.824	you're only exploring life because nothing else you have access to. Yes. Right now I may think I'm looking at this person. But no, I am only looking at the image that happens in my mind, isn't it so? Right now, if you touch somebody next to you, you think you're experiencing that person's hand. No, you only experience the sensations in your hand, isn't it so? So your entire experience of life is absolutely within you. That means you are capable of experiencing only this one life. So is it a trap that you can only experience this? No. If you experience this, then everything becomes a possibility because, see, when it comes to body, we clearly know this is my body, that's your body, hundred percent. This is my mind, that’s your mind, one hundred percent, isn't it? But when it comes to life, there is no such thing as your life and my life. This is 

In [12]:
query = "I just recently became a parent and I am feeling very nervous. What is the best way to bring up a child?"
search(query)

Input query: I just recently became a parent and I am feeling very nervous. What is the best way to bring up a child?
Results (after 0.69 seconds):
	0.468	Sadhguru: Ohoo! Only your neighbors should see whether your daughter or your child is a girl or a boy. You should never see whether this is a girl or a boy It’s the first thing Neighborhood boys will see that this is a girl, that’s okay You should not be wondering whether this is a boy or a girl, this is just a child. And the best thing you can do for your child is if you think the way you are is everything, naturally your aspiration will be they should become like you, which will be a backward step for next generation of people. What the next generation should be – what you cannot imagine, that’s what they should be. If you mold them how will you mold them? Like yourself, and maybe your parents were better at molding than you, so you will do a worse job than them Because probably your mother, your father at least for your mother may

In [17]:
query = "I had a divorce. I feel like a failure. How should I handle this heartbreak?"
search(query)

Input query: I had a divorce. I feel like a failure. How should I handle this heartbreak?
Results (after 0.87 seconds):
	0.042	the best way to conduct a divorce is immediately jump into another relationship and another relationship of the same kind. No, you will cause much more struggle and turmoil within the system by doing that. It’s extremely important the body has enough time to work out the memory, the body has enough time to keep the memory at a certain distance. Otherwise, you will render yourself to a space, where to make yourself peaceful and joyful will become an extremely hard thing to do in your life. So conducting this process gracefully and well is important as it is important to conduct every aspect of your life gracefully and well. Now, two people, who have shared their emotion, their body, their sensations and their living spaces, ripping it apart is because two memories have merged in many ways, ripping it apart is almost like tearing yourself apart. Even though you m