<a href="https://colab.research.google.com/github/fshivam/Semantic-Video-Recommender-System/blob/master/Zero_Shot.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **This notebook is addressed to Dr. Parul Shah and is to be considered alongside the draft of the approach proposed regarding the Sadhguru NLP Project discussed.**




---

We recommend this notebook be run on a GPU runtime by:

**`Runtime -> Change Runtime Type -> Hardware Accelerator -> GPU -> Save`**

*Note: The notebook is compatible with CPU runtimes too with no changes to the code.*

*Note: The runtime will have to be restarted for some specific libraries to be able to function as per expectation by:* **`Runtime -> Restart Runtime`**


---



## **Install required libraries**

In [None]:
!pip3 install -q SpeechRecognition
!pip3 install -q youtube-dl  
!pip3 install -q pydub
!pip3 install -q youtube_transcript_api
!python -m spacy download en_core_web_lg
!pip3 install -q pytube3
!pip3 install -q pyspellchecker
!pip install -q --upgrade google-api-python-client
!pip install -q --upgrade google-auth-oauthlib google-auth-httplib2
!pip install -U sentence-transformers
!pip3 install -q transformers==2.11.0

[K     |████████████████████████████████| 32.8MB 92kB/s 
[K     |████████████████████████████████| 1.8MB 4.5MB/s 
[?25hCollecting en_core_web_lg==2.2.5
[?25l  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9MB)
[K     |████████████████████████████████| 827.9MB 1.2MB/s 
Building wheels for collected packages: en-core-web-lg
  Building wheel for en-core-web-lg (setup.py) ... [?25l[?25hdone
  Created wheel for en-core-web-lg: filename=en_core_web_lg-2.2.5-cp36-none-any.whl size=829180944 sha256=eb156ce24e8306ce96daab6c8b1ff60271c6bc95971a188e8cf76579d240aafe
  Stored in directory: /tmp/pip-ephem-wheel-cache-ct40_ebg/wheels/2a/c1/a6/fc7a877b1efca9bc6a089d6f506f16d3868408f9ff89f8dbfc
Successfully built en-core-web-lg
Installing collected packages: en-core-web-lg
Successfully installed en-core-web-lg-2.2.5
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core

## **Import Required Libraries**

In [None]:
import numpy as np
import textwrap
import urllib
import json
import spacy
import en_core_web_lg
import re
import math 
import torch
import pandas as pd
import pickle
import os
import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors
import time
import nltk


from transformers import pipeline
from transformers import BartForSequenceClassification, BartTokenizer
from sentence_transformers import SentenceTransformer
from youtube_transcript_api import YouTubeTranscriptApi
from urllib.parse import urlparse, parse_qs
from spellchecker import SpellChecker 
from tqdm import tqdm
from sklearn.metrics.pairwise import cosine_similarity
from pprint import pprint
from dateutil import parser
from nltk.tag import pos_tag
from spellchecker import SpellChecker 

AttributeError: ignored

## **Configure GPU for Accelerating Text Summarization Model**



---
Using a GPU helps us dramatically decrease the inference times of the BART model used for getitng summaries from video transcripts. This helps the program scale to thousands of transcripts.

*Note: Ignore this cell if running on a CPU Runtime*


---




In [None]:
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Select the Runtime → "Change runtime type" menu to enable a GPU accelerator, ')
  print('and then re-execute this cell.')
else:
  print(gpu_info)

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Assuming that we are on a CUDA machine, this should print a CUDA device:

print(device)

## **Define the AutoSynop Class**



---
The AutoSynop Class achieves the following things:



1.  ` __init__()` : Defines API Keys (for accessing YouTube) and the Channel ID of the Sadhguru Channel used later. It also downloads the text summarization model.

2.  `get_metadata_for_videos(playlists)`: Fetches metadata of all videos hosted under a given playlist for the given channel. 

3.  `scrap_duplicates(video_metadata)`: Sometimes, playlists have same videos uploaded under different video IDs, this method scraps them and keeps only the unique videos.

4.  `get_video_ids_from_metadata(video_metadata)` : Extract only the video IDs from the video metadata obtained from `get_metadata_for_videos(playlists)` method.

5. `get_transcripts(video_ids)`: Fetches the transcripts, timed version of the transcript (as it appears on the video screen) and the start and duration of each text in the video frame based on the extracted video IDs.

6. `process_and_split_transcript()`: Split the transcript into uniform blocks of size approximately 250 (to avoid memory errors), get summaries for those blocks and combine the summaries into a single summary by simple concatenation.


**Thus the final returned output of running the methods in this class are a list of summaries for the fetched transcripts**

---








In [None]:
class AutoSynop():

  def __init__(self):

    self.API_KEY = 'AIzaSyCHYJRuKOkaIFlU5FBKzYjrZli1CFvpxXg'
    self.nlp = en_core_web_lg.load()
    nltk.download('averaged_perceptron_tagger')
    nltk.download('punkt')
    self.summarizer = pipeline('summarization', device = 0)
    self.wrapper = textwrap.TextWrapper(width=80) 
    

  def get_metadata_for_videos(self, playlists):

    # Helper Function 1
    # Process the JSON response
    def process_response(response):

      next_page_flag = 1
      video_ids = list()
      titles = list()

      items = response['items']

      try:
        nextPageToken = response['nextPageToken']

      except KeyError:
        next_page_flag = 0
        #print('Reached end of playlist!')

      for item in items:
        video_id = item['contentDetails']['videoId']
        title = item['snippet']['title']
        video_ids.append(video_id)
        titles.append(title)

        #print(video_id)
        #print(title)       

      if next_page_flag==1:
        return video_ids, titles, nextPageToken

      if next_page_flag==0:
        return video_ids, titles, None

    # Helper Function 2
    # Process to get video IDs and titles for all videos in one playlist
    def get_video_ids_for_playlist(youtube, playlist_id, nextPageToken):

      if nextPageToken==None:

        request = youtube.playlistItems().list(
            part="snippet, contentDetails",
            maxResults=50,
            playlistId=playlist_id)
        
      if nextPageToken!=None:

        request = youtube.playlistItems().list(
          part="snippet, contentDetails",
          maxResults=50,
          playlistId=playlist_id,
          pageToken=nextPageToken)

      response = request.execute()
      
      video_ids, titles, nextPageToken = process_response(response)

      return video_ids, titles, nextPageToken


    # Main function starts here

    # Set up a YouTube API Client
    api_service_name = "youtube"
    api_version = "v3"

    youtube = googleapiclient.discovery.build(
        api_service_name, 
        api_version, 
        developerKey=self.API_KEY)
    
    # Iterate over all playlists and fetch video IDs for ALL playlists
    for idx, playlist in enumerate(tqdm((playlists))):

      playlist_video_ids = list()
      playlist_titles = list()

      playlist_id = playlist[playlist.find('&list=')+6:]

      #print(playlist_id)

      video_ids, titles, nextPageToken = get_video_ids_for_playlist(youtube, playlist_id, None)
      playlist_video_ids.extend(video_ids)
      playlist_titles.extend(titles)

      while nextPageToken!=None:

        video_ids, titles, nextPageToken = get_video_ids_for_playlist(youtube, playlist_id, nextPageToken)
        playlist_video_ids.extend(video_ids)
        playlist_titles.extend(titles)
      
      #print('Processed a Playlist')
      yield list(zip(playlist_video_ids, playlist_titles))
    
    print('All playlists done')


  def scrap_duplicates(self, video_metadata):

    all_video_ids_and_titles = {}

    for p_num in range(len(video_metadata)):

      playlist_metadata = dict(video_metadata[p_num])

      cleaned_playlist_metadata = {}

      for key,value in playlist_metadata.items():
          if value not in cleaned_playlist_metadata.values():
              cleaned_playlist_metadata[key] = value

      all_video_ids_and_titles.update(cleaned_playlist_metadata)

    cleaned_metadata = {}

    for key,value in all_video_ids_and_titles.items():
          if value not in cleaned_metadata.values():
              cleaned_metadata[key] = value

    return cleaned_metadata

  def get_video_ids(self, cleaned_metadata):

    return list(cleaned_metadata.keys())
  
  def get_video_titles(self, cleaned_metadata):

    return list(cleaned_metadata.values())

  def get_transcripts(self, cleaned_metadata):

    video_ids = list(cleaned_metadata.keys())

    transcripts = list()
    timed_transcripts = list()
    chrono_of_timed_transcripts = list()
    failed_video_ids = list()

    fail = 0
    success = 0
    total = 0

    for video_id in tqdm(video_ids, position=0):

      total += 1

      try:
        dict_responses = YouTubeTranscriptApi.get_transcript(video_id)
        transcript = ' '.join([response['text'] for response in dict_responses])
        timed_transcript = [response['text'] for response in dict_responses]
        chrono_of_timed_transcript = [response['start'] + response['duration'] for response in dict_responses]
        transcripts.append(transcript)
        timed_transcripts.append(timed_transcript)
        chrono_of_timed_transcripts.append(chrono_of_timed_transcript)
        success += 1
        
      except Exception as e: 
        print(e)
        failed_video_ids.append(video_id)
        print(video_id)
        fail += 1
        continue
  
      print('{} of {} videos done and {} failed until now'.format(success, total, fail))
      time.sleep(8)
    
    cleaned_metadata = {key:value for key, value in cleaned_metadata.items() if key not in failed_video_ids}
  
    print('Number of transcripts retrieved: {}'.format(len(transcripts)))

    assert len(transcripts) == len(timed_transcripts)
    assert len(timed_transcripts) == len(chrono_of_timed_transcripts)

    return transcripts, timed_transcripts, chrono_of_timed_transcripts, cleaned_metadata


  def process_and_split_transcript(self, transcripts, timed_transcripts, chrono_of_timed_transcripts):

    # Helper Fuction 1 
    # To get blocks lengths
    def get_blocks_lengths(blocks):

      len_blocks = list()

      for idx, block in enumerate(blocks):

        block_text = ' '.join([sent for sent in block[0]])
        length = sum([len(sent.strip().split()) for sent in block[0]])
        len_blocks.append((idx, length))
        #print('BLOCK: {}'.format(idx+1))
        #print('\n')
        #print(wrapper.fill(block_text))
        #print('\n')
        #print('LENGTH OF BLOCK: {} is {}'.format(idx+1,length ))
        #print('\n')
        #print('-' * 100)
        #print('\n')

      return len_blocks


    # Helper Function 2
    # To print summaries for individual blocks in a transcript
    # In addition, also return the combined summary by adding all blocks
    def distil_block(blocks):

      block_summaries = list()

      for idx, block in enumerate(blocks):
        block_text = ' '.join([sent for sent in block[0]])
        block_summary = self.summarizer(block_text)[0]['summary_text']
        #print('SUMMARY OF BLOCK: {}'.format(idx+1))
        #print('\n')
        #print(wrapper.fill(block_summary))
        #print('-' * 100)
        #print('\n')
        block_summaries.append(block_summary)


      return block_summaries


    # Helper Function 3
    # Clean up text with regex
    def clean_up(transcript, timed_transcript):

      # Clean up transcript
      transcript = re.sub('(\\n)', ' ', transcript)
      transcript = re.sub('(\.\.+)', '', transcript)
      transcript = re.sub('\s+', ' ', transcript)

      # Clean up timed captions from JSON response
      timed_transcript = [re.sub('(\\n)', ' ', sentence) for sentence in timed_transcript]
      timed_transcript = [re.sub('(\.\.+)', '', sentence) for sentence in timed_transcript]
      timed_transcript = [re.sub('\s+', ' ', sentence) for sentence in timed_transcript]
      

      return transcript, timed_transcript
  
    # Main function starts here

    # Define empty lists to hold summaries and transcripts (with blocks and their timings)
    summaries = list()
    transcript_with_blocks_and_their_timings = list()
    counter = 0

    # Loop over all transcripts in the database
    for (transcript, timed_transcript, chrono_of_timed_transcript) in tqdm(zip(transcripts, timed_transcripts, chrono_of_timed_transcripts), total=len(timed_transcripts)):

      # Clean up all text
      transcript, timed_transcript = clean_up(transcript, timed_transcript)

      # Ensure lengths of transcripts and timed transcripts after cleaning is the same
      length_of_transcript = len(transcript.strip().split())
      length_of_timed_transcript = sum([len(sentence.strip().split()) for sentence in timed_transcript])
      assert length_of_timed_transcript == length_of_transcript


      # Calculate number of blocks and the allowed margin
      num_blocks = math.ceil(length_of_timed_transcript/250)
      allowed_margin = sum([len(sent.strip().split()) for sent in timed_transcript])/len(timed_transcript)
      #print('Allowed margin is {}'.format(allowed_margin))
      #print('Number of blocks are: {}'.format(num_blocks))

      try:
        length_of_each_block = math.floor(length_of_timed_transcript/num_blocks)
        #print('Length of each block is {}'.format(length_of_each_block))
      except ZeroDivisionError:
        #print('\n')
        #print('Removing a very short transcript (<100 words)')
        continue

      # Define some more variables to hold miscellaneous intermediates
      block=list()
      buffer = str()
      blocks = list()
      block_length = 0
      flag = 0
      sentences_proccessed = 0

      # Iterate over all sentences in the transcript to divide it into roughly equal blocks

      for idx, sentence in enumerate(timed_transcript):

        if flag==0:
          pass

        if flag==1:
          assert block_length==0
          block.append(buffer)
          block_length += len(buffer.strip().split())
          flag = 0

        length_of_current_sentence = len(sentence.strip().split())
        block_length += length_of_current_sentence
        #print('Length of block currently is at: {}'.format(block_length))
          

        if block_length < length_of_each_block and block_length + allowed_margin <= length_of_each_block:
          block.append(sentence)
          sentences_proccessed+=1
        
        elif block_length == length_of_each_block:
          block.append(sentence)
          sentences_proccessed+=1
          
        else:
          flag = 1
          buffer = sentence
          sentences_proccessed+=1
          #print(block)
          #print((block, idx))
          blocks.append((block, idx-1))
          #print(blocks)
          block_length = 0
          block=list()

      # Ensure there is at least 1 block
      if len(blocks)==0:
        blocks.append((block, idx))
        block_length = 0
        block=list()

      # Handle the case when there are some sentences that missed out
      if len(block)!=0:

        # Get last block
        for left_over_sent in block:
          blocks[-1][0].append(left_over_sent)


      # Ensure all sentences in the transcript have been processed once and only once
      assert sentences_proccessed == len(timed_transcript)

      # Get blocks with their lengths
      len_blocks = get_blocks_lengths(blocks)

      # Get end timings for each block in transcript 
      end_timing_of_blocks = [chrono_of_timed_transcript[-1] if idx+1==len(blocks) else chrono_of_timed_transcript[block[1]] for idx, block in enumerate(blocks)]

      # Store transcripts, with all its blocks and their timings for targeted search
      transcript_with_blocks_and_their_timings.append((transcript, blocks, end_timing_of_blocks))

      # Summarize individual blocks
      summary_blocks = distil_block(blocks)

      # Replace multiple spaces in the summary blocks with a single space
      summary_blocks = [re.sub('\s+', ' ', summary_block) for summary_block in summary_blocks]

      # Append the summary to the 'summaries' list
      summaries.append(summary_blocks)

    return summaries, transcript_with_blocks_and_their_timings


  # There is no point in generating a 'summary' for a very short transcript (<100 words)
  # Hence, we fill those NaN values in the summaries by the corresponding transcripts themselves
  # THIS FUNCTION IS UNDER DEVELOPMENT, DO NOT CALL
  def fill_nans(self, summaries, transcripts):

    empty = list()

    for idx, summary in enumerate(summaries):

      if not summary:
        empty.append(idx)

    for idx in empty:

      summaries[idx] = transcripts[idx]

    for summary in summaries:
    
      assert len(summary.strip().split()) != 0

    print('Successfully Copied over the transcripts as it is to summaries where the length was less than 100 words!')

## **Use AutoSynop Class to get Automated Synopses**



---
We will first define playlists (their links) and their names taken from the official Sadhguru public channel on YouTube. 

Next, we will use the AutoSynop Class (and its methods) to obtain a database of summaries for ~850 transcripts in an end-to-end fashion


---



In [None]:
# First, let us define our playlists
# It is from these playlists that we will be downloading transcripts

playlists = '''

https://www.youtube.com/watch?v=Dl8MUnLfEsk&list=PL3uDtbb3OvDOWpCZ8ERCXHMcslGaBEOBT
https://www.youtube.com/watch?v=cxoQdEhHaT8&list=PL3uDtbb3OvDNXmmy_3Q7SCHIZdz9ja4SG
https://www.youtube.com/watch?v=yL_fgyXXnSM&list=PL3uDtbb3OvDPHh7DWhekbw-ywA7SCnstr
https://www.youtube.com/watch?v=O1B0lDS1Jnw&list=PL3uDtbb3OvDNsDLMnmyR94MTfGHQh6HtP
https://www.youtube.com/watch?v=zO8QzMWZbN4&list=PL3uDtbb3OvDMpNqWoWfsY9qqT7UZijw0w
https://www.youtube.com/watch?v=4OBLAW7oQYo&list=PL3uDtbb3OvDPup8tDy1viWElFkPZcL4pM
https://www.youtube.com/watch?v=GM0lU5Dq7eA&list=PL3uDtbb3OvDPZG2coablWM-9XX6JQtSQT
https://www.youtube.com/watch?v=vvntRXe6YcU&list=PL3uDtbb3OvDPBGzSYKBeEFlrG48_0DBC4
https://www.youtube.com/watch?v=DTWMwHtF-UA&list=PL3uDtbb3OvDNnH5j_UFzZwR2KWg4TShJn
https://www.youtube.com/watch?v=kAMvYHqTWs0&list=PL3uDtbb3OvDPt8Ayn5QQ_13Juo98-EDxP
https://www.youtube.com/watch?v=xswUGZOVdc4&list=PL3uDtbb3OvDPLLMGlDi3C3-uAwyTBXtnR
https://www.youtube.com/watch?v=3J-cYxxHQGQ&list=PL3uDtbb3OvDNxpFp3baiPKRM4tGtD3_Me
https://www.youtube.com/watch?v=f7-lwz_FacE&list=PL3uDtbb3OvDMjs6pYa27tCweTBKBgxUij
https://www.youtube.com/watch?v=a6danRWYxpo&list=PL3uDtbb3OvDNVQJSz1__CuW-IS2s2kw2A
https://www.youtube.com/watch?v=uoIXz3KcwME&list=PL3uDtbb3OvDMgLTgfZe4fDN48SYfEtesX
https://www.youtube.com/watch?v=bJggjXvB52c&list=PL3uDtbb3OvDMHDwKA8sPrEi2SV3IKnT0S
https://www.youtube.com/watch?v=4OBLAW7oQYo&list=PL3uDtbb3OvDPAcaMIq68euWqHvZosh8JI
https://www.youtube.com/watch?v=QAsJvKsd2Xk&list=PL3uDtbb3OvDNWKnzD4MJRQRX_wBAT9iDC
https://www.youtube.com/watch?v=UT_nWVLi4Ws&list=PL3uDtbb3OvDMMbCg-hvVjXYZ3osM4rpr2
https://www.youtube.com/watch?v=X_fHa73_nOg&list=PL3uDtbb3OvDNo0TvQIHbB6TLndA7jEMTR
https://www.youtube.com/watch?v=HIkgY0Rz1jU&list=PL3uDtbb3OvDMaNezBWgE_SNQ6QkeYkV1w
https://www.youtube.com/watch?v=AHS1c_vqjxI&list=PL3uDtbb3OvDMBO-NUpWCvV_zhJh1pFlEX
https://www.youtube.com/watch?v=diFkCJ802vY&list=PL3uDtbb3OvDMdjRscdox0QYkcE9cghmvx
https://www.youtube.com/watch?v=rbYdXbEVm6E&list=PL3uDtbb3OvDONMcvq4e82gs33laM4IJ_z
https://www.youtube.com/watch?v=235gIzWOkrM&list=PL3uDtbb3OvDNDjm-mp82KCJB6VDpqZi3I

'''

playlists = playlists.strip().split('\n')

In [None]:
# Next, let's create an object of the class
Autosynop = AutoSynop()

In [None]:
# Get video IDs of all videos in the playlists

# First we obtain the video metadata which includes the video ID and the title
# We will do this using the get_metadata_for_videos() method
video_metadata = list(Autosynop.get_metadata_for_videos(playlists))


# We will remove duplicate videos (if any) using the scrap_duplicates() method
# This will rid the video metadata of any duplicacies
cleaned_metadata = Autosynop.scrap_duplicates(video_metadata)

In [None]:
# Get transcripts along with timing metadata for those video ids
# The timing metadata is used for targeted search (demonstrated later)

# If files are not provided, run this
#transcripts, timed_transcripts, chrono_of_timed_transcripts, cleaned_metadata = Autosynop.get_transcripts(cleaned_metadata)

# Finally, we'll seperate out our video IDs from the cleaned video metadata
# This is done using the get_video_ids() and get_video_titles() method
# If files are not provided, run this
#video_ids = Autosynop.get_video_ids(cleaned_metadata)
#video_titles = Autosynop.get_video_titles(cleaned_metadata)

# If files already provided, run this instead
with open('/content/transcripts_f.pickle', 'rb') as handle:
    transcripts = pickle.load(handle)

with open('/content/timed_transcripts_f.pickle', 'rb') as handle:
    timed_transcripts = pickle.load(handle)


with open('/content/chrono_of_timed_transcripts_f.pickle', 'rb') as handle:
    chrono_of_timed_transcripts = pickle.load(handle)


with open('/content/cleaned_metadata_f.pickle', 'rb') as handle:
    cleaned_metadata = pickle.load(handle)


with open('/content/video_ids_f.pickle', 'rb') as handle:
    video_ids = pickle.load(handle)


with open('/content/video_titles_f.pickle', 'rb') as handle:
    video_titles = pickle.load(handle)

In [None]:
# Finally, get summaries for all transcripts
# Also, get the start and end times of the blocks in the transcript (used for targeted search, demonstrated later)
# This will take a while to execute ~2.5 hours on a P100 - PCE GPU 
# summaries, transcript_with_blocks_and_their_timings = Autosynop.process_and_split_transcript(transcripts, timed_transcripts, chrono_of_timed_transcripts)

# If files are provided, run this instead 
with open('/content/transcript_with_blocks_and_their_timings_f.pickle', 'rb') as handle:
  transcript_with_blocks_and_their_timings = pickle.load(handle)

with open('/content/summaries_f.pickle', 'rb') as handle:
    summaries = pickle.load(handle)

## **Save the Summaries to a Pandas DataFrame for Recommendation**



---
The saved DataFrame has four columns:


1.   Video ID
2.   Video Title
3.   Summary
4.   Summary Blocks
5.   Summary Block End Timings
6.   Length of Summary
7.   Transcript
8.   Transcript Blocks
9.   Transcript Block End Timings
10.   Length of Transcript


**The .csv file is saved at path:**

`/content/db_633.pickle`

---






In [None]:
def save_df(video_ids, video_titles, summaries, transcripts, transcript_with_blocks_and_their_timings):

  joined_summaries = [' '.join(summary_block) for summary_block in summaries]
  length_transcripts = [len(transcript.strip().split()) for transcript in transcripts]
  length_summaries = [len(summary.strip().split()) for summary in joined_summaries]
  transcripts = [element[0] for element in transcript_with_blocks_and_their_timings]
  transcript_blocks = [element[1] for element in transcript_with_blocks_and_their_timings]
  block_end_timings = [element[2] for element in transcript_with_blocks_and_their_timings]
 

  tuples_for_df = list(zip(video_ids, video_titles, joined_summaries, summaries, block_end_timings, length_summaries, transcripts, transcript_blocks, block_end_timings, length_transcripts))
  df = pd.DataFrame(tuples_for_df, columns = ['Video ID', 'Video Title', 'Summary', 'Summary Blocks', 'Summary Block End Timings', 'Length of Summary',  'Transcript', 'Transcript Blocks', 'Transcript Block End Timings',  'Length of Transcript'])
  df.to_pickle('/content/db_633_f.pickle')

  return df 

df = save_df(video_ids, video_titles, summaries, transcripts, transcript_with_blocks_and_their_timings)

In [None]:
# Preview the dataframe
df.head()

## **Matching using Zero Shot Classification**



---
The goal is to recommend the most relevant video from the database to the user based on a short description given by the user.

Specifically, the inputs taken is:

*   A brief description of the user's quandary or what he/she needs clarity on. 


Given this input, the proposed approach is as follows:


1.   **Classify Summaries:** Even before the user input, we use Zero Shot Classification using Bart (Facebook AI) to classify all the summaries in the database as the probability that the summary belongs to each of the themes (defined in the beginning).


2.   **Shortlist**: This step, involves two tracks of computation. First, we take the user input (as defined above) and next, we follow the following two tracks:


*   **Track 1 (NLP on Video Titles):**
    
    This involves calculating embeddings for the user's input using roBERTa (Facebook AI) and the titles of the videos in the database. We use cosine distance in the embedding space between the user's input embeddings and the video title embeddings as a measure of similarity and shortlist the top 20 matched videos. We believe that there is a great deal of accurate semantic information in the title of the video itself and it should taken into consideration.

*   **Track 2 (NLP on Summaries):**

    First, we tag the user input using the concept of Zero Shot classify with Bart (Facebook AI). 
    
    For example, a user's input could be:

        "I just had a breakup with my girlfriend. 
        She decided to part ways but I am unable to accept or process this. 
        I feel lost all the time and cannot focus at work or anywhere else. I cannot move on. 
        At night, her memory keeps me up. In the morning too, and even at work, I see her face. 
        I cannot get her out of my mind. I don't know what to do."

    And the tags (using Zero Shot classification) would be:

        [ 'love and relationships', 'suffering', 'stress' ]
    
    Next, we filter videos based on those tags (from our already Zero Shot classified database.)
    
    And finally, we calculate the cosine similarity between the embeddings of summaries of the filtered videos and the embeddings of the user input, selecting the top 20 here too.

3.   **Recommendation**: In this next step, we take the intersection of Track 1 and Track 2, to end up with a list of videos that are very likely highly relevant to the user. The final list of videos are shown as recommendations. 


4.  **Smart Snippets (Targeted Recommendations)**: This takes the final list of recommendations and tries to find the best entry and exit points in the video and recommend them too. This is done by doing a transcript search, proceeding block by block and noting the times of the blocks in the transcripts that best fit the user's description. 

Each of the 4 steps, are documented below with code.

---


### **1.  Classify Summaries**



---

For each summary, the probability that the summary belongs to each theme is calculated in three ways using Bart (Facebook AI):

1.   **Averaging**: In this method, for each summary, we zero shot classify *each block* of the summary as the probability that it belongs each of the themes, and then average those probabilites across all the summary blocks to return the final average probabilities. 


2.   **Soft Dominance**: In this method, for each summary, we zero shot classify *each block* of the summary as the probability that it belongs to each of the themes, and for *each summary block*, the most probable theme is found.The final probabilites are then calculated as: `number of times a theme has dominated a block (been the maximum) / number of blocks in the summary`


3.   **Hard Dominance**: In this method, for each summary, we zero shot classify *each block* of the summary as the probability that it belongs each of the themes ans then for *each summary block*, the most probable theme is found *over a cummulative addition of probabilties*. The final probabilites are then calculated as: `number of times a theme has dominated the cummulative probability across blocks (been the maximum in cummulative probability across blocks) / number of blocks in the summary`

We divide by `number of blocks in the summary` in all methods to keep the probabilities between 0 and 1 (as they should be)


---




In [None]:
# First, we'll define the themes we'll be using to classify the transcripts on
# These themes reflect the names of the playlists we have scrapped for the transcripts

themes = '''

love and relationships
addiction and compulsiveness
unraveling death
karma
education
parenting
suffering
stress
time management
existence of god
living happily
health
success
depression
marriage
sleep and restfulness
peace
virus

'''

In [None]:
# Process themes and define a dict to map from the index number to the theme
themes = themes.strip().split('\n')
emotion_list = themes
emotion_ids = list(range(1, len(emotion_list)+1))
emotion_ids_to_emotions = dict(zip(emotion_ids, emotion_list))

In [None]:
# If df_633_f is provided, run this
#summaries = df['Summary Blocks'].tolist()

# If summaries_f.pkl is provided, run this
with open('/content/summaries_f.pickle', 'rb') as handle:
    summaries = pickle.load(handle)

In [None]:
# First, set up Bart
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli')

# Send the model to GPU and print out the architecture
model.to(device)

In [None]:
# Perform Zero Shot Classification

# Setup logging and wrapped printing
import logging
logging.basicConfig(level=logging.ERROR)
wrapper = textwrap.TextWrapper(width=80) 

# Define global list for Averaged method
zero_shot_averaged = list()

# Define global list for Hard Dominance method
zero_shot_hard_dominance = list()

# Define global list for Soft Dominance method
zero_shot_soft_dominance = list()


# Loop over all summaries
for summary in tqdm(summaries):

  initial_probs = [0.00] * len(emotion_list)
  initial_dominance = [0] * len(emotion_list)
  num_blocks = len(summary)
  labels = emotion_list

  # Define local list (one summary) for Averaged method
  zero_shot_for_summary_averaged = dict(zip(emotion_list, initial_probs))

  # Define local list (one summary) for Hard Dominance method
  zero_shot_for_summary_hard_dominance = dict(zip(emotion_list, initial_dominance))

  # Define local list (one summary) for Soft Dominance method
  zero_shot_for_summary_soft_dominance = dict(zip(emotion_list, initial_dominance))

  # Buffer list to store intermediate values
  zero_shot_for_summary = dict(zip(emotion_list, initial_probs))


  # Loop over the blocks in the summary
  for summary_block in summary:

    #print(wrapper.fill(''.join(summary_block)))
    #print('\n')

    # Perform Zero Shot Classification

    # pose sequence as a NLI premise and label (politics) as a hypothesis
    premise = summary_block

    for label in labels:

      hypothesis = f'This text is about {label}.'

      # run through model pre-trained on MNLI
      x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                              max_length=tokenizer.max_len,
                              truncation_strategy='only_first')
      x = x.to(device)
      logits = model(x)[0]

      # We throw away "neutral" (dim 1) and take the probability of
      # "Entailment" as the probability of the label being true 
      entail_contradiction_logits = logits[:,[0,2]]
      probs = entail_contradiction_logits.softmax(1)
      prob_label_is_true = probs[:,1]

      # Update the initial probabilities in the dict for the label
      zero_shot_for_summary[label] = prob_label_is_true.item() 
      
      # Accumulate the initial probabilities in the dict for the label
      zero_shot_for_summary_averaged[label] += prob_label_is_true.item() 


      #print(f'Probability the text belongs to {label} is: {prob_label_is_true.item():0.2f}%')

    # Find the dominant theme in the block (cummulative)
    keymax = max(zero_shot_for_summary_averaged, key=zero_shot_for_summary_averaged.get) 

    # Update the hard dominance dict 
    zero_shot_for_summary_hard_dominance[keymax] += 1

    # Find the dominant theme in the block (non-cummulative)
    keymax = max(zero_shot_for_summary, key=zero_shot_for_summary.get) 

    # Update the soft dominance dict 
    zero_shot_for_summary_soft_dominance[keymax] += 1
  
  
  for label in labels:

    # Average out the probabilities after each block in the summary has been assigned a probability
    zero_shot_for_summary_averaged[label] /= num_blocks

    # Calculate the dominance percentage 
    zero_shot_for_summary_soft_dominance[label] /= num_blocks

    # Calculate the dominance percentage 
    zero_shot_for_summary_hard_dominance[label] /= num_blocks


  # Filter out all labels are zeroes for soft dominance
  zero_shot_for_summary_soft_dominance = {key:val for key, val in zero_shot_for_summary_soft_dominance.items() if val > 0.0}

  # Filter out all labels are zeroes for hard dominance
  zero_shot_for_summary_hard_dominance = {key:val for key, val in zero_shot_for_summary_hard_dominance.items() if val > 0.0} 
  
  # Filter out all the labels that are less than 25% for averaged  
  #zero_shot_for_summary_averaged = {key:val for key, val in zero_shot_for_summary_averaged.items() if val >= 0.15}


  # Print out the dict of classification probabilities
  #print(zero_shot_for_summary)
  #print(dominance_frequency)
  #print('\n')
  #print('-' * 50)

  # Append the results for one summary to the global list
  #print(zero_shot_for_summary_averaged)
  #print(zero_shot_for_summary_soft_dominance)
  #print(zero_shot_for_summary_hard_dominance)
  zero_shot_averaged.append(zero_shot_for_summary_averaged)
  zero_shot_soft_dominance.append(zero_shot_for_summary_soft_dominance)
  zero_shot_hard_dominance.append(zero_shot_for_summary_hard_dominance)

print('Done!')

In [None]:
# Append the zero shot classification results to the existing dataframe and save it

# Load
df = pd.read_pickle('/content/db_633_f.pickle')
df.head()

# Apoend
df['Zero Shot Classification (Averaged)'] = zero_shot_averaged
df['Zero Shot Classification (Soft Dominance)'] = zero_shot_soft_dominance
df['Zero Shot Classification (Hard Dominance)'] = zero_shot_hard_dominance
df.head()

# Save
df.to_pickle('/content/db_633_zero_shot_f.pickle')

In [None]:
df.head()

### **2.  Shortlist**



---

First, we take the user input and then we perform computations of two tracks:

**Track 1** aims to extract the rich semantic information from the title of the videos. 

**Track 2** aims to extract the semantic information in the content of the videos itself.

In [None]:
# Read dataframe 
df = pd.read_pickle('/content/db_633_zero_shot_f.pickle')

# Retrieve soem columns of the DataFrame
zero_shot_averaged = df['Zero Shot Classification (Averaged)'].tolist()
zero_shot_soft_dominance = df['Zero Shot Classification (Soft Dominance)'].tolist()
zero_shot_hard_dominance = df['Zero Shot Classification (Hard Dominance)'].tolist()
video_ids = df['Video ID'].tolist()
video_titles = df['Video Title'].tolist()
transcripts = df['Transcript Blocks'].tolist()
summaries = df['Summary Blocks'].tolist()

# Display the DataFrame
df.head()

In [None]:
# Some basic processing to combine all the zero shot classification methods into a single list

# Filter zero_shot_averaged list
filtered_zero_shot_averaged = list()
for zero_shots in zero_shot_averaged:
  filtered_probs_dict = {key:val for key, val in zero_shots.items() if val >= 0.25}
  filtered_zero_shot_averaged.append(filtered_probs_dict)

# Combine all
zero_shot = list()
for index in range(len(zero_shot_averaged)):
  zero_shot.append(list(set(filtered_zero_shot_averaged[index].keys()).union(set(zero_shot_soft_dominance[index].keys()).union(set(zero_shot_hard_dominance[index].keys())))))

# Ensure the length of the final list is equal to the number of transcripts 
assert len(zero_shot)==len(zero_shot_averaged)
assert len(zero_shot)==len(zero_shot_soft_dominance)
assert len(zero_shot)==len(zero_shot_hard_dominance)

In [None]:
# We'll define the themes we'll be using to classify the transcripts on
# These themes reflect the names of the playlists we have scrapped for the transcripts

themes = '''

love and relationships
addiction and compulsiveness
unraveling death
karma
education
parenting
suffering
stress
time management
existence of god
living happily
health
success
depression
marriage
sleep and restfulness
peace
virus

'''

# Process themes and define a dict to map from the index number to the theme
themes = themes.strip().split('\n')
emotion_list = themes
emotion_ids = list(range(1, len(emotion_list)+1))
emotion_ids_to_emotions = dict(zip(emotion_ids, emotion_list))

#### **Just some examples!**

---

Try any of these or write your own!

To try:

1. Copy the string and enter it on the next cell! 

2. Done!

(Click below to reveal)

---



In [None]:
############## EXAMPLES #################

s1 = "I just had a breakup with my girlfriend. She decided to part ways but i am unable to accept or process this. I feel lost all the time and cannot focus at work or anywhere else. I cannot move on. At night, her memory keeps me up. in the morning too, and even at work, I see her face. I cannot get her out of my mind. I don't know what to do."
s2 = "After losing my job, there is a lot of pressure on me. I keep having negative thoughts all the time. I feel like I am worthless. I feel myself slipping to depression. I don't know what to do."

#### **Your input goes here!**

In [None]:
# Take Input from user
inputs_descrption = input('Please feel free to describe what you are going through: ')

In [None]:
# Clean input
inputs_descrption = re.sub("(\.\.+)", '', inputs_descrption)
inputs_descrption = re.sub('\s+', ' ', inputs_descrption)
inputs_descrption = inputs_descrption.lower()

# Display the inputs given by the user
print('Your inputs were: \n')
print(f'{inputs_descrption}')

#### **Track 1 : Matching Video Titles to User Input**

In [None]:
# First, preprocess all titles to remove anything that does not provide useful semantic informaiton

def clean_sentence(sentence):
    stop_words = ['Sadhguru', 'IST', 'Unplug', 'unplug', '@', 'everyday']
    tokens = sentence.split()
    tokens_filtered= [word for word in tokens if not word in stop_words]
    filtered_sentence = ' '.join(tokens_filtered)

    return filtered_sentence


def is_valid_date(date_str):
    try:
        parser.parse(date_str)
        return True
    except:
        return False

new_list = [' '.join([w for w in line.split() if not is_valid_date(w)]) for line in video_titles]
new_list = [re.sub('[^\sA-Za-z0-9]+', '', mystring) for mystring in new_list]
new_list = [re.sub('\s+', ' ', mystring) for mystring in new_list]
new_list = [clean_sentence(mystring) for mystring in new_list]
titles = [mystring.replace('Sadhguru', '').replace('Unplug', '').replace('IST', '').replace('@', '').replace('With in Challenging Times', '').replace('with in Challenging Times', '') for mystring in new_list]
titles = [mystring.replace('', 'xxx deleted xxx') if mystring=='' else mystring for mystring in titles]

In [None]:
# Load model to get embeddings from sentences
model = SentenceTransformer('roberta-large-nli-stsb-mean-tokens', device=0)
nltk.download('punkt')

100%|██████████| 1.31G/1.31G [00:27<00:00, 47.2MB/s]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# Get user embeddings by passing the user input to the loaded model
user_text = inputs_descrption
user_sent_text = nltk.sent_tokenize(user_text)
user_sent_text_embeddings = model.encode(user_sent_text)

# Average the embeddings
user_sent_text_embeddings = sum(user_sent_text_embeddings) / len(user_sent_text)
user_sent_text_embeddings = np.reshape(user_sent_text_embeddings, (1, user_sent_text_embeddings.shape[0]))

# Get title embeddings by passing all video titles to the loaded model
title_embeddings = np.asarray(model.encode(titles))

# Find the similarity of each title embedding with the user embedding using Cosine similarity
similarities = cosine_similarity(user_sent_text_embeddings, title_embeddings)
similarities = list(np.reshape(similarities, (similarities.shape[1])))

# Get indices of top 20 videos
top_20_from_video_titles = sorted(range(len(similarities)), key=lambda i: similarities[i])[-20:]

# Print titles of the selected 20 videos
for video_title_index in top_20_from_video_titles:
  print(titles[video_title_index])

Troubled by Fear Just Change Your Channel
The Desire For Everything Understanding The Human Predicament
Depression Stop The Suicide In Installments
Dont Let Fear of Suffering Limit Your Possibility
Why Is Breaking Up So Painful
How Do I Deal With Unfulfilled Expectations
Missing Life is a Tragedy
Coping With The Emotional Turmoil In A Pandemic 
Sick with Exam Fear This Will Help
Heaven is a Lousy Place 
Becoming Utterly Ignorant
Insight Into Depression
What is The Worst Ailment You Can Get 
Is Suffering Inevitable
The End of Suffering
Why Misery
Why Am I Stressed on Stress
on The Source of All Suffering
on Fear of Failure
The Source of Human Misery


##### **Question: Is averaging the embeddings a good representation of all the sentences in a query or paragraph?**



---
Yes, it does seem to be a fair represenation of the overall sentiment in the sentences! This can be shown by the similarity matrix between the average of all sentence embeddings and the embeddings of the individual sentence! 

The similarity matrix shows that most of the individual sentence embedding matches with more than 50% with the average embedding thus strenghtening the assumption that the average embedding is indeed a fair represention of the overall sentiment in the sentences! 

The sentences that match less than 50% are usually bland sentences like "I don't know" etc

(Click below to see the code)

---



In [None]:
# Get embeddings of individual sentences
user_text = inputs_descrption
user_sent_text = nltk.sent_tokenize(user_text)
user_sent_text_embeddings = np.asarray(model.encode(user_sent_text))

# Get average embeddings of all sentences
user_text = inputs_descrption
user_sent_text = nltk.sent_tokenize(user_text)
user_sent_text_embeddings_avg = model.encode(user_sent_text)
user_sent_text_embeddings_avg = sum(user_sent_text_embeddings_avg) / len(user_sent_text)
user_sent_text_embeddings_avg = np.reshape(user_sent_text_embeddings_avg, (1, user_sent_text_embeddings_avg.shape[0]))

# Finally, get similarity matrix
cosine_similarity(user_sent_text_embeddings, user_sent_text_embeddings_avg)

array([[0.531698  ],
       [0.6648342 ],
       [0.6563994 ],
       [0.68624353],
       [0.48793417],
       [0.41167915],
       [0.7173095 ],
       [0.6379081 ]], dtype=float32)

#### **Track 2 : Extracting Tags from User Input, Filtering Videos using the Zero Shot Classified Summaries, and Matching the Filtered Videos to the Semantic Content in those Videos**

In [None]:
# Load model for zero shot classification
# Set up Bart for tag extraction
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-mnli')
model2 = BartForSequenceClassification.from_pretrained('facebook/bart-large-mnli').to(device)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=908.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1629486723.0, style=ProgressStyle(descr…




In [None]:
# Now pick up tags from the user input

initial_probs = [0.00] * len(emotion_list)
initial_dominance = [0] * len(emotion_list)
num_sents = len(user_sent_text)
labels = emotion_list

# Define local list (one summary) for Averaged method
zero_shot_for_user_text_averaged = dict(zip(emotion_list, initial_probs))

# Define local list (one summary) for Hard Dominance method
zero_shot_for_user_text_hard_dominance = dict(zip(emotion_list, initial_dominance))

# Define local list (one summary) for Soft Dominance method
zero_shot_for_user_text_soft_dominance = dict(zip(emotion_list, initial_dominance))

# Buffer list to store intermediate values
zero_shot_for_user_text = dict(zip(emotion_list, initial_probs))

# Loop over all sentences in the user input
for user_text in tqdm(user_sent_text):

    # pose sequence as a NLI premise and label (politics) as a hypothesis
    premise = user_text

    for label in labels:

      hypothesis = f'This text is about {label}.'

      # run through model pre-trained on MNLI
      x = tokenizer.encode(premise, hypothesis, return_tensors='pt',
                              max_length=tokenizer.max_len,
                              truncation_strategy='only_first')
      x = x.to(device)
      logits = model2(x)[0]

      # We throw away "neutral" (dim 1) and take the probability of
      # "Entailment" as the probability of the label being true 
      entail_contradiction_logits = logits[:,[0,2]]
      probs = entail_contradiction_logits.softmax(1)
      prob_label_is_true = probs[:,1]

      # Update the initial probabilities in the dict for the label
      zero_shot_for_user_text[label] = prob_label_is_true.item() 
      
      # Accumulate the initial probabilities in the dict for the label
      zero_shot_for_user_text_averaged[label] += prob_label_is_true.item() 

      #print(f'Probability the text belongs to {label} is: {prob_label_is_true.item():0.2f}%')

    # Find the dominant theme in the block (cummulative)
    keymax = max(zero_shot_for_user_text_averaged, key=zero_shot_for_user_text_averaged.get) 

    # Update the hard dominance dict 
    zero_shot_for_user_text_hard_dominance[keymax] += 1

    # Find the dominant theme in the block (non-cummulative)
    keymax = max(zero_shot_for_user_text, key=zero_shot_for_user_text.get) 

    # Update the soft dominance dict 
    zero_shot_for_user_text_soft_dominance[keymax] += 1
  
  
for label in labels:

  # Average out the probabilities after each block in the summary has been assigned a probability
  zero_shot_for_user_text_averaged[label] /= num_sents

  # Calculate the dominance percentage 
  zero_shot_for_user_text_soft_dominance[label] /= num_sents

  # Calculate the dominance percentage 
  zero_shot_for_user_text_hard_dominance[label] /= num_sents


# Filter out all the labels that are less than 25% for averaged
zero_shot_for_user_text_averaged = {key:val for key, val in zero_shot_for_user_text_averaged.items() if val >= 0.25}

# Filter out all labels are zeroes for soft dominance
zero_shot_for_user_text_soft_dominance = {key:val for key, val in zero_shot_for_user_text_soft_dominance.items() if val > 0.0}

# Filter out all labels are zeroes for hard dominance
zero_shot_for_user_text_hard_dominance = {key:val for key, val in zero_shot_for_user_text_hard_dominance.items() if val > 0.0} 

100%|██████████| 5/5 [00:03<00:00,  1.35it/s]


In [None]:
# Print the extracted tags using different methods
pprint(zero_shot_for_user_text_soft_dominance)
pprint(zero_shot_for_user_text_hard_dominance)
pprint(zero_shot_for_user_text_averaged)

{'stress': 0.2, 'suffering': 0.8}
{'stress': 0.6, 'suffering': 0.4}
{'depression': 0.5996770307421684,
 'stress': 0.7527786880731583,
 'suffering': 0.9185027122497559}


In [None]:
# Shortlist videos based on extracted tags from Input 
user_tags = list(set(zero_shot_for_user_text_averaged.keys()).union(set(zero_shot_for_user_text_soft_dominance.keys()).union(set(zero_shot_for_user_text_hard_dominance.keys()))))
labels = set(user_tags)

# Get all combinations of the input
from itertools import chain, combinations

def powerset(iterable):
    "powerset([1,2,3]) --> () (1,) (2,) (3,) (1,2) (1,3) (2,3) (1,2,3)"
    s = list(iterable)
    return chain.from_iterable(combinations(s, r) for r in range(len(s)+1))

combination_of_user_tags = list(powerset(user_tags))
combination_of_user_tags = [list(user_tag) for user_tag in combination_of_user_tags if len(user_tag)>1]

# Get matches for all the combination of tags
matches = list()
for user_tag in combination_of_user_tags:
  for idx, entry in enumerate(zero_shot):
    if set(user_tag).issubset(set(entry)):
      matches.append(idx)

# Keep only the unique matches
matches = set(matches)

# Print them
#for match in matches:
  #print(match)

In [None]:
# Filter Videos using NLP

# Get user embeddings by passing the user input to the loaded model
user_text = inputs_descrption
user_sent_text = nltk.sent_tokenize(user_text)
user_sent_text_embeddings = model.encode(user_sent_text)

# Average the embeddings
user_sent_text_embeddings = sum(user_sent_text_embeddings) / len(user_sent_text)
user_sent_text_embeddings = np.reshape(user_sent_text_embeddings, (1, user_sent_text_embeddings.shape[0]))

sim_mats = list()
sim_scores = list()
non_zero_percentages = list()
total_doms = list()
nzall = list()
tssall = list()

matched_blocks_all = list()

for match in tqdm(matches):

  summary = summaries[match]
  non_zero_percentage = 0
  sim_scores_loc = list()
  matched_blocks = list()

  doms = list()

  for idx, block in enumerate(summary):

    #block_text = ' '.join(block)
    block_sent_text = nltk.sent_tokenize(block)
    #pprint(block_sent_text)
    #print('\n')

    block_sent_text_embeddings = model.encode(block_sent_text)
    block_sent_text_embeddings = sum(block_sent_text_embeddings) / len(block_sent_text_embeddings)
    block_sent_text_embeddings = np.reshape(block_sent_text_embeddings, (1, block_sent_text_embeddings.shape[0]))
    similarity = cosine_similarity(user_sent_text_embeddings, block_sent_text_embeddings)

    sim_scores_loc.append(similarity)

    if similarity>=0.50:
      non_zero_percentage += 1

    #pprint(sim_mat)
    #print('\n')

    #pprint(sum_row)
    #print('/n')

    #pprint(sum_total)
    #print('/n')


  #non_zero_percentage = len(matched_blocks) / len(transcript)
  #non_zero_percentage = (sum([d for d in doms if d!=0])/len(doms)) * 100
  #non_zero_percentages.append(non_zero_percentage)
  #total_doms.append(doms)
  matched_blocks_all.append(matched_blocks)
  sim_scores.append(sum(sim_scores_loc)/len(sim_scores_loc))
  non_zero_percentages.append(non_zero_percentage/len(summary))

100%|██████████| 186/186 [01:36<00:00,  1.94it/s]


In [None]:
# Get indices of top 20 maximum matches
indices = sorted(range(len(sim_scores)), key=lambda i: sim_scores[i])[-50:]
indices2 = sorted(range(len(non_zero_percentages)), key=lambda i: non_zero_percentages[i])[-50:]
matches_from_indices = [match for idx, match in enumerate(matches) if idx in indices]
matches_from_indices2 = [match for idx, match in enumerate(matches) if idx in indices2]


# Print titles of the selected 20 videos
for top in matches_from_indices:
  print(video_titles[top])

print('\n')

# Print titles of the selected 20 videos
for top in matches_from_indices2:
  print(video_titles[top])

How to Deal with Relationships? | Sadhguru
How to Deal with an Exploitative Spouse? Sadhguru
How Do You Accept People You Don't Like? Sadhguru
What Is True Friendship? – Sadhguru
Is it Ok to be Jealous of a Friend’s Success? - Sadhguru
Why Do We Seek Success in Relationships? - Sadhguru
​Humanity, Yes! Morality, No! | Sadhguru
Why Good People Won’t Get Anywhere | Sadhguru
Troubled by Fear? Just Change Your Channel! - Sadhguru
To Make a Journey, Don’t Change Directions | Sadhguru
Resisting Change is Resisting Life | Sadhguru
Is it true pregnant women should not do Aum Kar? Sadhguru
Why Would One Take Their Own Life? Sadhguru - With Sadhguru in Challenging Times - 14th Jun
Bliss Beyond Intoxication | Sadhguru
Sadhguru on Why Being a Soldier Is a Big Deal
Why are you suffering the Lockdown - With Sadhguru in Challenging Times - 30 Mar
Just a Brief Life | Sadhguru
Untangling the Knots of Life | Sadhguru
Stop Creating | Sadhguru
Ending Fear-Based Education - Sadhguru
Why Do Parents Worry So

### **3. Recommendation: Intersection of Track 1 (Video Title Semantics) and Track 2 (Video Content Semantics)**

In [None]:
# Intersection of track1 and track2 
intersection = list(set(top_20_from_video_titles).intersection(matches_from_indices2))

In [None]:
# Print titles of the intersection
print('The recommendations found are as follows: \n')

for top in intersection:
  print(video_titles[top])

The recommendations found are as follows: 

Coping With The Emotional Turmoil In A Pandemic - With Sadhguru in Challenging Times - 28 Mar
Don’t Let Fear of Suffering Limit Your Possibility - Sadhguru
Missing Life is a Tragedy | Sadhguru
Troubled by Fear? Just Change Your Channel! - Sadhguru


### **4. Smart Snippets (Targetted Recommendations)**