<a href="https://colab.research.google.com/github/amrindersingh03/Unstructured-Machine-Learning-/blob/main/Langchain_transcription_and_Semantic_search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook will transcribe a youtube video using langchain transcription , and then perform semantic search on the transcription.

In [None]:
# Make sure you are connected to a GPU runtime

### Install pytube: Library to download audios

In [1]:
pip install pytube # For audio downloading

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytube
  Downloading pytube-12.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 KB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-12.1.2


In [None]:
# Get whisper. Whisper is a speech recognition and translation model from open AI.

In [2]:
pip install git+https://github.com/openai/whisper.git -q # Whisper from OpenAI transcription model

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m37.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m21.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (setup.py) ... [?25l[?25hdone


In [3]:
import whisper 
import pytube 

### Take any youtube video that you wish to transcribe. Here, I have taken a video of Steve Jobs speech.

In [4]:
url = "https://www.youtube.com/watch?v=Tuw8hxrFBH8"
video = pytube.YouTube(url) # We now have access of that video stored in a variable named " video "

In [5]:
audio = video.streams.get_audio_only() # Extracting audio from the video
audio.download(filename='tmp.mp3') # Downloads only audio from youtube video

'/content/tmp.mp3'

In [None]:
# Load the whisper model

In [6]:
model = whisper.load_model("small")

100%|███████████████████████████████████████| 461M/461M [00:07<00:00, 66.3MiB/s]


In [None]:
# Performing transcription on audio using whisper model " model "

In [7]:
transcription = model.transcribe('/content/tmp.mp3') # Here, transcription data is stored in the variable named " transcription "

### Let's visualize how this transcription looks like

In [8]:
transcription

{'text': " Today, I want to tell you three stories from my life. That's it. No big deal. Just three stories. The first story is about connecting the dots. I dropped out of Reed College after the first six months, but then stayed around as a drop-in for another 18 months or so before I really quit. So why'd I drop out? It started before I was born. My biological mother was a young unwed graduate student, and she decided to put me up for adoption. She felt very strongly that I should be adopted by college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife. Except that when I popped out, they decided at the last minute that they really wanted a girl. So my parents, who were on a waiting list, got a call in the middle of the night asking, we've got an unexpected baby boy. Do you want him? They said, of course. My biological mother found out later that my mother had never graduated from college and that my father had never graduated from high school.

In [None]:
# We saw that transcription is in the dictionary form.

In [9]:
transcription.keys()

dict_keys(['text', 'segments', 'language'])

In [10]:
res = transcription['segments'] # We have grabbed only "segments" from dictionary  " transcription "

In [None]:
# Let's try to arrange data in more organised and readable manner.

In [11]:
from datetime import datetime

def store_segments(segments):
  texts = []
  start_times = []

  for segment in segments:
    text = segment['text']
    start = segment['start']

    # Convert the starting time to a datetime object
    start_datetime = datetime.fromtimestamp(start)

    # Format the starting time as a string in the format "00:00:00"
    formatted_start_time = start_datetime.strftime('%H:%M:%S')

    texts.append("".join(text))
    start_times.append(formatted_start_time)

  return texts, start_times

In [12]:
texts, start_times = store_segments(res)

#### Install langchain, to perform semantic search. LangChain is a library that we will use to create  Large Language Model.

In [15]:
pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [16]:
pip install openai # We need tools from open ai to create embeddings. Therefore need to install open ai environment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.26.5.tar.gz (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 KB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: openai
  Building wheel for openai (pyproject.toml) ... [?25l[?25hdone
  Created wheel for openai: filename=openai-0.26.5-py3-none-any.whl size=67620 sha256=27510cb6ce7e99e788413b6c5b14f153d359e27a59b9004d2a189ed03c8d5a83
  Stored in directory: /root/.cache/pip/wheels/a7/47/99/8273a59fbd59c303e8ff175416d5c1c9c03a2e83ebf7525a99
Successfully built openai
Installing collected packages: openai
Successfully installed openai-0.26.5


In [20]:
import openai
from langchain import OpenAI

#### FAISS is a Facebook AI Similarity Search library. It will allow us to quickly search for embeddings of multimedia documents that are similar to each other. In simple words it will search for text in the transcription that is similar to text in our question or search

In [18]:
pip install --upgrade faiss-gpu==1.7.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu==1.7.1
  Downloading faiss_gpu-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 MB[0m [31m13.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.1


In [21]:
import faiss

from langchain.vectorstores.faiss import FAISS

####FAISS takes the input in the form of embeddings(Vectors). Therefore,text needs to be converted into embeddings before feeding to FAISS. FAISS then compares and searches for similar embedding to that of the question. To create embeddings, we will use OpenAIEmbeddings

In [25]:
from langchain.embeddings.openai import OpenAIEmbeddings #To create embeddings to feed to FAISS

In [27]:
import os #Import os environment of open AI using your unique Open AI API Key
os.environ["OPENAI_API_KEY"] = "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"

####Breaking down lengthy text into smaller segments is often essential for handling them effectively We will use CharacterTextSplitter to split text into segments and store in a list

In [22]:
from langchain.text_splitter import CharacterTextSplitter

In [28]:
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs = []
metadatas = []
for i, d in enumerate(texts):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": start_times[i]}] * len(splits))
embeddings = OpenAIEmbeddings()

#### Feed the text embeddings generated to FAISS and store in a variable. 

In [30]:
store = FAISS.from_texts(docs, embeddings, metadatas=metadatas) # Make sure you have open ai account credit avaialable in order to run this command successfully.
faiss.write_index(store.index, "docs.index")

####Create a chain using VectorDBQAWithSourcesChain tool from LangChain library.
VectorDBQAWithSourcesChain will take the question and lookup for the documents from the vector database (craeted by FAISS) stored in variable 'store'


In [29]:
from langchain.chains import VectorDBQAWithSourcesChain

In [31]:
chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)

In [32]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Let's ask our question.

In [33]:
result = chain({"question": "How old was Steve Jobs when started Apple?"})

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (1581 > 1024). Running this sequence through the model will result in indexing errors


In [34]:
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  Steve Jobs was 20 when he started Apple.
  Sources: 00:04:53


#### Try asking different questions related to information from the video and check the answers generated.