<a href="https://colab.research.google.com/github/amrindersingh03/Unstructured-Machine-Learning-/blob/main/Langchain_transcription_and_Semantic_search_%20using%20cohere.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This notebook will transcribe a youtube video using langchain transcription , and then perform semantic search on the transcription.

In [None]:
# Make sure you are connected to a GPU runtime

### Install pytube: Library to download audios

In [1]:
pip install pytube # For audio downloading

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pytube
  Downloading pytube-12.1.2-py3-none-any.whl (57 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m57.0/57.0 KB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pytube
Successfully installed pytube-12.1.2


In [2]:
# Get whisper. Whisper is a speech recognition and translation model from open AI.

In [3]:
pip install git+https://github.com/openai/whisper.git -q # Whisper from OpenAI transcription model

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m46.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m190.3/190.3 KB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for openai-whisper (setup.py) ... [?25l[?25hdone


In [5]:
import whisper 
import pytube 

### Take any youtube video that you wish to transcribe. Here, I have taken video of steve jos speech.

In [6]:
url = "https://www.youtube.com/watch?v=Tuw8hxrFBH8"
video = pytube.YouTube(url) # We now have access of that video stored in variable named " video "

In [7]:
audio = video.streams.get_audio_only() # Extracting audio from the video
audio.download(filename='tmp.mp3') # Downlods only audio from youtube video

'/content/tmp.mp3'

In [8]:
# Load the whisper model

In [9]:
model = whisper.load_model("small")

100%|███████████████████████████████████████| 461M/461M [00:08<00:00, 57.1MiB/s]


In [10]:
# Performing transcriptin on audio using whisper model " model "

In [11]:
transcription = model.transcribe('/content/tmp.mp3') # Here, transcription data is stored in the variable named " transcription "

### Let's visualize how this transcription looks like

In [None]:
transcription

In [None]:
# We saw that transcription is in a dictionary form

In [12]:
res = transcription['segments'] # We have grabbed only "segment" from dictionary  " transcription "

In [None]:
res

In [None]:
# Let's try to arrange data in more organised and readable manner.

In [13]:
from datetime import datetime

def store_segments(segments):
  texts = []
  start_times = []

  for segment in segments:
    text = segment['text']
    start = segment['start']

    # Convert the starting time to a datetime object
    start_datetime = datetime.fromtimestamp(start)

    # Format the starting time as a string in the format "00:00:00"
    formatted_start_time = start_datetime.strftime('%H:%M:%S')

    texts.append("".join(text))
    start_times.append(formatted_start_time)

  return texts, start_times

In [14]:
store_segments(res)

([' Today, I want to tell you three stories from my life.',
  " That's it. No big deal. Just three stories.",
  ' The first story is about connecting the dots.',
  ' I dropped out of Reed College after the first six months, but then stayed around as a drop-in',
  " for another 18 months or so before I really quit. So why'd I drop out?",
  ' It started before I was born. My biological mother was a young unwed graduate student,',
  ' and she decided to put me up for adoption. She felt very strongly that I should be adopted by',
  ' college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife.',
  ' Except that when I popped out, they decided at the last minute that they really wanted a girl.',
  ' So my parents, who were on a waiting list, got a call in the middle of the night asking,',
  " we've got an unexpected baby boy. Do you want him? They said, of course.",
  ' My biological mother found out later that my mother had never graduated from colle

In [21]:
texts, start_times = store_segments(res)

In [None]:
# Install langchain, to perform semantic search

In [15]:
pip install langchain

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting langchain
  Downloading langchain-0.0.86-py3-none-any.whl (250 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m250.2/250.2 KB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json<0.6.0,>=0.5.7
  Downloading dataclasses_json-0.5.7-py3-none-any.whl (25 kB)
Collecting marshmallow-enum<2.0.0,>=1.5.1
  Downloading marshmallow_enum-1.5.1-py2.py3-none-any.whl (4.2 kB)
Collecting typing-inspect>=0.4.0
  Downloading typing_inspect-0.8.0-py3-none-any.whl (8.7 kB)
Collecting mypy-extensions>=0.3.0
  Downloading mypy_extensions-1.0.0-py3-none-any.whl (4.7 kB)
Installing collected packages: mypy-extensions, typing-inspect, marshmallow-enum, dataclasses-json, langchain
Successfully installed dataclasses-json-0.5.7 langchain-0.0.86 marshmallow-enum-1.5.1 mypy-extensions-1.0.0 typing-inspect-0.8.0


In [16]:
pip install cohere

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting cohere
  Downloading cohere-3.5.0.tar.gz (13 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting urllib3~=1.26
  Downloading urllib3-1.26.14-py2.py3-none-any.whl (140 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m140.6/140.6 KB[0m [31m6.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: cohere
  Building wheel for cohere (setup.py) ... [?25l[?25hdone
  Created wheel for cohere: filename=cohere-3.5.0-cp38-cp38-linux_x86_64.whl size=16265 sha256=a08d0d01ec47e556bf2ecdbe00188c596965fc9018d33c16c08716a6ec46c524
  Stored in directory: /root/.cache/pip/wheels/c3/2c/25/0696f1aa599c730e68d48caafb6fc8ff2b1870ea451336e7ff
Successfully built cohere
Installing collected packages: urllib3, cohere
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.3:
      Successfully uni

In [17]:
pip install --upgrade faiss-gpu==1.7.1

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting faiss-gpu==1.7.1
  Downloading faiss_gpu-1.7.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (89.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.7/89.7 MB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.1


In [18]:
from langchain.embeddings import CohereEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.vectorstores.faiss import FAISS
from langchain.chains import VectorDBQAWithSourcesChain
from langchain import Cohere
import cohere
import faiss

In [19]:
import os
os.environ["COHERE_API_KEY"] = "7aVorhk322RAlLriOgJlo6Tf6E1MbRk5lBWPXsxw"

In [29]:
text_splitter = CharacterTextSplitter(chunk_size=1500, separator="\n")
docs = []
metadatas = []
for i, d in enumerate(texts):
    splits = text_splitter.split_text(d)
    docs.extend(splits)
    metadatas.extend([{"source": start_times[i]}] * len(splits))
# embeddings = OpenAIEmbeddings()
embeddings = CohereEmbeddings(cohere_api_key="7aVorhk322RAlLriOgJlo6Tf6E1MbRk5lBWPXsxw" )

In [None]:
# !apt install libomp-dev
# !python -m pip install --upgrade faiss faiss-gpu
# import faiss

In [25]:
docs

['Today, I want to tell you three stories from my life.',
 "That's it. No big deal. Just three stories.",
 'The first story is about connecting the dots.',
 'I dropped out of Reed College after the first six months, but then stayed around as a drop-in',
 "for another 18 months or so before I really quit. So why'd I drop out?",
 'It started before I was born. My biological mother was a young unwed graduate student,',
 'and she decided to put me up for adoption. She felt very strongly that I should be adopted by',
 'college graduates, so everything was all set for me to be adopted at birth by a lawyer and his wife.',
 'Except that when I popped out, they decided at the last minute that they really wanted a girl.',
 'So my parents, who were on a waiting list, got a call in the middle of the night asking,',
 "we've got an unexpected baby boy. Do you want him? They said, of course.",
 'My biological mother found out later that my mother had never graduated from college',
 'and that my fathe

In [30]:
embeddings

CohereEmbeddings(client=<cohere.client.Client object at 0x7f31514dc610>, model='large', truncate='NONE', cohere_api_key='7aVorhk322RAlLriOgJlo6Tf6E1MbRk5lBWPXsxw')

In [None]:
metadatas

In [34]:
store = FAISS.from_texts(docs, embeddings)
faiss.write_index(store.index, "docs.index")

TypeError: ignored

In [None]:
chain = VectorDBQAWithSourcesChain.from_llm(llm=OpenAI(temperature=0), vectorstore=store)

NameError: ignored

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
result = chain({"question": "How old was Steve Jobs when started Apple?"})

In [None]:
print(f"Answer: {result['answer']}  Sources: {result['sources']}")

Answer:  Steve Jobs was 20 when he started Apple.  Sources: 00:05:47, 00:05:59
