# Youtube video to transcript to RAG on LLM

Only the minimal steps to achieve the following:
- Get transcript from the YouTube video
- Create embeddings and store them in Pinecone (Context for RAG)
- Set model and prompt chain to ask questions (LLM - RAG)

In [2]:
import os
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

YOUTUBE_VIDEO = "https://www.youtube.com/watch?v=PgGKhsWhUu8" # 3 hours <- Friend's suggestion (Bill Ackman)

## Transcribing the YouTube Video

To use the YouTube video as context, first we want to get the transcription.  
Here we use [OpenAI's Whisper](https://github.com/openai/whisper) which is a general-purpose speech recognition model.

In [4]:
import tempfile
import whisper
from pytubefix import YouTube
from pytubefix.cli import on_progress
import time

# We only process the transcription if the file for the video doesn't exists already
# transcription_video_file = 'youtube_v_<VIDEO_ID>.txt'
transcription_video_file = ".".join(["youtube_v_" + YOUTUBE_VIDEO.split('?v=')[-1], 'txt'])
if not os.path.exists(transcription_video_file):
    youtube = YouTube(YOUTUBE_VIDEO, on_progress_callback = on_progress)
    youtube_stream = youtube.streams.get_audio_only()

    # Load Whisper's base model. Not as accurate but faster
    whisper_model = whisper.load_model("base")

    with tempfile.TemporaryDirectory() as tmpdir:
        file = youtube_stream.download(output_path=tmpdir, mp3=True)
        tic = time.perf_counter()
        transcription = whisper_model.transcribe(file, fp16=False)["text"].strip()
        toc = time.perf_counter()
        print(f"Transcribing audio to text took {toc - tic:0.4f} seconds")

        with open(transcription_video_file, "w") as file:
            file.write(transcription)
            print("Done transcribing audio to text!")

## Create embeddings and store them in Pinecone (Vectore Store)

- Use TextLoader to read the transcript from file
- Use TextSplitter to split the file into chunks for the embeddings
- Initialize OpenAI's embeddings model with LangChain
- Store embeddings in vector storage (Pinecone)

### Use TextLoader to read the transcript from file

In [9]:
# Use TextLoader to read the transcript from file
from langchain_community.document_loaders import TextLoader

loader = TextLoader(transcription_video_file)
text_documents = loader.load()
text_documents

[Document(metadata={'source': 'youtube_v_PgGKhsWhUu8.txt'}, page_content="The only person who calls you more harm than a thief with a dagger is a journalist with a pen. The following is a conversation with Bill Ackman, a legendary activist investor who has been part of some of the biggest and at times controversial trades in history. Also, he is fearlessly vocal on X, FKA, Twitter, and uses the platform to fight for ideas he believes in. For example, he was a central figure in the resignation of the president of Harvard University, Claude Engay, the saga of which we discuss in this episode. This is the Lex Friedman podcast to support it. Please check out our sponsors in the description and now to your friends, here's Bill Ackman. In your lecture on the basics of finance and investing, you mentioned a book, Intelligent Investor by Benjamin Graham, as being formative in your life. What key lesson do you take away from that book that informs your own investing? Sure. Actually, it was the 

### Use TextSplitter to split the file into chunks for the embeddings

In [10]:
# Use TextSplitter to split the file into chunks for the embeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Hyperparameters for splitting the documents:
chunk_size = 1000
chunk_overlap = 20

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
documents = text_splitter.split_documents(text_documents)
documents[:5]

[Document(metadata={'source': 'youtube_v_PgGKhsWhUu8.txt'}, page_content="The only person who calls you more harm than a thief with a dagger is a journalist with a pen. The following is a conversation with Bill Ackman, a legendary activist investor who has been part of some of the biggest and at times controversial trades in history. Also, he is fearlessly vocal on X, FKA, Twitter, and uses the platform to fight for ideas he believes in. For example, he was a central figure in the resignation of the president of Harvard University, Claude Engay, the saga of which we discuss in this episode. This is the Lex Friedman podcast to support it. Please check out our sponsors in the description and now to your friends, here's Bill Ackman. In your lecture on the basics of finance and investing, you mentioned a book, Intelligent Investor by Benjamin Graham, as being formative in your life. What key lesson do you take away from that book that informs your own investing? Sure. Actually, it was the 

### Initialize OpenAI's embeddings model with LangChain

In [19]:
# Initialize OpenAI's embeddings model with LangChain
from langchain_openai.embeddings import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

### Store embeddings in vector storage (Pinecone)

In [20]:
# Store embeddings in vector storage (Pinecone)
from langchain_pinecone import PineconeVectorStore

index_name = "youtube-index-clean"

pinecone = PineconeVectorStore.from_documents(
    documents, embeddings, index_name=index_name
)

## Set OpenAI model and prompt chain to ask questions (LLM - RAG)

- Setup OpenAI chat model
- Setup our prompt chain with context from Pinecone (vestore store)
- Ask a question using our chain

### Setup OpenAI chat model

In [21]:
from langchain_openai.chat_models import ChatOpenAI

model = ChatOpenAI(openai_api_key=OPENAI_API_KEY, model="gpt-3.5-turbo")

### Setup prompt chain with context from Pinecone (vestore store)

To setup our `chain` we need to define the following:
- `template`: Prompt template
- `prompt`: The way we're going to ask questions to OpenAI (using the `template`)
- `model`: (OpenAI model defined in our previous step)
- `parser`: Output parser for the `answer`
- `retriever`: In this case `pinecone` as retriever for the context

In [26]:
from langchain.prompts import ChatPromptTemplate

template = """
Answer the question based on the context below. If you can't
answer the question, reply "I don't know".

Context: {context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

from langchain_core.output_parsers import StrOutputParser
parser = StrOutputParser()

from langchain_core.runnables import RunnableParallel, RunnablePassthrough
chain = (
    {"context": pinecone.as_retriever(), "question": RunnablePassthrough()}
    | prompt
    | model
    | parser
)

### Ask a question using our chain

In [27]:
chain.invoke("What were Carl thoughts on Herba Life?")

'Carl Icahn believed that the person being interviewed was stupid for being short on Herbalife, indicating that he had a positive view of the company.'

## Further work

- Avoid Pinecone (vectore store) to index multiple times the same embeddings