following https://github.com/gkamradt/langchain-tutorials/blob/main/loaders/YouTube%20Loader.ipynb

In [1]:
# load .env
import os
from dotenv import load_dotenv

load_dotenv()

# check HF token and SERPAPI key have been loaded
assert "OPENAI_API_KEY" in os.environ

In [2]:
from langchain.document_loaders import YoutubeLoader
from langchain.llms import OpenAI
from langchain.chains.summarize import load_summarize_chain


In [10]:
# lets make this meta by passing the video where `Data Independent` goes over this tutorial

# this seems to be broken in langchain 0.0.135 https://github.com/hwchase17/langchain/issues/1962
# loader = YoutubeLoader.from_youtube_url("https://www.youtube.com/watch?v=pNcQ5XXMgH4", add_video_info=True)

loader = YoutubeLoader(
    video_id="pNcQ5XXMgH4", 
    # add_video_info=True # add_video_info also seems to be broken
)

In [11]:
result = loader.load()

In [17]:
print(f"{type(result) = }")
print(f"{len(result) = }")
print(f"{result[0].page_content = }")
print(f"{result[0].metadata = }")

type(result) = <class 'list'>
len(result) = 1
result[0].page_content = "what is going on good people again right now we have a super exciting tutorial because we are going to take YouTube transcripts and we're going to pass them to open Ai and the way that we're going to do that is via a library called Lang chain which is what this entire series is about now before we jumped into it I wanted to show a diagram again I think these diagrams are helpful but you have to let me know so just let me know in the comments here so I wanted to do an overview about what we're actually going to be writing out in code because I think it's a little easier to see in pictures first so the way this is going to work is we're going to have a video a YouTube video we're going to pass it we're going to pass it a URL and then what Lang chain is going to help us do is it's going to help us load this video as a document and a document just means you're going to be taking the transcript which is the text of the 

In [19]:
# init llm
llm = OpenAI(temperature=0)

# init load_summarize_chain
chain = load_summarize_chain(llm, chain_type="stuff", verbose=False)

# run
print(chain.run(result))

 This tutorial explains how to use the Lang Chain library to take YouTube transcripts and pass them to Open AI to generate a summary. It also explains how to use the recursive character splitter to split up long transcripts into smaller chunks, and how to use the mapreduce method to generate a summary of multiple videos. Finally, it explains how to use the summarize scan to generate a summary of multiple videos.


As the summary says, in the tutorial it then moves on to explain how to to split up long transcripts and how to to generate a summary of multiple videos.

Lets diverge here and instead try to do QA on the transcripts

In [20]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain import OpenAI, VectorDBQA


# this will be meta**2
youtube_ids_list = [
    "pNcQ5XXMgH4", # video about Youtube Transcript
    "EnT-ZTrcPrg"  # video about QA on custom data
]

# get and split transcripts
text_fragments = []

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)

for vid_id in youtube_ids_list:
    loader = YoutubeLoader(vid_id)
    result = loader.load()
    
    text_fragments.extend(text_splitter.split_documents(result))

# init embeddings model, VectorDB, LLM and VectorDBQA component
embeddings = OpenAIEmbeddings()

docsearch = Chroma.from_documents(text_fragments, embeddings)

llm = OpenAI(temperature=0)

qa = VectorDBQA.from_chain_type(llm=OpenAI(), chain_type="stuff", vectorstore=docsearch, return_source_documents=True)

Using embedded DuckDB without persistence: data will be transient


In [22]:
query = "How can you use LangChain to summarize transcripts from very long videos?"
result = qa({"query": query})

In [23]:
result['result']

' LangChain can split the transcript into shorter pieces and then use Open AI to generate a summary for each piece.'

In [30]:
for doc in result['source_documents']:
    print(f"from source {doc.metadata['source']}: {doc.page_content[:150]}...")

from source pNcQ5XXMgH4: something that lane chain can help understand now with that document we're then going to go generate a summary of it and the way that link chain is go...
from source pNcQ5XXMgH4: to tell us hey this video is about XYZ now an interesting part about this and where it gets kind of confusing is well what happens if your video is to...
from source pNcQ5XXMgH4: days before Lang chain what we'd have to do here is we'd have to figure out some way to either run multiple pieces ourselves manually copy and paste i...
from source pNcQ5XXMgH4: going to split up that text so we're going to still see that it's from video two but we're going to have our document one document two document three ...
