### [Lance Martin](https://github.com/rlancemartin) from [LangChain](https://www.langchain.com/) has recently uploaded awesome [YouTube videos about RAG](https://www.youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x) with [accompanying codebase](https://github.com/langchain-ai/rag-from-scratch).<br><br>By the following codebase we generate a summarization of it's transcripts using the MapReduce approach.

In [1]:
import os

from langchain_community.document_loaders.blob_loaders.youtube_audio import YoutubeAudioLoader
from langchain_community.document_loaders.generic import GenericLoader
from langchain_community.document_loaders.parsers import OpenAIWhisperParser

from langchain.chains.summarize import load_summarize_chain

from langchain_core.documents.base import Document
from langchain_core.prompts import PromptTemplate

from langchain_openai import ChatOpenAI

from pytube import Playlist

In [2]:
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_ENDPOINT'] = 'https://api.smith.langchain.com'
# os.environ['LANGCHAIN_API_KEY'] = <your-api-key>
# os.environ['OPENAI_API_KEY'] = <your-api-key>

### Load all the videos first through the given playlist on YouTube:

In [3]:
playlist_url = 'https://www.youtube.com/playlist?list=PLfaIDFEXuae2LXbO1_PKyVJiQ23ZztA0x'
playlist = Playlist(playlist_url)
urls = [url for url in playlist]
f'retrieved {len(urls)} single videos from the playlist'

'retrieved 14 single videos from the playlist'

### Load the YouTube URLs as audio files:

In [4]:
loader = GenericLoader(YoutubeAudioLoader(urls, save_dir='~/Downloads/rag_from_scratch_audios'), OpenAIWhisperParser())
docs = loader.load()

[youtube] Extracting URL: https://www.youtube.com/watch?v=wd7TZ4w1mSw
[youtube] wd7TZ4w1mSw: Downloading webpage
[youtube] wd7TZ4w1mSw: Downloading ios player API JSON
[youtube] wd7TZ4w1mSw: Downloading android player API JSON




[youtube] wd7TZ4w1mSw: Downloading m3u8 information




[youtube] wd7TZ4w1mSw: Downloading initial data API JSON
[info] wd7TZ4w1mSw: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 1 (Overview).m4a has already been downloaded
[download] 100% of    4.82MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 1 (Overview).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=bjb_EMsTDKI
[youtube] bjb_EMsTDKI: Downloading webpage
[youtube] bjb_EMsTDKI: Downloading ios player API JSON
[youtube] bjb_EMsTDKI: Downloading android player API JSON




[youtube] bjb_EMsTDKI: Downloading m3u8 information
[info] bjb_EMsTDKI: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 2 (Indexing).m4a has already been downloaded
[download] 100% of    4.50MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 2 (Indexing).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=LxNVgdIz9sU
[youtube] LxNVgdIz9sU: Downloading webpage
[youtube] LxNVgdIz9sU: Downloading ios player API JSON
[youtube] LxNVgdIz9sU: Downloading android player API JSON




[youtube] LxNVgdIz9sU: Downloading m3u8 information
[info] LxNVgdIz9sU: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 3 (Retrieval).m4a has already been downloaded
[download] 100% of    4.84MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 3 (Retrieval).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=Vw52xyyFsB8
[youtube] Vw52xyyFsB8: Downloading webpage
[youtube] Vw52xyyFsB8: Downloading ios player API JSON
[youtube] Vw52xyyFsB8: Downloading android player API JSON




[youtube] Vw52xyyFsB8: Downloading m3u8 information
[info] Vw52xyyFsB8: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 4 (Generation).m4a has already been downloaded
[download] 100% of    5.94MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 4 (Generation).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=JChPi0CRnDY
[youtube] JChPi0CRnDY: Downloading webpage
[youtube] JChPi0CRnDY: Downloading ios player API JSON
[youtube] JChPi0CRnDY: Downloading android player API JSON




[youtube] JChPi0CRnDY: Downloading m3u8 information
[info] JChPi0CRnDY: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 5 (Query Translation -- Multi Query).m4a has already been downloaded
[download] 100% of    5.69MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 5 (Query Translation -- Multi Query).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=77qELPbNgxA
[youtube] 77qELPbNgxA: Downloading webpage
[youtube] 77qELPbNgxA: Downloading ios player API JSON
[youtube] 77qELPbNgxA: Downloading android player API JSON




[youtube] 77qELPbNgxA: Downloading m3u8 information
[info] 77qELPbNgxA: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 6 (Query Translation -- RAG Fusion).m4a has already been downloaded
[download] 100% of    5.27MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 6 (Query Translation -- RAG Fusion).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=h0OPWlEOank
[youtube] h0OPWlEOank: Downloading webpage
[youtube] h0OPWlEOank: Downloading ios player API JSON
[youtube] h0OPWlEOank: Downloading android player API JSON




[youtube] h0OPWlEOank: Downloading m3u8 information
[info] h0OPWlEOank: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 7 (Query Translation -- Decomposition).m4a has already been downloaded
[download] 100% of    6.12MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 7 (Query Translation -- Decomposition).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=xn1jEjRyJ2U
[youtube] xn1jEjRyJ2U: Downloading webpage
[youtube] xn1jEjRyJ2U: Downloading ios player API JSON
[youtube] xn1jEjRyJ2U: Downloading android player API JSON




[youtube] xn1jEjRyJ2U: Downloading m3u8 information
[info] xn1jEjRyJ2U: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 8 (Query Translation -- Step Back).m4a has already been downloaded
[download] 100% of    6.45MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 8 (Query Translation -- Step Back).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=SaDzIVkYqyY
[youtube] SaDzIVkYqyY: Downloading webpage
[youtube] SaDzIVkYqyY: Downloading ios player API JSON
[youtube] SaDzIVkYqyY: Downloading android player API JSON




[youtube] SaDzIVkYqyY: Downloading m3u8 information
[info] SaDzIVkYqyY: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 9 (Query Translation -- HyDE).m4a has already been downloaded
[download] 100% of    4.42MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 9 (Query Translation -- HyDE).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=pfpIndq7Fi8
[youtube] pfpIndq7Fi8: Downloading webpage
[youtube] pfpIndq7Fi8: Downloading ios player API JSON
[youtube] pfpIndq7Fi8: Downloading android player API JSON




[youtube] pfpIndq7Fi8: Downloading m3u8 information
[info] pfpIndq7Fi8: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 10 (Routing).m4a has already been downloaded
[download] 100% of    6.52MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 10 (Routing).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=kl6NwWYxvbM
[youtube] kl6NwWYxvbM: Downloading webpage
[youtube] kl6NwWYxvbM: Downloading ios player API JSON
[youtube] kl6NwWYxvbM: Downloading android player API JSON




[youtube] kl6NwWYxvbM: Downloading m3u8 information
[info] kl6NwWYxvbM: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 11 (Query Structuring).m4a has already been downloaded
[download] 100% of    5.53MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 11 (Query Structuring).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=gTCU9I6QqCE
[youtube] gTCU9I6QqCE: Downloading webpage
[youtube] gTCU9I6QqCE: Downloading ios player API JSON
[youtube] gTCU9I6QqCE: Downloading android player API JSON




[youtube] gTCU9I6QqCE: Downloading m3u8 information
[info] gTCU9I6QqCE: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 12 (Multi-Representation Indexing).m4a has already been downloaded
[download] 100% of    6.09MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG from scratch： Part 12 (Multi-Representation Indexing).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=z_6EeA2LDSw
[youtube] z_6EeA2LDSw: Downloading webpage
[youtube] z_6EeA2LDSw: Downloading ios player API JSON
[youtube] z_6EeA2LDSw: Downloading android player API JSON




[youtube] z_6EeA2LDSw: Downloading m3u8 information




[youtube] z_6EeA2LDSw: Downloading initial data API JSON
[info] z_6EeA2LDSw: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 13 (RAPTOR).m4a has already been downloaded
[download] 100% of    7.09MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 13 (RAPTOR).m4a; file is already in target format m4a
[youtube] Extracting URL: https://www.youtube.com/watch?v=cN6S0Ehm7_8
[youtube] cN6S0Ehm7_8: Downloading webpage
[youtube] cN6S0Ehm7_8: Downloading ios player API JSON
[youtube] cN6S0Ehm7_8: Downloading android player API JSON




[youtube] cN6S0Ehm7_8: Downloading m3u8 information
[info] cN6S0Ehm7_8: Downloading 1 format(s): 140
[download] /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 14 (ColBERT).m4a has already been downloaded
[download] 100% of    6.67MiB
[ExtractAudio] Not converting audio /Users/bvahdat/Downloads/rag_from_scratch_audios/RAG From Scratch： Part 14 (ColBERT).m4a; file is already in target format m4a
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!
Transcribing part 1!


### Use the MapReduce approach for transcript summarization:

In [5]:
# Prompt using the MapReduce approach
map_prompt_template = '''
                      Write a summary of this chunk of text that includes the main points and any important details.
                      {text}
                      '''

map_prompt = PromptTemplate(template=map_prompt_template, input_variables=['text'])

combine_prompt_template = '''
                      Write a concise summary of the following text delimited by triple backquotes.
                      Return your response in bullet points which covers the key points of the text.
                      ```{text}```
                      BULLET POINT SUMMARY:
                      '''

combine_prompt = PromptTemplate(template=combine_prompt_template, input_variables=['text'])

# LLM (unfortunatley setting the batch_size of 1 is not possible to keep the videos order)
# see https://github.com/langchain-ai/langchain/issues/2465
llm = ChatOpenAI(model_name='gpt-4-turbo', temperature=0)

# Chain
map_reduce_chain = load_summarize_chain(
    llm,
    chain_type='map_reduce',
    map_prompt=map_prompt,
    combine_prompt=combine_prompt,
    return_intermediate_steps=True,
)

# Formatting
def output_content(content, title_prefix):
    for i, doc in enumerate(content):
        print(f'{title_prefix} Nr. {i+1}:')
        print(doc.page_content if isinstance(doc, Document) else doc)
        print('\n\n\n')

In [6]:
map_reduce_outputs = map_reduce_chain({'input_documents': docs})

  warn_deprecated(


### Short summarization over all the videos:

In [7]:
print(map_reduce_outputs['output_text'])

- Lance from LangChain explores query translation techniques in the "RAG from Scratch" series to enhance information retrieval in Retrieval Augmentation Graphs (RAGs).
- Key strategies for query translation include:
  1. Rewriting Questions: Modifying the original question to explore different perspectives.
  2. Sub-Question Breakdown: Splitting complex questions into simpler ones, solving them separately, and combining the results.
  3. Step-Back Prompting: Creating more abstract questions from specific ones to improve document retrieval.
- The series covers advanced topics such as using metadata filters for structured queries, the Raptor technique for hierarchical indexing, and the HIDE technique for transforming questions into hypothetical documents.
- Techniques like logical and semantic routing are discussed to direct modified questions to the right data sources.
- Multi-representation indexing is introduced, where documents are distilled into propositions using a Large Language M

### Summarization of each single video:

In [8]:
# video order doesn't match as batch_size parameter is not supported (see llm above)
output_content(map_reduce_outputs['intermediate_steps'], 'Summarization text of the video')

Summarization text of the video Nr. 1:
This text is a transcript from Lance at LangChain, discussing the concept of step-back prompting in the context of query translation for retrieval augmentation graphs (RAGs). The main focus is on improving the retrieval process by modifying input questions to enhance the relevance and comprehensiveness of the retrieved information.

The video explores different strategies for query translation:
1. **Rewriting Questions**: This involves modifying the original question to capture various perspectives, potentially improving the retrieval process. Techniques like RAG fusion and multi-query are examples of this approach.
2. **Sub-Question Breakdown**: This method involves decomposing a complex question into simpler sub-questions, solving each independently, and then consolidating the answers.
3. **Step-Back Prompting**: Introduced by Google, this technique involves formulating more abstract questions from the specific ones. It uses few-shot prompting t

### Original transcript of each single video:

In [9]:
# video order doesn't match as batch_size parameter is not supported (see llm above)
output_content(map_reduce_outputs['input_documents'], 'Original transcript of the video')

Original transcript of the video Nr. 1:
Hi, this is Lance from LangChain. This is the fourth video in our deep dive on query translation in the RAG from Scratch series, and we're going to be focused on step-back prompting. So query translation, as we said in some of the prior videos, kind of sits at the first stage of a RAG pipeline or flow, and the main aim is to take an input question and to translate it or modify it in such a way that it improves retrieval. Now, we talked through a few different ways to approach this problem. So one general approach involves rewriting a question, and we talked about two ways to do that, RAG fusion, multi-query, and again, this is really about taking a question and modifying it to capture a few different perspectives, which may improve the retrieval process. Now, another approach is to take a question and kind of make it less abstract, like break it down into sub-questions and then solve each of those independently. So that's what we saw with least t