# RAG using youtube subtitles
https://python.langchain.com/docs/use_cases/question_answering/quickstart#retrieval-and-generation-retrieve

### Indexing
1. **Load**: First we need to load data using DocumentLoaders.
2. **Split**: Text splitter breaks large Documents into smaller chunks. This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.
3. **Store**: Vector store is used to store and index document splits to perform a search.

### Retrieval and generation
1. **Retrieve**: Given a user input, relevant splits are retrieved from storage using a Retriever.
2. **Generate**: A ChatModel / LLM produces an answer using a prompt that includes the question and the retrieved data


### Dependencies
1. **OpenAI chat model and embeddings** 
2. **Chroma vector store** 

#### Packages (Already installed)
```
%pip install --upgrade --quiet  langchain langchain-community langchainhub langchain-openai chromadb
```


In [1]:
import bs4
from langchain import hub
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import WebBaseLoader
from langchain_community.vectorstores import Chroma
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

###  Load data
##### Use **YoutubeLoader** to load the subtitles of the video 

https://www.youtube.com/watch?v=_v_ZAtc06Jk

In [2]:
! pip install youtube_transcript_api pytube


[0m

In [3]:
from langchain_community.document_loaders import YoutubeLoader

loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=_v_ZAtc06Jk",
    add_video_info=True,
    language=["en", "id"],
    translation="en",
)
subtitles_data = loader.load()
len(subtitles_data[0].page_content)

2655


### Split the data 
Use the RecursiveCharacterTextSplitter. Set ```add_start_index=True``` so that the character index at which each split Document starts within the initial Document is preserved as metadata attribute ```“start_index”```.

In [4]:

from langchain.text_splitter import RecursiveCharacterTextSplitter

## Compare the difference created by chunk overlap
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, chunk_overlap=100, add_start_index=True
)
all_splits = text_splitter.split_documents(subtitles_data)


In [5]:
len(all_splits)

3

In [6]:
len(all_splits[0].page_content)

998

In [7]:
all_splits[0].metadata
#print(all_splits[0].page_content)

{'source': '_v_ZAtc06Jk',
 'title': 'German TV news',
 'description': 'Unknown',
 'view_count': 389077,
 'thumbnail_url': 'https://i.ytimg.com/vi/_v_ZAtc06Jk/hqdefault.jpg?sqp=-oaymwEXCJADEOABSFryq4qpAwkIARUAAIhCGAE=&rs=AOn4CLALPil2gGZCbqZ64EoExtig7j-dzw',
 'publish_date': '2013-02-03 00:00:00',
 'length': 199,
 'author': 'newsglotzer',
 'start_index': 0}

### Store - Embed and store document splits using the Chroma vector store and OpenAIEmbeddings model.

In [8]:
# create simple ids
ids = [str(i) for i in range(1, len(all_splits) + 1)]
vectorstore = Chroma.from_documents(documents=all_splits, embedding=OpenAIEmbeddings(), ids=ids)

### 4. Retrieval
LangChain defines a ``Retriever`` interface which wraps an index that can return relevant Documents given a string query.
The most common type of Retriever is the ```VectorStoreRetriever```, which uses the ``similarity`` search capabilities of a vector store to facillitate retrieval. 
Any VectorStore can easily be turned into a ```Retriever``` with VectorStore.as_retriever():

In [9]:
retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
docs = retriever.invoke("Who are we talking about?")
docs

[Document(page_content='You see the first german television with the "tagesschau" (view of the day) Ladies and gentlemen, welcome to the "tagesschau" At the Munich Security Conference, the United States and Russia have insist on their different positions about the syria-conflict. US-vice president Biden manifested the Syrian regime for ruined Russian Foreign Minister Sergey Lavrov on the other hand, made clear that his government further stand adhere to Assad. However Lavrov quoth first time ever with the leader of the Syrian opposition Then he agreed to meet regularly On the verge of the security conference german defence Minister de Maizière announced that about 40 German soldiers at the planned eu mission would contribute to Mali The germans have to educate pioneers which have to deactivate booby traps The german Bundestag (parliament) is expected to decide later this month on the terms, The application have to start at the begin of march. first time since the start of the military 

### 5. Generation
 **Using OpenAI chat model**

In [10]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

#### Pull RAG prompt from the langchain hub

In [11]:

prompt = hub.pull("rlm/rag-prompt")
print(prompt)

input_variables=['context', 'question'] messages=[HumanMessagePromptTemplate(prompt=PromptTemplate(input_variables=['context', 'question'], template="You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.\nQuestion: {question} \nContext: {context} \nAnswer:"))]



## You can define your custom template as well 

In [12]:

# template = """
#         If you don't know the answer, just say that you don't know.
#         Don't try to make up an answer.
#         {context}

#         Question: {question}
#         Answer:
#         """
# prompt_template =  PromptTemplate(
#         template=template,
#         input_variables=[
#             "context",
#             "question",
#         ],
#     )

### Initialize RetrievalQA Chain
```RetrievalQAChain``` combines a Retriever and a QA chain (described above). It is used to retrieve documents from a Retriever and then use a QA chain to answer a question based on the retrieved documents

In [13]:
from langchain.chains import RetrievalQA

chain = RetrievalQA.from_chain_type(
        llm=llm,
        retriever=vectorstore.as_retriever(),
        return_source_documents=False,
        chain_type_kwargs={"prompt": prompt}
    )
chain.invoke({"query":"What is the news about?"})["result"]

Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


"The news is about the Munich Security Conference and the differing positions of the United States and Russia on the Syria conflict, the German soldiers contributing to the EU mission in Mali, French President Hollande's visit to Mali, the attack on the US embassy in Ankara, Spanish Prime Minister Rajoy's response to the slush funds allegations, the Hamburger SV's loss in the German football league, and the weather forecast."

### Create a chat bot using gradio

In [14]:
examples = [ ''' Examples ''',
    "What is this video about?",
    "Who is being talked about?"
    "Summarize top 5 events in bullet points"


    ]


import time
import gradio as gr

def slow_echo(message, history):
    yield str(chain.invoke({"query": message})["result"])

demo = gr.ChatInterface(slow_echo,title="Mychat bot",theme=gr.themes.Soft(),examples = examples).queue()
demo.launch(debug=True, share=False)

  from .autonotebook import tqdm as notebook_tqdm


Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.


Number of requested results 4 is greater than number of elements in index 3, updating n_results = 3


Keyboard interruption in main thread... closing server.




In [None]:
## DB operations

#retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
#retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"filter":{"start_index":1620}})
# retriever = vectorstore.as_retriever(search_kwargs={"k": 1})
# docs = retriever.invoke("Who are we talking about?")
# docs

# docs[0].metadata["author"] = "Varun"

# vectorstore.update_document(ids[0], docs[0])
#print(vectorstore._collection.get(ids=[ids[0]]))

# # delete the last document
#print("count before", vectorstore._collection.count())
# vectorstore._collection.delete(ids=[ids[-1]])
# print("count after", vectorstore._collection.count())