Retrieval-augmented generation (RAG)

In [1]:
!pip install --upgrade --quiet  langchain langchain-openai faiss-cpu tiktoken
!pip install youtube-transcript-api  pytube --quiet
!pip install python-dotenv --quiet
!pip install sentence-transformers --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m798.0/798.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.6/17.6 MB[0m [31m51.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m51.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m57.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m216.6/216.6 kB[0m [31m17.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.3/48.3 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.1/212.1 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.4/49.4 kB[0m [31m806.2 kB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━

In [32]:
import os
from dotenv import load_dotenv
load_dotenv()

import openai
# create .env and define OPENAI_API_KEY = API_Token
openai.api_key = os.environ["OPENAI_API_KEY"]

In [33]:
from operator import itemgetter
from langchain.text_splitter import RecursiveCharacterTextSplitter

from langchain_community.vectorstores import FAISS
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

In [34]:
from langchain.llms import OpenAI
from langchain.chat_models import ChatOpenAI

model = OpenAI(
     base_url="https://openrouter.ai/api/v1",
     model="gryphe/mythomist-7b")
llm = ChatOpenAI(
     base_url="https://openrouter.ai/api/v1",
     model="gryphe/mythomist-7b")

In [35]:
llm.model_name

'gryphe/mythomist-7b'

In [36]:
from langchain.document_loaders import YoutubeLoader,WebBaseLoader

# loader = WebBaseLoader("https://www.history.com/topics/ancient-rome/ancient-rome")
loader = YoutubeLoader.from_youtube_url(
    "https://www.youtube.com/watch?v=3STDzWv7zLU", add_video_info=True
)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(loader.load())

In [37]:
splits[0]

Document(page_content="That's the beauty of a word, isn't it? This is KND featuring. My name is Bick and\xa0\nwelcome to the show. Today, we have the pleasure \nand honor to talk to someone amazing. He's a politician, but not just any politician, a party leader. He's a father and also someone who a lot of\xa0\npeople of younger generation look up to and consider him as an idol. And of course,\n he speaks\xa0English fluently. Welcome to the show! คุณทิม สวัสดีครับ Thank you for having me. \nIt's good to\xa0be here. - Thank you so much", metadata={'source': '3STDzWv7zLU', 'title': 'คุยภาษาอังกฤษกับ ทิม พิธา ลิ้มเจริญรัตน์ หัวหน้าพรรคก้าวไกล | คำนี้ดี EP.1033', 'description': 'Unknown', 'view_count': 3552351, 'thumbnail_url': 'https://i.ytimg.com/vi/3STDzWv7zLU/hq720.jpg', 'publish_date': '2023-01-25 00:00:00', 'length': 3191, 'author': 'คำนี้ดี (EP.643-ล่าสุด)'})

In [38]:
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import DocArrayInMemorySearch

# model replace OpenAI Embedding
model_id = 'sentence-transformers/all-MiniLM-L12-v2'
# model_id = 'sentence-transformers/all-mpnet-base-v2'

model_kwargs = {'device': 'cpu'}
hf_embedding = HuggingFaceEmbeddings(
    model_name=model_id,
    model_kwargs=model_kwargs
)


In [39]:
# Embedding data store in Vector Store
vectorstore = FAISS.from_documents(documents=splits, embedding=hf_embedding)
retriever = vectorstore.as_retriever(k=10, lambda_mult=0.25)

In [40]:
template = """Read the following pieces of Youtube video transcript for answering the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Please answer in markdown.

### context:
{context}

### question:
{question}

### answer:
"""
prompt = ChatPromptTemplate.from_template(template)

chain2 = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)
print(chain2.invoke("Please list all topic mentioned in the video.").content)

 The topics mentioned in the video are Youtube video transcript, discussing a world with COVID-19 and geopolitical hot spots, applying new ways of doing things, critical thinking, communication, persistence, exploring examples of good and bad responses, a forum and town hall at a school, JFK School of Government, education system and its changes, student-centric and adult-centric approach, KPIs in education, and leadership styles in the Thai political


In [41]:

template = """Answer the question based only on the following context:
{context}

Question: {question}

Answer in the following language: {language}
"""
prompt = ChatPromptTemplate.from_template(template)

chain3 = (
    {
        "context": itemgetter("question") | retriever,
        "question": itemgetter("question"),
        "language": itemgetter("language"),
    }
    | prompt
    | llm
    | StrOutputParser()
)
print(chain3.invoke({"question": "What is purpose of Video", "language": "english"}))

 The purpose of the video seems to be a conversation or interview in which Tim Phitha Lim-Ecorn, the Head of the Move Forward Party, discusses various topics related to his political journey and experiences, along with his perspectives on government, leadership, and the country's future. This conversation aims to provide insights into his thoughts and motivations, as well as possibly addressing concerns about the party's vision and goals. Additionally, the video likely showcases his public speaking
