# RAG

## Requirements

In [1]:
%%capture
!pip install transformers accelerate bitsandbytes langchain langchain-community langchain-huggingface sentence-transformers faiss-gpu pandas gdown

## Dataset

In [2]:
!gdown --fuzzy https://drive.google.com/file/d/1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI/view?usp=sharing

Downloading...
From (original): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI
From (redirected): https://drive.google.com/uc?id=1Lq2zVJlN_B4kUAu4VafQ4jXMIQiAR9vI&confirm=t&uuid=a5ed664f-080a-4da5-8471-de9882e7ff09
To: /content/IMDB_crawled.json
100% 292M/292M [00:03<00:00, 83.9MB/s]


## Config

In [3]:
class Config:
    EMBEDDING_MODEL_NAME="thenlper/gte-base"
    LLM_MODEL_NAME="HuggingFaceH4/zephyr-7b-beta"
    K = 5 # top K retrieval

## Preprocessing

In [4]:
import pandas as pd

try:
    df = pd.read_json('IMDB_crawled.json')
except FileNotFoundError:
    df = pd.read_json('../../../crawled_data/IMDB_crawled_01.json')

In [5]:
df.head()

Unnamed: 0,id,title,first_page_summary,release_year,mpaa,budget,gross_worldwide,rating,directors,writers,stars,related_links,languages,countries_of_origin,summaries,synposis,reviews,genres
0,tt0071562,The Godfather Part II,The early life and career of Vito Corleone in ...,1974,R,"$13,000,000 (estimated)","$47,962,683",9.0,[Francis Ford Coppola],,"[Al Pacino, Robert De Niro, Robert Duvall]",[https://imdb.com/title/tt0068646/?ref_=tt_sim...,"[English, Italian, Spanish, Latin, Sicilian]",[United States],[The early life and career of Vito Corleone in...,[The Godfather Part II presents two parallel s...,"[[Coppola's masterpiece is rivaled only by ""Th...","[Crime, Drama]"
1,tt0120737,The Lord of the Rings: The Fellowship of the Ring,A meek Hobbit from the Shire and eight compani...,2001,PG-13,"$93,000,000 (estimated)","$884,041,698",8.9,[Peter Jackson],,"[Elijah Wood, Ian McKellen, Orlando Bloom]",[https://imdb.com/title/tt0167261/?ref_=tt_sim...,"[English, Sindarin]","[New Zealand, United States]",[A meek Hobbit from the Shire and eight compan...,[Galadriel (Cate Blanchett) (The Elven co-rule...,"[[Here is one film that lived up to its hype, ...","[Action, Adventure, Drama]"
2,tt0110912,Pulp Fiction,"The lives of two mob hitmen, a boxer, a gangst...",1994,R,"$8,000,000 (estimated)","$213,928,762",8.9,[Quentin Tarantino],,"[John Travolta, Uma Thurman, Samuel L. Jackson]",[https://imdb.com/title/tt0137523/?ref_=tt_sim...,"[English, Spanish, French]",[United States],"[The lives of two mob hitmen, a boxer, a gangs...",[Narrative structure\nPulp Fiction's narrative...,[[I like the bit with the cheeseburger. It mak...,"[Crime, Drama]"
3,tt0068646,The Godfather,The aging patriarch of an organized crime dyna...,1972,R,"$6,000,000 (estimated)","$250,342,030",9.2,[Francis Ford Coppola],,"[Marlon Brando, Al Pacino, James Caan]",[https://imdb.com/title/tt0071562/?ref_=tt_sim...,"[English, Italian, Latin]",[United States],[The aging patriarch of an organized crime dyn...,"[In late summer 1945, guests are gathered for ...",[['The Godfather' is the pinnacle of flawless ...,"[Crime, Drama]"
4,tt0111161,The Shawshank Redemption,"Over the course of several years, two convicts...",1994,R,"$25,000,000 (estimated)","$28,904,232",9.3,[Frank Darabont],"[Stephen King, Frank Darabont]","[Tim Robbins, Morgan Freeman, Bob Gunton]",[https://imdb.com/title/tt0468569/?ref_=tt_sim...,[English],[United States],"[Over the course of several years, two convict...","[In 1947, Andy Dufresne (Tim Robbins), a banke...",[[The Shawshank Redemption is written and dire...,[Drama]


In [6]:
import os

os.makedirs('data', exist_ok=True)

# preprocess your data and only store the needed data as the context window for embedding model is limited

df = df[['id', 'title', 'first_page_summary', 'release_year', 'mpaa', 'genres', 'rating', 'directors', 'stars', 'languages']]

df.to_csv('data/imdb.csv', index=False)

## Vectorizer

load the CSV file and vectorize the rows using HuggingFaceEmbeddings.
Store the results using FAISS vectorstore.
Save the vectorestore in a pickle file for future usages.

In [7]:
import pickle

from langchain.document_loaders.csv_loader import CSVLoader
from langchain.vectorstores.utils import DistanceStrategy
from langchain.vectorstores.faiss import FAISS

from langchain_community.embeddings import HuggingFaceEmbeddings

# load the csv
loader = CSVLoader(file_path='data/imdb.csv')
documents = loader.load()

# load the embeddings model
embeddings = HuggingFaceEmbeddings(model_name=Config.EMBEDDING_MODEL_NAME)

# save embed the documents using the model in a vectorstore
vectorstore = FAISS.from_documents(documents, embeddings, distance_strategy=DistanceStrategy.COSINE)

# with open("data/vectorstore.pkl", "wb") as f:
#     pickle.dump(vectorstore, f)



  warn_deprecated(
  from tqdm.autonotebook import tqdm, trange


modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/68.1k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/618 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/219M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/314 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

load the vectorstore as a retriever.

In [8]:
# with open("data/vectorstore.pkl", "rb") as f:
#     vectorstore = pickle.load(f)

# load the retriever from the vectorstore
retriever = vectorstore.as_retriever(search_kwargs={"k": Config.K})


## LLM

load the quantized LLM.

In [9]:
import torch

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from transformers import pipeline

from langchain_community.llms.huggingface_pipeline import HuggingFacePipeline

# load the quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(Config.LLM_MODEL_NAME, quantization_config=bnb_config, device_map="cuda:0")
tokenizer = AutoTokenizer.from_pretrained(Config.LLM_MODEL_NAME)

# init the pipeline
READER_LLM = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.95,
    repetition_penalty=1.15,
    do_sample=True,
)

llm = HuggingFacePipeline(
    pipeline=READER_LLM,
)


config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/8 [00:00<?, ?it/s]

model-00001-of-00008.safetensors:   0%|          | 0.00/1.89G [00:00<?, ?B/s]

model-00002-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00003-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00004-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00005-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00006-of-00008.safetensors:   0%|          | 0.00/1.95G [00:00<?, ?B/s]

model-00007-of-00008.safetensors:   0%|          | 0.00/1.98G [00:00<?, ?B/s]

model-00008-of-00008.safetensors:   0%|          | 0.00/816M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.43k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/168 [00:00<?, ?B/s]

  warn_deprecated(


initialize the prompt template for the query chain. query chain is used to get a query from the chat history. you may change the prompt as you like to get better results.

In [30]:
from langchain.prompts import PromptTemplate

from langchain_core.output_parsers import StrOutputParser

class LoggerStrOutputParser(StrOutputParser):
    def parse(self, text: str) -> str:
        # process the LLM output
        print(f"QUERY: {text}\n\n")
        return super().parse(text)

query_transform_prompt = PromptTemplate(
    input_variables=["messages"],
    template="""<|system|>You are a movie database assistant. Your job is to extract the core search terms from the conversation for finding optimal movie recommendations.
{messages}
<|user|>
Give me the search keywords derived from the conversation above.
<|assistant|>"""
)

# init the query chain
query_transforming_retriever_chain = (
    query_transform_prompt
    | llm
    | LoggerStrOutputParser()
    | retriever
)


initialize the main retrieval chain that gives the resulting documents to LLM and gets the output back.

In [31]:
from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain_core.runnables import RunnablePassthrough

prompt = PromptTemplate(
    input_variables=["context", "messages"],
    template="""<|system|>You are a movie recommendation assistant. Your task is to recommend the best movie from the provided list based on the user's preferences and query. Ensure to select the movie that closely matches the user's request.
The user is asking for a movie recommendation, and you have a list of movies to choose from.
Based on the user's query and the given movie list, provide a detailed recommendation including the movie's title, release year, genres, rating, directors, stars, and a brief summary.

Movie list:
---------------------------
{context}
---------------------------

{messages}
<|assistant|>"""
)

# init the retriver chain
retrieval_chain = create_stuff_documents_chain(llm, prompt)


write the conversation helper class for easier testing.

In [32]:
class Conversation:
    def __init__(self):
        self.messages = []

    def add_assistant_message(self, message):
        self.messages.append(('assistant', message))

    def add_user_message(self, message):
        self.messages.append(('user', message))

    def get_messages(self):
        # concatenate the messages with the roles in the instruction format
        return "\n".join([f"<|{role}|>{message}" for role, message in self.messages])

    def chat(self, message):
        self.add_user_message(message)
        messages = self.get_messages()
        # invoke the chain
        response = retrieval_chain.invoke({
            "context": query_transforming_retriever_chain.invoke(messages),
            "messages": messages
        })
        self.add_assistant_message(response)
        return response


## Test

talk with the RAG to see how good it performs.

In [33]:
c = Conversation()
A = c.chat('give me a science fiction movie similar to the movie Predestination(2014)')
print(A)


QUERY: <|system|>You are a movie database assistant. Your job is to extract the core search terms from the conversation for finding optimal movie recommendations.
<|user|>give me a science fiction movie similar to the movie Predestination(2014)
<|user|>
Give me the search keywords derived from the conversation above.
<|assistant|>
Science fiction movie similar to Predestination (2014)
Keywords: Science fiction, time travel, thriller, twist ending, Ethan Hawke, Emma Roberts, Robert Carnellay, Max Landis (writer), Michael Spierce (director).


<|system|>You are a movie recommendation assistant. Your task is to recommend the best movie from the provided list based on the user's preferences and query. Ensure to select the movie that closely matches the user's request.
The user is asking for a movie recommendation, and you have a list of movies to choose from.
Based on the user's query and the given movie list, provide a detailed recommendation including the movie's title, release year, gen

In [34]:
A = c.chat('give one that is directed by christopher nolan')
print(A)

QUERY: <|system|>You are a movie database assistant. Your job is to extract the core search terms from the conversation for finding optimal movie recommendations.
<|user|>give me a science fiction movie similar to the movie Predestination(2014)
<|assistant|><|system|>You are a movie recommendation assistant. Your task is to recommend the best movie from the provided list based on the user's preferences and query. Ensure to select the movie that closely matches the user's request.
The user is asking for a movie recommendation, and you have a list of movies to choose from.
Based on the user's query and the given movie list, provide a detailed recommendation including the movie's title, release year, genres, rating, directors, stars, and a brief summary.

Movie list:
---------------------------
id: tt1877832
title: X-Men: Days of Future Past
first_page_summary: The X-Men send Wolverine to the past in a desperate effort to change history and prevent an event that results in doom for both h