<a href="https://colab.research.google.com/github/etendra2501/Building-Local-RAG-Applications-with-LangChain/blob/main/Building_Local_RAG_Applications_with_LangChain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
pip install langchain langchain_community faiss-cpu sentence-transformers transformers



In [36]:
from langchain.llms import HuggingFacePipeline
from langchain.chains import RetrievalQA
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from transformers import pipeline
from langchain_community.vectorstores import FAISS

In [37]:
import os
import urllib.request
import zipfile

zip_url = "https://raw.githubusercontent.com/etendra2501/Building-Local-RAG-Applications-with-LangChain/8f757a131d29fe89ecd74194933f35909c205767/dataset/asia_documents.zip"
zip_path = "asia_documents.zip"
extract_folder = "asia_txt_files"

print("Downloading zip file...")
urllib.request.urlretrieve(zip_url, zip_path)
print("Download complete!")

print("Extracting files...")
os.makedirs(extract_folder, exist_ok=True)
with zipfile.ZipFile(zip_path, "r") as zip_ref:
    zip_ref.extractall(extract_folder)

print(f"Files extracted to: {extract_folder}")

print("Extracted files:")
print(os.listdir(extract_folder))

Downloading zip file...
Download complete!
Extracting files...
Files extracted to: asia_txt_files
Extracted files:
['Indonesia.txt', 'Vietnam.txt', 'Philippines.txt', 'Japan.txt', 'Malaysia.txt', 'Thailand.txt', 'Mongolia.txt', 'Taiwan.txt', 'South_Korea.txt']


In [38]:
import os
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings

folder_path = "asia_txt_files"

documents = []
for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        loader = TextLoader(file_path)
        documents.extend(loader.load())

text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

vectorstore = FAISS.from_documents(docs, embedding_model)
retriever = vectorstore.as_retriever()

In [39]:
from langchain.chains import LLMChain
from langchain.prompts import PromptTemplate
from langchain.chains import RetrievalQA
from transformers import pipeline

llm_pipeline = pipeline("text-generation", model="gpt2", device=0, max_new_tokens=200)
llm = HuggingFacePipeline(pipeline=llm_pipeline)

Device set to use cpu


In [40]:
prompt_template = "Answer the following question based on the provided context: {context}\n\nQuestion: {query}\nAnswer:"

prompt = PromptTemplate(input_variables=["query", "context"], template=prompt_template)

llm_chain = LLMChain(llm=llm, prompt=prompt)

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    verbose=True
)

In [41]:
def truncate_to_max_tokens(text, max_tokens=500):
    tokens = text.split()
    if len(tokens) > max_tokens:
        return " ".join(tokens[:max_tokens])
    return text

In [42]:
query = "What are the best Asian cuisine dishes?"

# using only the top-1 document by default
retrieved_docs = retriever.get_relevant_documents(query)[:1]
context = " ".join([doc.page_content for doc in retrieved_docs])
context = truncate_to_max_tokens(context, max_tokens=500)
response = retrieval_qa.run(query)
print("Answer:", response)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m
Answer: Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer.

Vietnam is a Southeast Asian country known for its rich history, diverse landscapes, and delicious cuisine. Hanoi and Ho Chi Minh City are its major urban centers, each with a unique character. Ha Long Bay’s limestone karsts and the Mekong Delta’s floating markets are famous geographical highlights. Vietnamese culture is deeply influenced by Confucian values, French colonial heritage, and indigenous traditions.

Thailand is a Southeast Asian country famous for its tropical beaches, ornate temples, and bustling street food culture. Bangkok, the capital, is known for its vibrant nightlife and historical sites like the Grand Palace and Wat Arun. Northern Thailand features mountainous landscapes and cultural cities like Chiang Mai, while the south o

In [43]:
!pip install gradio --upgrade  # Install Gradio

import gradio as gr

def get_answer(query):
  retrieved_docs = retriever.get_relevant_documents(query)[:1]  # Get top relevant document
  context = " ".join([doc.page_content for doc in retrieved_docs])
  context = truncate_to_max_tokens(context, max_tokens=500)  # Limit context length
  response = retrieval_qa.run(query)  # Get answer using RetrievalQA
  return response

iface = gr.Interface(  # Create Gradio interface
    fn=get_answer,
    inputs=gr.Textbox(lines=2, placeholder="Enter your question here..."),
    outputs=gr.Textbox(lines=5, label="Answer"),
    title="Asia Documents Q&A",
    description="Ask questions about Asia based on the provided documents."
)

iface.launch()  # Launch the Gradio app

Running Gradio in a Colab notebook requires sharing enabled. Automatically setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://31b29ec0cb71ebc2a0.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


