<a href="https://colab.research.google.com/github/VOX304/SchoolChatbot/blob/main/RAG_aNhan.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
%pip install langchain \
langchain_community \
langchain_core \
langchain_google_genai \
python-dotenv \
pypdf



In [11]:
pip install faiss-cpu



In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAIEmbeddings
import os
from google.colab import userdata

os.environ["GOOGLE_API_KEY"] = userdata.get('GOOGLE_API_KEY')
embedding_model = GoogleGenerativeAIEmbeddings(model="models/text-embedding-004")
pdf_files = ["/content/sample_data/CSE Module Handbook.pdf", "/content/sample_data/CSE2021_Info Session  Internship, Thesis and Graduation.pdf"]  # Adjust paths

In [60]:
documents = []
for pdf in pdf_files:
    pdf_loader = PyPDFLoader(pdf)
    documents.extend(pdf_loader.load())

text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)

# Ensure embeddings are generated correctly
#embeddings = embedding_model.embed_documents([doc.page_content for doc in chunks])

# Pass embedded vectors to FAISS

with open("extracted_content.txt", "w", encoding="utf-8") as f:
    for i, chunk in enumerate(chunks):
        f.write(f"Chunk {i+1}:\n{chunk.page_content}\n\n{'='*50}\n\n")

print("📄 Extracted content saved to extracted_content.txt")

📄 Extracted content saved to extracted_content.txt


In [14]:
vector_db = FAISS.from_documents(chunks, embedding_model)

In [15]:
print(f"✅ Processed {len(chunks)} text chunks into FAISS vector database.")


✅ Processed 360 text chunks into FAISS vector database.


In [36]:
query = "What is the requirement for graduation?"
retrieved_docs = vector_db.similarity_search(query, k = 10)


In [38]:
for i, doc in enumerate(retrieved_docs[:10]):  # Show top 3
    print(f"\n📄 Document {i+1}:\n{doc.page_content}")


📄 Document 1:
GRADUATION
1. General Information
2. Graduation Timeline

📄 Document 2:
Vietnamese-German University Computer Science Program
General Information
1. Prerequisites: 
- Pass all modules (180 ECTS)
- Complete 04 German classes or submit an A2 German Certificate
2. Expected timeline:
- VGU conducts two graduation assessments annually: in April and October 
- Only one Graduation Ceremony: November

📄 Document 3:
General Information
1. Prerequisites: 
- Evidence of the internship registration with a signed training contract (IC)
- Successful completion of all modules of the first 5 semesters (150 ECTS)
2. Grading Policy: Bachelor Thesis (weighting 80%) and Colloquium 
(min. 30 min. and max. 60 min., weighting 20%)
3. Regulation: Thesis final reports submitted late will fail. Bachelor’s 
thesis with colloquium only be repeated once.
Vietnamese-German University
7
Computer Science Program

📄 Document 4:
Vietnamese-German University Computer Science Program
Graduation Timeline
1.

In [18]:
from langchain_google_genai import ChatGoogleGenerativeAI

chat_model = ChatGoogleGenerativeAI(
    google_api_key=os.environ["GOOGLE_API_KEY"],
    model="gemini-2.0-flash-thinking-exp-01-21",
    temperature=0.7
)
print("✅ Chat model loaded successfully.")

✅ Chat model loaded successfully.


In [68]:
from langchain.schema import (
    SystemMessage,
    HumanMessage,
    AIMessage
)

def augment_prompt(query):
    # Get top 3 results from the knowledge base
    results = vector_db.similarity_search(query, k=5)

    # Extract text, sources, and pages
    source_map = {}
    for doc in results:
        source = doc.metadata.get("source", "Unknown")
        page = doc.metadata.get("page", "Unknown")
        source_map[doc.page_content] = (source, page)

    # Construct the augmented prompt
    source_knowledge = "\n".join(source_map.keys())
    augmented_prompt = f"""You are the school assistant: Using the contexts below, answer the query in a friendly way.
    Don't make up answers. If you don't know, just say that you don't know.

    Contexts:
    {source_knowledge}

    Query: {query}"""

    return augmented_prompt, source_map



In [80]:
%pip install scikit-learn \
numpy



In [86]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# User question
question = "what are requirements to graduate"

# Generate augmented prompt and retrieve sources
context, source_map = augment_prompt(question)

# Create human message for Gemini model
prompt = HumanMessage(content=context)

# Invoke Gemini model
res = chat_model.invoke([prompt])
response_text = res.content

# Get embeddings for LLM response
response_embedding = embedding_model.embed_query(response_text)

# Track relevant sources
relevant_sources = set()

for text, (source, page) in source_map.items():
    chunk_embedding = embedding_model.embed_query(text)  # Get embedding for each chunk
    similarity_score = cosine_similarity([response_embedding], [chunk_embedding])[0][0]

    if similarity_score >= 0.7:  # Threshold for relevance
        relevant_sources.add(f"{source} (Page {page+1})")

formatted_response = f"Response: {response_text}\nSources: {list(relevant_sources) if relevant_sources else ['No sources matched']}"
print(formatted_response)

Response: Hello! To graduate from the Computer Science program at VGU, you need to make sure you meet a few requirements. Let's break them down:

Firstly, you need to **pass all your modules**, which means earning a total of **180 ECTS credits**.

Secondly, you'll need to demonstrate your German language skills by either **completing 4 German classes** or by providing an **A2 German Certificate**.

Additionally, you need to show **evidence of your internship registration** with a signed training contract.

And lastly, you must have **successfully completed all the modules from your first 5 semesters**, which equals to **150 ECTS credits**.

I hope this helps! Let me know if you have any other questions.
Sources: ['/content/sample_data/CSE2021_Info Session  Internship, Thesis and Graduation.pdf (Page 11)', '/content/sample_data/CSE2021_Info Session  Internship, Thesis and Graduation.pdf (Page 18)']
