### 1. Install Dependencies
You'll need:

```pip install langchain langchain-community langchain-ollama beautifulsoup4 requests chromadb unstructured```

In [8]:
%pip install langchain langchain-community langchain-ollama beautifulsoup4 requests chromadb unstructured


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2. Load Data (PDFs & Websites)
We need to extract text from both **PDF Documents** and **webpage**

#### 2.1 Extract Text from PDFs
Modify your existing LangChain PDF loader to handle multiple PDFs:

In [13]:
from langchain_community.document_loaders import UnstructuredPDFLoader

def load_pdfs(pdf_paths):
    all_data = []
    for path in pdf_paths:
        loader = UnstructuredPDFLoader(path)
        all_data.extend(loader.load())
    return all_data

pdf_file = ["/Users/nychanthrith/Documents/Chantharith/ME-Chatbot/ai-model/data/raw_pdfs/Institute of Technology of Cambodia.pdf"]
pdf_data = load_pdfs(pdf_file)
print(f"Load {len(pdf_data)} documents from PDFs")

Load 1 documents from PDFs


#### 2.2 Scrape Websites
For retrieving data from ITC, RUPP, CADT, and UHS websites, use ``BeautifulSoup``:

In [14]:
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        return soup.get_text()
    return ""

urls = [
    "https://itc.edu.kh/",
    "https://www.rupp.edu.kh/",
    "https://cadt.edu.kh/",
    "https://uhs.edu.kh/"
]

web_data = [scrape_website(url) for url in urls]
print(f"Scraped {len(web_data)} websites")

Scraped 4 websites


### 3. Process and Store Data

#### 3.1 Split Text into Chunks
Since PDFs and websites have long text, we split them for better retrieval.

In [17]:
print(f"Type of pdf_data: {type(pdf_data)}")  # Should be a list of Document objects
print(f"Type of web_data: {type(web_data)}")  # Should be a list of Document objects


Type of pdf_data: <class 'list'>
Type of web_data: <class 'list'>


In [26]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def split_text(data):
    # Ensure input is a list of LangChain Documents
    if isinstance(data, str):  
        data = [Document(page_content=data)]
    elif isinstance(data, list) and all(isinstance(item, str) for item in data):  
        data = [Document(page_content=item) for item in data]  # Corrected 'd' to 'item'
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
    return text_splitter.split_documents(data)

# ✅ Convert PDF data to LangChain Documents if it's raw text
if isinstance(pdf_data, str):
    pdf_data = [Document(page_content=pdf_data)]
elif isinstance(pdf_data, list) and all(isinstance(item, str) for item in pdf_data):  # Corrected 'd' to 'item'
    pdf_data = [Document(page_content=item) for item in pdf_data]

# ✅ Convert Web data to LangChain Documents if needed
if isinstance(web_data, str):
    web_data = [Document(page_content=web_data)]
elif isinstance(web_data, list) and all(isinstance(item, str) for item in web_data):  # Corrected 'd' to 'item'
    web_data = [Document(page_content=item) for item in web_data]

# Now process
pdf_chunks = split_text(pdf_data)
web_chunks = split_text(web_data)

# Combine all chunks
all_chunks = pdf_chunks + web_chunks
print(f"Total chunks created: {len(all_chunks)}")


Total chunks created: 20


#### 3.2 Store Data in a Vector Database
We store the chunks using **ChromaDB**:

In [27]:
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

vector_db = Chroma.from_documents(
    documents=all_chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="exam-chatbot"
)

print("Vector database created successfully")

Vector database created successfully


### 4. Build the Chatbot

#### 4.1 Setup LLM & Retrieval

In [28]:
from langchain_ollama.chat_models import ChatOllama
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import ChatPromptTemplate, PromptTemplate

llm = ChatOllama(model="llama3.2")

retriever = MultiQueryRetriever.from_llm(vector_db.as_retriever(), llm)

#### 4.2 Create RAG Chain

In [29]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = """Answer based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

#### 4.3 Create Chat Function

In [32]:
from IPython.display import display, Markdown

def chat_with_bot(question):
    english_instruction = "Please respond in English: "
    response = chain.invoke(english_instruction + question)
    display(Markdown(response))
    
# Test Example
chat_with_bot("What are the admission requirements for ITC")

The Institute of Technology of Cambodia (ITC) offers a variety of training programs, but it appears that they also have an academic component with admission requirements. Based on the available information, here are some general admission requirements for ITC:

1. **Foundation Year**: For students who want to pursue a degree at ITC, they typically need to complete the Foundation Year program. The requirements for this program include:
	* Age: 17-25 years old
	* Academic qualifications: High school diploma or equivalent (e.g., A-levels, IB)
	* English proficiency: TOEFL or IELTS with a minimum score of 500 or 5.0, respectively
2. **Undergraduate Programs**: For students who want to pursue an undergraduate degree at ITC, they typically need to meet the following requirements:
	* Age: 17-25 years old (for domestic students) or 18-30 years old (for international students)
	* Academic qualifications: High school diploma or equivalent (e.g., A-levels, IB), plus specific prerequisite courses for each program
	* English proficiency: TOEFL or IELTS with a minimum score of 500 or 5.0, respectively
	* Other requirements may include:
		+ SAT or ACT scores for international students
		+ Letters of recommendation from teachers or mentors
		+ Personal statement or essay
3. **Postgraduate Programs**: For students who want to pursue a postgraduate degree at ITC, they typically need to meet the following requirements:
	* Age: 25-40 years old (for domestic students) or 26-35 years old (for international students)
	* Academic qualifications: Bachelor's degree from an accredited institution
	* English proficiency: TOEFL or IELTS with a minimum score of 500 or 5.0, respectively
	* Other requirements may include:
		+ Master's degree transcripts and certificates
		+ Letters of recommendation from teachers or mentors
		+ Personal statement or essay

Please note that these are general requirements and may vary depending on the specific program and admission cycle. I recommend checking the official website of ITC or contacting their admissions office directly for more information and up-to-date requirements.

Additionally, the Institute of Technology of Cambodia also offers various scholarships and financial aid options to support students with limited financial resources.