### 1. Install Dependencies
You'll need:

```pip install langchain langchain-community langchain-ollama beautifulsoup4 requests chromadb unstructured```

In [None]:
%pip install langchain langchain-community langchain-ollama beautifulsoup4 requests chromadb unstructured


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### 2. Load Data (PDFs & Websites)
We need to extract text from both **PDF Documents** and **webpage**

#### 2.1 Extract Text from PDFs
Modify your existing LangChain PDF loader to handle multiple PDFs:

In [20]:
from langchain_community.document_loaders import UnstructuredPDFLoader
import os

def load_pdfs(folder_paths):
    all_data = []
    for filename in os.listdir(folder_paths):
        if filename.endswith(".pdf"):
            file_path = os.path.join(folder_paths, filename)
            loader = UnstructuredPDFLoader(file_path)
            all_data.extend(loader.load())
    return all_data

folder_path = "/Users/nychanthrith/Documents/Chantharith/ME-Chatbot/ai-model/data/raw_pdfs"
pdf_data = load_pdfs(folder_path)
print(f"Load {len(pdf_data)} documents from PDFs")

Load 4 documents from PDFs


#### 2.2 Scrape Websites
For retrieving data from ITC, RUPP, CADT, and UHS websites, use ``BeautifulSoup``:

In [2]:
import requests
from bs4 import BeautifulSoup

def scrape_website(url):
    response = requests.get(url)
    if response.status_code == 200:
        soup = BeautifulSoup(response.text, "html.parser")
        return soup.get_text()
    return ""

urls = [
    "https://itc.edu.kh/#",
    "https://www.rupp.edu.kh/",
    "https://cadt.edu.kh/about/",
    "https://uhs.edu.kh/"
]

web_data = [scrape_website(url) for url in urls]
print(f"Scraped {len(web_data)} websites")

Scraped 4 websites


### 3. Process and Store Data

#### 3.1 Split Text into Chunks
Since PDFs and websites have long text, we split them for better retrieval.

In [21]:
print(f"Type of pdf_data: {type(pdf_data)}")  # Should be a list of Document objects
print(f"Type of web_data: {type(web_data)}")  # Should be a list of Document objects


Type of pdf_data: <class 'list'>
Type of web_data: <class 'list'>


In [22]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

def split_text(data):
    # Ensure input is a list of LangChain Documents
    if isinstance(data, str):  
        data = [Document(page_content=data)]
    elif isinstance(data, list) and all(isinstance(item, str) for item in data):  
        data = [Document(page_content=item) for item in data]  # Corrected 'd' to 'item'
    
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=300)
    return text_splitter.split_documents(data)

# ✅ Convert PDF data to LangChain Documents if it's raw text
if isinstance(pdf_data, str):
    pdf_data = [Document(page_content=pdf_data)]
elif isinstance(pdf_data, list) and all(isinstance(item, str) for item in pdf_data):  # Corrected 'd' to 'item'
    pdf_data = [Document(page_content=item) for item in pdf_data]

# ✅ Convert Web data to LangChain Documents if needed
if isinstance(web_data, str):
    web_data = [Document(page_content=web_data)]
elif isinstance(web_data, list) and all(isinstance(item, str) for item in web_data):  # Corrected 'd' to 'item'
    web_data = [Document(page_content=item) for item in web_data]

# Now process
pdf_chunks = split_text(pdf_data)
web_chunks = split_text(web_data)

# Combine all chunks
all_chunks = pdf_chunks + web_chunks
print(f"Total chunks created: {len(all_chunks)}")


Total chunks created: 24


#### 3.2 Store Data in a Vector Database
We store the chunks using **ChromaDB**:

In [23]:
from langchain_community.vectorstores import Chroma
from langchain_ollama import OllamaEmbeddings

vector_db = Chroma.from_documents(
    documents=all_chunks,
    embedding=OllamaEmbeddings(model="nomic-embed-text"),
    collection_name="exam-chatbot"
)

print("Vector database created successfully")

Vector database created successfully


### 4. Build the Chatbot

#### 4.1 Setup LLM & Retrieval

In [24]:
from langchain_ollama.chat_models import ChatOllama
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.prompts import ChatPromptTemplate, PromptTemplate

llm = ChatOllama(model="llama3.2")

retriever = MultiQueryRetriever.from_llm(vector_db.as_retriever(), llm)

#### 4.2 Create RAG Chain

In [25]:
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

template = """Answer based ONLY on the following context:
{context}
Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

#### 4.3 Create Chat Function

In [26]:
from IPython.display import display, Markdown

def chat_with_bot(question):
    english_instruction = "Please respond in English: "
    response = chain.invoke(english_instruction + question)
    display(Markdown(response))
    
# Test Example
chat_with_bot("What are the admission requirements for ITC")

The admission requirements for Institute of Technology of Cambodia (ITC) vary depending on the program and student status. Generally, the following requirements are required:

* Completion of upper secondary education (high school) or equivalent
* Specific subject requirements, such as math and sciences for engineering programs
* Language proficiency in Khmer and/or English

It's also worth noting that tuition fees vary depending on the program, year of study, and student status. Additionally, information on scholarships and financial aid should be available through the university's student services or financial aid office.

Here are some specific admission requirements for certain programs at ITC:

* Technician Degree: Completion of upper secondary education (high school) with a minimum GPA of 2.5
* Engineering Degree: Completion of upper secondary education (high school) with a minimum GPA of 3.0, plus specific subject requirements such as math and sciences
* Graduate School: Master's degree or equivalent, with a minimum GPA of 3.0
* Doctoral program: Ph.D. or equivalent, with a minimum GPA of 3.5

Please note that these requirements are subject to change, and it's always best to check the official website of ITC or contact their admissions department for the most up-to-date information.

In [27]:
chat_with_bot("What are the admission requirements for CADT")

According to the provided text, the admission requirements for the Cambodia Academy of Digital Technology (CADT) vary depending on the specific program. Generally, these may include:

* Completion of upper secondary education (high school) or equivalent
* Specific subject requirements, particularly in mathematics and sciences, for technology-related programs
* Entrance examinations or interviews for some programs

It is recommended to check the CADT website for a complete list of their current program offerings and admission requirements, as they may evolve rapidly to keep pace with the tech industry.

Additionally, details on application procedures, deadlines, and required documents can be found on the official CADT website. Inquiring about potential scholarships or financial aid options through CADT's student services or financial aid department may also provide additional information.

In [28]:
chat_with_bot("What are the admission requirements for RUPP")

The admission requirements for Royal University of Phnom Penh (RUPP) vary depending on the specific faculty and program. However, some general requirements are as follows:

* Completion of upper secondary education (high school)
* General requirement: completion of high school or equivalent
* Specific programs may have additional requirements such as:
	+ Entrance examinations
	+ Specific subject prerequisites
	+ Certain grade requirements

It is crucial to consult the RUPP website or admissions office for details on specific programs and their respective admission criteria.

In [29]:
chat_with_bot("What are the admission requirements for UHS")

The admission requirements for the University of Health Sciences (UHS) vary depending on the specific program of study. However, generally, the following requirements are applicable:

1. Completion of upper secondary education (high school)
2. Entrance examinations that assess aptitude for health sciences
3. Specific grade requirements or minimum scores in relevant subjects (e.g., biology, chemistry, physics)
4. Interviews may be part of the selection process for some programs.

It is essential to check the UHS website or contact their admissions office for the precise and most current admission criteria for each program.