Updated notebook - main changes: revert to separating each category into its own collection, implement 2 step pipeline (categorize, then retrieve from those relevant collections). Somehow, takes a substantial amount of time now (ie 7 queries in 17 minutes) (using general qa set and some other random set of data), to discuss.

Install some dependencies

In [None]:
!pip install -q -U accelerate==0.27.1
!pip install -q -U datasets==2.17.0
!pip install -q -U transformers==4.38.1
!pip install langchain sentence-transformers chromadb langchainhub

!pip install langchain-community langchain-core

Get the Model You Want

In [None]:
!pip install llama-cpp-python

Define Variables

In [None]:
from llama_cpp import Llama
from langchain_community.llms import HuggingFaceEndpoint
import os
from transformers import pipeline

model_path = "mistral-7b-instruct-v0.2.Q4_0.gguf"
model = Llama(model_path=model_path, n_ctx=2048, n_threads=8, verbose=False)

In [90]:
from langchain.embeddings import HuggingFaceEmbeddings
import chromadb

embedding_model = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

# persistent client to interact w chroma vector store
client = chromadb.PersistentClient(path="./chroma_db")

# create collections for each data (for testing rn)
collection = client.get_or_create_collection(name="combined_docs")


Define Data Sources

In [None]:
import pandas as pd
import concurrent.futures
import uuid
import os

file_names = [
    "study_permit_general", "work_permit_student_general", "work-study-data-llm",
    "vancouver_transit_qa_pairs", "permanent_residence_student_general", "data-with-sources",
    "faq_qa_pairs_general", "hikes_qa", "sfu-faq-with-sources", "sfu-housing-with-sources",
    "sfu-immigration-faq", "park_qa_pairs-up", "cultural_space_qa_pairs_up",
    "qa_pairs_food", "qa_pairs_year_and_month_avg", "qa_pairs_sfu_clubs"
]

collections = {}
batch_size = 32

def process_file(file):
    try:
        path = f'../Data/{file}.csv'
        if not os.path.exists(path):
            return f"{file} skipped (file not found)."

        df = pd.read_csv(path, usecols=lambda col: col.lower() in {"question", "answer"})
        df.columns = df.columns.str.lower()

        if "question" not in df.columns or "answer" not in df.columns:
            return f"{file} skipped (missing question/answer columns)."

        df = df.drop_duplicates(subset="question")
        df["text"] = df["question"].fillna('') + ' ' + df["answer"].fillna('')
        unique_texts = list(set(df["text"].dropna().tolist()))

        collection = client.get_or_create_collection(name=file)
        for i in range(0, len(unique_texts), batch_size):
            batch = unique_texts[i:i + batch_size]
            embeddings = embedding_model.embed_documents(batch)
            ids = [str(uuid.uuid4()) for _ in batch]
            collection.add(ids=ids, embeddings=embeddings, documents=batch)

        collections[file] = collection
        return f"{file}: Loaded {len(unique_texts)} docs."
    except Exception as e:
        return f"{file}: Error - {e}"

# parallelogram
with concurrent.futures.ThreadPoolExecutor(max_workers=6) as executor:
    results = list(executor.map(process_file, file_names))

for result in results:
    print(result)


In [94]:
# define cat to collection mapping
# motivation: takes wayyyy too long now -> dataset size trippled and time grew exponentially...
# now takes around 2 mins on avg for each response..compared to 40 seconds previously..

collection_map = {
    "study permit": "study_permit_general",
    "work permit": "work_permit_student_general",
    "work-study": "work-study-data-llm",
    "transit": "vancouver_transit_qa_pairs",
    "permanent residence": "permanent_residence_student_general",
    "health": "data-with-sources",
    "faq": "faq_qa_pairs_general",
    "hiking": "hikes_qa",
    "sfu faq": "sfu-faq-with-sources",
    "housing": "sfu-housing-with-sources",
    "immigration": "sfu-immigration-faq",
    "parks": "park_qa_pairs-up",
    "culture": "cultural_space_qa_pairs_up",
    "food": "qa_pairs_food",
    "expenditure": "qa_pairs_year_and_month_avg",
    "clubs": "qa_pairs_sfu_clubs"
}


Function to now match for releveant document

In [95]:
def get_relevant_documents(query, categories, n_results=3):
    all_results = []
    query_embedding = embedding_model.embed_documents([query])[0]

    for category in categories:
        collection_name = collection_map[category]
        if collection_name in collections:
            try:
                result = collections[collection_name].query(
                    query_embeddings=[query_embedding],
                    n_results=n_results
                )
                docs = result.get("documents", [[]])[0]
                sims = result.get("distances", [[]])[0]

                all_results.extend(zip(docs, sims))
            except Exception as e:
                print(f"error querying {collection_name}: {e}")

    all_results = sorted(all_results, key=lambda x: x[1])

    return all_results[:n_results]


### Classify Prompt

In [96]:
import re
import difflib

valid_categories = list(collection_map.keys())
fallback_category = "faq"

def classify_query(query):
    category_prompt = f"""
    You are a classifier for a Q&A system for international students in British Columbia.
    Choose the **1 most relevant** category from this list, or at most 3 if absolutely needed (comma-separated):

    {", ".join(valid_categories)}

    Query: "{query}"

    Return only the category name(s) as a comma-separated string.
    """

    response = model(category_prompt, max_tokens=50, temperature=0.1)["choices"][0]["text"].strip().lower()
    print("Raw out:", response)

    tokens = re.findall(r'\b\w+\b', response)

    matched = []
    for token in tokens:
        closest = difflib.get_close_matches(token, valid_categories, n=1, cutoff=0.8)
        if closest and closest[0] not in matched:
            matched.append(closest[0])
        if len(matched) == 3:
            break

    if fallback_category not in matched:
      matched.append(fallback_category)

    return matched[:3]


Generate Answer

In [None]:
def generate_answer(query):
    categories = classify_query(query)
    print(f"Categories {categories}\n")
    relevant_documents = get_relevant_documents(query, categories)

    if not relevant_documents:
        return {
            "Response": "Sorry, no relevant documents found."
        }

    #relevant_documents = list(set(relevant_documents))

    seen = set()
    unique_docs = []
    for doc, sim in relevant_documents:
        if doc not in seen:
            seen.add(doc)
            unique_docs.append((doc, sim))

    print("Relevant Documents with Similarity Scores:")
    for doc, sim in unique_docs:
        print(f"Similarity: {sim:.4f}\nDoc: {doc}\n")

    relevant_texts = "\n\n".join([doc for doc, _ in unique_docs])
    
    
    # print ("categories 1:", categories[0])
    
    ## additional prompts
    hike_prompt = f"""
        INSTRUCTIONS:
            1. Convert structured information about the hike into a short, friendly paragraph using natural language. Do not repeat numbers or use formatting from the source.
            2. If they ask about hiking information, only answer with required information. Users can ask for more information if needed.
            3. When asked for a particular type of hike, find it instead of saying that one would not work in the category they asked for.
            4. Do NOT list trail attributes or stats (like “Distance: 3.1 km, Elevation: 789 m”). Instead, describe them in context (e.g., “a steep 3 km trail with a tough 789 m climb”).
            5. Avoid repeating exact numbers unless essential (e.g., elevation gain is helpful, but don’t dump all stats).
    """
    
    parks_prompt = f""" 
        INSTRUCTIONS:
            1. Convert structured information about the park into a short, friendly paragraph using natural language. Do not repeat numbers or use formatting from the source.
            2. Provide only necessary information that will allow the user to enjoy the park.
                - Feel free to tell them about logisitical information if asked.
    """
    
    food_prompt = f""" 
    """

    ## activities general covers how to answer general parks, hikes, food, clubs, cultural related questions 
    activities_general = f""" 
        INSTRUCTIONS:
            1. If they ask for suggestions, provide 2 to 3 suggestions.
            2. Do NOT list all information. Instead describe them in context 
            3. Provide accuracte suggestions, NOT suggestions of things that will not work for what they want.
            4. Convert structured information about the activity into a short, friendly paragraph using natural language. Do not repeat formatting from the source.
    """
    
    ## permits prompt - covers ways to answer immigration, study permits, work permits, and permanent residence related questions 
    permits_prompt = f"""
        INSTRUCTIONS:
            1. When given a specific question with many possible answers, you can ask for more specific information.
                - if they are not asking for an extension do not provide information in regards to an extension of a permit.
            2. Only answer with information provided 
                - Information should NOT be guessed and do NOT add extra information
            3. If the answer is not in the dataset, respond with: "I’m sorry, I don’t have that information. Please check the official IRCC website for more details."
            4. If it is helpful, provide the link and a description about it.
            5. Do NOT list all information. Instead describe them in context 
            6. If the answer depends on a specific condition explain those clearly.
    """
    
    housing_prompt = f""" 
    """ 
    
    transit_prompt = f""" 
    """

    
    ## main rag prompt - how to answer general questions 
    rag_prompt = f"""
    You are a helpful, friendly assistant for international students new to British Columbia, Canada.

    Below are some reference documents that may be relevant to the user's question:
    {relevant_texts}

    INSTRUCTIONS:
    1. If the user's query is just a greeting (like "hello", "hi", "what's up"):
       - Respond with a single brief friendly greeting
       - Offer to help with questions about studying or living in BC
       - Do NOT include ANY information from the reference documents
       - Do NOT create additional answers beyond answering their original question

    2. If the user is asking for information:
       - Be friendly and answer based ONLY on the reference documents if relevant
       - Summarize the necessary information into a couple sentences.
       - Do NOT create additional questions and answers beyond answering their original question
       - Limit your entire response to no more than 3 concise sentences when possible. Do not create long multi-line answers.
       - If the documents don't provide sufficient information, say "I don't have enough information to answer that. Please refer to official sources."
       - Ask for more information when there are multiple senarios in the documents.
       - If they ask things like "can I", "will I", "how can I" feel free to ask follow up questions if you don't how to answer with the information provided. Do not just assume.
       - If they only asked a question and did not provide information surrounding it, there is no need to state "Based on the information provided,"
       - Include all relevant information from the documents
       - If the documents don't provide sufficient information, say "I don't have enough information to answer that. Please refer to official sources."
    
    3. IMPORTANT: Never generate additional content beyond answering the user's question
    
    4. IMPORTANT: Do NOT number or bullet your points. Always use natural sentences and group similar information together where possible.
    
    5. IMPORTANT: Never generate additional content beyond answering the user's question

    User question: {query}

    Your response (just the answer, no preamble):
    """
    
    
    # adding the category specific prompting to main if necessary
    for category in categories:
        if category == "hiking" or category == "parks" or category == "food" or category == "cultural" or category == "clubs":
            rag_prompt += "\n" + activities_general
            
        if category == "hiking":
            rag_prompt += "\n" + hike_prompt
            
        if category == "parks":
            rag_prompt += "\n" + parks_prompt
            
        if category == "study" or category == "student work" or category == "immigration" or category == "permanent residence":
            rag_prompt += "\n" + permits_prompt
       
    response_after_rag = model(rag_prompt, max_tokens=300, temperature=0.1)["choices"][0]["text"]
    # response_after_rag = model(rag_prompt, max_tokens=300, temperature=0.1)

    return {
        "Response": response_after_rag
    }


Example Usage

In [229]:
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

In [236]:
answer = generate_answer("I'm hungry")
print(answer["Response"])

Raw out: category: food
Categories ['food', 'faq']

Relevant Documents with Similarity Scores:
Similarity: 1.5588
Doc: What are Dining Dollars in the Platinum Meal Plan? Dining Dollars are prepaid credits included in the 7-Day Platinum Plan that can be used at select SFU food outlets like Tim Hortons and Starbucks.


    User question: I'm hungry

    Your response:
      I'm sorry to hear that. There are several options for you to grab a bite to eat on campus. The 7-Day Platinum Meal Plan includes Dining Dollars that can be used at select food outlets like Tim Hortons and Starbucks. Alternatively, you can visit the SFU Bookstore Food Court for a variety of options. I hope this helps! Let me know if you have any other questions.


In [None]:
!pip install evaluate
!pip install bert_score

In [221]:
import pandas as pd
from evaluate import load
bertscore = load("bertscore")

benchmark_data = pd.read_csv("../s-eval-set/s_test_qa.csv")

for idx, row in benchmark_data.iterrows():
    user_query = row["Questions"]
    correct_answer = row["Answers"]

    responses = generate_answer(user_query)
    
    predictions = [responses.get("Response", "N/A")]
    references = [correct_answer]
    results = bertscore.compute(predictions=predictions, references=references, lang="en")

    print("\n" + "="*50)
    print(f"Benchmark Query {idx + 1}: {user_query}")
    print("="*50)
    print("\nRAG Response:\n", responses.get("Response", "N/A"))
    print("\n(Benchmark) Answer:\n", correct_answer)
    print("BERT Score == ", results)
    print("="*50 + "\n\n")


Raw out: immigration
Categories ['immigration', 'faq']

Relevant Documents with Similarity Scores:
Similarity: 0.6393
Doc: What is required to apply for a study permit extension? You can only extend a study permit if you are physically inside Canada at the time of application.



Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.



Benchmark Query 1: As a minor how can I apply for a study permit in Canada?

RAG Response:
 1. You cannot apply for a study permit as a minor if you are not physically present in Canada.
    2. If you are already in Canada, you can apply for a study permit extension if you meet the requirements.
    3. To extend your study permit, you must submit your application before your current permit expires.
    4. You will need to provide proof of financial support, a letter of acceptance from a designated learning institution, and meet other eligibility requirements.
    5. If you are under 18 years old, you may need a guardian in Canada to provide consent for your study permit application.
    6. For more information, refer to the official Immigration, Refugees and Citizenship Canada website.

(Benchmark) Answer:
 Your parents will need to help with your study permit application. Both parents will need to submit documents, but if you only have one parent for any reason there are other ways o

In [182]:
from evaluate import load
bertscore = load("bertscore")
predictions = ["An SFU computing ID"]
references = ["An SFU Computing ID"]
results = bertscore.compute(predictions=predictions, references=references, lang="en")
print("Results == ", results)

Some weights of RobertaModel were not initialized from the model checkpoint at roberta-large and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Results ==  {'precision': [0.9849278926849365], 'recall': [0.9849278926849365], 'f1': [0.9849278926849365], 'hashcode': 'roberta-large_L17_no-idf_version=0.3.12(hug_trans=4.50.3)'}
