# **Applied Deep Learning and Artificial Intelligence - Group Assignment 4**

**Group members:**
*   Annika í Jákupsstovu, study no. 20204059
*   Mikkel Ørts Nielsen, study no. 20205211

#**Description of assignment**

**Introduction**

This assignment is designed to explore the frontier of AI applications, focusing on the integration of Retrieval-Augmented Generation (RAG) with vector databases such as ChromDB and LanceDB, and the comparison of various prompt engineering techniques. The goal is to build an application that not only showcases advanced AI and DL capabilities but also evaluates the impact of different prompt strategies on model performance.

**Task Description**

Create an application that utilizes RAG and vector databases, and systematically compares the effectiveness of at least three distinct prompt engineering techniques.

**Key Components**

*  **RAG and Vector Database Integration:** Implement RAG with ChromDB and LanceDB to enhance information retrieval and content generation.

*  **Transformer Model Adaptation:** Use transformer models (SBERT or BERT)

*  **Prompt Engineering Comparison:** Experiment with and evaluate at least three different prompt engineering techniques to determine their impact on the model’s performance.

*  **Platform Integration:** Deploy the application on Hugging Face, with interactive access provided via Gradio or HF Spaces.

**Additional Features (Nice-to-Have)**

*  **Fine-Tuning Capabilities:** If possible, fine-tune a GPT model specific to your application’s needs, detailing the process and its impact on application performance.

*  **Streamlit Application:** Develop a Streamlit app hosted on the HF Hub, offering a richer, more interactive user experience.

**Data**

*  You may use open-source datasets or create your own data for the application.

*  Ensure that your data choice effectively demonstrates the capabilities of your application.

**Submission**

*  Create a GitHub repository specifically for this assignment.

*  Include all necessary materials, such as code, datasets, and a descriptive README.md.

*  Submissions can be individual or in groups of up to three members.

*  **Submission also via DigitalExam**, where you compile all your previous assignments and submit in one file for the overall portfolio for the module exam. You are welcome to tweak/improve previous module submissions for that.

# **Our Idea**

We want to make a chatbot for International Cand. Merc. Masters at AAU Business School:
*  Business Data Science
*  Finance
*  Innovation Management
*  International Business
*  Marketing and Sales

The chatbot will be built using RAG, Mistral 7B and ChromDB as vector database. After the chatbot has been built, we will test and evaluate three different prompting methods to make the chatbot as accurate as possible.


The data has been gathered manually from the information page for each of the five masters. In addition, we have downloaded the curriculum and module description for each of the masters in PDF format.

# **Loading libraries and packages**

In [None]:
!pip install --upgrade pip
!pip install gradio==3.50.2
!pip install accelerate --q
!pip install "pydantic>=1.9,<2.0"
!pip install pypdf --q
!pip install -qqq chromadb==0.4.10 --progress-bar off
!pip install -qqq sentence_transformers==2.2.2 --progress-bar off
!pip install -qqq langchain==0.0.299 --progress-bar off
!pip install -qqq xformers==0.0.21 --progress-bar off
!pip install -qqq tokenizers==0.14.0 --progress-bar off
!pip install -qqq optimum==1.13.1 --progress-bar off
!pip install -qqq auto-gptq==0.4.2 --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ --progress-bar off
!pip install -qqq unstructured==0.10.16 --progress-bar off

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m13.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting gradio==3.50.2
  Downloading gradio-3.50.2-py3-none-any.whl.metadata (17 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio==3.50.2)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi (from gradio==3.50.2)
  Downloading fastapi-0.110.0-py3-none-any.whl.metadata (25 kB)
Collecting ffmpy (from gradio==3.50.2)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.6.1 (from gradio==3.50.2)
  Downloading gradio_client-0.6.1-py3-none-any.whl.metadata (7.1 kB)
Collecting httpx (from gradio==3

# **Loading data**

In [None]:
# Cloning our Github repository to Google Files
!git clone https://github.com/annikaijak/deeplearning_assignments

Cloning into 'deeplearning_assignments'...
remote: Enumerating objects: 275, done.[K
remote: Counting objects: 100% (230/230), done.[K
remote: Compressing objects: 100% (203/203), done.[K
remote: Total 275 (delta 74), reused 111 (delta 18), pack-reused 45[K
Receiving objects: 100% (275/275), 11.68 MiB | 18.64 MiB/s, done.
Resolving deltas: 100% (78/78), done.


In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader
# Loading the PDF files from Google Files
loader = PyPDFDirectoryLoader("/content/deeplearning_assignments/Assignment_4/PDF_Documents")
docs = loader.load()
# Printing the number of pages of the loaded data
len(docs)

100

# **Data preprocessing**

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Splitting the text in smaller chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=64)
texts = text_splitter.split_documents(docs)
len(texts)

246

## **Embeddings**

We'll create BERT word embeddings for the provided documents to help the computer understand the text.

In [None]:
from langchain.embeddings import HuggingFaceEmbeddings

# Creating embeddings
embeddings = HuggingFaceEmbeddings(
    model_name="thenlper/gte-large",
    model_kwargs={"device": "cuda"},
    encode_kwargs={"normalize_embeddings": True},
)

query_result = embeddings.embed_query(texts[0].page_content)
print(len(query_result))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


.gitattributes:   0%|          | 0.00/1.52k [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/191 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/67.9k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/619 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/670M [00:00<?, ?B/s]

onnx/config.json:   0%|          | 0.00/632 [00:00<?, ?B/s]

model.onnx:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

onnx/special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

onnx/tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

onnx/tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

onnx/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/670M [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/57.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/712k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

modules.json:   0%|          | 0.00/385 [00:00<?, ?B/s]

1024


# **Integrating vector database**

In [None]:
from langchain.vectorstores import Chroma

# Saving the embeddings in the Chroma database
db = Chroma.from_documents(texts, embeddings, persist_directory="db")
results = db.similarity_search("Transformer models", k=2)
print(results[0].page_content)

data -driven  application.  
Read more about the courses (modules) in the curriculum - §18 “Overview of the 
programme” . 
Company collaboration - Study abroad


# **Loading model and setting configurations**

In [None]:
# Loading the transformer model
import torch
from langchain import HuggingFacePipeline
from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig, pipeline

MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.1"

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, use_fast=True)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME, torch_dtype=torch.float16, trust_remote_code=True, device_map="auto"
)

# Create a configuration for text generation based on the specified model name
generation_config = GenerationConfig.from_pretrained(MODEL_NAME)

# Set the maximum number of new tokens in the generated text to 1024.
# This limits the length of the generated output to 1024 tokens.
generation_config.max_new_tokens = 1024

# Set the temperature for text generation. Lower values (e.g., 0.0001) make output more deterministic, following likely predictions.
# Higher values make the output more random.
generation_config.temperature = 0.0001

# Set the top-p sampling value. A value of 0.95 means focusing on the most likely words that make up 95% of the probability distribution.
generation_config.top_p = 0.95

# Enable text sampling. When set to True, the model randomly selects words based on their probabilities, introducing randomness.
generation_config.do_sample = True

# Set the repetition penalty. A value of 1.15 discourages the model from repeating the same words or phrases too frequently in the output.
generation_config.repetition_penalty = 1.15


# Create a text generation pipeline using the initialized model, tokenizer, and generation configuration
text_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    generation_config=generation_config,
)

# Create a LangChain pipeline that wraps the text generation pipeline and set a specific temperature for generation
llm = HuggingFacePipeline(pipeline=text_pipeline, model_kwargs={"temperature": 0})

tokenizer_config.json:   0%|          | 0.00/1.47k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.94G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



# **Creating QA system**

## **First prompt**

In [None]:
from langchain.chains import RetrievalQA
from langchain import PromptTemplate
from textwrap import fill

template = """
<s>[INST] You are a student counselor at Aalborg University Business School. Answer the question at the end based on the master programs at Aalborg University Business School. The possible masters that you should recommend are: "Finance", "Business Data Science", "Marketing and Sales", "Innovation Management" and "International Business". Keep your responses concise, within 40 words.

{context}
{question}
[/INST]</s>
"""

prompt = PromptTemplate(template=template, input_variables=["context", "question"])


qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt},
)


In [None]:
result = qa_chain(
    "how do i find out what masters degree i want to study"
)
print(fill(result["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


To determine which master's program you would like to study, consider your
interests and career goals. If you are interested in finance, business data
science, marketing and sales, innovation management or international business,
Aalborg University Business School offers relevant programs. To apply, log into
the Application Portal and submit your application by the deadline. Admission
requirements include meeting certain ECTS requirements. It is important to note
that some programs may have limited spots available, so applying early can
increase your chances of being accepted.


In [None]:
result = qa_chain(
    "i like working with numbers, what masters degree should i choose?"
)
print(fill(result["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


If you enjoy working with numbers, I would recommend pursuing a master's degree
in Business Data Science from Aalborg University Business School. This program
focuses on applied data science, data-driven business development, and
legal/responsible data practice, providing in-depth knowledge in these areas.
Additionally, the program utilizes real-world data and cases, invites guest
lecturers, organizes company visits, and facilitates company collaboration to
ensure an authentic business problem approach. The program is a two-year,
research-based, full-time study program and entitles graduates to the Danish
designation Kandidatuddannelsen i erhvervsøkonomi, cand.merc. (business data
science), or the English designation Master of Science (MSc) in Economics and
Business Administration (Business Data Science).


## **N-shot Prompting**

Improving the prompt with N-shot Prompting:

In [None]:
template_2 = """
<s>[INST] You are a student counselor at Aalborg University Business School. Answer the question at the end based on the master programs at Aalborg University Business School. The possible masters that you should recommend are: "Finance", "Business Data Science", "Marketing and Sales", "Innovation Management" and "International Business". Keep your responses concise, within 40 words. Use the provided examples to inform your answers but do not directly mention any of the text in the examples.

N-shot Learning Examples:
Q: How do I decide on a master's degree?
A: Consider what subjects interested you most during your bachelor's and look for master's programs that offer advanced modules in those areas.

Q: I liked Applied statistics and mathmatics
A: Based on your interests in Applied statistics and mathmatics, it may be beneficial to consider studying Business Data Science. The curriculum for this program includes several modules that align with your interests.

{context}
Now, answer the following question based on the above guidance:
{question}
[/INST]</s>
"""

prompt_2 = PromptTemplate(template=template_2, input_variables=["context", "question"])


qa_chain_2 = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_2},
)

In [None]:
result_2 = qa_chain_2(
    "How do I find out what masters degree I want to study"
)
print(fill(result_2["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


To determine which master's degree you want to pursue, consider your interests
from your bachelor's studies and look for programs that offer advanced modules
in those areas. If you are unsure, you can also consult with a career counselor
or research the different programs available to see which one best aligns with
your goals and interests. It is important to note that some programs may have
limited availability and require an individual academic assessment for
admission.


In [None]:
result_2 = qa_chain_2(
    "I liked macro economics and organisation"
)
print(fill(result_2["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Based on your interest in macroeconomics and organization, it may be beneficial
to consider studying International Business Economics or Innovation and
Entrepreneurship as they both include courses related to these topics. However,
if you prefer more specialized knowledge in finance, marketing, sales, or
innovation management, then Finance, Business Data Science, Marketing and Sales,
or Innovation Management would be better options for you.


## **Generated Knowledge Prompting**

Based on questions and answers from the N-shot prompt and first instruction, we copy good questions and answers for the model to use as examples for generating answers.

In [None]:
template_3 = """
<s>[INST] You are a student counselor at Aalborg University Business School. Answer the question at the end based on the master programs at Aalborg University Business School. The possible masters that you should recommend are: "Finance", "Business Data Science", "Marketing and Sales", "Innovation Management" and "International Business". Keep your responses concise, within 40 words, and focus on the unique modules in the 1st and 2nd semesters. Use the provided examples to inform your answers but do not directly mention any of the text in the examples.

Generated knowledge prompting examples:
Q: How do I find out what masters degree I want to study?
A: To determine which master's degree you want to pursue, consider your interests, career goals, and strengths. Research different fields and programs to see which ones align with your aspirations. It may also be helpful to speak with professors or industry professionals in the field you are interested in to gain further insight. Ultimately, choose a program that will provide you with the skills and knowledge needed to achieve your desired career path.

Q: I liked macro economics and organisation
A: Based on your interests in macroeconomics and organization, it may be beneficial to consider studying International Business Economics or Innovation Management. These programs offer advanced modules in these areas and could provide a more specialized education in your field of interest.

{context}
Now, answer the following question based on the above guidance:
{question}
[/INST]</s>
"""

prompt_3 = PromptTemplate(template=template_3, input_variables=["context", "question"])


qa_chain_3 = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_3},
)


In [None]:
result_3 = qa_chain_3(
    "How do I find out what masters degree I want to study"
)
print(fill(result_3["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


To find out which master's degree you want to study, consider your interests,
career goals, and strengths. Research different fields and programs to see which
ones align with your aspirations. Speak with professors or industry
professionals in the field you are interested in to gain further insight. Choose
a program that provides the skills and knowledge needed to achieve your desired
career path.


In [None]:
result_3 = qa_chain_3(
    "I liked statistics and applied mathematics"
)
print(fill(result_3["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Based on your interests in statistics and applied mathematics, it may be
beneficial to consider studying Finance or Business Data Science. These programs
offer advanced modules in these areas and could provide a more specialized
education in your field of interest. Additionally, they provide hands-on
experience working with industry-standard programming languages and platforms,
exploring approaches to prototype development and deployment of data-driven
applications. This exposure to common business analytics applications as well as
state-of-the-art artificial intelligence techniques used across various
industries could help you understand how data science techniques can be used
both within organizations and to develop new businesses.


In [None]:
result_3 = qa_chain_3(
    "I liked macro economics and organisation"
)
print(fill(result_3["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Based on your interests in macroeconomics and organization, it may be beneficial
to consider studying International Business Economics or Innovation Management.
These programs offer advanced modules in these areas and could provide a more
specialized education in your field of interest.


## **Final Instruction Prompt**
Combining N-shot and Generated Knowledge Prompting to make our final instruction prompt

In [None]:
template_4 = """
<s>[INST] You are a student counselor at Aalborg University Business School. Answer the question at the end based on the master programs at Aalborg University Business School. The possible masters that you should recommend are: "Finance", "Business Data Science", "Marketing and Sales", "Innovation Management" and "International Business". Keep your responses concise, within 40 words, and focus on the unique modules in the 1st and 2nd semesters. Use the provided examples to inform your answers but do not directly mention any of the text in the examples.

N-shot Learning Examples:
Q: How do I decide on a master's degree?
A: Consider what subjects interested you most during your bachelor's and look for master's programs that offer advanced modules in those areas.

Q: I liked Applied statistics and mathmatics
A: Based on your interests in Applied statistics and mathmatics, it may be beneficial to consider studying Business Data Science. The curriculum for this program includes several modules that align with your interests.

Generated knowledge prompting examples:
Q: How do I find out what masters degree I want to study?
A: To determine which master's degree you want to pursue, consider your interests, career goals, and strengths. Research different fields and programs to see which ones align with your aspirations. It may also be helpful to speak with professors or industry professionals in the field you are interested in to gain further insight. Ultimately, choose a program that will provide you with the skills and knowledge needed to achieve your desired career path.

Q: I liked macro economics and organisation
A: Based on your interests in macroeconomics and organization, it may be beneficial to consider studying International Business Economics or Innovation Management. These programs offer advanced modules in these areas and could provide a more specialized education in your field of interest.

{context}
Now, answer the following question based on the above guidance:
{question}
[/INST]</s>

"""

prompt_4 = PromptTemplate(template=template_4, input_variables=["context", "question"])


qa_chain_4 = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=db.as_retriever(search_kwargs={"k": 2}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": prompt_4},
)

In [None]:
result_4 = qa_chain_4(
    "How do I find out what masters degree I want to study"
)
print(fill(result_4["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


To find out what master's degree you want to study, consider your interests,
career goals, and strengths. Research different fields and programs to see which
ones align with your aspirations. It may also be helpful to speak with
professors or industry professionals in the field you are interested in to gain
further insight. Choose a program that will provide you with the skills and
knowledge needed to achieve your desired career path.


In [None]:
result_4 = qa_chain_4(
    "I liked macro economics and organisation"
)
print(fill(result_4["result"].strip(), width=80))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


Based on your interests in macroeconomics and organization, it may be beneficial
to consider studying International Business Economics or Innovation Management.
These programs offer advanced modules in these areas and could provide a more
specialized education in your field of interest.


# **Gradio Interface**

In [None]:
import gradio as gr

In [None]:
import time

In [None]:
bot_name = "Master Supervisor"

with gr.Blocks() as demo:
    gr.Markdown("### Master's Degree Program Advisor")
    gr.Markdown("I can help you find the master's degree program that's right for you. Ask me any question related to choosing a master's program.")

    chatbot = gr.Chatbot()
    msg = gr.Textbox()
    clear = gr.ClearButton([msg, chatbot])

    def reply_bot_1(message, chat_history):
      bot_result = qa_chain_4(message)
      chat_history.append((message, (bot_result["result"].strip()))),
      time.sleep(2),
      return "", chat_history

    msg.submit(reply_bot_1, [msg, chatbot], [msg, chatbot])

demo.queue().launch(share=True)

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://1eb955878b3f8791a9.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


