# Naive RAG

- Below is simple application using language models to answer user question using a knowledge base.
- Your task is to fill in contents of two functions `find_related_docs` and `answer_question`.
- The program is based on semantic search of most similar documents in knowledge base to the user query using language model embeddings and another model that summarizes the top results to short and fluent answer.
- We are interested in your thought process, you don't have to know how to write the code immediatelly.


# Set up

In [None]:
# Run this in case you need to install the lib in the notebook
!pip install transformers sentence-transformers

In [62]:
from typing import List
import numpy as np

from sentence_transformers import SentenceTransformer, util
from transformers import pipeline

In [63]:
print("initializing embedding model..")
embedding_model = SentenceTransformer('sentence-transformers/multi-qa-mpnet-base-dot-v1')

print("initializing summarization model..")
summarizer = pipeline("summarization", model="philschmid/bart-large-cnn-samsum")

initializing embedding model..
initializing summarization model..


In [64]:
# Define the knowledge base
knowledge_base = ["Today is sunny weather.",
                  "Pasta is traditional Italian food.",
                  "I love large language models.",
                  "This technical interview is extra complex.",
                  "Workday is the best company and I would like to work here!"]

print("generating document embeddings...")
knowledge_base_embeddings = embedding_model.encode(knowledge_base, show_progress_bar=True)

generating document embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [65]:
print(f"Embedding size: {knowledge_base_embeddings.shape}")
print("Embedding example: ")
knowledge_base_embeddings[0:2]

Embedding size: (5, 768)
Embedding example: 


array([[-0.28737584, -0.28308612, -0.34415117, ...,  0.09670485,
        -0.11437837, -0.06926885],
       [-0.3482608 , -0.14685202, -0.2589665 , ..., -0.12266742,
        -0.04499736, -0.27315393]], dtype=float32)

In [66]:
len(knowledge_base_embeddings[0])

768

# Part 1 - Find Relevant Sentences

The goal of this task is to implement function which finds most similar documents from knowledge base to the query.

In [67]:
def find_related_docs(
    knowledge_base: List[str],
    knowledge_base_embeddings: np.ndarray,
    query: str,
    number_of_matches: int,
) -> List[str]:
    """
    Find most similar documents from knowledge base

    Params:
    * knowledge_base - documents
    * knowledge_base_embeddings - generated embeddings of knowledge_base
    * query - sentence for which the function should return the similar one/s
    * number_of_matches - number of most similar documents to be returned
    """

    query_embed = embedding_model.encode(query)
    similarities = np.dot(knowledge_base_embeddings, query_embed)

    #top_indices = np.argpartition(similarities, -number_of_matches)[-number_of_matches:]
    #top_docs = [knowledge_base[idx] for idx in top_indices]

    #Combine docs & scores
    doc_score_pairs = list(zip(knowledge_base, similarities))

    #Sort by decreasing score
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    top_docs = [doc for doc, index in doc_score_pairs[:number_of_matches]]

    return top_docs

# Unit test
expected = [
    'Workday is the best company and I would like to work here!',
    'This technical interview is extra complex.'
]

assert expected == find_related_docs(
    knowledge_base,
    knowledge_base_embeddings,
    "I'm currently interviewing with Workday.",
    2
)

# Part 2 - Summarize the Documents

Using the function `find_related_docs` write a function that
1. finds most similar documents to a use query,
2. summarizes the answer.

In [68]:
# New knowledge base more suitable to summarization
knowledge_base = [
    "Workday, Inc. is a renowned company specializing in cloud-based software solutions primarily for human resources and finance. The company delivers applications related to financial management, human capital management, planning, and analytics, helping organizations to adapt and thrive in the changing business landscape. Workday’s solutions are designed to enable a range of business processes, fostering efficiency, and supporting decision-making within enterprises.",
    "The company is recognized for its commitment to innovation, with a significant focus on user-friendliness and providing a seamless experience across various devices. Workday has positioned itself as a competitor to traditional enterprise resource planning (ERP) providers, offering scalable and versatile solutions that cater to the needs of both large enterprises and mid-sized businesses.",
    "SAP SE is a global leader in enterprise software solutions, headquartered in Germany. The company develops software to manage business operations and customer relations seamlessly. SAP is especially known for its enterprise resource planning (ERP) software, which helps organizations integrate various business processes and functions across different departments, facilitating smoother operations and informed decision-making.",
    "SAP's extensive portfolio also includes solutions in areas like human capital management, business intelligence, and supply chain management. These solutions are used by businesses of all sizes across various industries to optimize their operations and drive innovation. SAP continues to evolve its offerings, integrating advanced technologies like AI and machine learning, to meet the ever-changing needs of businesses.",
    "Oracle Corporation is a multinational computer technology company that offers software, cloud solutions, hardware, and services. The company is best known for its focus on databases, but it provides a broad range of software and hardware systems and services, including its comprehensive suite of application, server, and storage solutions. Oracle’s products are designed to help businesses improve their operational efficiency, innovate faster, and improve their bottom line.",
    "In addition to its flagship product, Oracle Database, the company has a diverse portfolio of applications, platforms, and infrastructure solutions. Oracle’s cloud services and license support segment offer a variety of cloud-based and license support services, including Oracle Cloud Services and SaaS applications. The company's solutions serve the needs of a diverse clientele, ranging from small businesses to the largest global enterprises."
]

print("generating document embeddings...")
knowledge_base_embeddings = embedding_model.encode(knowledge_base, show_progress_bar=True)

generating document embeddings...


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

In [81]:
def answer_question(
    knowledge_base: List[str],
    knowledge_base_embeddings: np.ndarray,
    query: str,
) -> str:
    """
    Answer user question defined in query using the knowledge base.

    Params:
    * knowledge_base - documents
    * knowledge_base_embeddings - generated embeddings of knowledge_base
    * query - question that should be summarized using knowledge from available documents
    """

    top_docs = find_related_docs(
        knowledge_base,
        knowledge_base_embeddings,
        query,
        3
    )

    final_answer = summarizer(' '.join(top_docs))[0]["summary_text"]

    ###########################################
    ########### Write your code here  #########
    ###########################################

    return final_answer


result = answer_question(
    knowledge_base,
    knowledge_base_embeddings,
    "What do you know about Workday?"
)
assert str == type(result)
assert "Workday" in result

In [82]:
result

'Workday, Inc. is a renowned company that provides cloud-based software solutions for human resources and finance. Workday has positioned itself as a competitor to traditional ERP providers. Oracle Corporation is a multinational computer technology company that offers software, cloud solutions, hardware, and services.'