<a href="https://colab.research.google.com/github/faynercosta/faynercosta/blob/main/Papers_Summary_commit.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# How to use functions with a knowledge base

This notebook builds on the concepts in the [argument generation](How_to_call_functions_with_chat_models.ipynb) notebook, by creating an agent with access to a knowledge base and two functions that it can call based on the user requirement.

We'll create an agent that uses data from arXiv to answer questions about academic subjects. It has two functions at its disposal:
- **get_articles**: A function that gets arXiv articles on a subject and summarizes them for the user with links.
- **read_article_and_summarize**: This function takes one of the previously searched articles, reads it in its entirety and summarizes the core argument, evidence and conclusions.

This will get you comfortable with a multi-function workflow that can choose from multiple services, and where some of the data from the first function is persisted to be used by the second.

## Walkthrough

This cookbook takes you through the following workflow:

- **Search utilities:** Creating the two functions that access arXiv for answers.
- **Configure Agent:** Building up the Agent behaviour that will assess the need for a function and, if one is required, call that function and present results back to the agent.
- **arXiv conversation:** Put all of this together in live conversation.


In [1]:
!pip install scipy
!pip install tenacity
!pip install tiktoken==0.3.3
!pip install termcolor
!pip install openai
!pip install requests
!pip install arxiv
!pip install pandas
!pip install PyPDF2
!pip install tqdm

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting tiktoken==0.3.3
  Downloading tiktoken-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m44.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tiktoken
Successfully installed tiktoken-0.3.3
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting openai
  Downloading openai-0.27.8-py3-none-any.whl (73 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m73.6/73.6 kB[0m [31m6.8 MB/s[0m eta [36m0:00:

In [2]:
import arxiv
import ast
import concurrent
from csv import writer
from IPython.display import display, Markdown, Latex
import json
import openai
import os
import pandas as pd
from PyPDF2 import PdfReader
import requests
from scipy import spatial
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken
from tqdm import tqdm
from termcolor import colored
import time

GPT_MODEL = 'gpt-3.5-turbo-16k-0613'
EMBEDDING_MODEL = "text-embedding-ada-002"
openai.api_key = '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'


## Search utilities

We'll first set up some utilities that will underpin our two functions.

Downloaded papers will be stored in a directory (we use ```./data/papers``` here). We create a file ```arxiv_library.csv``` to store the embeddings and details for downloaded papers to retrieve against using ```summarize_text```.

In [65]:
def folder_setup():
  # Set a directory to store downloaded papers
  data_dir = os.path.join(os.curdir, "data", "papers")

  # Create data/papers directory if it does not exist
  if not os.path.exists(data_dir):
      os.makedirs(data_dir)

  paper_dir_filepath = "./data/arxiv_library.csv"

  # Only generate a blank dataframe if the file does not exist
  if not os.path.exists(paper_dir_filepath):
    df = pd.DataFrame(list())
    df.to_csv(paper_dir_filepath)
  return 0


In [42]:
import time


In [53]:
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def embedding_request(text):
    response = openai.Embedding.create(input=text, model=EMBEDDING_MODEL)
    return response


def get_articles(query, library=paper_dir_filepath, top_k=5):
    """This function gets the top_k articles based on a user's query, sorted by relevance.
    It also downloads the files and stores them in arxiv_library.csv to be retrieved by the read_article_and_summarize.
    """
    query=query
    print("A query enviada para arxiv é: ", query)
    search = arxiv.Search(
        query=query, max_results=top_k, sort_by=arxiv.SortCriterion.Relevance
    )
    result_list = []
    for result in search.results():
        result_dict = {}
        result_dict.update({"title": result.title})
        result_dict.update({"summary": result.summary})

        # Taking the first url provided
        result_dict.update({"article_url": [x.href for x in result.links][0]})
        result_dict.update({"pdf_url": [x.href for x in result.links][1]})
        result_list.append(result_dict)

        # Combine the title and summary
        combined_text = result.title + ' ' + result.summary

        # Request the embedding for the combined text instead of just the title
        response = embedding_request(text=combined_text)

        # Store references in library file
        try:
            file_reference = [
                result.title,
                result.download_pdf(data_dir),
                response["data"][0]["embedding"],
            ]
            #Write to file
            with open(library, "a") as f_object:
                writer_object = writer(f_object)
                writer_object.writerow(file_reference)
                f_object.close()
            time.sleep(5)  # wait 5 seconds before the next download
        except Exception as e:
            print(f"Failed to download {result.title}. Error: {e}")
    return result_list

In [None]:
# Test that the search is working
result_output = get_articles("terapia cognitivo comportamental")
result_output[0]


{'title': 'Hacia una teoria de unificacion para los comportamientos cognitivos',
 'summary': "Each cognitive science tries to understand a set of cognitive behaviors. The\nstructuring of knowledge of this nature's aspect is far from what it can be\nexpected about a science. Until now universal standard consistently describing\nthe set of cognitive behaviors has not been found, and there are many questions\nabout the cognitive behaviors for which only there are opinions of members of\nthe scientific community. This article has three proposals. The first proposal\nis to raise to the scientific community the necessity of unified the cognitive\nbehaviors. The second proposal is claim the application of the Newton's\nreasoning rules about nature of his book, Philosophiae Naturalis Principia\nMathematica, to the cognitive behaviors. The third is to propose a scientific\ntheory, currently developing, that follows the rules established by Newton to\nmake sense of nature, and could be the theor

In [5]:
def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100,
) -> list[str]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = embedding_request(query)
    query_embedding = query_embedding_response["data"][0]["embedding"]
    strings_and_relatednesses = [
        (row["filepath"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n]


In [60]:
def read_pdf(filepath):
    """Takes a filepath to a PDF and returns a string of the PDF's contents"""
    # creating a pdf reader object
    reader = PdfReader(filepath)
    pdf_text = ""
    page_number = 0
    for page in reader.pages:
        page_number += 1
        pdf_text += page.extract_text() + f"\nPage Number: {page_number}"
    return pdf_text


# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    """Returns successive n-sized chunks from provided text."""
    tokens = tokenizer.encode(text)
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j


def extract_chunk(content, template_prompt):
    """This function applies a prompt to some input content. In this case it returns a summarize chunk of text"""
    prompt = template_prompt + content
    response = openai.ChatCompletion.create(
        model=GPT_MODEL, messages=[{"role": "user", "content": prompt}], temperature=0
    )
    return response["choices"][0]["message"]["content"]

# Summarize the top ranked PDF file, based on title + summary.
def summarize_text(query):
    """This function does the following:
    - Reads in the arxiv_library.csv file in including the embeddings
    - Finds the closest file to the user's query
    - Scrapes the text out of the file and chunks it
    - Summarizes each chunk in parallel
    - Does one final summary and returns this to the user"""

    #Check if folders exist and create them if not
    folder_setup()
    # A prompt to dictate how the recursive summarizations should approach the input paper
    summary_prompt = """Summarize this text from an academic paper. Extract any key points with reasoning.\n\nContent:"""

    # If the library is empty (no searches have been performed yet), we perform one and download the results
    library_df = pd.read_csv(paper_dir_filepath).reset_index()
    if len(library_df) == 0:
        print("No papers searched yet, downloading first.")
        get_articles(query)
        print("Papers downloaded, continuing")
        library_df = pd.read_csv(paper_dir_filepath).reset_index()
    else: print("Search already done in the past, continuing")
    library_df.columns = ["title", "filepath", "embedding"]
    library_df["embedding"] = library_df["embedding"].apply(ast.literal_eval)
    strings = strings_ranked_by_relatedness(query, library_df, top_n=1)
    print("Chunking text from paper")
    pdf_text = read_pdf(strings[0])

    # Initialise tokenizer
    tokenizer = tiktoken.get_encoding("cl100k_base")
    results = ""

    # Chunk up the document into 1500 token chunks
    chunks = create_chunks(pdf_text, 1500, tokenizer)
    text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
    print("Summarizing each chunk of text")

    # Parallel process the summaries
    with concurrent.futures.ThreadPoolExecutor(
        max_workers=len(text_chunks)
    ) as executor:
        futures = [
            executor.submit(extract_chunk, chunk, summary_prompt)
            for chunk in text_chunks
        ]
        with tqdm(total=len(text_chunks)) as pbar:
            for _ in concurrent.futures.as_completed(futures):
                pbar.update(1)
        for future in futures:
            data = future.result()
            results += data

    # Final summary
    print("Summarizing into overall summary")
    response = openai.ChatCompletion.create(
        model=GPT_MODEL,
        messages=[
            # {
            #     "role": "user",
            #     "content": f"""Write a summary collated from this collection of key points extracted from an academic paper.
            #             The summary should highlight the core argument, conclusions and evidence, and answer the user's query.
            #             User query: {query}
            #             The summary should be structured in bulleted lists following the headings Core Argument, Evidence, and Conclusions.
            #             Key points:\n{results}\nSummary:\n""",
            # }
            {
                "role": "user",
                "content": f"""Escreva um resumo, como se fosse para um adolescente layman, composto por esta coleção de pontos chave extraídos de um artigo científico.
                        O resumo deve pontuar o argumento chave, as conclusões, qualquer dado interessante e responder a pergunta do usuário.
                        pergunta do usuário: {query}
                        O resumo deve ser estruturado em uma lista de pontos seguindo os títulos Argumento chave, Evidências, Pontos interessantes e Conclusão.
                        Pontos Chave:\n{results}\nResumo:\n""",
            }
        ],
        temperature=0,
    )
    return response


In [64]:
# Test the summarize_text function works
chat_test_response = summarize_text("Terapia cognitivo comportamental")


No papers searched yet, downloading first.
A query enviada para arxiv é:  Terapia cognitivo comportamental
Papers downloaded, continuing
Chunking text from paper
Summarizing each chunk of text


100%|██████████| 20/20 [00:08<00:00,  2.28it/s]


Summarizing into overall summary


KeyboardInterrupt: ignored

In [11]:
print(chat_test_response["choices"][0]["message"]["content"])


- O artigo discute a necessidade de uma teoria de unificação para comportamentos cognitivos.
- Propõe três ideias principais: a comunidade científica deve considerar seriamente o problema da unificação dos comportamentos cognitivos, a aplicação das regras de Newton para raciocinar sobre a natureza no estudo dos comportamentos cognitivos e o desenvolvimento de uma teoria científica que siga as regras de Newton e possa explicar todos os comportamentos cognitivos.
- Destaca a importância da unificação nas teorias científicas e a natureza interdisciplinar do estudo dos comportamentos cognitivos.
- Discute a história e o desenvolvimento da Inteligência Artificial (IA) e sua busca por alcançar a inteligência humana em computadores.
- Menciona a necessidade de unificar diferentes teorias de comportamento humano e propõe o uso de computadores para testar o sucesso da unificação.
- Destaca a importância da unificação da neurobiologia e psiquiatria no campo das ciências cognitivas.
- Apresenta a

## Configure Agent

We'll create our agent in this step, including a ```Conversation``` class to support multiple turns with the API, and some Python functions to enable interaction between the ```ChatCompletion``` API and our knowledge base functions.

In [14]:
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def chat_completion_request(messages, functions=None, model=GPT_MODEL):
    headers = {
        "Content-Type": "application/json",
        "Authorization": "Bearer " + openai.api_key,
    }
    json_data = {"model": model, "messages": messages}
    if functions is not None:
        json_data.update({"functions": functions})
    try:
        response = requests.post(
            "https://api.openai.com/v1/chat/completions",
            headers=headers,
            json=json_data,
        )
        return response
    except Exception as e:
        print("Unable to generate ChatCompletion response")
        print(f"Exception: {e}")
        return e


In [15]:
class Conversation:
    def __init__(self):
        self.conversation_history = []

    def add_message(self, role, content):
        message = {"role": role, "content": content}
        self.conversation_history.append(message)

    def display_conversation(self, detailed=False):
        role_to_color = {
            "system": "red",
            "user": "green",
            "assistant": "blue",
            "function": "magenta",
        }
        for message in self.conversation_history:
            print(
                colored(
                    f"{message['role']}: {message['content']}\n\n",
                    role_to_color[message["role"]],
                )
            )

In [17]:
# Initiate our get_articles and read_article_and_summarize functions
arxiv_functions = [
    {
        "name": "get_articles",
        "description": """Use this function to get academic papers from arXiv to answer user questions.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            User query in JSON. Responses should be summarized and should include the article URL reference
                            """,
                }
            },
            "required": ["query"],
        },
        "name": "read_article_and_summarize",
        "description": """Use this function to read whole papers and provide a summary for users.
        You should NEVER call this function before get_articles has been called in the conversation.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            Description of the article in plain text based on the user's query
                            """,
                }
            },
            "required": ["query"],
        },
    }
]


In [18]:
def chat_completion_with_function_execution(messages, functions=[None]):
    """This function makes a ChatCompletion API call with the option of adding functions"""
    response = chat_completion_request(messages, functions)
    full_message = response.json()["choices"][0]
    if full_message["finish_reason"] == "function_call":
        print(f"Function generation requested, calling function")
        return call_arxiv_function(messages, full_message)
    else:
        print(f"Function not required, responding to user")
        return response.json()


def call_arxiv_function(messages, full_message):
    """Function calling function which executes function calls when the model believes it is necessary.
    Currently extended by adding clauses to this if statement."""

    if full_message["message"]["function_call"]["name"] == "get_articles":
        try:
            parsed_output = json.loads(
                full_message["message"]["function_call"]["arguments"]
            )
            print("Getting search results")
            results = get_articles(parsed_output["query"])
        except Exception as e:
            print(parsed_output)
            print(f"Function execution failed")
            print(f"Error message: {e}")
        messages.append(
            {
                "role": "function",
                "name": full_message["message"]["function_call"]["name"],
                "content": str(results),
            }
        )
        try:
            print("Got search results, summarizing content")
            response = chat_completion_request(messages)
            return response.json()
        except Exception as e:
            print(type(e))
            raise Exception("Function chat request failed")

    elif (
        full_message["message"]["function_call"]["name"] == "read_article_and_summarize"
    ):
        parsed_output = json.loads(
            full_message["message"]["function_call"]["arguments"]
        )
        print("Finding and reading paper")
        summary = summarize_text(parsed_output["query"])
        return summary

    else:
        raise Exception("Function does not exist and cannot be called")


## arXiv conversation

Let's put this all together by testing our functions out in conversation.

In [19]:
# Start with a system message
paper_system_message = """You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!"""
paper_conversation = Conversation()
paper_conversation.add_message("system", paper_system_message)


In [25]:
# Add a user message
paper_conversation.add_message("user", "Hi, what is the relation between information theory and reality?")
chat_response = chat_completion_with_function_execution(
    paper_conversation.conversation_history, functions=arxiv_functions
)
assistant_message = chat_response["choices"][0]["message"]["content"]
paper_conversation.add_message("assistant", assistant_message)
display(Markdown(assistant_message))


Function not required, responding to user


Information theory is a branch of mathematics that deals with the quantification, storage, and communication of information. It provides a framework to understand the fundamental limits of communication, compression, and encryption. The relation between information theory and reality is that information theory provides a mathematical model to describe and analyze how information is processed and transmitted in various systems, including both human-made systems and natural systems found in the physical world. It has applications in various fields such as telecommunications, cryptography, data compression, machine learning, and more. By studying information theory, we can gain insights into the principles governing the transfer and transformation of information in the real world.

In [67]:
# Add another user message to induce our system to use the second tool
paper_conversation.add_message(
    "user",
    "Can you read the classical mechanical newton paper for me and give me a summary",
)
updated_response = chat_completion_with_function_execution(
    paper_conversation.conversation_history, functions=arxiv_functions
)
display(Markdown(updated_response["choices"][0]["message"]["content"]))


Function generation requested, calling function
Finding and reading paper
No papers searched yet, downloading first.
A query enviada para arxiv é:  classical mechanical newton
Failed to download Mechanics and Newton-Cartan-Like Gravity on the Newton-Hooke Space-time. Error: [Errno 2] No such file or directory: './data/papers/hep-th/0411004v2.Mechanics_and_Newton_Cartan_Like_Gravity_on_the_Newton_Hooke_Space_time.pdf'
Failed to download Classical Field Theory and Analogy Between Newton's and Maxwell's Equations. Error: [Errno 2] No such file or directory: './data/papers/hep-th/9312009v1.Classical_Field_Theory_and_Analogy_Between_Newton_s_and_Maxwell_s_Equations.pdf'
Papers downloaded, continuing
Chunking text from paper
Summarizing each chunk of text


100%|██████████| 5/5 [00:06<00:00,  1.21s/it]


Summarizing into overall summary


Argumento chave:
- O artigo discute o conceito de aleatoriedade na mecânica clássica e quântica.
- O autor argumenta que o mundo determinístico clássico de Newton não se sustenta e que há uma aleatoriedade fundamental e irreversível na mecânica clássica.
- O artigo sugere uma abordagem "funcional" para a mecânica clássica, onde a equação fundamental da dinâmica microscópica é a equação de Liouville para a função de distribuição de uma única partícula.
- A equação de Newton aparece como uma equação aproximada que descreve a dinâmica dos valores médios de posição e momento.

Evidências:
- A trajetória newtoniana clássica não tem um significado físico direto, uma vez que as observações só podem ser feitas de números racionais, não de números reais arbitrários.
- As soluções da equação de Liouville têm a propriedade de delocalização, o que explica a irreversibilidade.
- A equação de movimento para uma única partícula em um campo potencial é dada por ∂ρ/∂t=−p/m∂ρ/∂q+∂V(q)/∂q∂ρ/∂p.
- As funções de distribuição na mecânica clássica funcional e na mecânica quântica coincidem sob certas condições.

Pontos interessantes:
- O artigo sugere interpretar a mecânica quântica de uma maneira que incorpore a aleatoriedade fundamental tanto na mecânica clássica quanto na quântica.
- Em vez de um conjunto de eventos, o artigo sugere introduzir um conjunto de observadores.
- O artigo também discute a incerteza na mecânica clássica e quântica e a introdução de variáveis contextuais na mecânica quântica.

Conclusão:
- O artigo discute a abordagem funcional para a mecânica clássica, que se concentra em funções de distribuição em vez de trajetórias precisas de partículas.
- O artigo sugere que tanto a mecânica clássica quanto a quântica contêm aleatoriedade fundamental, o que requer uma reavaliação da interpretação da mecânica quântica.
- O artigo também menciona possíveis aplicações da mecânica funcional à mecânica estatística, teoria de campos, cosmologia e buracos negros.

In [None]:
# Add another user message to induce our system to use the second tool
paper_conversation.add_message(
    "user",
    "Explique mais profundamente o que significa o conceito da equialencia digital. quais os paradoxos criados pela velocidade máxima do universo. explique também como a simplicidade das leis matemáticas apoiam a ideia de que o mundo surge do processamento finito de informações. Explique quais as implicações caso essa teoria seja verdadeira.",
)
updated_response = chat_completion_with_function_execution(
    paper_conversation.conversation_history, functions=arxiv_functions
)
display(Markdown(updated_response["choices"][0]["message"]["content"]))


Function generation requested, calling function
Finding and reading paper
Chunking text from paper
Summarizing each chunk of text


100%|██████████| 7/7 [00:11<00:00,  1.69s/it]


Summarizing into overall summary


Argumento chave:
- O artigo explora a ideia de que o universo é uma realidade virtual criada por processamento de informações.
- A criação do universo no Big Bang não seria paradoxal se fosse uma realidade virtual, pois todo sistema virtual precisa ser inicializado.
- A ciência da informação moderna pode explicar propriedades físicas fundamentais como espaço, tempo, luz, matéria e movimento como derivadas do processamento de informações.

Evidências:
- Experimentos de física, como dilatação do tempo, curvatura do espaço, teleportação e criação do nada, estão impulsionando a necessidade de teorias estranhas na física.
- As teorias atuais da física, como a mecânica quântica e a relatividade, têm sido bem-sucedidas em suas previsões, mas ainda carecem de uma fundamentação e não fazem sentido.

Pontos interessantes:
- O autor argumenta que se a matéria, energia, carga, momento e spin são todas informações, então todas as leis de conservação poderiam se reduzir a uma lei de conservação da informação.
- A simplicidade das leis matemáticas em descrever o mundo físico é vista como evidência de uma teoria de realidade virtual, já que cálculos frequentes em uma realidade virtual precisariam ser simples.
- A incerteza complementar, descrita pelo princípio da incerteza de Heisenberg, é vista como uma propriedade da realidade na teoria de realidade virtual.

Conclusão:
- O artigo conclui que a necessidade na física não é de mais provas ou aplicações, mas de mais compreensão.
- A teoria de realidade virtual pode ajudar a explicar os mistérios da física moderna, como como as partículas parecem "saber" o que fazer e a velocidade máxima da luz.
- A teoria de realidade virtual é uma opção lógica que deve ser considerada ao lado de outras teorias na física.