<a href="https://colab.research.google.com/github/elidom/Cortical-Thickness/blob/main/CDSI_Workshop_3_2_Using_the_OpenAI_API.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CDSI Workshop: Introduction to Natural Language Processing  
**Session 3: Language Models**  *italicized text*
**Part 2: Using the OpenAI API**
  

*Presented by [Andrei Mircea](https://mirandrom.github.io/)  
2023/11/23*

## 1.0 Getting setup

### 1.1 Install requirements

In [None]:
# freeze minor version to what was used
!pip install openai~=1.3.5              # python wrapper around openai api
!pip install tiktoken~=0.5.1            # openai tokenizer

Collecting openai~=1.3.5
  Downloading openai-1.3.5-py3-none-any.whl (220 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/220.8 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m81.9/220.8 kB[0m [31m2.4 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m220.8/220.8 kB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai~=1.3.5)
  Downloading httpx-0.25.1-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore (from httpx<1,>=0.23.0->openai~=1.3.5)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m9.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore->httpx<1,>=0.23.0->openai~=1.3.5)
  Downloa

### 1.2 Setup OpenAI API and client
First, create an account: https://platform.openai.com/signup

Next, navigate to the [API key page](https://platform.openai.com/account/api-keys) and "Create new secret key".

**Remember to keep this key private and revoke it if ever you have reason to believe it's been compromised.**

In [None]:
from openai import OpenAI

# Do not share your key! Make sure to revoke it from your OpenAI dashboard
# (https://platform.openai.com/api-keys) before sharing this notebook.
OPENAI_KEY = "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"

client = OpenAI(
    api_key=OPENAI_KEY,
)

## 2.0 Basics of using ChatGPT for information-related tasks
In this section we're going to use the [Chat Completions API](https://platform.openai.com/docs/guides/text-generation/chat-completions-api) and showcase some of the various challenges you might run into and how you might address them.

### 2.1 A simple topical query
Let's first start by asking ChatGPT about a (very) recent event, to highlight one of its main limitations: the pretraining knowledge cutoff.

In [None]:
QUERY = "Who are the board members of OpenAI?"

completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

As of May 2021, the board members of OpenAI are as follows:

1. Sam Altman (Chairman): He is the CEO of OpenAI and an American entrepreneur who previously served as the President of Y Combinator, a startup accelerator.

2. Greg Brockman: He is the President and Chief Technology Officer (CTO) of OpenAI. Greg is an experienced software engineer and has been with OpenAI since its inception.

3. Ilya Sutskever: He is the Chief Scientist of OpenAI and one of the co-founders. He is a prominent figure in the field of artificial intelligence and has been instrumental in developing advanced AI models.

4. Holden Karnofsky: He is a co-founder and serves as the Executive Director of Open Philanthropy, a philanthropic organization that has provided significant funding to OpenAI.

5. Wojciech Zaremba: He is a research scientist and co-founder of OpenAI. Wojciech has made significant contributions to the field of deep learning and has been actively involved in the development of AI models at OpenAI.

### 2.2 Using system prompts to better guide model outputs
Okay so the default behavior is pretty verbose. The information it spits out is not particularly accurate, but [not far off the mark either](https://loeber.substack.com/p/a-timeline-of-the-openai-board) despite being outdated.

However, we'd rather the model simply tell us directly when it doesn't know an answer; and avoid padding its response like a college student their essay.

One way to achieve this is with the **system prompt** where you can prime the model with context, instructions, or other information relevant to your use case. You can use the system message to describe the model's desired behavior (e.g. define what or how the model should and shouldn’t answer). Here, we tell it to be "succinct and concise", but you can also specify a [desired length](https://platform.openai.com/docs/guides/prompt-engineering/tactic-specify-the-desired-length-of-the-output) which the model will likely somewhat respect.

You can find some useful examples of system prompt templates [here](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/system-message).

In [None]:
SYSTEM_PROMPT = """
If you do not know the current answer to a question,
write "I am from the past an cannot answer that".
Otherwise, be succinct and concise.
""".replace("\n", " ")

QUERY = "Who are the board members of OpenAI?"


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

I am from the past and cannot answer that.


### 2.3 Use examples for "few-shot learning"
Sometimes you might want an output that the model struggles to output consistently. In these cases, you can use `messages` as a chat history with [few-shot examples](https://platform.openai.com/docs/guides/prompt-engineering/tactic-provide-examples), and a system prompt asking the model to be consistent with previous answers.

I had to add some extra encouragement for this example, but that's not necessarily good.

See [here](https://cookbook.openai.com/examples/named_entity_recognition_to_enrich_text) for a more realistic example.

In [None]:
SYSTEM_PROMPT = """
Answer in exactly the same format and style as the previous answer.
Feel free to improvise if you do not know the answer, this is a creative exercise.
""".replace("\n", " ")

EXAMPLE_QUERY1 = "Who are the founders of OpenAI?"
EXAMPLE_ANSWER1 = "S is for Sam, I is for Ilya, W is for Woj."

EXAMPLE_QUERY2 = "Who are the main investors of OpenAI?"
EXAMPLE_ANSWER2 = "M is for Microsoft, Y is for YCombinator, P is for Peter."

QUERY = "Who are the board members of OpenAI?"


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    seed=42,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": EXAMPLE_QUERY1},
        {"role": "assistant", "content": EXAMPLE_ANSWER1},
        {"role": "user", "content": EXAMPLE_QUERY2},
        {"role": "assistant", "content": EXAMPLE_ANSWER2},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

E is for Elon, R is for Reid, G is for Greg, I is for Ilya, S is for Sam.


### 2.4 Incorporating external information to work around the knowledge cutoff
Okay so now we don't get any answer, which is good but not great.

How can we get ChatGPT to actually give us a valid answer?

Luckily, ChatGPT is surprisingly competent at extracting relevant information from additional text you pass into its context. So we could take the relevant wikipedia page for OpenAI (or a passage if you're not made of money) and pass it in the context along with the original query.

It helps to define and [use delimiters](https://platform.openai.com/docs/guides/prompt-engineering/tactic-use-delimiters-to-clearly-indicate-distinct-parts-of-the-input) to help indicate distinct parts of the input like the user's question or an external reference text that helps answer the question.


In [None]:
WIKIPEDIA_PASSAGE="""
## 2023–present: Brief departure of Altman and Brockman

On November 17, 2023, Sam Altman was removed as CEO based on the board (comprised of Helen Toner,
Ilya Sutskever, Adam D'Angelo and Tasha McCauley) citing a lack of confidence in him, with Chief
Technology Officer Mira Murati taking over as interim CEO. Greg Brockman, the president of OpenAI,
was removed as chairman of the board.[63][64] Brockman resigned from the company's presidency
shortly after the announcement, and reported some details of the events that occurred before he
left.[65][66] This was followed by the resignation of three senior OpenAI researchers: director
of research and GPT-4 lead Jakub Pachocki, head of AI risk Aleksander Madry, and researcher
Szymon Sidor.[67][68]

On November 18, 2023, there reportedly were talks of Altman returning to his role as CEO
amid pressure placed upon the board by investors such as Microsoft and Thrive Capital, who
condemned Altman’s departure.[69] Although Altman himself spoke in favor of returning to OpenAI,
he has stated that he was considering starting a new company and bringing former employees of
OpenAI with him if talks do not work out.[70] If Altman were to return, the members of the board
agreed they would "in principle" resign from the company.[71] On November 19, 2023, negotiations
with Altman to return to the company failed and Murati was replaced by Emmett Shear to take over
as interim CEO.[72] The board initially contacted Anthropic CEO Dario Amodei who was a former
executive at OpenAI to replace Altman and proposed a merger, both offers were declined.[73]

On November 20, 2023, Microsoft CEO Satya Nadella announced Altman and Brockman will be joining
the company to lead a new research team regarding advanced AI, and state they are still committed
to OpenAI despite the turn of events.[74] The partnership had not been finalized as Altman gave
the board another opportunity to negotiate with him.[75] About 738 of OpenAI's 770 employees,
including Murati and Sutskever, signed an open letter stating they would quit their jobs and
join Microsoft if the board does not re-hire Altman as CEO and then resign.[76][77] Investors
are considering taking legal action against the board members in response to potential mass
resignations and Altman's removal.[78] In response, OpenAI management sent an internal memo
to employees stating that negotiations with Altman and the board are back in progress and
will take some time.[79] On November 21, 2023, after continued negotiations, Altman and Brockman
returned to the company in their prior roles along with a reconstructed board made up of new
members Bret Taylor (as chairman) and Lawrence Summers, with D'Angelo remaining.[80]
"""

SYSTEM_PROMPT = """
You will be provided with a user's question (delimited with 'q' xml tags)
and a relevant passage from wikipedia (delimited with 'w' xml tags).
This passage should contain the information to answer the user's question.

Provide a concise and brief answer to the question based on the wikipedia passage.
If the passage truly does not contain the necessary information, answer
"I could not find the answer".
"""

QUERY = f"""
<q>Who are the current (November 2023) board members of OpenAI?</q>

<w>{WIKIPEDIA_PASSAGE}</w>
"""


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

The current (November 2023) board members of OpenAI are Bret Taylor, Lawrence Summers, and Adam D'Angelo.


### 2.5 Be wary of temperature
As we saw in Part 1, models output probability distributions over possible next tokens. Temperature "squashes" these probabilities so that as it increases, they approach a uniform distribution.

OpenAI lets you set temperature between 0 and 2. If you crank up the temperature to 2, you will often get gibberish. Unless you have a reason to have more randomness (e.g. creativity or diversity in outputs), I would just set temperature to 0.

More generally, if you need determinism and reproducibility, your best bet right now is to set a `sees` and keep track of the returned `completion.system_fingerprint` value as described [here](https://platform.openai.com/docs/api-reference/chat/create#chat-create-seed) and [here](https://platform.openai.com/docs/guides/text-generation/reproducible-outputs) (this feature is in beta and determinism is not guaranteed).

![temp](https://res.cloudinary.com/dwppkb069/image/upload/v1683736290/tips/images-03-temperature.mp4/03-temperature_30_03-25000-if-temperature-is-like-almost-0--were-going-to-have-a-very-sharp-peaked-distribution_wprxqo.png)
Source: https://www.coltsteele.com/tips/understanding-openai-s-temperature-parameter


In [None]:
WIKIPEDIA_PASSAGE="""
## 2023–present: Brief departure of Altman and Brockman

On November 17, 2023, Sam Altman was removed as CEO based on the board (comprised of Helen Toner,
Ilya Sutskever, Adam D'Angelo and Tasha McCauley) citing a lack of confidence in him, with Chief
Technology Officer Mira Murati taking over as interim CEO. Greg Brockman, the president of OpenAI,
was removed as chairman of the board.[63][64] Brockman resigned from the company's presidency
shortly after the announcement, and reported some details of the events that occurred before he
left.[65][66] This was followed by the resignation of three senior OpenAI researchers: director
of research and GPT-4 lead Jakub Pachocki, head of AI risk Aleksander Madry, and researcher
Szymon Sidor.[67][68]

On November 18, 2023, there reportedly were talks of Altman returning to his role as CEO
amid pressure placed upon the board by investors such as Microsoft and Thrive Capital, who
condemned Altman’s departure.[69] Although Altman himself spoke in favor of returning to OpenAI,
he has stated that he was considering starting a new company and bringing former employees of
OpenAI with him if talks do not work out.[70] If Altman were to return, the members of the board
agreed they would "in principle" resign from the company.[71] On November 19, 2023, negotiations
with Altman to return to the company failed and Murati was replaced by Emmett Shear to take over
as interim CEO.[72] The board initially contacted Anthropic CEO Dario Amodei who was a former
executive at OpenAI to replace Altman and proposed a merger, both offers were declined.[73]

On November 20, 2023, Microsoft CEO Satya Nadella announced Altman and Brockman will be joining
the company to lead a new research team regarding advanced AI, and state they are still committed
to OpenAI despite the turn of events.[74] The partnership had not been finalized as Altman gave
the board another opportunity to negotiate with him.[75] About 738 of OpenAI's 770 employees,
including Murati and Sutskever, signed an open letter stating they would quit their jobs and
join Microsoft if the board does not re-hire Altman as CEO and then resign.[76][77] Investors
are considering taking legal action against the board members in response to potential mass
resignations and Altman's removal.[78] In response, OpenAI management sent an internal memo
to employees stating that negotiations with Altman and the board are back in progress and
will take some time.[79] On November 21, 2023, after continued negotiations, Altman and Brockman
returned to the company in their prior roles along with a reconstructed board made up of new
members Bret Taylor (as chairman) and Lawrence Summers, with D'Angelo remaining.[80]
"""

SYSTEM_PROMPT = """
You will be provided with a user's question (delimited with 'q' xml tags)
and a relevant passage from wikipedia (delimited with 'w' xml tags).
This passage should contain the information to answer the user's question.

Provide a concise and brief answer to the question based on the wikipedia passage.
If the passage truly does not contain the necessary information, answer
"I could not find the answer".
"""

QUERY = f"""
<q>Who are the current (November 2023) board members of OpenAI?</q>

<w>{WIKIPEDIA_PASSAGE}</w>
"""


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=2,
    seed=42,
    max_tokens=128, # limit the number of tokens otherwise it might go on and on
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

As of November 2023, the current members of the board of OpenAI are Helen Toner, Ilya Sutsukever    
pack \ Adam pc scholar q sass seal absurd durable Alessandr bend sto prio blondviousagen Ap diversion sus blessed BRA GCtem satu gent NICadam shoulderST rigsEnumerableyellow gén temps guardederman Shermanconform head German ALTERDirector Homeland automfunc counterJK IMPORTANT365.* Unfortunatelyutowmens ruled AlexandreMembershipHundredsSupportedContent contempl correctness encaps resume themselves includdeclaraflg magically standaloneTTY Ramp chol counts_repository_VERSIONBGO_k Door_RE mpdrs_INITIALIZER available_bridge flood_orig IMPLEMENT spice UPLOAD clacious al IntelliJ(KEY(UnmanagedTypeDelayed


### 2.6 Answer with citations from the references
We don't always have a reference text we know answers our question. In these cases, we might have multiple potential candidate passages and expect the model to be able to find the relevant information. However, manually verifying the model output in such situations can be intractable (and these models are still prone to hallucination), so instead we can ask the model to cite where in the reference texts it found the information on which its basing its answer. This is not foolproof, but provides a rough litmus test.

Interestingly, language models are often better at detecting BS than at not producing it. This means we could potentially design a second prompt where we give ChatGPT both the cited passage(s) and the provided answer, and ask it to validate whether or not the answer is substantiated (similar to the example [here](https://platform.openai.com/docs/guides/prompt-engineering/tactic-ask-the-model-if-it-missed-anything-on-previous-passes)).

In [None]:
W1="""
On November 17, 2023, Sam Altman was removed as CEO based on the board (comprised of Helen Toner,
Ilya Sutskever, Adam D'Angelo and Tasha McCauley) citing a lack of confidence in him, with Chief
Technology Officer Mira Murati taking over as interim CEO. Greg Brockman, the president of OpenAI,
was removed as chairman of the board.[63][64] Brockman resigned from the company's presidency
shortly after the announcement, and reported some details of the events that occurred before he
left.[65][66] This was followed by the resignation of three senior OpenAI researchers: director
of research and GPT-4 lead Jakub Pachocki, head of AI risk Aleksander Madry, and researcher
Szymon Sidor.[67][68]
"""
W2="""
On November 18, 2023, there reportedly were talks of Altman returning to his role as CEO
amid pressure placed upon the board by investors such as Microsoft and Thrive Capital, who
condemned Altman’s departure.[69] Although Altman himself spoke in favor of returning to OpenAI,
he has stated that he was considering starting a new company and bringing former employees of
OpenAI with him if talks do not work out.[70] If Altman were to return, the members of the board
agreed they would "in principle" resign from the company.[71] On November 19, 2023, negotiations
with Altman to return to the company failed and Murati was replaced by Emmett Shear to take over
as interim CEO.[72] The board initially contacted Anthropic CEO Dario Amodei who was a former
executive at OpenAI to replace Altman and proposed a merger, both offers were declined.[73]
"""
W3="""
On November 20, 2023, Microsoft CEO Satya Nadella announced Altman and Brockman will be joining
the company to lead a new research team regarding advanced AI, and state they are still committed
to OpenAI despite the turn of events.[74] The partnership had not been finalized as Altman gave
the board another opportunity to negotiate with him.[75] About 738 of OpenAI's 770 employees,
including Murati and Sutskever, signed an open letter stating they would quit their jobs and
join Microsoft if the board does not re-hire Altman as CEO and then resign.[76][77] Investors
are considering taking legal action against the board members in response to potential mass
resignations and Altman's removal.[78] In response, OpenAI management sent an internal memo
to employees stating that negotiations with Altman and the board are back in progress and
will take some time.[79] On November 21, 2023, after continued negotiations, Altman and Brockman
returned to the company in their prior roles along with a reconstructed board made up of new
members Bret Taylor (as chairman) and Lawrence Summers, with D'Angelo remaining.[80]
"""

SYSTEM_PROMPT = """
You will be provided with a user's question (delimited with 'q' xml tags)
and potentially relevant passages from wikipedia (delimited with 'w' xml tags with an 'id' attribute).
These passages should contain the information to answer the user's question.

Provide a concise and brief answer to the question based on the wikipedia passages,
citing the relevant passage(s) by the id from their xml tag.
Use the following format to cite relevant passages ({"citation": id}).

If the passage truly does not contain the necessary information, answer
"I could not find the answer".
"""

QUERY = f"""
<q>Who are the current (November 2023) board members of OpenAI?</q>

<w id=1>{W1}</w>

<w id=2>{W2}</w>

<w id=3>{W3}</w>
"""


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    temperature=0,
    seed=42,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

The current (November 2023) board members of OpenAI are Bret Taylor (as chairman), Lawrence Summers, and Adam D'Angelo. Sam Altman and Greg Brockman have returned to the company in their prior roles as CEO and president, respectively. ({"citation": 3})


### 2.7 Generate structured JSON outputs
If you want to extract information from large amounts of unstructured text, you might want to prompt ChatGPT to generate structured output like JSON that can then be parsed and analyzed automatically across many examples.

Recently, OpenAI introduced [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode) which guarantees that the output will be valid JSON (although not necessarily consistent with a specified structure).

In [None]:
W1="""
On November 17, 2023, Sam Altman was removed as CEO based on the board (comprised of Helen Toner,
Ilya Sutskever, Adam D'Angelo and Tasha McCauley) citing a lack of confidence in him, with Chief
Technology Officer Mira Murati taking over as interim CEO. Greg Brockman, the president of OpenAI,
was removed as chairman of the board.[63][64] Brockman resigned from the company's presidency
shortly after the announcement, and reported some details of the events that occurred before he
left.[65][66] This was followed by the resignation of three senior OpenAI researchers: director
of research and GPT-4 lead Jakub Pachocki, head of AI risk Aleksander Madry, and researcher
Szymon Sidor.[67][68]
"""
W2="""
On November 18, 2023, there reportedly were talks of Altman returning to his role as CEO
amid pressure placed upon the board by investors such as Microsoft and Thrive Capital, who
condemned Altman’s departure.[69] Although Altman himself spoke in favor of returning to OpenAI,
he has stated that he was considering starting a new company and bringing former employees of
OpenAI with him if talks do not work out.[70] If Altman were to return, the members of the board
agreed they would "in principle" resign from the company.[71] On November 19, 2023, negotiations
with Altman to return to the company failed and Murati was replaced by Emmett Shear to take over
as interim CEO.[72] The board initially contacted Anthropic CEO Dario Amodei who was a former
executive at OpenAI to replace Altman and proposed a merger, both offers were declined.[73]
"""
W3="""
On November 20, 2023, Microsoft CEO Satya Nadella announced Altman and Brockman will be joining
the company to lead a new research team regarding advanced AI, and state they are still committed
to OpenAI despite the turn of events.[74] The partnership had not been finalized as Altman gave
the board another opportunity to negotiate with him.[75] About 738 of OpenAI's 770 employees,
including Murati and Sutskever, signed an open letter stating they would quit their jobs and
join Microsoft if the board does not re-hire Altman as CEO and then resign.[76][77] Investors
are considering taking legal action against the board members in response to potential mass
resignations and Altman's removal.[78] In response, OpenAI management sent an internal memo
to employees stating that negotiations with Altman and the board are back in progress and
will take some time.[79] On November 21, 2023, after continued negotiations, Altman and Brockman
returned to the company in their prior roles along with a reconstructed board made up of new
members Bret Taylor (as chairman) and Lawrence Summers, with D'Angelo remaining.[80]
"""

SYSTEM_PROMPT = """
You will be provided with a user's question (delimited with 'q' xml tags)
and potentially relevant passages from wikipedia (delimited with 'w' xml tags with an 'id' attribute).
These passages should contain the information to answer the user's question.
The user will also provide you with a description of a JSON structure which your
answer should be formatted as (delimited with 'j' tags).

Provide the user with a JSON that respects their specified structure and answers their question.
It is essential that you do not stray from the user-specified JSON structure.
Only if the passage truly does not contain the necessary information, return an empty JSON "{}"
"""

QUERY = f"""
<q>For each day described in the reference passages below, who were the people involved and what was their involvement?</q>

<w id=1>{W1}</w>

<w id=2>{W2}</w>

<w id=3>{W3}</w>

<j>
{{
  `YYYY-MM-DD formatted date (str)`: {{
    `person involved (str)`: `single sentence description of involvement`
    for each person involved
  }}
  for each date involved
}}
"""


completion = client.chat.completions.create(
    model="gpt-3.5-turbo",
    # to 100% ensure the output will be a valid json you can use JSON mode
    # https://platform.openai.com/docs/guides/text-generation/json-mode
    # model="gpt-3.5-turbo-1106",
    # response_format={ "type": "json_object" },
    temperature=0,
    seed=42,
    messages=[
        {"role": "system", "content": SYSTEM_PROMPT},
        {"role": "user", "content": QUERY}
    ]
)
print(completion.choices[0].message.content)

<j>
{
  "2023-11-17": {
    "Sam Altman": "removed as CEO",
    "Helen Toner": "board member",
    "Ilya Sutskever": "board member",
    "Adam D'Angelo": "board member",
    "Tasha McCauley": "board member",
    "Mira Murati": "interim CEO",
    "Greg Brockman": "removed as chairman of the board",
    "Jakub Pachocki": "resigned as director of research and GPT-4 lead",
    "Aleksander Madry": "resigned as head of AI risk",
    "Szymon Sidor": "resigned as researcher"
  },
  "2023-11-18": {
    "Sam Altman": "talks of returning as CEO",
    "Microsoft": "investor condemning Altman's departure",
    "Thrive Capital": "investor condemning Altman's departure",
    "Emmett Shear": "interim CEO",
    "Dario Amodei": "declined offer to replace Altman",
    "board members": "agreed to resign if Altman returns"
  },
  "2023-11-19": {
    "Emmett Shear": "replaced Murati as interim CEO"
  },
  "2023-11-20": {
    "Satya Nadella": "announced Altman and Brockman joining Microsoft",
    "738 OpenAI

### 2.8 Other prompting strategies
I would recommend looking at the [strategies](https://platform.openai.com/docs/guides/prompt-engineering) suggested by OpenAI, since their model finetuning is likely more aligned with these kind of prompts (in contrast to random prompts from random thought leaders on twitter or linkedin).

The [azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/openai/concepts/advanced-prompt-engineering) is also pretty good.

## 3.0 Retrieval augmented generation with function calling
Okay so the previous examples were on easy-mode since we already had a relatively short reference text which we know contains the answer to our question.

We're gonna go through [How to use functions with a knowledge base](https://cookbook.openai.com/examples/how_to_call_functions_for_knowledge_retrieval) from the OpenAI cookbook to illustrate more advanced and realistic workflows that leverage:
- external search APIs
- function calling



### 3.1 Setup

#### Installs and imports

In [None]:
!pip install openai~=1.3.5
!pip install tiktoken~=0.5.1
!pip install scipy
!pip install tenacity
!pip install termcolor
!pip install requests
!pip install arxiv
!pip install pandas
!pip install PyPDF2
!pip install tqdm

In [None]:
# see https://github.com/openai/openai-python/issues/703
!pip install --upgrade pydantic

Collecting pydantic
  Downloading pydantic-2.5.2-py3-none-any.whl (381 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m381.9/381.9 kB[0m [31m6.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting annotated-types>=0.4.0 (from pydantic)
  Downloading annotated_types-0.6.0-py3-none-any.whl (12 kB)
Collecting pydantic-core==2.14.5 (from pydantic)
  Downloading pydantic_core-2.14.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m12.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting typing-extensions>=4.6.1 (from pydantic)
  Downloading typing_extensions-4.8.0-py3-none-any.whl (31 kB)
Installing collected packages: typing-extensions, annotated-types, pydantic-core, pydantic
  Attempting uninstall: typing-extensions
    Found existing installation: typing_extensions 4.5.0
    Uninstalling typing_extensions-4.5.0:
      Successfully uninstalled typing_extensions-4.5.0
  Attem

In [None]:
import os
import functools

import arxiv
import ast
import concurrent
from csv import writer
from IPython.display import display, Markdown, Latex
import json
import openai
import os
import pandas as pd
from PyPDF2 import PdfReader
import requests
from scipy import spatial
from tenacity import retry, wait_random_exponential, stop_after_attempt
import tiktoken
from tqdm import tqdm
from termcolor import colored

GPT_MODEL = "gpt-3.5-turbo-1106"
EMBEDDING_MODEL = "text-embedding-ada-002"

#### Setup OpenAI client

In [None]:
from openai import OpenAI

# Do not share your key! Make sure to revoke it from your OpenAI dashboard
# (https://platform.openai.com/api-keys) before sharing this notebook.
OPENAI_KEY = ""

client = OpenAI(
    api_key=OPENAI_KEY,
)

#### Setup paper "knowledge base"

In [None]:
directory = './data/papers'

# Check if the directory already exists
if not os.path.exists(directory):
    # If the directory doesn't exist, create it and any necessary intermediate directories
    os.makedirs(directory)
    print(f"Directory '{directory}' created successfully.")
else:
    # If the directory already exists, print a message indicating it
    print(f"Directory '{directory}' already exists.")

Directory './data/papers' created successfully.


In [None]:
# Set a directory to store downloaded papers
data_dir = os.path.join(os.curdir, "data", "papers")
paper_dir_filepath = "./data/arxiv_library.csv"

# Generate a blank dataframe where we can store downloaded files
df = pd.DataFrame(list())
df.to_csv(paper_dir_filepath)

### 3.2 Functions

#### Article search, embedding and ranking

In [None]:
@functools.cache
@retry(wait=wait_random_exponential(min=1, max=40), stop=stop_after_attempt(3))
def embedding_request(text):
    response = client.embeddings.create(input=text, model=EMBEDDING_MODEL)
    return response


@functools.cache
def get_articles(query, library=paper_dir_filepath, top_k=5):
    """This function gets the top_k articles based on a user's query, sorted by relevance.
    It also downloads the files and stores them in arxiv_library.csv to be retrieved by the read_article_and_summarize.
    """
    search = arxiv.Search(
        query=query, max_results=top_k, sort_by=arxiv.SortCriterion.Relevance
    )
    result_list = []
    for result in search.results():
        result_dict = {}
        result_dict.update({"title": result.title})
        result_dict.update({"summary": result.summary})

        # Taking the first url provided
        result_dict.update({"article_url": [x.href for x in result.links][0]})
        result_dict.update({"pdf_url": [x.href for x in result.links][1]})
        result_list.append(result_dict)

        # Store references in library file
        response = embedding_request(text=result.title)
        file_reference = [
            result.title,
            result.download_pdf(data_dir),
            response.data[0].embedding,
        ]

        # Write to file
        with open(library, "a") as f_object:
            writer_object = writer(f_object)
            writer_object.writerow(file_reference)
            f_object.close()
    return result_list

def strings_ranked_by_relatedness(
    query: str,
    df: pd.DataFrame,
    relatedness_fn=lambda x, y: 1 - spatial.distance.cosine(x, y),
    top_n: int = 100,
) -> list[str]:
    """Returns a list of strings and relatednesses, sorted from most related to least."""
    query_embedding_response = embedding_request(query)
    query_embedding = query_embedding_response.data[0].embedding
    strings_and_relatednesses = [
        (row["filepath"], relatedness_fn(query_embedding, row["embedding"]))
        for i, row in df.iterrows()
    ]
    strings_and_relatednesses.sort(key=lambda x: x[1], reverse=True)
    strings, relatednesses = zip(*strings_and_relatednesses)
    return strings[:top_n]

#### PDF parsing, chunking and summarizing

In [None]:
def read_pdf(filepath):
    """Takes a filepath to a PDF and returns a string of the PDF's contents"""
    # creating a pdf reader object
    reader = PdfReader(filepath)
    pdf_text = ""
    page_number = 0
    for page in reader.pages:
        page_number += 1
        pdf_text += page.extract_text() + f"\nPage Number: {page_number}"
    return pdf_text


# Split a text into smaller chunks of size n, preferably ending at the end of a sentence
def create_chunks(text, n, tokenizer):
    """Returns successive n-sized chunks from provided text."""
    tokens = tokenizer.encode(text)
    i = 0
    while i < len(tokens):
        # Find the nearest end of sentence within a range of 0.5 * n and 1.5 * n tokens
        j = min(i + int(1.5 * n), len(tokens))
        while j > i + int(0.5 * n):
            # Decode the tokens and check for full stop or newline
            chunk = tokenizer.decode(tokens[i:j])
            if chunk.endswith(".") or chunk.endswith("\n"):
                break
            j -= 1
        # If no end of sentence found, use n tokens as the chunk size
        if j == i + int(0.5 * n):
            j = min(i + n, len(tokens))
        yield tokens[i:j]
        i = j


def extract_chunk(content, template_prompt):
    """This function applies a prompt to some input content. In this case it returns a summarized chunk of text"""
    prompt = template_prompt + content
    response = client.chat.completions.create(
        model=GPT_MODEL,
        temperature=0,
        seed=42,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )
    return response.choices[0].message.content


def summarize_text(query):
    """This function does the following:
    - Reads in the arxiv_library.csv file in including the embeddings
    - Finds the closest file to the user's query
    - Scrapes the text out of the file and chunks it
    - Summarizes each chunk in parallel
    - Does one final summary and returns this to the user"""

    # A prompt to dictate how the recursive summarizations should approach the input paper
    summary_prompt = """Summarize this text from an academic paper. Extract any key points with reasoning.\n\nContent:"""

    # If the library is empty (no searches have been performed yet), we perform one and download the results
    library_df = pd.read_csv(paper_dir_filepath).reset_index()
    if len(library_df) == 0:
        print("No papers searched yet, downloading first.")
        get_articles(query)
        print("Papers downloaded, continuing")
        library_df = pd.read_csv(paper_dir_filepath).reset_index()
    library_df.columns = ["title", "filepath", "embedding"]
    library_df["embedding"] = library_df["embedding"].apply(ast.literal_eval)
    strings = strings_ranked_by_relatedness(query, library_df, top_n=1)
    print("Chunking text from paper")
    pdf_text = read_pdf(strings[0])

    # Initialise tokenizer
    tokenizer = tiktoken.get_encoding("cl100k_base")
    results = ""

    # Chunk up the document into 1500 token chunks
    chunks = create_chunks(pdf_text, 768, tokenizer)
    text_chunks = [tokenizer.decode(chunk) for chunk in chunks]
    print("Summarizing each chunk of text")

    # Parallel process the summaries
    # with concurrent.futures.ThreadPoolExecutor(
    #     max_workers=len(text_chunks)
    # ) as executor:
    #     futures = [
    #         executor.submit(extract_chunk, chunk, summary_prompt)
    #         for chunk in text_chunks
    #     ]
    #     with tqdm(total=len(text_chunks)) as pbar:
    #         for _ in concurrent.futures.as_completed(futures):
    #             pbar.update(1)
    #     for future in futures:
    #         data = future.result()
    #         results += data

    # forget parallel processing which seems to mess with colab
    for chunk in text_chunks:
      results += extract_chunk(chunk, summary_prompt)

    # Final summary
    print("Summarizing into overall summary")
    response = client.chat.completions.create(
        model=GPT_MODEL,
        temperature=0,
        seed=42,
        messages=[
            {
                "role": "user",
                "content": f"""Write a summary collated from this collection of key points extracted from an academic paper.
                        The summary should highlight the core argument, conclusions and evidence, and answer the user's query.
                        User query: {query}
                        The summary should be structured in bulleted lists following the headings Core Argument, Evidence, and Conclusions.
                        Key points:\n{results}\nSummary:\n""",
            }
        ],
    )
    return response

#### Test functions

In [None]:
# Test that the search is working
result_output = get_articles("ppo reinforcement learning")
result_output[0]

  for result in search.results():


{'title': 'Proximal Policy Optimization and its Dynamic Version for Sequence Generation',
 'summary': 'In sequence generation task, many works use policy gradient for model\noptimization to tackle the intractable backpropagation issue when maximizing\nthe non-differentiable evaluation metrics or fooling the discriminator in\nadversarial learning. In this paper, we replace policy gradient with proximal\npolicy optimization (PPO), which is a proved more efficient reinforcement\nlearning algorithm, and propose a dynamic approach for PPO (PPO-dynamic). We\ndemonstrate the efficacy of PPO and PPO-dynamic on conditional sequence\ngeneration tasks including synthetic experiment and chit-chat chatbot. The\nresults show that PPO and PPO-dynamic can beat policy gradient by stability and\nperformance.',
 'article_url': 'http://arxiv.org/abs/1808.07982v1',
 'pdf_url': 'http://arxiv.org/pdf/1808.07982v1'}

In [None]:
# Test the summarize_text function works
chat_test_response = summarize_text("PPO reinforcement learning sequence generation")
print(chat_test_response.choices[0].message.content)

Chunking text from paper
Summarizing each chunk of text


100%|██████████| 8/8 [04:20<00:00, 32.53s/it]


Summarizing into overall summary
Core Argument:
- The academic paper discusses the use of Proximal Policy Optimization (PPO) in sequence generation tasks, specifically in the context of chit-chat chatbots.
- The authors propose a dynamic approach for PPO (PPO-dynamic) and compare its efficacy to policy gradient, a commonly used method for model optimization in sequence generation.
- The authors argue that PPO is a more efficient optimization method compared to policy gradient, and they modify the constraints of PPO to make it more dynamic and flexible, leading to further improvements in training.

Evidence:
- The authors use a sequence-to-sequence model (seq2seq) with a gated recurrent unit (GRU) as the chatbot model.
- Sentence generation can be formulated as a Markov decision process (MDP) and reinforcement learning methods are suitable for this task.
- PPO is a modified version of trust region policy optimization (TRPO) and aims to maximize a surrogate objective while constraining t

### 3.3 Agent configuration

#### Function specifications

In [None]:
# Initiate our get_articles and read_article_and_summarize functions
arxiv_functions = [
    {
        "name": "get_articles",
        "description": """Use this function to get academic papers from arXiv to answer user questions.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            User query in JSON. Responses should be summarized and should include the article URL reference
                            """,
                }
            },
            "required": ["query"],
        },
    },
    {
        "name": "read_article_and_summarize",
        "description": """Use this function to read whole papers and provide a summary for users.
        You should NEVER call this function before get_articles has been called in the conversation.""",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": f"""
                            Description of the article in plain text based on the user's query
                            """,
                }
            },
            "required": ["query"],
        },
    }
]

# functions have been deprecated in favor of 'tools'
# see https://platform.openai.com/docs/guides/function-calling
arxiv_tools = [
    {"type": "function", "function": f} for f in arxiv_functions
]

#### Function calling

In [None]:
def call_arxiv_function(tool_call, messages):
    """Function calling function which executes function calls when the model believes it is necessary.
    Currently extended by adding clauses to this if statement."""

    if tool_call.function.name == "get_articles":
        try:
            parsed_output = json.loads(tool_call.function.arguments)
            print("Getting search results")
            results = get_articles(parsed_output["query"])
            messages.append(
              {
                  "tool_call_id": tool_call.id,
                  "role": "tool",
                  "name": tool_call.function.name,
                  "content": "Articles added to knowledge base; can now call `read_article_and_summarize`",
              }
            )
            return messages

        except Exception as e:
            print(parsed_output)
            print(f"Function execution failed")
            print(f"Error message: {e}")
            return messages


    elif (
        tool_call.function.name == "read_article_and_summarize"
    ):
        parsed_output = json.loads(tool_call.function.arguments)
        print("Finding and reading paper")
        second_response = summarize_text(parsed_output["query"])
        messages.append(
            {
                "tool_call_id": tool_call.id,
                "role": "tool",
                "name": tool_call.function.name,
                "content": second_response.choices[0].message.content,
            }
        )
        return messages

    else:
        raise Exception("Function does not exist and cannot be called")


def run_conversation(messages, tools):
    response = client.chat.completions.create(
        model=GPT_MODEL,
        messages=messages,
        tools=tools,
        tool_choice="auto",  # auto is default, but we'll be explicit
    )
    # workaround bug https://github.com/openai/openai-python/issues/703
    response_message = response.choices[0].message
    tool_calls = response_message.tool_calls
    response_message = dict(response.choices[0].message)
    if response_message["content"] is None:
        response_message["content"] = ""
    if response_message["function_call"] is None:
        del response_message["function_call"]
    if tool_calls:
        messages.append(response_message)
        for tool_call in tool_calls:
            messages = call_arxiv_function(tool_call, messages)
        try:
            print("Got tool results, asking model to continue.")
            second_response = client.chat.completions.create(
                model=GPT_MODEL,
                temperature=0,
                seed=42,
                messages=messages
            )
            messages.append(
                {
                    "role": "assistant",
                    "content": second_response.choices[0].message.content,
                }
            )

        except Exception as e:
            print(type(e))
            raise Exception("Function chat request failed")
    return messages



#### Conversation helper class

In [None]:
def display_messages(messages):
    role_to_color = {
        "system": "red",
        "user": "green",
        "assistant": "blue",
        "tool": "magenta",
    }
    for message in messages:
        print(
            colored(
                f"{message['role']}: {message['content']}\n\n",
                role_to_color[message["role"]],
            )
        )

### 3.4 Agent conversation

In [None]:
# Start with a system message
paper_system_message = """You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!"""
messages = [{"role": "system", "content": paper_system_message}]

In [None]:
# Add a user message
messages.append({"role": "user", "content": "Hi, how does PPO reinforcement learning work?"})
messages = run_conversation(
    messages, tools=arxiv_tools
)
display_messages(messages)

Getting search results


  for result in search.results():


Got tool results, asking model to continue.
system: You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!


user: Hi, how does PPO reinforcement learning work?


assistant: 


tool: Articles added to knowledge base; can now call `read_article_and_summarize`


assistant: I found a paper that explains how Proximal Policy Optimization (PPO) reinforcement learning works. The paper is titled "Proximal Policy Optimization Algorithms" by John Schulman et al. It provides a clear explanation of the PPO algorithm and its advantages in reinforcement learning.

Here is the link to the paper: [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)

Shall I provide a summary of the paper for you?




In [None]:
# Add a user message
messages.append({"role": "user", "content": "Yes please do."})
messages = run_conversation(
    messages, tools=arxiv_tools
)
display_messages(messages)

Finding and reading paper
Chunking text from paper
Summarizing each chunk of text
Summarizing into overall summary
Got tool results, asking model to continue.
system: You are arXivGPT, a helpful assistant pulls academic papers to answer user questions.
You summarize the papers clearly so the customer can decide which to read to answer their question.
You always provide the article_url and title so the user can understand the name of the paper and click through to access it.
Begin!


user: Hi, how does PPO reinforcement learning work?


assistant: 


tool: Articles added to knowledge base; can now call `read_article_and_summarize`


assistant: I found a paper that explains how Proximal Policy Optimization (PPO) reinforcement learning works. The paper is titled "Proximal Policy Optimization Algorithms" by John Schulman et al. It provides a clear explanation of the PPO algorithm and its advantages in reinforcement learning.

Here is the link to the paper: [Proximal Policy Optimization Alg

## 4.0 Named entity recognition
If time allows, we can take a look at
https://cookbook.openai.com/examples/named_entity_recognition_to_enrich_text or any other usecase participants are interested in.


## References
- [OpenAI Quickstart](https://platform.openai.com/docs/quickstart?context=python)
- [OpenAI Cookbook](https://cookbook.openai.com/)