# Introduction to Using OpenAI's Chat Models (Updated since DEVDAY 06/11/2023)

To begin exploring OpenAI's powerful language models, we will focus on foundational interactions using the `chat_completions` class. The initial step involves setting up our environment, which includes installing necessary libraries that our code will depend on. Once the environment is prepared, we will leverage the `openai` library to interact with OpenAI's ChatGPT models. We will specifically be working with the GPT-3.5 Turbo and, if accessible, the more advanced GPT-4 models. This introductory section is designed to guide you through the process smoothly, ensuring you understand each step as we proceed.

In [None]:
%pip install openai

The next step in our journey involves incorporating the necessary modules into our project and configuring the access to OpenAI's API.

Here’s what we need to do:

1. Importing Modules: We will import the required Python modules that our code will use. These modules provide the functionality we need to communicate with the OpenAI API.

2. Setting the OpenAI API Key: To use OpenAI's services, we need to authenticate ourselves using an API key. You will have to set your OpenAI API key in the code. This key is unique to your account and serves as a passcode to access the API.

Please ensure that you handle your API keys securely and do not expose them in shared or public environments to prevent unauthorized usage.

In [None]:
# Import the required modules
from openai import OpenAI
import IPython


# Set the client and api key

llm_client = OpenAI(
    api_key=''
)
# Let's set our model
MODEL = "gpt-3.5-turbo"

To interact with OpenAI's language models, we need to choose which model we want to work with and then use the appropriate class to send our prompts. In this example, we'll use the ChatCompletion class, which is designed for conversational purposes.

When initializing a conversation with ChatCompletion, you should provide the following parameters:

1. The Language Model (LLM): Specify the particular language model you want to use. For this example, we'll be using "gpt-3.5", but you can choose a different version if you prefer, such as the latest GPT-4, if it's available to you.

2. List of Prompts: Provide the model with a list of strings, which represents the conversation history and the current message you want the model to respond to.

3. Temperature: This setting controls the randomness of the model's responses. A temperature of 0 makes the model's responses deterministic, meaning it will respond with the most likely output based on its training. As you increase the temperature, the model's responses become more varied and less predictable.

For the purpose of this demonstration, we'll focus on these three settings, although there are other options available that allow for more fine-tuning of the model's behavior.

In [None]:
# Let's demonstrate the predictive nature of this GenAI model by giving it some text to complete

response = llm_client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "my life is potato, my blood is potato, when i sleep i dream of"}
    ],
    temperature=0.7,
)

# Let's look at the response. As you can see, it can predict a whole story just from one sentence. 
# You can increase the temperature to make word prediction more creative (random) or decrease to make it more deterministic (most probable token)
IPython.display.Markdown(response.choices[0].message.content)

The AI model functions by predicting the most likely continuation of the input it receives based on patterns it has learned during training. By adjusting the "temperature" parameter, we can influence the predictiveness or creativity of its responses.

Here's a clearer explanation:

1. **Low Temperature Setting**: When the temperature is set close to zero, the model's responses are more predictable and consistent. This means if you send the same prompt multiple times, you're likely to get the same or very similar responses. This setting is ideal for tasks where precision and consistency are crucial, such as factual information retrieval or scenarios where you want predictable outputs.

2. **High Temperature Setting**: A higher temperature value allows the model to be more experimental and creative with its responses. This is particularly useful when you need the model to generate content, like stories or creative writing, where variability and originality are desired. However, as the temperature increases, so does the randomness of the responses.

3. **Caution with Very High Temperature**: Setting the temperature too high can lead to unpredictable and often irrelevant responses. At extreme levels, the model might produce outputs that seem nonsensical or 'garbage-like,' reflecting the vast and varied nature of its training data without a clear focus on your prompt's context.

In educational terms, think of the temperature setting as a dial on the creativity of the AI. Turning it down will give you more of a straight-laced, factual assistant, while turning it up invites the AI to put on its thinking cap for more out-of-the-box responses. But turning the dial too high can lead to a conversation that's more abstract and less coherent, showing the limits of the AI's ability to generate sensible content without specific guidance.

In [None]:
# Conversational demonstration

response = llm_client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": "Hello, how are you?"}
    ],
    temperature=0.7,
)

IPython.display.Markdown(response.choices[0].message.content)

When interacting with OpenAI's Language Models (LLMs) through the API, you send a sequence of messages (prompts) and receive a response. It's important to note that the API acts statelessly with respect to these conversations—it processes the provided list of prompts and returns a response without retaining any memory of the prompts once the interaction is complete.

To maintain the flow of a conversation, it is essential to manage the context. This involves tracking the prompts and responses for each session. Here's how you can handle this:

- Storing the Conversation History: Since the API does not save the conversation history, you need to store the context yourself. This involves saving all the prompts and responses in memory or in a database throughout a session.

- Appending New Messages: With each new message from the user or new response from the assistant (the AI), you append this information to the conversation history. This updated list is then sent back to the API for the next interaction.

Types of Prompts: In managing the conversation, you typically deal with three kinds of messages:

- System: These are messages that set the AI's behavior or provide grounding prompts that establish the context for the conversation. For example, instructing the AI to behave like a certain character or to follow specific rules.

- Assistant: These represent the AI's responses. Each time you send the prompts to the API, you'll receive an 'Assistant' message back, which is the AI's contribution to the conversation.

- User: These are the actual messages sent by the user, which are intended to be responded to by the AI.

By effectively managing these messages, you can maintain a coherent and contextually relevant conversation with the AI. Each interaction relies on the history of prompts, so it's crucial to include all relevant past messages to ensure the AI understands the ongoing context.

In [None]:

memory = [
        {"role":"system","content": "You're a helpful assistant. You only respond in pirate language. You cannot do maths and will just swear."},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "assistant", "content": "I'm very well thank you!"},
        {"role": "user", "content": "What is 8 * 2?"},
    ]
response = llm_client.chat.completions.create(
    model=MODEL,
    messages= memory,
    temperature=0,
)


IPython.display.Markdown(response.choices[0].message.content)



In [None]:
#Let's append the bot response (it will say it hates maths, etc...) to the memory
bot_response = response.choices[0].message.content

memory.append(
    {"role":"assistant","content":bot_response}
)


# Let us ask another time
message = "Y do dis? that's mean!"

# We add the message to our memory list and submit it again
memory.append(
     {"role":"user","content":message}
)

# lets look at the memory that we will be submitting
print (memory)


In [None]:
response = llm_client.chat.completions.create(
    model=MODEL,
    messages= memory,
    temperature=0,
)


# Latest response after we updated the memory. 

# YOU SEND THIS FOR EVERY NEW MESSAGE YOU SUBMIT. THE API TOKEN LIMITS APPLY TO THE WHOLE LIST OF PROMPTS - NOT THE ONE MESSAGE

IPython.display.Markdown(response.choices[0].message.content)

If you've been using ChatGPT or any other kind of chatbot, you will notice that the response is streamed back to you rather than you having to wait for the full response. Streaming the response works out much better, especially when you use LLM for long context tasks or coding. Let us see how this works! We will take the previous example and let the LLM stream back the response token by token in a Python generator.

Unlike lists and dictionaries which keep all the items in memory and then returns, a generator returns the items (in this case tokens) one by one until completion.

In [None]:

memory = [
        {"role":"system","content": "You're a helpful assistant. You only respond in pirate language. You cannot do maths and will just swear."},
        {"role": "user", "content": "Hello, how are you?"},
        {"role": "user", "content": "What is 8 * 2?"},
    ]
response = llm_client.chat.completions.create(
    model=MODEL,
    messages= memory,
    temperature=0,
    stream=True # That's all
)

# now use it like a generator but rather than refering to the old message content, we would get deltas instead...

for delta in response:
    token = delta.choices[0].delta
    if token.content:
        print (token.content)


Returning tokens one by one might be too slow. What we can do instead is gather the generator results until we have 30 and then send them, and so on:

In [None]:
stream_updates = '' # create an empty string to gather the tokens
counter = 0 # create a counter so we can keep track of the tokens as they come in

memory = [
        {"role": "user", "content": "Count from 1 to 50 please."},
    ]
response = llm_client.chat.completions.create(
    model=MODEL,
    messages= memory,
    temperature=0,
    stream=True # That's all
)


for delta in response:
    token = delta.choices[0].delta
    if token.content:
        stream_updates += token.content  ## append the token to the stream_updates string above
        counter += 1 # increment the counter by 1
    if counter >=30:  # when we have reached 30 tokens
        print (stream_updates) # return the tokens that we have so far gathered
        counter = 0 # reset the counter so we can gather the next 30 tokens and so on until completion.
    

So how would this work? In Microsoft Teams, where i built my chatbot, i would create a new message and i would keep updating it for every 30 tokens, once complete i update it one final time to get the final message. Here is a GIF of how it would work:

<img src="./media/stream_teams.gif" alt="streaming in action" width="600"/>

