In [1]:
import os
import cohere
from dotenv import load_dotenv
load_dotenv('../.env')

True

In [2]:
co = cohere.Client(os.environ['COHERE_API_KEY'])

In [11]:
message = "Hello World!"
MODEL = "command"
COST_PER_1M_TOKENS = {'input': 1.0, 'output': 2.0} # https://cohere.com/pricing

In [4]:
api_params = dict(message=message,  model=MODEL,  temperature=0.9)

In [5]:
response = co.chat(**api_params)
answer = response.text

In [6]:
print(answer)

Hi there! How can I assist you today? This is Coral, an AI-assistant chatbot, welcoming you to engage in a conversation with me. If you have any questions or need assistance with a particular task, feel free to let me know, and I'll do my best to provide you with helpful and thorough responses. 

It's great to chat with you, have a fantastic day!


In [17]:
input_cost = response.meta['billed_units']['input_tokens'] * COST_PER_1M_TOKENS['input'] / 1000000
output_cost = response.meta['billed_units']['output_tokens'] * COST_PER_1M_TOKENS['output'] / 1000000
print(f"Input cost: ${input_cost} \nOutput cost: ${output_cost}")

Input cost: $5.4e-05 
Output cost: $0.00016


## Chat History

In [20]:
chat_history = [
    {"user_name": "User", "text": "Hey!"},
	{"user_name": "Chatbot", "text": "Hey! How can I help you today?"}
]
message = "What do LLMs do?"
api_params = dict(chat_history=chat_history, message=message, model=MODEL, temperature=0.9)

In [21]:
response = co.chat(**api_params)
answer = response.text

In [22]:
print(answer)

Large Language Models (LLMs) are artificial intelligence tools that have been trained on massive amounts of text data and can generate human-like language in response to prompts. Their ability to comprehend and create natural language makes them valuable for a variety of tasks, including:

1. **Language Translation:** LLMs can translate text or speech from one language to another with increasing accuracy and fluency. They can handle complex linguistic tasks and capture the nuances of different languages, aiding in cross-language communication.

2. **Text Completion:** LLMs can generate coherent and contextually relevant text when given a prompt or a starting point. This is useful for generating descriptive language, completing partial sentences or paragraphs, or even storytelling.

3. **Question Answering**: Given a question, LLMs can retrieve the relevant information from the knowledge database they have been trained on and provide answers. They can handle complex questions and delive

## Streaming

In [30]:
api_params['stream'] = True
for response in co.chat(**api_params):
    if response.event_type == 'stream-start':
        continue
    elif response.event_type == 'stream-end':
        break
    print(response.text, end='', flush=True)

Large Language Models (LLMs) are artificial intelligence tools that have been trained on massive amounts of text data and can generate human-like language in response to prompts. They are designed to perform a wide range of natural language processing tasks, including:

1. **Language Translation**: LLMs can translate text or speech from one language to another with increasing accuracy and fluency. 

2. **Text Completion**: LLMs can generate coherent and contextually appropriate responses to partial or truncated sentences, helping to create natural-sounding dialogue.

3. **Question Answering**: Given a question, LLMs can retrieve the relevant information from the knowledge base and provide a response. They are able to understand the context and the relationships between words to provide more accurate answers.

4. **Story Generation**: LLMs can create unique and coherent stories when provided with a prompt or a set of guidelines. They can also summarize long texts while retaining the key

In [55]:
api_params['stream'] = True
full_response = ""
for response in co.chat(**api_params):
    if response.event_type == 'stream-start':
        continue
    elif response.event_type == 'stream-end':
        break
    elif response.event_type == 'text-generation':
        print(response.text, end='', flush=True)
        full_response += response.text

Hello to you as well! How can I assist you today? If you would like, we can have a conversation about anything you'd like to discuss. Alternatively, if you have any questions or need assistance with a particular task, feel free to let me know, and I'll do my best to help you out. 

Would you like me to suggest some conversation topics?

In [43]:
from tokenizers import Tokenizer
# https://huggingface.co/Cohere/Command-nightly
tokenizer = Tokenizer.from_pretrained("Cohere/command-nightly")
enc = tokenizer.encode(full_response)
number_tokens = len(enc.ids)
output_cost = number_tokens * COST_PER_1M_TOKENS['output'] / 1000000
print("Number of tokens in response:", number_tokens)
print(f"\nOutput cost: ${output_cost}")

Number of tokens in response: 424

Output cost: $0.000848


### But how do we know that this amount of tokens is what Cohere tells us they are billing?

In [75]:
message = "Hello World!"
preamble = ""
api_params = dict(message=message, model=MODEL, temperature=0.9, preamble_override=preamble, chat_history=[])
response = co.chat(**api_params)

#### Responses: does the open-source tokenizer match what the Cohere API reports?

In [76]:
response.meta['billed_units']['output_tokens']

61

In [77]:
enc = tokenizer.encode(response.text)
number_tokens = len(enc.ids)
print("Number of tokens in response:", number_tokens)

# Observation, this consistently returns 2 tokens more than the number of tokens in the response

Number of tokens in response: 63


In [81]:
response.meta['billed_units']['input_tokens']

54

In [84]:
response.message

'Hello World!'

In [89]:
sample_msgs = [
    "Hello World! blah blah blah blah",
    "What do LLMs do? 27l. 91. 26.",
    "How do I use the API? {storage: [1, 2, 3], compute: [ec2, azure-vm, compute-engine}",
    "What is the meaning of life?",
    "What is the best movie of all time?",
]

data = {
    'msg': [],
    'response': [],
    'estimated_input_tokens': [],
    'estimated_output_tokens': [],
    'actual_input_tokens': [],
    'actual_output_tokens': []
}

INPUT_BUFFER_TOKENS = 49
OUTPUT_BUFFER_TOKENS = -2

for msg in sample_msgs:
    
    api_params = dict(message=msg, model=MODEL, temperature=0.9, preamble_override=preamble, chat_history=[])
    response = co.chat(**api_params)

    enc = tokenizer.encode(msg)
    est_in=len(enc.ids) + INPUT_BUFFER_TOKENS
    est_out=len(tokenizer.encode(response.text).ids) + OUTPUT_BUFFER_TOKENS
    
    data['msg'].append(msg)
    data['response'].append(response.text)
    data['estimated_input_tokens'].append(est_in)
    data['estimated_output_tokens'].append(est_out)
    data['actual_input_tokens'].append(response.meta['billed_units']['input_tokens'])
    data['actual_output_tokens'].append(response.meta['billed_units']['output_tokens'])

In [91]:
import pandas as pd
pd.DataFrame(data)

Unnamed: 0,msg,response,estimated_input_tokens,estimated_output_tokens,actual_input_tokens,actual_output_tokens
0,Hello World! blah blah blah blah,Hello to you as well! It's great to hear from ...,58,112,58,112
1,What do LLMs do? 27l. 91. 26.,"I'm sorry, I am unable to respond to your requ...",64,68,64,68
2,"How do I use the API? {storage: [1, 2, 3], com...","To use the API, you need to perform the follow...",82,452,82,452
3,What is the meaning of life?,The meaning of life is a philosophical questio...,58,212,58,212
4,What is the best movie of all time?,"Determining the ""best movie of all time"" is su...",60,541,60,541


#### Prompts: does Cohere count `chat_history` against the billed tokens?

In [111]:
sample_msgs_w_history = [
    (
        "Hello World! blah blah blah blah", 
        [
            {"role": "USER", "message": "Who discovered gravity?"},
            {"role": "CHATBOT", "message": "Isaac Newton"}
        ]
    ),
    (
        "What do LLMs do? 27l. 91. 26.", 
        [
            {"role": "USER", "message": "Who contributes to the international space station?"},
            {"role": "CHATBOT", "message": "The United States, Russia, Japan, Canada, and the European Space Agency"}
        ]
    ),
]

data = {
    'msg': [],
    'response': [],
    'estimated_input_tokens': [],
    'estimated_output_tokens': [],
    'actual_input_tokens': [],
    'actual_output_tokens': []
}

def get_n_tokens(string):
    return len(tokenizer.encode(string).ids)

INPUT_BUFFER_TOKENS = 49
CHAT_HISTORY_ADJUSTMENT = -4
OUTPUT_BUFFER_TOKENS = -2

for msg, history in sample_msgs_w_history:
    
    api_params = dict(message=msg, model=MODEL, temperature=0.9, preamble_override=preamble, chat_history=history)
    response = co.chat(**api_params)

    est_in = get_n_tokens(msg) + sum([get_n_tokens(h['message']) for h in history]) + INPUT_BUFFER_TOKENS + CHAT_HISTORY_ADJUSTMENT
    est_out = len(tokenizer.encode(response.text).ids) + OUTPUT_BUFFER_TOKENS
    
    data['msg'].append(msg)
    data['response'].append(response.text)
    data['estimated_input_tokens'].append(est_in)
    data['estimated_output_tokens'].append(est_out)
    data['actual_input_tokens'].append(response.meta['billed_units']['input_tokens'])
    data['actual_output_tokens'].append(response.meta['billed_units']['output_tokens'])

In [112]:
pd.DataFrame(data)

Unnamed: 0,msg,response,estimated_input_tokens,estimated_output_tokens,actual_input_tokens,actual_output_tokens
0,Hello World! blah blah blah blah,Hello to you too! \n\nWould you like me to bla...,65,19,65,19
1,What do LLMs do? 27l. 91. 26.,"I'm sorry, I am unable to respond to this requ...",87,98,87,98
