# LLM Chat Examples with OCI Generative AI

### What this file does:

Demonstrates basic chat functionality using OCI Generative AI with Cohere models, including parameter experimentation, conversation history, structured outputs (JSON), and streaming responses. Shows how to make chat requests, configure parameters like temperature, max tokens, and seed, and experiment with different settings for reproducible and controlled results. This notebook builds on concepts from the accompanying Python files (e.g., cohere_chat.py, cohere_chat_history.py).

**Documentation to reference:**

- OCI Gen AI: https://docs.oracle.com/en-us/iaas/Content/generative-ai/pretrained-models.htm
- Cohere Chat Models: https://docs.oracle.com/en-us/iaas/Content/generative-ai/chat-models.htm
- OCI Python SDK: https://github.com/oracle/oci-python-sdk/tree/master/src/oci/generative_ai_inference/models
- Cohere Chat API: https://docs.cohere.com/docs/chat-api

**Relevant slack channels:**

- #generative-ai-users: *for questions on OCI Gen AI*
- #igiu-innovation-lab: *general discussions on your project*
- #igiu-ai-learning: *help with sandbox environment or running this code*

**Env setup:**

- sandbox.yaml: Contains OCI config, compartment, and other details.
- .env: Load environment variables (e.g., API keys if needed).
- configure cwd for jupyter match your workspace python code:
  - vscode menu -> Settings > Extensions > Jupyter > Notebook File Root
  - change from `${fileDirname}` to `${workspaceFolder}`

**How to run in notebook:**

- Make sure your runtime environment has all dependencies and access to required config files.
- Run the notebook cells in order.

### Supported models

See [OCI Chat Models Documentation](https://docs.oracle.com/en-us/iaas/Content/generative-ai/chat-models.htm) for the latest list.

**Cohere models (generic API):**

- cohere.command-a-03-2025
- cohere.command-r-08-2024
- cohere.command-r-plus-08-2024

**OpenAI-compatible models (generic API):**

- openai.gpt-4.1
- openai.gpt-4.1-mini
- openai.gpt-4.1-nano
- openai.gpt-o1
- openai.gpt-o3
- openai.gpt-4o
- openai.gpt-4o-mini
- openai.gpt-5
- openai.gpt-5-mini
- openai.gpt-5-nano
- xai.grok-3
- xai.grok-4
- xai.grok-4-fast-reasoning
- xai.grok-4-fast-non-reasoning
- meta.llama-3.1-405b-instruct
- meta.llama-3.3-70b-instruct
- meta.llama-3.2-90b-vision-instruct

### Step 1: Load config and initialize clients

Set up the OCI Generative AI client and load configuration. This step initializes the connection to OCI services.

In [None]:
from oci.generative_ai_inference import GenerativeAiInferenceClient
from oci.generative_ai_inference.models import *
import oci
import json, os 
from dotenv import load_dotenv
from envyaml import EnvYAML

# make sure your sandbox.yaml file is setup for your environment. You might have to specify the full path depending on your `cwd` 
# you can also try making your cwd for jupyter match your workspace python code: 
# vscode menu -> Settings > Extensions > Jupyter > Notebook File Root
# change from ${fileDirname} to ${workspaceFolder}

SANDBOX_CONFIG_FILE = "sandbox.yaml"
load_dotenv()

CHAT_MODEL = "cohere.command-r-plus-08-2024" 
PREAMBLE = """
        answer in three bullets. respond in hindi 
"""
MESSAGE = """
        "why is indian cricket team so good"
"""

llm_service_endpoint= "https://inference.generativeai.us-chicago-1.oci.oraclecloud.com"

scfg = EnvYAML(SANDBOX_CONFIG_FILE)
if scfg is None or "oci" not in scfg or "bucket" not in scfg:
    raise RuntimeError("Invalid sandbox configuration.")

#read the oci config
config = oci.config.from_file(os.path.expanduser(scfg["oci"]["configFile"]),scfg["oci"]["profile"])
                               
# chat request      
llm_chat_request = CohereChatRequest()
llm_chat_request.preamble_override = PREAMBLE 
llm_chat_request.message = MESSAGE
llm_chat_request.is_stream = False 
llm_chat_request.max_tokens = 500 # max token to generate, can lead to incomplete responses
llm_chat_request.temperature = 1.0 # higer value menas more randon, defaul = 0.3
llm_chat_request.top_p = 0.7  # ensures only tokens with toptal probabely of p are considered, max value = 0.99, min 0.01, default 0.75
llm_chat_request.top_k = 0  #Ensures that only top k tokens are considered, 0 turns it off, max = 500
llm_chat_request.frequency_penalty = 0.0 # reduces the repeatedness of tokens max value 1.9=0, min 0,0


# set up chat details
chat_detail = ChatDetails()
chat_detail.serving_mode = OnDemandServingMode(model_id=CHAT_MODEL)
chat_detail.compartment_id = scfg["oci"]["compartment"]
chat_detail.chat_request = llm_chat_request

# set up the LLM client 
llm_client = GenerativeAiInferenceClient(
                config=config,
                service_endpoint=llm_service_endpoint,
                retry_strategy=oci.retry.NoneRetryStrategy(),
                timeout=(10,240))

### Step 2: Basic Chat Functionality

Send a simple chat request to the model. This demonstrates the core chat API call. Experiment by changing the message or preamble to see how the response varies.

In [None]:
llm_chat_request.preamble_override = "respond in bullet" 
llm_chat_request.message = "tell me 2 things about oracle primavera cloud "
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
        
print (llm_text)

### Step 3: Response Predictability

Without a seed, the same question can yield different responses due to randomness. Adding a seed makes responses more deterministic (best effort). Run the cell multiple times to observe differences.

**Experiment:** Change the seed value (e.g., 1234, 9999) and rerun to test reproducibility.

In [None]:
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
# same question can result in differet response        
print (llm_text)

## Play with seed
llm_chat_request.seed = 7555 # adding the seed will make response more predictable
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
        
print (llm_text)

llm_chat_request.seed = 7555
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
print (llm_text)

### Step 4: Controlling Output Length

The `max_tokens` parameter limits response length. Lower values truncate responses; check `finish_reason` to see if it stopped due to length.

**Experiment:** Try max_tokens = 10, 100, or 1000 and observe how the response and finish_reason change.

In [None]:
###  controlling the output length.  
llm_chat_request.max_tokens = 60
llm_response = llm_client.chat(chat_detail)
print (llm_response.data.chat_response.text)
print (llm_response.data.chat_response.finish_reason)

### Step 5: Conversation History Support

For conversational bots, maintain history by adding previous messages (user and assistant) to the request. This provides context for follow-up questions (e.g., pronouns like 'it'). History is typically added in pairs.

**Experiment:** Add more history pairs or change the follow-up message to test context retention. Try removing history to see ambiguous responses.

In [None]:
## History  support
# optionally update the PREAMBLE 
# PREAMBLE = " Answer the questions in a professional tone, based on thw conversation history"


previous_chat_message = oci.generative_ai_inference.models.CohereUserMessage(message="Tell me something about Oracle.")
previous_chat_reply = oci.generative_ai_inference.models.CohereChatBotMessage(message="Oracle is one of the largest vendors in the enterprise IT market and the shorthand name of its flagship product. The database software sits at the center of many corporate IT")
llm_chat_request.chat_history = [previous_chat_message, previous_chat_reply]
# ask the question 
llm_chat_request.seed = None

#llm_chat_request.message = "Where is Oracle's HQ?"

llm_chat_request.message = "what is its flagship product?"  # LLM knows what "it" means due to us adding history records


llm_response = llm_client.chat(chat_detail)

# Print result
print("**************************Chat Result**************************")
llm_text = llm_response.data.chat_response.text

print(llm_text)

### Step 6: JSON Format for Structured Outputs

Use `response_format` with CohereResponseJsonFormat() for structured JSON responses (supported on 08-2024+ models). Specify schemas for validation (e.g., objects, arrays). Always instruct JSON in preamble for better adherence.

**Note:** Schemas enforce structure; mismatched schemas (e.g., single object vs. array) limit output.

**Experiment:** Create a schema for extracting entities from text (e.g., names, dates) and test with different inputs. Try without schema to compare free-form JSON.

In [None]:
##  JSON format


# change top the supported model
chat_detail.serving_mode = OnDemandServingMode(model_id="cohere.command-r-08-2024")
llm_chat_request.response_format = CohereResponseJsonFormat()
llm_chat_request.preamble_override = "Answer in json format"
llm_chat_request.message = "list 3 Latest novels from indian authors "

llm_chat_request.max_tokens = 500
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
print (llm_text)

  

#Here we specify a simple JSON. note the scheme only allows for one entry.
#so even thouhg we ask for multiple answer will be just one
response_schema = {
            "type": "object",
            "required": ["title", "author", "publication_year"],
            "properties": {
                "title": {"type": "string"},
                "author": {"type": "string"},
                "publication_year": {"type": "integer"},
            },
        }

llm_chat_request.response_format = CohereResponseJsonFormat(schema = response_schema)
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
print (llm_text)

### nested schema with array


#Here we allso for an array so the answer will be multiple entries
#also see how we were able to break name into first & last name
response_schema = {
            "type": "object",
            "required": ["authors"],
            "properties" : {
                "authors" : {
                        "type" : "array",
                        "items" : {
                                "type" : "object",
                                "required": ["title", "author", "publication_year"],
                                "properties": {
                                        "title": {"type": "string"},
                                        "author": {
                                                "type": "object",
                                                "required" : ["fname", "lname"],
                                                "properties" : {
                                                        "fname" : {"type":"string"},       
                                                        "lname" : {"type":"string"}
                                                }
                                        },
                                        "publication_year": {"type": "integer"}
                                }
                        }        
                }
            }
        }

llm_chat_request.response_format = CohereResponseJsonFormat(schema = response_schema)
llm_chat_request.message = "list three most popular novels from indian authors last 2022"
llm_response = llm_client.chat(chat_detail)
llm_text = llm_response.data.chat_response.text
print (llm_text)

### Step 7: Streaming Responses

Enable streaming (`is_stream=True`) to receive responses token-by-token via Server-Sent Events (SSE). This is ideal for real-time UIs. Process events to print progressively.

**Experiment:** Change the message to a longer query and observe token-by-token generation. Compare timing with non-streaming.

In [None]:
## Streaming response



#ask the question 
llm_chat_request.preamble_override = "Answer as a poem" 
llm_chat_request.message = "why is sky blue"
llm_chat_request.is_stream = True 
llm_chat_request.response_format = CohereResponseTextFormat()


# geerate response
llm_response = llm_client.chat(chat_detail)


# stream twh output
import json 
for event in llm_response.data.events():
    res = json.loads(event.data)
    if 'finishReason' in res.keys():
        print(f"\nFinish reason: {res['finishReason']}")
        break
    if 'text' in res:
        print(res['text'], end="", flush=True)
print("\n")

### Practice Exercises and Discussion Prompts

1. **Basic Chat Experiments:**
   - Try zero-shot, few-shot (add examples in preamble), and chain-of-thought prompting.
   - Test output formats: freeform text, bullets, JSON, CSV, SQL.
   - Input different languages (e.g., Hindi, French) and observe multilingual support.

2. **Language Analysis Tasks:**
   - Use the LLM for proofreading: Feed in text with errors and ask for corrections.
   - Entity extraction: Ask to extract names, dates, locations from a news article.
   - Summarization and sentiment: Summarize a paragraph and analyze its tone.

3. **Data Generation:**
   - Generate SQL INSERT statements for sample data.
   - Create JSON/CSV outputs for mock datasets.

4. **Guardrails and Safety:**
   - Test harmful prompts and observe model refusals.
   - Implement simple content filters in the preamble.

5. **Build a Mini-Project:**
   - Create a conversational bot that remembers user preferences (e.g., response language).
   - Extract entities from a poem and generate a structured summary.

**Discussion Prompts:**
- How does adding history affect response quality for ambiguous queries?
- When would you use structured JSON vs. free-text outputs in applications?
- Discuss trade-offs between streaming (real-time) and non-streaming (complete response) in UI design.

For help, reach out in #igiu-ai-learning!