# LLM API Basics

<center>
<img src="https://supportvectors.ai/logo-poster-transparent.png" width="400px" style="opacity:0.7">
</center>

In [1]:
%run supportvectors-common.ipynb


<div style="color:#aaa;font-size:8pt">
<hr/>
&copy; SupportVectors. All rights reserved. <blockquote>This notebook is the intellectual property of SupportVectors, and part of its training material. 
Only the participants in SupportVectors workshops are allowed to study the notebooks for educational purposes currently, but is prohibited from copying or using it for any other purposes without written permission.

<b> These notebooks are chapters and sections from Asif Qamar's textbook that he is writing on Data Science. So we request you to not circulate the material to others.</b>
 </blockquote>
 <hr/>
</div>



## Ollama Installation Steps

Run ollama using instructions from [here](https://github.com/ollama/ollama).  

The Linux instructions alone reproduced below.  Check for Mac or Windows from above link

1. `curl -fsSL https://ollama.com/install.sh | sh` (to install ollama)
2. `ollama pull llama3.2` to pull `llama3.2` model.
3. `ollama run llama3.2` (to run `ollama` on the command line with `llama3.2` model)
4. To access `llama3.2` model on `ollama` through a REST API:
    ```
    curl http://localhost:11434/api/generate -d '{
    "model": "llama3.2",
    "prompt":"Why is the sky blue?"
    }'
    ```

Once we have `ollama` running as above, accessing models through a REST API can be done very similarly to how one accesses OpenAI through an API.

Imports

In [2]:
from openai import OpenAI
from rich import print as rprint

## Accessing LLMs served by ollama

Ollama serves models on port 11434.  The model being served can be specified during the api call.

In [3]:

client = OpenAI(
    base_url='http://localhost:11434/v1/',

    # required but ignored
    api_key='ollama',
)


llama3.2 is being served in the below call

In [4]:

chat_completion = client.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'What are the instructions to access openAI through a REST API',
        }
    ],
    model='llama3.2',
)
rprint(chat_completion)

In [5]:

completion = client.completions.create(
    model="llama3.2",
    prompt="What are the instructions to access openAI through a REST API",
)
rprint(completion)

In [5]:

list_completion = client.models.list()
rprint(list_completion)

In [6]:

model = client.models.retrieve("llama3.2")
rprint(model)

## Doing the same now with OpenAi gpt-4o-mini

Not we are not passing any api key here - it is picked automatically from the environment variable `OPENAI_API_KEY`

In [7]:
client_openai = OpenAI(
)

In [8]:

chat_completion = client_openai.chat.completions.create(
    messages=[
        {
            'role': 'user',
            'content': 'What are the instructions to access openAI through a REST API',
        }
    ],
    model='gpt-4o-mini',
)
rprint(chat_completion)

In [10]:
completion = client_openai.completions.create(
    model="gpt-4o-mini",
    prompt="What are the instructions to access openAI through a REST API",
)
rprint(completion)

### A few LLM Sampling Parameters

#### Temperature:

##### Definition: Controls the randomness of the token sampling process.
##### Effect:
Lower values (e.g., 0.2) make the output more deterministic, favoring tokens with higher probabilities.
Higher values (e.g., 1.0 or 1.5) introduce randomness, which can result in more creative or diverse outputs.
##### Use Cases:
Use lower values for tasks requiring precision (e.g., code generation or factual answers).
Use higher values for creative writing or brainstorming.

#### Top-p (Nucleus Sampling):

##### Definition: Limits token sampling to a subset of tokens whose cumulative probability is less than a specified threshold.
##### Effect:
A value of 0.9 means only tokens within the top 90% cumulative probability are considered, reducing low-probability noise.
##### Use Cases:
Ensures the output is coherent while allowing for some variability.

#### Top-k Sampling:

##### Definition: Limits token sampling to the top k most probable tokens.
##### Effect:
Restricts randomness to a fixed number of high-probability options, regardless of cumulative probabilities.
##### Use Cases:
Works well for tasks requiring diversity while avoiding unlikely tokens.

#### Frequency Penalty:

##### Definition: Penalizes tokens that have already appeared in the generated text.
##### Effect:
Reduces repetition by lowering the likelihood of previously used words.
##### Use Cases:
Ideal for generating long-form content like essays or stories.

#### Presence Penalty:

##### Definition: Penalizes tokens based on their presence in the text so far.
##### Effect:
Encourages introducing new topics or ideas into the output.
##### Use Cases:
Useful for exploratory or brainstorming tasks.

**See below how the output varies for the same prompt with different values for temperature and nucleus sampling (top_p)**

In [11]:
client = OpenAI(
    base_url='http://localhost:11434/v1/',

    # required but ignored
    api_key='ollama',
)

In [12]:
completion = client.completions.create(
    model="llama3.2",
    prompt="Write a story about a young boy named Angelo from a small village",
    #temperature=3.0,
    #top_p=0.7,
    n=1,
)
rprint(completion.choices[0].text)

Increasing temperature increases uncertainty (increases "creativity")

In [13]:
completion = client.completions.create(
    model="llama3.2",
    prompt="Write a story about a young boy named Angelo from a small village",
    temperature=3.0,
    #top_p=0.7,
    n=1,
)
rprint(completion.choices[0].text)

Decreasing top_p (as well as temperature) has the opposite effect (makes response less uncertain). Typically we vary only one of these parameters.

In [14]:
completion = client.completions.create(
    model="llama3.2",
    prompt="Write a story about a young boy named Angelo from a small village",
    #temperature=3.0,
    top_p=0.7,
    n=1,
)
rprint(completion.choices[0].text)

**Note that the above OpenAI calls allow for playing with the various parameters - temperature, top_p, frequency_penalty, presence_penalty,  but there is no explicit support for top_k**