# Gai/Gen: Text-to-Text (TTT)

## 1.1 Setting Up

We will create a seperate virtual environment for this to avoid conflicting dependencies that each underlying model requires.

```sh
sudo apt update -y && sudo apt install ffmpeg git git-lfs -y
conda create -n TTT python=3.10.10 -y
conda activate TTT
pip install -e ".[TTT]"
```

The following examples has been tested on the following environment:

-   NVidia GeForce RTX 2060 6GB
-   Windows 11 + WSL2
-   Ubuntu 22.04
-   Python 3.10
-   CUDA Toolkit 11.8
-   openai 1.6.1
-   anthropic 0.8.1
-   transformers 4.36.2
-   bitsandbytes 0.41.3.post2
-   scipy 1.11.4
-   accelerate 0.25.0
-   llama-cpp-python 0.2.25


## 1.2 Running as a Library

### OpenAI GPT4

For (1) and (2) below, you will use the GaiGen library to call OpenAI's GPT4.
You will need to get an API key from OpenAI. 
Create .env file in project root directory and insert the OpenAI API Key below:

```sh
OPENAI_API_KEY=<your key here>
```

In [1]:
### 1. GPT4 Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('gpt-4')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}], max_tokens=100,stream=False)
print(response.choices[0].message.content)

GENERATING:


2024-03-07 06:56:20 INFO gai.gen.Gaigen:[32mGaigen: Loading generator gpt-4...[0m
2024-03-07 06:56:20 INFO gai.gen.ttt.TTT:[32mUsing engine OpenAI_TTT...[0m
2024-03-07 06:56:20 DEBUG gai.gen.ttt.OpenAI_TTT:[35mOpenAI_TTT.create: model_params={'max_tokens': 100, 'temperature': 0.7, 'top_p': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'stop': None, 'logit_bias': {}, 'n': 1}[0m


In the heart of the ancient town of Littleshire, there lived a peculiar old man named Mr. Bentley. Known for his exceptional talent for fixing anything, he was the town's favorite handyman. One day, a mysterious object fell from the sky, alarming the whole town. It was a complex machine, unlike anything anyone had ever seen. Fearful townsfolk turned to Mr. Bentley, who, with his gentle touch and intuitive understanding of machines, fixed the strange contraption. The


In [2]:
### 2. GPT4 Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('gpt-4')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],stream=True)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content,end='',flush=True)

2024-03-07 06:56:30 DEBUG gai.gen.Gaigen:[35mGaigen.load: Generator is already loaded. Skip loading.[0m
2024-03-07 06:56:30 DEBUG gai.gen.ttt.OpenAI_TTT:[35mOpenAI_TTT.create: model_params={'max_tokens': 100, 'temperature': 0.7, 'top_p': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'stop': None, 'logit_bias': {}, 'n': 1}[0m


STREAMING:
In the heart of a bustling city, lived an old watchmaker named Gregory. Every day, he worked meticulously, piecing together the intricate mechanics of timepieces, his hands telling a story of countless years of precision and dedication. One night, a peculiar customer walked into his shop, presenting a centuries-old watch that had long stopped ticking. Gregory, intrigued and challenged, spent weeks bringing it back to life. As the timepiece began to tick again, it revealed a hidden compartment with an old map

### Mistral 7B 8k-context 4-bit quantized

For (3) and (4), you will run Mistral 7B locally. Clone TheBloke's 4-bit quantized version of Mistral-7B model from hugging face. This model utilizes the exLlama loader for increased performance. Make sure you have huggingface-hub installed, if not run `pip install huggingface-hub`.

```sh
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GPTQ \
        config.json
        model.safetensors 
        tokenizer.model
        --local-dir ~/gai/models/Mistral-7B-Instruct-v0.1-GPTQ \
        --local-dir-use-symlinks False
```

In [4]:
### 3. Mistral Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100, stream=False)
print(response.choices[0].message.content)


2024-03-07 06:56:45 DEBUG gai.gen.Gaigen:[35mGaigen.load: New generator_name specified, unload current generator.[0m


GENERATING:


2024-03-07 06:56:51 INFO gai.gen.Gaigen:[32mGaigen: Loading generator mistral7b-exllama...[0m
2024-03-07 06:56:51 INFO gai.gen.ttt.TTT:[32mUsing engine ExLlama_TTT...[0m
2024-03-07 06:56:51 INFO gai.gen.ttt.TTT:[32mLoading model from models/Mistral-7B-Instruct-v0.1-GPTQ[0m
2024-03-07 06:56:51 INFO gai.gen.ttt.ExLlama_TTT:[32mExLlama_TTT.load: Loading model from /home/roylai/gai/models/Mistral-7B-Instruct-v0.1-GPTQ/model.safetensors[0m
2024-03-07 06:57:16 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=USER: Tell me a one paragraph short story.
ASSISTANT:[0m
2024-03-07 06:57:16 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: model_params={'temperature': 1.2, 'top_p': 0.15, 'min_p': 0.0, 'top_k': 50, 'max_new_tokens': 100, 'typical': 0.0, 'token_repetition_penalty_max': 1.25, 'token_repetition_penalty_sustain': 256, 'token_repetition_penalty_decay': 128, 'beams': 1, 'beam_length': 1}[0m
2024-03-07 06:57:16 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TT

 Once upon a time, in a small village nestled at the foot of a mountain, there lived an old woman who was known for her wisdom and kindness. She had spent her entire life studying the mysteries of nature and the secrets of the universe, and she believed that everything happened for a reason. One day, as she sat on her porch watching the sun set over the mountains, she noticed a young boy playing in the field across the street


In [5]:
### 4. Mistral Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    if (chunk.choices[0].delta.content):
        print(chunk.choices[0].delta.content,end='',flush=True)


2024-03-07 06:58:11 DEBUG gai.gen.Gaigen:[35mGaigen.load: Generator is already loaded. Skip loading.[0m
2024-03-07 06:58:11 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=USER: Tell me a one paragraph short story.
ASSISTANT:[0m
2024-03-07 06:58:11 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: model_params={'temperature': 1.2, 'top_p': 0.15, 'min_p': 0.0, 'top_k': 50, 'max_new_tokens': 100, 'typical': 0.0, 'token_repetition_penalty_max': 1.25, 'token_repetition_penalty_sustain': 256, 'token_repetition_penalty_decay': 128, 'beams': 1, 'beam_length': 1}[0m
2024-03-07 06:58:12 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: input token count=16[0m


STREAMING:
 Once upon a time, in a small village nestled at the foot of a mountain, there lived an old woman who was known for her wisdom and kindness. She had spent her entire life studying the mysteries of nature and sharing her knowledge with others. One day, as she sat by the river, a young boy approached her with a troubled expression on his face. He told her that he

2024-03-07 06:58:18 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: stopped by max_new_tokens: 100[0m


 didn't know what to do with his life

### Yarn-Mistral-7B 128k-context 4-bit quantized

Repeat the earlier examples but using a different version of Mistral-7B model with a larger context window.

```sh
huggingface-cli download TheBloke/Yarn-Mistral-7B-128k-GPTQ \
        --local-dir ~/gai/models/Yarn-Mistral-7B-128k-GPTQ \
        --local-dir-use-symlinks False
```

According to their paper, the perplexity seems better than the original once the token length is greater than 10k.

![perplexity-of-mistral7b-128k](https://raw.githubusercontent.com/jquesnelle/yarn/mistral/data/proofpile-long-small-mistral.csv.png)



In [1]:
### 3. Mistral Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b_128k-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

GENERATING:


2024-03-07 07:01:06 INFO gai.gen.Gaigen:[32mGaigen: Loading generator mistral7b_128k-exllama...[0m
2024-03-07 07:01:06 INFO gai.gen.ttt.TTT:[32mUsing engine ExLlama_TTT...[0m
2024-03-07 07:01:06 INFO gai.gen.ttt.TTT:[32mLoading model from models/Yarn-Mistral-7B-128k-GPTQ[0m
2024-03-07 07:01:06 INFO gai.gen.ttt.ExLlama_TTT:[32mExLlama_TTT.load: Loading model from /home/roylai/gai/models/Yarn-Mistral-7B-128k-GPTQ/model.safetensors[0m
2024-03-07 07:01:23 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=USER: Tell me a one paragraph short story.
ASSISTANT:[0m
2024-03-07 07:01:23 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: model_params={'temperature': 1.2, 'top_p': 0.15, 'min_p': 0.0, 'top_k': 50, 'max_new_tokens': 100, 'typical': 0.0, 'token_repetition_penalty_max': 1.25, 'token_repetition_penalty_sustain': 256, 'token_repetition_penalty_decay': 128, 'beams': 1, 'beam_length': 1}[0m
2024-03-07 07:01:23 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.s

 Once upon a time, there was a little girl who loved to play outside in the sunshine. One day she went out into her backyard and saw something strange – it looked like a giant spider web! She walked closer and realized that it wasn’t a spider at all but rather an intricate pattern of leaves on the ground. The more she explored this beautiful design, the happier she became until finally she sat down under its

In [2]:
### 4. Mistral Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b_128k-exllama')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    if (chunk.choices[0].delta.content):
        print(chunk.choices[0].delta.content,end='',flush=True)

2024-03-07 07:02:35 DEBUG gai.gen.Gaigen:[35mGaigen.load: Generator is already loaded. Skip loading.[0m
2024-03-07 07:02:35 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=USER: Tell me a one paragraph short story.
ASSISTANT:[0m
2024-03-07 07:02:35 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: model_params={'temperature': 1.2, 'top_p': 0.15, 'min_p': 0.0, 'top_k': 50, 'max_new_tokens': 100, 'typical': 0.0, 'token_repetition_penalty_max': 1.25, 'token_repetition_penalty_sustain': 256, 'token_repetition_penalty_decay': 128, 'beams': 1, 'beam_length': 1}[0m
2024-03-07 07:02:35 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: input token count=16[0m


STREAMING:
 Once upon a time, there was a little girl who loved to play outside in the sunshine. One day she went out into her backyard and saw something strange – it looked like a giant spider web! She walked closer and realized that it wasn’t a spider at all but rather an intricate pattern of leaves on the ground. The more she explored this beautiful design, the happ

2024-03-07 07:02:58 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: stopped by max_new_tokens: 100[0m


ier she became until finally she sat down under its

### Anthropics Claude2.1

The following example uses Anthropics Claude2.1 100k context window size model. Get API Key from Anthropics and add it to the .env file.
```sh
ANTHROPIC_APIKEY=<your key here>
```

In [None]:
### 5. Claude-2.1 Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('claude2-100k')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_tokens_to_sample=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
### 6. Claude-2.1 Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('claude2-100k')
response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_tokens_to_sample=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

### Llama2 7B with HuggingFace transformers

Follow the instructions [here](https://huggingface.co/docs/transformers/main/en/model_doc/llama2) to signup with Meta to download the LLaMa-2 model.
Download the model in HuggingFace format from [here] (https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) into ~/gai/models/Llama-2-7b-chat-hf.

In [None]:
### 7. Llama2-7B Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
from IPython.utils import io
with io.capture_output() as captured:
    gen = Gaigen.GetInstance().load('llama2-transformers')
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
### 8. Llama2-7B Text-to-Text Streaming

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('llama2-transformers')
from IPython.utils import io
with io.capture_output() as captured:
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

### Llama2 7B GGUF with LlaMaCPP (CPU only)

The following example uses GGUF formatted version of Mistral-7B for LlaMaCPP. This can be used when you want the model to run off CPU only

In [None]:
%%bash
# Download the model
huggingface-cli download TheBloke/Mistral-7B-Instruct-v0.1-GGUF \
                mistral-7b-instruct-v0.1.Q4_K_M.gguf  \
                config.json \
                --local-dir ~/gai/models/Mistral-7B-Instruct-v0.1-GGUF \
                --local-dir-use-symlinks False

In [None]:
## 9. Mistral-7B CPU-Only Text-to-Text Generation

print("GENERATING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-llamacpp')
from IPython.utils import io
import sys
with io.capture_output() as captured:
    # Redirect stderr to stdout
    sys.stderr = sys.stdout    
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100)
print(response.choices[0].message.content,end='',flush=True)

In [None]:
## 10. Mistral-7B CPU-Only Text-to-Text Generation

print("STREAMING:")
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('mistral7b-llamacpp')
from IPython.utils import io
with io.capture_output() as captured:
    response = gen.create(messages=[{'role':'USER','content':'Tell me a one paragraph short story.'},{'role':'ASSISTANT','content':''}],max_new_tokens=100,stream=True)
for chunk in response:
    print(chunk.choices[0].delta.content,end='',flush=True)

## 1.3 Using Function Call

OpenAPI provided a powerful feature for its API called Function calling. It is essentially a way for the LLM to seek external help when encountering limitation to its ability to generate text but returning a string emulating the calling of a function based on the function description provied by the user.

We will create a set of tools that can be made available to the models below.

In [3]:
tools = [
    {
        "type": "function",
        "function": {
            "name": "gg",
            "description": "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
            "parameters": {
                "type": "object",
                "properties": {
                    "search_query": {
                        "type": "string",
                        "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                    }
                },
                "required": ["search_query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "scrape",
            "description": "Scrape the content of the provided url",
            "parameters": {
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The url to scrape the content from"
                    }
                },
                "required": ["url"]
            }
        }
    }
]

Next we will show how it works in gpt-4.

In [7]:
from gai.gen import Gaigen
gen = Gaigen.GetInstance().load('gpt-4')

highlight("Model decided to use tool: ")
user_prompt = "Where did PM Lee Hsien Loong hold his 2023 national day rally?"
response = gen.create(
    messages=[
        {'role':'user','content':user_prompt},
        {'role':'assistant','content':''}],
        tools=tools,
        stream=False,
        max_new_tokens=200)
print(response.choices[0].message)

highlight("Model decided not to use tool: ")
user_prompt = "Tell me a one paragraph story."
response = gen.create(
    messages=[
        {'role':'user','content':user_prompt},
        {'role':'assistant','content':''}],
        tools=tools,
        stream=False,
        max_new_tokens=200)
print(response.choices[0].message)


2024-03-07 07:08:19 DEBUG gai.gen.Gaigen:[35mGaigen.load: New generator_name specified, unload current generator.[0m


2024-03-07 07:08:19 INFO gai.gen.Gaigen:[32mGaigen: Loading generator gpt-4...[0m
2024-03-07 07:08:19 INFO gai.gen.ttt.TTT:[32mUsing engine OpenAI_TTT...[0m


2024-03-07 07:08:19 DEBUG gai.gen.ttt.OpenAI_TTT:[35mOpenAI_TTT.create: model_params={'max_tokens': 100, 'temperature': 0.7, 'top_p': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'stop': None, 'logit_bias': {}, 'n': 1, 'tools': [{'type': 'function', 'function': {'name': 'gg', 'description': "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current informa

ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_MRSjMb7JHsU9mu3D1MDE6atO', function=Function(arguments='{\n  "search_query": "PM Lee Hsien Loong 2023 national day rally location"\n}', name='gg'), type='function')])


2024-03-07 07:08:22 DEBUG gai.gen.ttt.OpenAI_TTT:[35mOpenAI_TTT.create: model_params={'max_tokens': 100, 'temperature': 0.7, 'top_p': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'stop': None, 'logit_bias': {}, 'n': 1, 'tools': [{'type': 'function', 'function': {'name': 'gg', 'description': "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current informa

ChatCompletionMessage(content='Once upon a time, in a small, peaceful village nestled between the lush green hills, lived a kind-hearted blacksmith named Albert. One day, he discovered a magical stone that could turn anything it touched into gold. Overwhelmed with joy, Albert turned everything he owned into gold, but soon realized his food also turned to gold, leaving him starving. In his despair, he pleaded with the stone to take back its magic. The stone, hearing his genuine remorse, reversed its spell.', role='assistant', function_call=None, tool_calls=None)


In the following example, we demonstrate the same feature applied to Mistral-7B.

In [6]:
from gai.gen import Gaigen
from gai.common.notebook import highlight
gen = Gaigen.GetInstance().load('mistral7b-exllama')

highlight("Model decided to use tool: ")
user_prompt = "Where did PM Lee Hsien Loong hold his 2023 national day rally?"
response = gen.create(
    messages=[
        {'role':'user','content':user_prompt},
        {'role':'assistant','content':''}],
        tools=tools,
        stream=False,
        max_new_tokens=200)
print(response.choices[0].message)

highlight("Model decided not to use tool: ")
user_prompt = "Tell me a one paragraph story."
response = gen.create(
    messages=[
        {'role':'user','content':user_prompt},
        {'role':'assistant','content':''}],
        tools=tools,
        stream=False,
        max_new_tokens=200)
print(response.choices[0].message)

2024-03-07 07:07:19 DEBUG gai.gen.Gaigen:[35mGaigen.load: Generator is already loaded. Skip loading.[0m


2024-03-07 07:07:19 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=system: system:


                You will always begin your interaction by asking yourself if the user's message is a message that requires a tool response or a text response.
                                
                DEFINITIONS:
                1. A tool response is based on the following JSON format:
                        <tool>
                        {
                            'function': {
                                'name': ...,
                                'parameters': ...
                            }
                        }
                        </tool>
                
                And the tool is chosen from the following <tools> list:
                        <tools>
                        [{'type': 'function', 'function': {'name': 'gg', 'description': "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using

2024-03-07 07:07:22 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: tool_name=gg[0m
2024-03-07 07:07:27 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: tool_arguments={
                "search_query": "PM Lee Hsien Loong 2023 National Day Rally location"
            }[0m
2024-03-07 07:07:28 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: stopped by eos_token_id: 2[0m


ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_05437db3-43ff-4612-8a89-666c50e409e6', function=Function(arguments='{"search_query": "PM Lee Hsien Loong 2023 National Day Rally location"}', name='gg'), type='function')])


2024-03-07 07:07:28 DEBUG gai.gen.ttt.ExLlama_TTT:[35mExLlama_TTT.streaming: prompt=system: system:


                You will always begin your interaction by asking yourself if the user's message is a message that requires a tool response or a text response.
                                
                DEFINITIONS:
                1. A tool response is based on the following JSON format:
                        <tool>
                        {
                            'function': {
                                'name': ...,
                                'parameters': ...
                            }
                        }
                        </tool>
                
                And the tool is chosen from the following <tools> list:
                        <tools>
                        [{'type': 'function', 'function': {'name': 'gg', 'description': "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using

ChatCompletionMessage(content=' Once upon a time, there was a little girl named Alice who lived in an enchanted forest full of talking animals and magical creatures. One day, she stumbled upon a secret door hidden behind a waterfall and decided to explore it. As she entered through the door, she found herself in a beautiful kingdom ruled by a kind queen. The queen welcomed her warmly and showed her around the castle, introducing her to all sorts of fascinating people and things. Alice had so much fun that she never wanted to leave!', role='assistant', function_call=None, tool_calls=None)


## 1.4 Returning JSON

Here's an example of how to return a JSON output. This is useful for returning structured from unstructured result.


In [8]:
system_prompt = """
The user will show you information scraped from the web. 
You will analyse and return a <relevance> format.

DEFINITIONS:
1. A <relevance> response is based on the following JSON format:
        <relevance>
        {
            'type': 'json',
            'json': {
                'Relevance': '<relevance>%',
                'Reason': '<reason>'
            }
        }
        </relevance>

You have a tendency to forget the question while you are analysing the answer. 
Follow these steps:
1. Break down the answer into smaller chunks.
2. Ask yourself what is the question again then assess whether any part of the information within the answer is relevant to any part of the question.
3. Finally, you give a percentage overall score of <relevance> immediately. If it contains any direct relevance, you must give a score of at least 40%.
You will not be given a second chance or any further context or information. If you think the answer is not related to the question, just reply with "Not relevant" and give a reason.
Rule:
a. <relevance> is a string between "0%" and "100%". If Not relevant, return "0".
b. Pay attention to small part of the answer that addresses the question directly. Even when it is small part, the direct relevance is highly significant.
c. Respond only with <relevence> response and nothing else.
"""

In [9]:
user_prompt = """
My question is 'Where is PM Lee Hsien Loong 2023 National Day Rally location?'.
Refer to this <webpage>Once upon a time, there was a little girl named Alice who lived in an enchanted forest full of talking animals and magical creatures. One day, she stumbled upon a secret door hidden behind a waterfall and decided to explore it. As she entered through the door, she found herself in a beautiful kingdom ruled by a kind queen. The queen welcomed her warmly and showed her around the castle, introducing her to all sorts of fascinating people and things. Alice had so much fun that she never wanted to leave!</webpage> and return a <relevance> response.
"""

In [13]:
response = gen.create(
    messages=[
        {'role':'system','content':system_prompt},
        {'role':'user','content':user_prompt},
        {'role':'assistant','content':''}],
        stream=True,
        max_new_tokens=200)
for output in response:
    if output.choices[0].delta.content:
        print(output.choices[0].delta.content, end='', flush=True)

2024-03-07 07:25:34 DEBUG gai.gen.ttt.OpenAI_TTT:[35mOpenAI_TTT.create: model_params={'max_tokens': 100, 'temperature': 0.7, 'top_p': 1, 'presence_penalty': 0.0, 'frequency_penalty': 0.0, 'stop': None, 'logit_bias': {}, 'n': 1}[0m


{
    'type': 'json',
    'json': {
        'Relevance': '0%',
        'Reason': 'The information provided does not contain any details relevant to the location of PM Lee Hsien Loong 2023 National Day Rally.'
    }
}

## 1.5 Running as a Service

### Step 1: Start Docker container

In [None]:
%%bash

# Stop any container with the same name
docker rm -f gai-ttt

# Start the container
docker run -d \
    --name gai-ttt \
    -p 12031:12031 \
    --gpus all \
    -v ~/gai/models:/app/models \
    kakkoii1337/gai-ttt:latest

# Wait for model to load
sleep 30

# Confirm its running
docker logs gai-ttt

When the loading is completed, the logs should show this:

```bash
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:12031 (Press CTRL+C to quit)
```

### Step 2: Run Text Generation Client

The default model is Mistral7B-8k context size

In [None]:
import requests,json
response = requests.post(
    url='http://localhost:12031/gen/v1/chat/completions', 
    json={
        "model": "mistral7b-exllama",
        "messages": [
            {"role": "user", "content": "Tell me a one paragraph short story."},
            {"role": "assistant", "content": ""}
        ],
        "max_new_tokens": 100,
        "stream": True
    },
    stream=True)
for chunk in response.iter_lines():
    result = json.loads(chunk.decode('utf-8'))
    print(result["choices"][0]["delta"]["content"],end='',flush=True)


In [16]:
import requests,json
response = requests.post(
    url='http://localhost:12031/gen/v1/chat/completions', 
    json={
        "model": "mistral7b-exllama",
        "messages": [
            {"role": "user", "content": "What is today's date?"},
            {"role": "assistant", "content": ""}
        ],
        "max_new_tokens": 100,
        "stream": False,
        "tools": [
        {
            "type": "function",
            "function": {
                "name": "gg",
                "description": "The 'gg' function is a powerful tool that allows the AI to gather external information from the internet using Google search. It can be invoked when the AI needs to answer a question or provide information that requires up-to-date, comprehensive, and diverse sources which are not inherently known by the AI. For instance, it can be used to find current date, current news, weather updates, latest sports scores, trending topics, specific facts, or even the current date and time. The usage of this tool should be considered when the user's query implies or explicitly requests recent or wide-ranging data, or when the AI's inherent knowledge base may not have the required or most current information. The 'search_query' parameter should be a concise and accurate representation of the information needed.",
                "parameters": {
                    "type": "object",
                    "properties": {
                        "search_query": {
                            "type": "string",
                            "description": "The search query to search google with. For example, to find the current date or time, use 'current date' or 'current time' respectively."
                        }
                    },
                    "required": ["search_query"]
                }
            }
        }]
    },
    stream=False)

print(response.text)


{"id":"chatcmpl-53e917e4-2703-4587-bede-5a5de96f12f3","choices":[{"finish_reason":"tool_calls","index":0,"logprobs":null,"message":{"content":null,"role":"assistant","function_call":null,"tool_calls":[{"id":"call_fb4edf0a-a509-4c24-91a3-78dd8b4fdcf5","function":{"arguments":"{\"search_query\": \"current date\"}","name":"gg"},"type":"function"}]}}],"created":1707944860,"model":"Mistral7B-ExLlama","object":"chat.completion","system_fingerprint":null,"usage":{"completion_tokens":0,"prompt_tokens":3620,"total_tokens":3620}}
