In [1]:
!pip install transformers # Getting ready to use a HF model.
!pip install torch torchvision torchaudio # PyTorch install
!pip install flash_attn

Collecting flash_attn
  Downloading flash_attn-2.7.3.tar.gz (3.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.2/3.2 MB[0m [31m37.0 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: flash_attn
  Building wheel for flash_attn (setup.py) ... [?25l[?25hdone
  Created wheel for flash_attn: filename=flash_attn-2.7.3-cp310-cp310-linux_x86_64.whl size=191333579 sha256=15aae40ce1b09f613ea943dd55f425c57e7496661b113fada685520d9339aea3
  Stored in directory: /root/.cache/pip/wheels/85/d7/10/a74c9fe5ffe6ff306b27a220b2bf2f37d907b68fdcd138cdda
Successfully built flash_attn
Installing collected packages: flash_attn
Successfully installed flash_attn-2.7.3


In [3]:
import torch
device = 'cuda' if torch.cuda.is_available() else 'cpu'
device

'cuda'

In [4]:
# Initial thought process
"""
Call a HF model -> pass through chat template -> function calling capabilities -> output
"""

'\nCall a HF model -> pass through chat template -> function calling capabilities -> output\n'

## Stage 1: Research
Upon any task involving open-source LLMs, the best thing to do is to do some research on the number of open-source LLMs available specific to your use-case. The best website for this is HuggingFace, which contains an innumerable number of repositories for LLM use-cases and modifications. 

Two factors can mainly be kept in mind when it comes to finding suitable LLMs for any use-case. 
1. Size
2. License

### Size
A company that is seriously delving into LLMs may have a cost requirement of using the model locally on the computer. This may be done by deploying a HuggingFace model onto SageMaker or a similar platform, and the LLM can be run locally for whatever use-case necessary. One has to be careful and take the size of the LLM (in billions of parameters) into account such that it doesn't exceed the cost threshold of the company. 

### License
Many open-source LLMs may not be available for commercial use, though they might be available for anyone to use. Therefore, upon looking at any HuggingFace repository, it is important to look at the license file included. Good to go license files includes: 
1. MIT
2. Apache
3. BSD
4. CC-by-NC (can be used but not for commercial)
5. Llama (limited to llama family)



Upon researching, we will take a look at: 
1. OpenCodeInterpreter-DS-6.7B by m-a-p (multimodal art projection)
2. DeepSeek-Coder-V2-Lite-Instruct by deepseek-ai
3. Mistral-7B-Instruct-v0.3

One question that can be raised is why have I chosen instruct models rather than base. Instruct models have been fine-tuned to follow the question-answer format of a chatbot, unlike base models which is good for general language.

These models however, are definitely not comparable to GPT4 in terms of code generation benchmarks, however we cannot demonstrate that because of the closed source nature of GPT4. 

In [5]:
## OpenCodeInterpreter driver code (from HuggingFace)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_path="m-a-p/OpenCodeInterpreter-DS-6.7B"

tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

tokenizer_config.json:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.37M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/462 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/716 [00:00<?, ?B/s]

Unrecognized keys in `rope_scaling` for 'rope_type'='linear': {'type'}


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/119 [00:00<?, ?B/s]

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32256, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((4096,), eps=1e-06)
      )
    )
    (no

In [6]:
# Testing the code
prompt = "Write a function to find the shared elements from the given two lists."
inputs = tokenizer.apply_chat_template(
        [{'role': 'user', 'content': prompt }],
        return_tensors="pt"
    ).to(model.device)
outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True))

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


def shared_elements(list1, list2):
    return list(set(list1) & set(list2))


In [7]:
# Getting the chat template for the model 
print(inputs)

tensor([[ 2042,   417,   274, 20926, 14244, 20391,    11, 26696,   254, 20676,
         30742,   339,  8589,  2008,    11,  6908,   457, 20676, 30742,  7958,
            11,   285,   340,   885,  3495,  4301,  4512,   276,  4531,  8214,
            13,  1487,  4636,  2223, 13143,  4301,    11,  5411,   285, 13936,
          4447,    11,   285,   746,  2159,    12, 13517,   250,  8214,  4301,
            11,   340,   540, 20857,   276,  3495,    13,   185, 13518,  3649,
          3475,    25,   185,  9083,   245,  1155,   276,  1273,   254,  7483,
          4889,   473,   254,  2017,   979, 11996,    13,   185, 13518, 21289,
            25,   185]], device='cuda:0')


## Stage 2: Function Calling and Code Generation

We will now come to the task given. I have chosen to demo two approaches: 
1. Using prompt engineering (by giving manual function calls)
2. Using Open-AI style approach

### Using prompt engineering
We will manually define `n` functions that the model can use. The limitation with this approach is that the functions called must be within the list defined. Any other functions that might have to be called will have to be implemented through either a fallback mechanism or fine-tune the model. 

In [None]:
# Creating our funciton caller 
def function_caller(model_output):
    try:
        # Parse the model output
        output = json.loads(model_output)
        function_name = output["function_call"]
        arguments = output["arguments"]

        # Validate the function
        if function_name not in functions:
            raise ValueError(f"Function '{function_name}' not found.")

        # Get the function and its signature
        func = functions[function_name]
        sig = inspect.signature(func)

        # Validate the arguments
        validated_args = {k: v for k, v in arguments.items() if k in sig.parameters}

        # Call the function with validated arguments
        result = func(**validated_args)
        return result

    except Exception as e:
        return str(e)

In [13]:
# Defining the prompt for our model. 
prompt = f"""
You are a helpful assistant that can call functions to perform various tasks.

Here are the available functions you can call:

1. `get_weather(location: str) -> str`: Returns the weather in the given location.
2. `calculate_sum(a: int, b: int) -> int`: Returns the sum of two numbers.

When you understand the user's request, return your response in the following JSON format:
```json
{{
  "function_call": "<function_name>",
  "arguments": {{
    "<argument1>": <value1>,
    "<argument2>": <value2>
  }}
}}
"""

In [12]:
# Tokenize the prompt
inputs = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': prompt}],
    return_tensors="pt"
).to(model.device)

# Generate output from the model
outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode the output to text
output_text = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print("Model Output:", output_text)

Model Output: {
  "function_call": "get_weather",
  "arguments": {
    "location": "New York"
  }
}


### Using OpenAI-style approach
Using this approach, we will automatically execute the function based on the model's instruction. We have to be more strict with the JSON output and more setup in general.

In [18]:
# Creating our JSON structure. 
json_input = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current temperature for a given location.",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country e.g. Bogotá, Colombia"
                    }
                },
                "required": [
                    "location"
                ],
                "additionalProperties": False
            },
            "strict": True
        }
    },
    {
        "type": "function",
        "function": {
            "name": "calculate_sum",
            "description": "Calculate the sum of two integers.",
            "parameters": {
                "type": "object",
                "properties": {
                    "a": {
                        "type": "integer",
                        "description": "First integer to add"
                    },
                    "b": {
                        "type": "integer",
                        "description": "Second integer to add"
                    }
                },
                "required": [
                    "a",
                    "b"
                ],
                "additionalProperties": False
            },
            "strict": True
        }
    }
]

In [19]:
# Our prompt to include the tools object
prompt = f"""
You are an assistant that can call the following functions:

{json.dumps(json_input, indent=4)}

Please provide a function call in JSON format when appropriate.
"""

In [20]:
inputs = tokenizer.apply_chat_template(
    [{'role': 'user', 'content': prompt}],
    return_tensors="pt"
).to(model.device)

# Generate output from the model
outputs = model.generate(
    inputs, 
    max_new_tokens=1024,
    do_sample=False,
    pad_token_id=tokenizer.eos_token_id,
    eos_token_id=tokenizer.eos_token_id,
)

# Decode the output to text
output_text = tokenizer.decode(outputs[0][len(inputs[0]):], skip_special_tokens=True)
print("Model Output:", output_text)

Model Output: {
  "function": "get_weather",
  "parameters": {
    "location": "Bogota, Colombia"
  }
}


## Stage 3: Evaluation
After testing out various approaches of facilitating function calling, one can create a sample dataset with `n` function examples with corresponding ground truth samples and compare the model's generated output versus the ground output. 

There are many open-source LLMs with dedicated function calling features but one can emulate function calling through LLMs where function calling may not be accurately defined. 