# SageMaker Code Generation with Code Llama: Deploying Pre trained Code Llama

#### Importing sys and other important libraries: Lanchain, Chromadb as our vectordb to store indexes and boto3 for our environment

In [2]:
import sys
!{sys.executable} -m pip install langchain
!{sys.executable} -m pip install chromadb
!{sys.executable} -m pip install --upgrade boto3

Collecting langchain
  Downloading langchain-0.1.5-py3-none-any.whl.metadata (13 kB)
Collecting aiohttp<4.0.0,>=3.8.3 (from langchain)
  Downloading aiohttp-3.9.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.4 kB)
Collecting async-timeout<5.0.0,>=4.0.0 (from langchain)
  Using cached async_timeout-4.0.3-py3-none-any.whl.metadata (4.2 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl.metadata (25 kB)
Collecting jsonpatch<2.0,>=1.33 (from langchain)
  Downloading jsonpatch-1.33-py2.py3-none-any.whl.metadata (3.0 kB)
Collecting langchain-community<0.1,>=0.0.17 (from langchain)
  Downloading langchain_community-0.0.17-py3-none-any.whl.metadata (7.9 kB)
Collecting langchain-core<0.2,>=0.1.16 (from langchain)
  Downloading langchain_core-0.1.18-py3-none-any.whl.metadata (6.0 kB)
Collecting langsmith<0.1,>=0.0.83 (from langchain)
  Downloading langsmith-0.0.86-py3-none-any.whl.metadata (10 kB)
Collectin

#### Import other libraries and document loaders as well as libraries like the recursive character splitting to be able to efficiently generate code through our model

In [3]:
import argparse
import os
from langchain.document_loaders import DirectoryLoader
import chromadb
import json
import boto3
import time
import glob
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)
import ast
import sys

### Deploy the code Llama 7b model


In [4]:
model_id = "meta-textgeneration-llama-codellama-7b"

from sagemaker.jumpstart.model import JumpStartModel

model = JumpStartModel(model_id=model_id)
predictor = model.deploy(accept_eula = True)

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /root/.config/sagemaker/config.yaml


Using model 'meta-textgeneration-llama-codellama-7b' with wildcard version identifier '*'. You can pin to version '2.0.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.
Using model 'meta-textgeneration-llama-codellama-7b' with wildcard version identifier '*'. You can pin to version '2.0.0' for more stable results. Note that models may have different input/output signatures after a major version upgrade.


---------!

In [5]:
# Get the name of the endpoint
endpoint_name = str(predictor.endpoint)

print(endpoint_name)

The endpoint attribute has been renamed in sagemaker>=2.
See: https://sagemaker.readthedocs.io/en/stable/v2.html for details.


meta-textgeneration-llama-codellama-7b-2024-02-05-20-09-07-925


In [6]:
def query_endpoint(payload):
    client = boto3.client('runtime.sagemaker')
    response = client.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType='application/json',
        Body=json.dumps(payload).encode('utf-8'),
        CustomAttributes="accept_eula=true",
    )
    response = response["Body"].read().decode("utf8")
    response = json.loads(response)
    return response

### Supported parameters

***
This model supports many parameters while performing inference. They include:

* **max_length:** Model generates text until the output length (which includes the input context length) reaches `max_length`. If specified, it must be a positive integer.
* **max_new_tokens:** Model generates text until the output length (excluding the input context length) reaches `max_new_tokens`. If specified, it must be a positive integer.
* **num_beams:** Number of beams used in the greedy search. If specified, it must be integer greater than or equal to `num_return_sequences`.
* **no_repeat_ngram_size:** Model ensures that a sequence of words of `no_repeat_ngram_size` is not repeated in the output sequence. If specified, it must be a positive integer greater than 1.
* **temperature:** Controls the randomness in the output. Higher temperature results in output sequence with low-probability words and lower temperature results in output sequence with high-probability words. If `temperature` -> 0, it results in greedy decoding. If specified, it must be a positive float.
* **early_stopping:** If True, text generation is finished when all beam hypotheses reach the end of sentence token. If specified, it must be boolean.
* **do_sample:** If True, sample the next word as per the likelihood. If specified, it must be boolean.
* **top_k:** In each step of text generation, sample from only the `top_k` most likely words. If specified, it must be a positive integer.
* **top_p:** In each step of text generation, sample from the smallest possible set of words with cumulative probability `top_p`. If specified, it must be a float between 0 and 1.
* **return_full_text:** If True, input text will be part of the output generated text. If specified, it must be boolean. The default value for it is False.
* **stop**: If specified, it must a list of strings. Text generation stops if any one of the specified strings is generated.

We may specify any subset of the parameters mentioned above while invoking an endpoint. Next, we show an example of how to invoke endpoint with these arguments.
***

## Code completion without context
***
This section demonstrate how to perform code generation where the expected endpoint response is the natural continuation of the prompt. No context is provided to. As seen below the LLM hallucinates when providing the continuation of the code because it has not been trained on the library used to test
***

In [7]:
def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"{bold}> Input{unbold}\n{prompt}{bold}\n> Output{unbold}\n{response[0]['generated_text']}\n")

In [8]:
%%time

prompt = """\
import sagemaker

# Create an HTML page about Amazon SageMaker
html_content = f'''
<!DOCTYPE html>
<html>
<head>
    <title>Amazon SageMaker</title>
</head>
<body>
    <h1>Welcome to Amazon SageMaker</h1>
    <p>Amazon SageMaker is a fully managed service for building, training, and deploying machine learning models.</p>
    <h2>Key Features</h2>
    <ul>
        <li>Easy to use</li>
        <li>Scalable</li>
        <li>End-to-end machine learning workflow</li>
    </ul>
    <p>Get started with SageMaker today and unlock the power of machine learning!</p>
</body>
</html>
'''

html_content
"""

payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}}
response = query_endpoint(payload)
print_completion(prompt, response)

[1m> Input[0m
import sagemaker

# Create an HTML page about Amazon SageMaker
html_content = f'''
<!DOCTYPE html>
<html>
<head>
    <title>Amazon SageMaker</title>
</head>
<body>
    <h1>Welcome to Amazon SageMaker</h1>
    <p>Amazon SageMaker is a fully managed service for building, training, and deploying machine learning models.</p>
    <h2>Key Features</h2>
    <ul>
        <li>Easy to use</li>
        <li>Scalable</li>
        <li>End-to-end machine learning workflow</li>
    </ul>
    <p>Get started with SageMaker today and unlock the power of machine learning!</p>
</body>
</html>
'''

html_content
[1m
> Output[0m

# Create a SageMaker client
sagemaker_client = sagemaker.SageMakerClient()

# Create a SageMaker model
sagemaker_model = sagemaker.Model(
    model_data='s3://sagemaker-us-east-1-123456789012/model/model.tar.gz',
    image='123456789012.dkr.ecr.us-east-1.amazonaws.com/sagemaker-scikit-learn:latest',
    role='arn:aws:iam::123456789012:role/service-role/AmazonSageMak

# Code completion
***
The examples in this section demonstrate how to perform code generation where the expected endpoint response is the natural continuation of the prompt.
***

In [10]:
def print_completion(prompt: str, response: str) -> None:
    bold, unbold = '\033[1m', '\033[0m'
    print(f"{bold}> Input{unbold}\n{prompt}{bold}\n> Output{unbold}\n{response[0]['generated_text']}\n")

In [11]:
%%time

prompt = """\
import socket

def ping_exponential_backoff(host: str):\
"""

payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_completion(prompt, response)

[1m> Input[0m
import socket

def ping_exponential_backoff(host: str):[1m
> Output[0m

    """
    Ping a host with exponential backoff.
    :param host: host to ping
    :return: True if host is reachable, False otherwise
    """
    # TODO: Implement exponential backoff
    # Hint: Use the socket module
    # Hint: Use the time module
    # Hint: Use the random module
    # Hint: Use the logging module
    # Hint: Use the sys module
    # Hint: Use the traceback module
    # Hint: Use the threading module
    # Hint: Use the multiprocessing module
    # Hint: Use the queue module
    # Hint: Use the signal module
    # Hint: Use the subprocess module
    # Hint: Use the os module
    # Hint: Use the re module
    # Hint: Use the stat module
    # Hint: Use the shutil module
    # Hint: Use the pathlib module
    # Hint: Use the glob module
    # Hint: Use the fnmatch module
    # Hint: Use the tempfile module
    # Hint:

CPU times: user 17.6 ms, sys: 228 µs, total: 17.8 ms
Wall t

In [12]:
%%time

prompt = """\
import argparse

def main(string: str):
    print(string)
    print(string[::-1])

if __name__ == "__main__":\
"""

payload = {"inputs": prompt, "parameters": {"max_new_tokens": 256, "temperature": 0.2, "top_p": 0.9}}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_completion(prompt, response)

[1m> Input[0m
import argparse

def main(string: str):
    print(string)
    print(string[::-1])

if __name__ == "__main__":[1m
> Output[0m

    parser = argparse.ArgumentParser(description='Reverse a string')
    parser.add_argument('string', help='String to reverse')
    args = parser.parse_args()
    main(args.string)


CPU times: user 4.57 ms, sys: 170 µs, total: 4.74 ms
Wall time: 1.68 s


## Code infilling
***
The examples in this section demonstrate how to perform code generation where the expected endpoint response infills text between a prefix and a suffix. Only 7B, 7B-Instruct, 13B, and 13B-Instruct models have this capability, while the non-instruct models have been observed to obtain the best anecdotal performance.
***

In [13]:
def format_infilling(prompt: str) -> str:
    prefix, suffix = prompt.split("<FILL>")
    return f"<PRE> {prefix} <SUF>{suffix} <MID>"


def print_infilling(prompt: str, response: str) -> str:
    green, font_reset = "\x1b[38;5;2m", "\x1b[0m"
    prefix, suffix = prompt.split("<FILL>")
    print(f"{prefix}{green}{response[0]['generated_text']}{font_reset}{suffix}")

In [14]:
%%time

prompt = '''\
def remove_non_ascii(s: str) -> str:
    """<FILL>
    return result
'''
prompt_formatted = format_infilling(prompt)
payload = {
    "inputs": prompt_formatted,
    "parameters": {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_infilling(prompt, response)

def remove_non_ascii(s: str) -> str:
    """[38;5;2m
    Remove non-ASCII characters from a string.

    :param s: The string to remove non-ASCII characters from.
    :return: The string with non-ASCII characters removed.
    """
    result = ""
    for c in s:
        if ord(c) < 128:
            result += c[0m
    return result

CPU times: user 3.64 ms, sys: 1.52 ms, total: 5.16 ms
Wall time: 2.51 s


In [15]:
%%time

prompt = """\
# Installation instructions:
    ```bash
<FILL>
    ```
This downloads the LLaMA inference code and installs the repository as a local pip package.
"""
prompt_formatted = format_infilling(prompt)
payload = {
    "inputs": prompt_formatted,
    "parameters": {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_infilling(prompt, response)

# Installation instructions:
    ```bash
[38;5;2m  git clone https://github.com/LLaMA-AI/LLaMA-inference.git
    cd LLaMA-inference
    pip install -e .[0m
    ```
This downloads the LLaMA inference code and installs the repository as a local pip package.

CPU times: user 1.53 ms, sys: 3.53 ms, total: 5.06 ms
Wall time: 1.3 s


In [16]:
%%time

prompt = """\
class InterfaceManagerFactory(AbstractManagerFactory):
    def __init__(<FILL>
def main():
    factory = InterfaceManagerFactory(start=datetime.now())
    managers = []
    for i in range(10):
        managers.append(factory.build(id=i))
"""
prompt_formatted = format_infilling(prompt)
payload = {
    "inputs": prompt_formatted,
    "parameters": {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_infilling(prompt, response)

class InterfaceManagerFactory(AbstractManagerFactory):
    def __init__([38;5;2mself, start):
        self.start = start

    def build(self, id):
        return InterfaceManager(id, self.start)

[0m
def main():
    factory = InterfaceManagerFactory(start=datetime.now())
    managers = []
    for i in range(10):
        managers.append(factory.build(id=i))

CPU times: user 4.32 ms, sys: 2.58 ms, total: 6.91 ms
Wall time: 1.19 s


In [17]:
%%time

prompt = """\
/-- A quasi-prefunctoid is 1-connected iff all its etalisations are 1-connected. -/
theorem connected_iff_etalisation [C D : precategoroid] (P : quasi_prefunctoid C D) :
  π₁ P = 0 ↔ <FILL> = 0 :=
begin
  split,
  { intros h f,
    rw pi_1_etalisation at h,
    simp [h],
    refl
  },
  { intro h,
    have := @quasi_adjoint C D P,
    simp [←pi_1_etalisation, this, h],
    refl
  }
end
"""
prompt_formatted = format_infilling(prompt)
payload = {
    "inputs": prompt_formatted,
    "parameters": {"max_new_tokens": 256, "temperature": 0.05, "top_p": 0.9}
}
response = predictor.predict(payload, custom_attributes='accept_eula=true')
print_infilling(prompt, response)

/-- A quasi-prefunctoid is 1-connected iff all its etalisations are 1-connected. -/
theorem connected_iff_etalisation [C D : precategoroid] (P : quasi_prefunctoid C D) :
  π₁ P = 0 ↔ [38;5;2m∀ f : C ⟶ D, π₁ (P.etalise f)[0m = 0 :=
begin
  split,
  { intros h f,
    rw pi_1_etalisation at h,
    simp [h],
    refl
  },
  { intro h,
    have := @quasi_adjoint C D P,
    simp [←pi_1_etalisation, this, h],
    refl
  }
end

CPU times: user 5.68 ms, sys: 0 ns, total: 5.68 ms
Wall time: 668 ms


## Clean up the endpoint
If you are running the next lab on customizing Code Llama model then do not delete the endpoint. Otherwise go ahead and delete the endpoint by running the next cell.

In [18]:
predictor.delete_endpoint()