# StarCoder with the Hugging Face Text Generation Inference Server

This notebook shows you how to write a client that uses StarCoder with the 
Hugging Face [Text Generation Inference Server](https://github.com/huggingface/text-generation-inference). 
There is nothing StarCoder-specific here. This code should work with any
served model.

In [2]:
from text_generation import Client
# This will help with "batching".
from concurrent.futures.thread import ThreadPoolExecutor
from tqdm import tqdm

  from .autonotebook import tqdm as notebook_tqdm


Set the URL of the server below.

In [3]:
URL = "http://starcoder.tenant-85080e-nuprl.coreweave.cloud"
client = Client(URL)

This is basic code to generate from a single prompt. The `print_by_line`
function shows output line-by-line (I don't think we can do token-by-token,
in ChatGPT style in a notebook.) This function is not necessary.

In [4]:
def print_by_line(previous_text: str, new_text: str):
    """
    A little hack to print line-by-line in a Notebook. We receive results
    a few tokens at a time. This buffers output until a newline, so that
    we do not print partial lines.
    """
    if "\n" not in new_text:
        return
    last_newline = previous_text.rfind("\n")
    if last_newline != -1:
        print(previous_text[last_newline+1:] + new_text, end="")
    else:
        print(previous_text + new_text, end="")

DEFAULT_STOP_SEQUENCES = [ "\ndef", "\nclass", "\nif"  ]

def generate(prompt: str,
    max_new_tokens=512,
    stop_sequences=DEFAULT_STOP_SEQUENCES,
    echo=True):
    text = ""
    for response in client.generate_stream(prompt,
        max_new_tokens=max_new_tokens,
        temperature=0.2,
        top_p=0.95,
        stop_sequences=stop_sequences):
        if not response.token.special:
            if echo:
                print_by_line(text, response.token.text)
            text += response.token.text
    if echo:
        print_by_line(text, "\n") # flush any remaining text
    return text

The `generate` function blocks until it receives a completion. If you have
a batch of several prompts, this may grossly underutilize your GPU. The
`text-generation-inference` server does not natively support batch input.
But, it does create batches behind the scenes to efficiently process 
concurrent requests. So, we can achieve the same effect as batching on a
single client by issuing several concurrent requests.

In [5]:
def generate_concurrent(
    prompts: list,
    max_new_tokens: int,
    max_workers: int,
    stop_sequences=DEFAULT_STOP_SEQUENCES):
    with ThreadPoolExecutor(max_workers=max_workers) as ctx:
        return list(tqdm(
            ctx.map(lambda prompt: generate(prompt, max_new_tokens, stop_sequences, echo=False), prompts),
            desc="Completions",
            total=len(prompts)))

## Examples

### Generating a Web Page

In [6]:
html_result = generate(
    "<html><!-- A  HTML page for a search engine, with a simple text box in the center -->",
    stop_sequences=["</html>"])   


<head>
<title>Search Engine</title>
<meta name="description" content="A search engine for the web">
<meta name="keywords" content="search engine, web search, google, yahoo, bing">
<meta name="author" content="<NAME>">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="style.css">
</head>
<body>
<div id="container">
<h1>Search Engine</h1>
<form action="search.php" method="get">
<input type="text" name="q" placeholder="Search...">
<input type="submit" value="Search">
</form>
<p>This is a search engine for the web. It is a simple PHP script that uses the Google, Yahoo and Bing search engines to search the web. It is a simple search engine that can be used for educational purposes.</p>
<p>This search engine is not affiliated with Google, Yahoo or Bing. It is a simple search engine that can be used for educational purposes.</p>
<p>This search engine is not affiliated with Google, Yahoo or Bing. It is a simple search engine that can be used f

### Batching

In the cell below, `fast_results` completes much faster than `slow_results`, because
it uses more concurrent workers. There will be no point exceeding the maximum batch size
supported by the server.

In [7]:
fast_results = generate_concurrent(["# Write factorial", "# Write fib"] * 10,
    max_new_tokens=512,
    max_workers=20,
    stop_sequences=[])
slow_results = generate_concurrent(
    ["# Write factorial", "# Write fib"] * 10,
    max_new_tokens=512,
    max_workers=5,
    stop_sequences=[])

Completions: 100%|██████████| 20/20 [00:19<00:00,  1.02it/s]
Completions: 100%|██████████| 20/20 [00:56<00:00,  2.80s/it]


### Chat Mode

This is a hack. Chat will only run for about 1000 tokens. But, that can be tweaked by tweaking the CHAT_PROMPT below.

In [8]:
import requests
CHAT_PROMPT = requests.get("https://gist.githubusercontent.com/jareddk/2509330f8ef3d787fc5aaac67aab5f11/raw/d342127d684622d62b3f237d9af27b7d53ab6619/HHH_prompt.txt").text

Rerun the cell below to reset the chat state. Note that generate should be configured to sample differently: we want a repetition penalty and probably a higher temperature.

In [9]:
chat_state = CHAT_PROMPT + "\n"
def send(message):
    global chat_state
    message_to_send = "\nHuman:  " + message + "\n\nAssistant:"
    result = generate(chat_state + message_to_send, 
        max_new_tokens=256, 
        stop_sequences=["Human:", "-----"])
    if result.endswith("Human:"):
        result = result[:-len("Human:")]
    elif result.endswith("-----"):
        result = result[:-len("-----")]
    else:
        print("<stopped early>")
    chat_state += message_to_send + result

In [10]:
send("Please write the factorial function in Haskell.")

  Sure, here’s one example:

factorial :: Integer -> Integer
factorial 0 = 1
factorial n = n * factorial (n - 1)

-----


In [11]:
send("Write a web server in Python.")

  Sure, here’s one example:

from flask import Flask
app = Flask(__name__)

@app.route("/")
def hello():
       return "Hello World!"

if __name__ == "__main__":
       app.run()

-----
