## Info

This notebook contains several demos for usage of the ghostwheel inference server. [See the readme](readme.md) for more info.

Note that **you'll need to run this notebook on the internal network (Imperial-WPA)** as the server hosting ghostwheel is not exposed externally.

### Available LLMs

[The ghostwheel API docs](https://ese-timewarp.ese.ic.ac.uk) contain an up-to-date list of valid identifiers for LLMs you can call through ghostwheel. The docs are regenerated with any change to the backend application, so this list is kept current when any new models are deployed. In addition to the Ollama calls mentioned above, we also provide a `api/list_models` endpoint, should you want to programatically determine which LLMs are available to you.

For more information describing parameters for the completion endpoints and response specifications, [check out the Ollama API docs on GitHub](https://github.com/ollama/ollama/blob/main/docs/api.md).


## Common setup

We define two global constants here that we'll use throughout the notebook, namely, for the ghostwheel base URL and for our API key. In practice you should store your key as an environment variable (or otherwise outside of your codebase) and load it dynamically.

In [1]:
# Define the base URL and API key for ghostwheel
GHOSTWHEEL_BASE_URL = "https://ese-timewarp.ese.ic.ac.uk"
GHOSTWHEEL_API_KEY = "{your api key here}"

## Using ghostwheel as a REST API

The most direct (and arguably least convenient) way to use ghostwheel is with REST API calls. This has the advantage of working with practically any language and on any platform. Below is an example of calling the `api/generate` endpoint from Python, though if you're using Python for your application, we'll see in a moment that there are a few alternatives for communication with ghostwheel that are far more convenient.

### Setup

We'll install the requests package to make our API calls.

In [2]:
# Install the requests library
%pip install -q requests

Note: you may need to restart the kernel to use updated packages.


In [12]:
import requests
from pprint import pprint

# Generates a response from the inference server
def generate(model: str, prompt: str) -> dict:
    url = f"{GHOSTWHEEL_BASE_URL}/api/generate"
    res = requests.post(
        url,
        json={
            'model': model,
            'prompt': prompt,
            'stream': False,
        },
        headers={
            'Content-Type': "application/json", # Must specify JSON content type
            'X-API-Key': GHOSTWHEEL_API_KEY,
        },
    )
    res.raise_for_status()
    return res.json()


# Get a response from the model
res = generate(
    'llama3:70b-instruct-q8_0',
    "Tell me about the history of the United States of America in 50 words or less.",
)

res['context'] = res['context'][:4]  # Truncate the context so output isn't long as heck
pprint(res) # Pretty print the response in JSON

{'context': [128006, 882, 128007, 271],
 'created_at': '2024-06-12T12:55:03.111223323Z',
 'done': True,
 'done_reason': 'stop',
 'eval_count': 79,
 'eval_duration': 4690938000,
 'load_duration': 6250379,
 'model': 'llama3:70b-instruct-q8_0',
 'prompt_eval_duration': 63489000,
 'response': 'Here is a brief history of the United States:\n'
             '\n'
             'The US was founded in 1776 by European colonists who declared '
             'independence from Britain. The young nation expanded westward, '
             'survived a civil war (1861-1865), and emerged as a global '
             'superpower through industrialization, World War I and II, and '
             'the Cold War, shaping its diverse culture and economy along the '
             'way.',
 'total_duration': 4804171071}


## Using ghostwheel with the Python ollama package

Direct API call are portable, sure, but why reinvent the wheel? Ollama provides a convenient API client for Python [in the form of the `ollama` package](https://github.com/ollama/ollama-python), which we can trivially use with ghostwheel as well. Naturally, the endpoints ghostwheel doesn't implement (i.e. for administration of served models) will not work, so you'll be restricted to `api/generate`, `api/chat` and `api/tags`.

### Setup

In [4]:
# Install the ollama package
%pip install -q ollama

Note: you may need to restart the kernel to use updated packages.


### Basic completions

To use the ollama package with ghostwheel, we just need to set its host parameter and provide our API key in the client's default headers. Note that we *cannot* directly call `ollama.generate()` or the other functions directly from the module, since it uses a default client that's not configured to use ghostwheel. Instead, we initialize an `ollama.Client` instance ourselves and configure it as below:

In [5]:
from ollama import Client as OllamaClient

# Initialize the Ollama client with ghostwheel and provide our API key
client = OllamaClient(
    host=GHOSTWHEEL_BASE_URL,
    headers={
        'X-API-Key': GHOSTWHEEL_API_KEY,
    },
)


# Generate a response from the model
res = client.generate(
    'llama3:70b-instruct-q8_0',
    "Tell me about Fisherman's Wharf in San Francisco in at most 100 words.",
)
print(res['response']) # Print the response

Fisherman's Wharf is a popular tourist destination in San Francisco, California. This bustling waterfront district offers stunning views of the Bay Bridge, Alcatraz Island, and the sea lions at Pier 39. Visitors can sample fresh seafood, browse souvenir shops, and take a stroll along the pier. Street performers and live music add to the lively atmosphere. Take a ferry to Alcatraz or enjoy a sunset cruise from one of the many piers. With its rich maritime history and vibrant energy, Fisherman's Wharf is a must-visit spot in San Francisco.


### Chat completions

In [6]:
from pprint import pprint

# Generate a chat response
res = client.chat(
    'mistral:7b-instruct-v0.3-q8_0',
    [
        {
            'role': "user",
            'content': "What is the capital of France?",
        },
    ],
)

# Pretty print the whole response object
pprint(res)

{'created_at': '2024-06-12T12:53:49.281011657Z',
 'done': True,
 'done_reason': 'stop',
 'eval_count': 8,
 'eval_duration': 241640000,
 'load_duration': 10414176,
 'message': {'content': ' The capital of France is Paris.',
             'role': 'assistant'},
 'model': 'mistral:7b-instruct-v0.3-q8_0',
 'prompt_eval_count': 2,
 'prompt_eval_duration': 42765000,
 'total_duration': 426845674}


### Response streaming

We can stream responses easily with the ollama package as well, with the caveat that it doesn't provide any asynchronous methods—use Langchain if you need to stream responses asynchronously.

In [7]:
# Get a response stream from the model
stream = client.generate(
    'mistral:7b-instruct-v0.3-q8_0',
    "Tell me about Paris in 100 words or less.",
    stream=True,
)

# Iterate over the response stream
for res in stream:
    print(res['response'], end='')  # Print incoming response tokens

 Paris, the City of Light, is renowned for its romantic ambiance, iconic landmarks, and rich artistic heritage. Known as a global center for art, fashion, gastronomy, and culture, it's home to masterpieces like the Eiffel Tower, Louvre Museum, Notre-Dame Cathedral, and more. The Seine River winds through the city, offering picturesque views of its historical architecture. Parisian cuisine, from croissants to haute couture dining, delights visitors worldwide. Its vibrant energy, captured in the twirl of dancers at the Moulin Rouge, encapsulates the essence of a timeless, enchanting city.

## Using ghostwheel with Langchain

Langchain has built-in support for using Ollama models for text and chat completion; we can take advantage of these integrations to use ghostwheel as its backend. All that's required is to point an `Ollama` instance (or `ChatOllama`, `OllamaEmbeddings` etc. as needed) at ghostwheel with the `base_url` parameter, and provide our API key in the request headers (Langchain allows us to define headers for its API calls when constructing an instance of these classes).

### Setup

To get ghosthweel running behind langchain, we only need the `langchain` and `langchain-community` packages:

In [8]:
# Install required packages
%pip install -q \
    langchain \
    langchain-community

Note: you may need to restart the kernel to use updated packages.


### Basic Invocation

In [14]:
from langchain.llms import Ollama
from IPython.display import display, Markdown

llm = Ollama(
    model='llama3:70b-instruct-q8_0', # Any of the available models listed in the API docs
    base_url=GHOSTWHEEL_BASE_URL,
    headers={
        'X-API-Key': GHOSTWHEEL_API_KEY,
    },
)

# Get a response from the model
prompt = "Write a function that takes a list of integers and returns the sum of all the integers in the list."
res = llm.invoke(prompt)

# Print the response
display(Markdown(res))

Here is a simple function that does that:
```
def sum_list(numbers):
    return sum(numbers)
```
Explanation:

* The `sum` function is a built-in Python function that takes an iterable (such as a list) as an argument and returns the sum of all its elements.
* In this case, of a list of integers, we simply pass the list to the `sum` function and it will return the sum of all the integers in the list.

Example usage:
```
numbers = [1, 2, 3, 4, 5]
result = sum_list(numbers)
print(result)  # Output: 15
```
Note that this function assumes that the input list only contains integers. If the list can contain other types of elements, you may want to add error handling or type checking to ensure that the function behaves correctly.

### Response streaming
We can trivially stream the response and output tokens as they are received from the server.

In [10]:
prompt = "Tell me about the history of the United States in 100 words."

# Asynchronously stream tokens from the model
async for token in llm.astream(prompt):
    # Print each token as it's received (without the newline delimiter)
    print(token, end='')

Here is a brief 100-word history of the United States:

The United States declared independence from Britain in 1776 and fought a Revolutionary War to gain freedom. The young nation expanded westward, acquiring land through the Louisiana Purchase and Mexican-American War. Civil War (1861-1865) ended slavery, but racial tensions persisted. Industrialization and immigration shaped the country in the late 1800s and early 1900s. The US emerged as a world power after World Wars I and II, playing a key role in shaping global politics during the Cold War era. Social movements of the 1960s, including Civil Rights and Feminist movements, continued to shape American society.

### Embeddings

We can use ghostwheel for embeddings as well. First we'll install some requirements for document loading (`pypdf`) and for RAG further down (`langchain-chroma` to use as a vector database).

In [15]:
# Install requirements for document loading & RAG
%pip install -q \
    langchain-chroma \
    pypdf

Note: you may need to restart the kernel to use updated packages.


In [18]:
from langchain.embeddings import OllamaEmbeddings

from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Initialize the embeddings model
emb = OllamaEmbeddings(
    model='nomic-embed-text:137m-v1.5-fp16',
    base_url=GHOSTWHEEL_BASE_URL,
    headers={
        'X-API-Key': GHOSTWHEEL_API_KEY,
    },
)

# Load a PDF document and split into chunks
# See https://github.com/ollama/ollama/blob/main/examples/langchain-python-rag-document/main.py
loader = PyPDFLoader("./assets/future_of_chipmaking.pdf")
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
docs = loader.load()
doc_chunks = splitter.split_documents(docs)

# Print the first and last chunks
print(f"Loaded document ({len(doc_chunks)} chunks):\n")
print(f"{doc_chunks[0].page_content}\n\n...\n\n{doc_chunks[-1].page_content}")

Loaded document (22 chunks):

Future-of-chipmaking
TWO YEARS shy of its 60th birthday , Moore’ s law has become a bit like Schrödinger ’s hypothetical cat—at once dead and alive. In
1965 Gordon Moore, one of the co-founders of Intel, observed that the number of transistors—a type of electronic component—
that could be crammed onto a microchip was doubling every 12 months, a figure he later revised to every two years.

...

At some point, the day will arrive when no amount of clever technology can shrink transistors still further (it is hard to see, for
instance, how one could be built with less than an atom’ s worth of stuf f). As Moore himself warned in 2003, “no exponential is for
ever.” But, he told the assembled engineers, “your job is delaying for ever”. Chipmakers have done an admirable job of that in the
two decades since he spoke. And they have at least sketched out a path for the next two decades, too.


### RAG

In [19]:
from langchain.vectorstores import Chroma
from langchain_core.prompts import ChatPromptTemplate

from langchain.chat_models.ollama import ChatOllama
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains.retrieval import create_retrieval_chain

# Load the document chunks into a vector database
db = Chroma.from_documents(documents=doc_chunks, embedding=emb)
retriever = db.as_retriever()

# Set up a chain for RAG question/answers
sys_prompt = """
You are an assistant for question-answering tasks. Use the following \
pieces of retrieved contextto answer the question. If you don't know \
the answer, just say that you don't know. Use three sentences \
maximum and keep the answer concise.

Context: {context}
"""
rag_prompt = ChatPromptTemplate.from_messages([
    ('system', sys_prompt),
    ('human', "{input}")
])

# Set up a chat model
rag_model = ChatOllama(
    model='llama3:70b-instruct-q8_0',
    base_url=GHOSTWHEEL_BASE_URL,
    headers={
        'X-API-Key': GHOSTWHEEL_API_KEY,
    },
    temperature=0
)

# Create the RAG chain
docs_chain = create_stuff_documents_chain(rag_model, rag_prompt)
chain = create_retrieval_chain(retriever, docs_chain)

# Retrieves an answer
def query(question: str) -> dict:
    res = chain.invoke({
        'input': question,
    })
    return res

# Ask a question based on the document
res = query("Why are researchers looking into abandoning silicon?")
print(res['answer'])

Researchers are looking into abandoning silicon because at tiny sizes (less than four nanometers), current leakage becomes much worse due to silicon's surface roughness, which hinders electron movement and reduces the transistor's ability to switch on and off properly.
