In [None]:
#| hide
from onprem.core import *

# OnPrem

> A tool for running large language models on-premises using non-public data

**OnPrem** is a simple Python package that makes it easier to run large language models (LLMs) on non-public or sensitive data and on machines with no internet connectivity (e.g., behind corporate firewalls). Inspired by the [privateGPT](https://github.com/imartinez/privateGPT) GitHub repo and Simon Willison's [LLM](https://pypi.org/project/llm/) command-line utility, **OnPrem** is designed to help integrate local LLMs into practical applications.

## Install

```sh
pip install onprem
```

For GPU support, see additional instructions below.

## How to use

### Setup

In [None]:
#| notest

import os.path
from onprem import LLM

url = 'https://huggingface.co/TheBloke/Wizard-Vicuna-7B-Uncensored-GGML/resolve/main/Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin'

llm = LLM(model_name=os.path.basename(url))
llm.download_model(url, ssl_verify=True ) # set to False if corporate firewall gives you problems

There is already a file Wizard-Vicuna-7B-Uncensored.ggmlv3.q4_0.bin in /home/amaiya/onprem_data. Do you want to still download it? (Y/n) Y
[██████████████████████████████████████████████████]

### Send Prompts to the LLM

In [None]:
#| notest

prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""

saved_output = llm.prompt(prompt)


Cillian Murphy, Florence Pugh

### How to Speed Up Inference Using a GPU

The above example employed the use of a CPU.  
If you have a GPU (even an older one with less VRAM), you can speed up responses.

#### Step 1: Install `llama-cpp-python` with CUDABLAS support
```shell
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.69 --no-cache-dir
```
It is important to use the specific version shown above due to library incompatibilities.

#### Step 2: Use the `n_gpu_layers` argument with `LLM`
llm = LLM(model_name=os.path.basename(url), n_gpu_layers=128)

With the steps above, calls to methods like `llm.prompt` will offload computation to your GPU and speed up responses from the LLM.

