In [None]:
#| hide
from onprem.core import *

# OnPrem

> A tool for running large language models on-premises using non-public data

**OnPrem** is a simple Python package that makes it easier to run large language models (LLMs) on non-public or sensitive data and on machines with no internet connectivity (e.g., behind corporate firewalls). Inspired by the [privateGPT](https://github.com/imartinez/privateGPT) GitHub repo and Simon Willison's [LLM](https://pypi.org/project/llm/) command-line utility, **OnPrem** is designed to help integrate local LLMs into practical applications.

## Install

Once [installing PyTorch](https://pytorch.org/get-started/locally/), you can install **OnPrem** with:

```sh
pip install onprem
```

For GPU support, see additional instructions below.

## How to use

### Setup

In [None]:
#| notest

import os.path
from onprem import LLM

llm = LLM()

### Send Prompts to the LLM to Solve Problems
This is an example of few-shot prompting, where we provide an example of what we want the LLM to do.

In [None]:
#| notest

prompt = """Extract the names of people in the supplied sentences. Here is an example:
Sentence: James Gandolfini and Paul Newman were great actors.
People:
James Gandolfini, Paul Newman
Sentence:
I like Cillian Murphy's acting. Florence Pugh is great, too.
People:"""

saved_output = llm.prompt(prompt)


Cillian Murphy, Florence Pugh

### Talk to Your Documents

Answers are generated from the content of your documents.

#### Step 1: Ingest the  Documents into a Vector Database

In [None]:
#| notest
llm.ingest('./sample_data')

2023-09-03 16:30:54.459509: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Creating new vectorstore
Loading documents from ./sample_data


Loading new documents: 100%|██████████████████████| 2/2 [00:00<00:00, 17.16it/s]


Loaded 11 new documents from ./sample_data
Split into 62 chunks of text (max. 500 tokens each)
Creating embeddings. May take some minutes...
Ingestion complete! You can now query your documents using the LLM.ask method


#### Step 2: Answer Questions About the Documents

In [None]:
#| notest
question = """What is  ktrain?""" 
answer, docs = llm.ask(question)
print('\n\nReferences:\n\n')
for i, document in enumerate(docs):
    print(f"\n{i+1}.> " + document.metadata["source"] + ":")
    print(document.page_content)

 Ktrain is a low-code machine learning library designed to augment human
engineers in the machine learning workow by automating or semi-automating various
aspects of model training, tuning, and application. Through its use, domain experts can
leverage their expertise while still benefiting from the power of machine learning techniques.

References:



1.> ./sample_data/ktrain_paper.pdf:
lection (He et al., 2019). By contrast, ktrain places less emphasis on this aspect of au-
tomation and instead focuses on either partially or fully automating other aspects of the
machine learning (ML) workﬂow. For these reasons, ktrain is less of a traditional Au-
2

2.> ./sample_data/ktrain_paper.pdf:
possible, ktrain automates (either algorithmically or through setting well-performing de-
faults), but also allows users to make choices that best ﬁt their unique application require-
ments. In this way, ktrain uses automation to augment and complement human engineers
rather than attempting to entirely r

### Speeding Up Inference Using a GPU

The above example employed the use of a CPU.  
If you have a GPU (even an older one with less VRAM), you can speed up responses.

#### Step 1: Install `llama-cpp-python` with CUBLAS support
```shell
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python==0.1.69 --no-cache-dir
```
It is important to use the specific version shown above due to library incompatibilities.

#### Step 2: Use the `n_gpu_layers` argument with `LLM`

```python
llm = LLM(model_name=os.path.basename(url), n_gpu_layers=128)
```

With the steps above, calls to methods like `llm.prompt` will offload computation to your GPU and speed up responses from the LLM.

