# Using Xorbits Inference to Deploy Local LLMs - in 3 steps!


## <span style="font-size: xx-large;;">🤖  </span> Installing and Running Xorbits Inference (1/3)

#### i. Run `pip install "xinference[all]"` in a terminal window

#### ii. After installation is complete, restart this jupyter notebook

#### iii. Run `xinference` in a new terminal window

#### iv. You should see something similar to the following output:

```
INFO:xinference:Xinference successfully started. Endpoint: http://127.0.0.1:9997
INFO:xinference.core.service:Worker 127.0.0.1:21561 has been added successfully
INFO:xinference.deploy.worker:Xinference worker successfully started.
```

#### v. In the endpoint description, locate the endpoint port number after the colon. In the above case it is `9997`

#### vi. Paste the endpoint port number in the following cell

In [None]:
port = 9997  # replace with your endpoint port number

## <span style="font-size: xx-large;;">🚀  </span> Downloading and Launching Local Models (2/3)

#### In this step, simply run the following code blocks

#### Also, feel free to change the model configuration for different experiences!

#### The latest list of supported models can be found in Xorbits Inference's [official GitHub page](https://github.com/xorbitsai/inference/blob/main/README.md)

##### Here are the parameter options for vicuna-v1.3, ranked from the least space-consuming to the most resource-intensive but high-performing:

model_size_in_billions: `7`, `13`, `33`

quantization: `q2_K`, `q3_K_L`, `q3_K_M`, `q3_K_S`, `q4_0`, `q4_1`, `q4_K_M`, `q4_K_S`, `q5_0`, `q5_1`, `q5_K_M`, `q5_K_S`, `q6_K`, `q8_0`

##### Here are a few of the supported models:

| Name          | Type             | Language | Format  | Size (in billions) | Quantization                            |
|---------------|------------------|----------|---------|--------------------|-----------------------------------------|
| baichuan      | Foundation Model | en, zh   | ggmlv3  | 7                  | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |
| llama-2-chat  | RLHF Model       | en       | ggmlv3  | 7, 13, 70          | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |
| chatglm       | SFT Model        | en, zh   | ggmlv3  | 6                  | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'  |
| chatglm2      | SFT Model        | en, zh   | ggmlv3  | 6                  | 'q4_0', 'q4_1', 'q5_0', 'q5_1', 'q8_0'  |
| wizardlm-v1.0 | SFT Model        | en       | ggmlv3  | 7, 13, 33          | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |
| wizardlm-v1.1 | SFT Model        | en       | ggmlv3  | 13                 | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |
| vicuna-v1.3   | SFT Model        | en       | ggmlv3  | 7, 13              | 'q2_K', 'q3_K_L', ... , 'q6_K', 'q8_0'  |


In order to achieve satisfactory results, it is recommended to use models above 13 billion in size.

In [None]:
# If Xinference can not be imported, you may need to restart jupyter notebook
from llama_index import (
    ListIndex,
    TreeIndex,
    VectorStoreIndex,
    KeywordTableIndex,
    KnowledgeGraphIndex,
    SimpleDirectoryReader,
    ServiceContext,
)
from llama_index.llms import Xinference
from xinference.client import RESTfulClient
from IPython.display import Markdown, display

In [None]:
# Define a client to send commands to xinference
client = RESTfulClient(f"http://localhost:{port}")

# Download and Launch a model, this may take a while the first time
model_uid = client.launch_model(
    model_name="llama-2-chat",
    model_size_in_billions=7,
    model_format="ggmlv3",
    quantization="q2_K",
    n_ctx=4096,
)

llm = Xinference(endpoint=f"http://localhost:{port}", model_uid=model_uid)
service_context = ServiceContext.from_defaults(llm=llm)

## <span style="font-size: xx-large;;">🕺  </span> Index the Data and Start Chatting! (3/3)

#### In this step, simply run the following code blocks

#### Also, feel free to change the index that is used for different experiences

#### A list of all available indexes can be found in Llama Index's [official Docs](https://gpt-index.readthedocs.io/en/latest/core_modules/data_modules/index/modules.html)

Here are some available indexes that are imported:

`ListIndex`, `TreeIndex`, `VetorStoreIndex`, `KeywordTableIndex`, `KnowledgeGraphIndex`

The following code uses `VetorStoreIndex`. To change index, simply replace its name with another index

In [None]:
# create index from the data
documents = SimpleDirectoryReader("../data/paul_graham").load_data()

# change index name in the following line
index = VectorStoreIndex.from_documents(
    documents=documents, service_context=service_context
)

In [None]:
# ask a question and display the answer
query_engine = index.as_query_engine()

question = "What did the author do after his time at Y Combinator?"

response = query_engine.query(question)
display(Markdown(f"<b>{response}</b>"))