# Create your own chatbot with SOLAR-10.7b-Instruct-v1.0 on AWS Inferentia

This guide will detail how to export, deploy and run a **SOLAR** chat model on AWS inferentia.

You will learn how to:
- set up your AWS instance,
- export the SOLAR model to the Neuron format,
- push the exported model to the Hugging Face Hub,
- deploy the model and use it in a chat application.

Note: This tutorial was created on a inf2.48xlarge AWS EC2 Instance.

## Prerequisite: Setup AWS environment

*you can skip that section if you are already running this notebook on your instance.*

https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/setup/neuron-setup/pytorch/neuronx/ubuntu/torch-neuronx-ubuntu22.html#setup-torch-neuronx-ubuntu22 

Once the instance is up and running, you can ssh into it. But instead of developing inside a terminal you need to launch a Jupyter server to run this notebook.

For this, you need first to add a port for forwarding in the ssh command, which will tunnel our localhost traffic to the AWS instance.

From a local terminal, type the following commands:

```shell
HOSTNAME="" # IP address, e.g. ec2-3-80-....
KEY_PATH="" # local path to key, e.g. ssh/trn.pem

ssh -L 8080:localhost:8080 -i ${KEY_NAME}.pem ubuntu@$HOSTNAME
```

On the instance, you can now start the jupyter server.

```
python -m notebook --allow-root --port=8080
```

You should see a familiar jupyter output with a URL.

You can click on it, and a jupyter environment will open in your local browser.



In [1]:
# Special widgets are required for a nicer display
import sys
!{sys.executable} -m pip install ipywidgets

Looking in indexes: https://pypi.org/simple, https://pip.repos.neuron.amazonaws.com


## 1. Export the SOLAR-10.7b-Instruct-v1.0 model to Neuron

For this guide, we will use the [upstage/SOLAR-10.7B-Instruct-v1.0](https://huggingface.co/upstage/SOLAR-10.7B-Instruct-v1.0) model

As explained in the [optimum-neuron documentation](https://huggingface.co/docs/optimum-neuron/guides/export_model#why-compile-to-neuron-model)
, models need to be compiled and exported to a serialized format before running them on Neuron devices.

Fortunately, 🤗 **optimum-neuron** offers a [very simple API](https://huggingface.co/docs/optimum-neuron/guides/models#configuring-the-export-of-a-generative-model)
to export standard 🤗 [transformers models](https://huggingface.co/docs/transformers/index) to the Neuron format.

When exporting the model, we will specify two sets of parameters:

- using *compiler_args*, we specify on how many cores we want the model to be deployed (each neuron device has two cores), and with which precision (here *float16*),
- using *input_shapes*, we set the static input and output dimensions of the model. All model compilers require static shapes, and neuron makes no exception. Note that the
*sequence_length* not only constrains the length of the input context, but also the length of the Key/Value cache, and thus, the output length.

Depending on your choice of parameters and inferentia host, this may take from a few minutes to more than an hour.

For your convenience, we host a pre-compiled version of that model on the Hugging Face hub, so you can skip the export and start using the model immediately in section 2.

In [2]:
from optimum.neuron import NeuronModelForCausalLM

compiler_args = {"num_cores": 24, "auto_cast_type": 'fp16'}
input_shapes = {"batch_size": 1, "sequence_length": 2048}
model = NeuronModelForCausalLM.from_pretrained(
        "upstage/SOLAR-10.7B-Instruct-v1.0",
        export=True,
        **compiler_args,
        **input_shapes)

Loading checkpoint shards:   0%|          | 0/5 [00:00<?, ?it/s]

Passing the argument `library_name` to `get_supported_tasks_for_model_type` is required, but got library_name=None. Defaulting to `transformers`. An error will be raised in a future version of Optimum if `library_name` is not provided.


2024-03-28 04:18:06.000282:  4906  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache




2024-03-28 04:18:41.000131:  5115  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-28 04:18:41.000258:  5116  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-28 04:18:41.000344:  5117  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_3ccbbf8fad9f8653719c+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_3ccbbf8fad9f8653719c+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_3ccbbf8fad9f8653719c+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000424:  5115  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/b05de439-4b12-4171-8bbc-a1da42daa894/model.MODULE_3ccbbf8fad9f8653719c+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/b05de439-4b12-4171-8bbc-a1da42daa894/model.MODULE_3ccbbf8fad9f8653719c+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_159bbea91adf9a015aed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000464:  5118  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_159bbea91adf9a015aed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_159bbea91adf9a015aed+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000552:  5116  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/1ed66f78-fe45-4082-99c6-8f0619b88f82/model.MODULE_159bbea91adf9a015aed+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/1ed66f78-fe45-4082-99c6-8f0619b88f82/model.MODULE_159bbea91adf9a015aed+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_98408c431741877be917+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000584:  5119  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_98408c431741877be917+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_98408c431741877be917+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000625:  5117  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/540d6fce-e5af-485e-83ff-743962c6f003/model.MODULE_98408c431741877be917+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/540d6fce-e5af-485e-83ff-743962c6f003/model.MODULE_98408c431741877be917+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_a1542f8e722601b49752+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_a1542f8e722601b49752+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_a1542f8e722601b49752+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000687:  5118  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/55db4ff3-05fc-41cf-856f-74475c98c1f3/model.MODULE_a1542f8e722601b49752+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/55db4ff3-05fc-41cf-856f-74475c98c1f3/model.MODULE_a1542f8e722601b49752+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-03-28 04:18:41.000783:  5120  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_38eaf08eacefe5d81f34+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_38eaf08eacefe5d81f34+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000852:  5121  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_38eaf08eacefe5d81f34+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:41.000879:  5119  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/11f85e43-fdd4-4eed-ab6f-63c27f65a94d/model.MODULE_38eaf08eacefe5d81f34+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/11f85e43-fdd4-4eed-ab6f-63c27f65a94d/model.MODULE_38eaf08eacefe5d81f34+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
2024-03-28 04:18:41.000983:  5123  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
2024-03-28 04:18:41.000993:  5122  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_fe4597a2864fc897f83b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_fe4597a2864fc897f83b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:42.000049:  5124  INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache


neuronxcc-2.12.68.0+4480452af/MODULE_fe4597a2864fc897f83b+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:42.000239:  5121  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/1d9f6a32-6cdc-4f60-8a10-f1c13fd15b38/model.MODULE_fe4597a2864fc897f83b+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/1d9f6a32-6cdc-4f60-8a10-f1c13fd15b38/model.MODULE_fe4597a2864fc897f83b+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_53f9602d8c748fdde2f5+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_cfc674e9b3447df6f61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_53f9602d8c748fdde2f5+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_cfc674e9b3447df6f61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_a098f08124e3c8e57271+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the correspondin

2024-03-28 04:18:42.000328:  5123  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/3542673e-f4bc-4a6d-b5af-0fab3e515155/model.MODULE_53f9602d8c748fdde2f5+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/3542673e-f4bc-4a6d-b5af-0fab3e515155/model.MODULE_53f9602d8c748fdde2f5+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_a098f08124e3c8e57271+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_cfc674e9b3447df6f61a+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:42.000338:  5122  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/c9a3c7eb-010d-4ba1-93d2-9b904d30c065/model.MODULE_cfc674e9b3447df6f61a+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/c9a3c7eb-010d-4ba1-93d2-9b904d30c065/model.MODULE_cfc674e9b3447df6f61a+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_a098f08124e3c8e57271+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:42.000389:  5124  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/7db6436a-64f5-41ab-8f05-4704cd4f3c92/model.MODULE_a098f08124e3c8e57271+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/7db6436a-64f5-41ab-8f05-4704cd4f3c92/model.MODULE_a098f08124e3c8e57271+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']


neuronxcc-2.12.68.0+4480452af/MODULE_16e691faa03898b124ec+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_16e691faa03898b124ec+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.
neuronxcc-2.12.68.0+4480452af/MODULE_16e691faa03898b124ec+2c2d707e/model.neff not found in aws-neuron/optimum-neuron-cache: the corresponding graph will be recompiled. This may take up to one hour for large models.


2024-03-28 04:18:42.000504:  5120  INFO ||NEURON_CC_WRAPPER||: Call compiler with cmd: ['neuronx-cc', '--target=trn1', 'compile', '--framework', 'XLA', '/tmp/ubuntu/neuroncc_compile_workdir/1992587a-5274-47f8-8596-e602d4a567bf/model.MODULE_16e691faa03898b124ec+2c2d707e.hlo.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/1992587a-5274-47f8-8596-e602d4a567bf/model.MODULE_16e691faa03898b124ec+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']
..........................................................................................
Compiler status PASS
.........
Compiler status PASS

Compiler status PASS

Compiler status PASS

Compiler status PASS

Compiler status PASS

Compiler status PASS
...
Compiler status PASS

Compiler status PASS
...
Compiler status PASS


This probably took a while.

Fortunately, you will need to do this only once because you can save your model and reload it later.

In [3]:
model.save_pretrained("SOLAR-10.7B-Instruct-v1.0")

Even better, you can push it to the [Hugging Face hub](https://huggingface.co/models).

For that, you need to be logged in to a [HuggingFace account](https://huggingface.co/join).

If you are not connected already on your instance, you will now be prompted for an access token.

In [4]:
# from huggingface_hub import notebook_login

# notebook_login(new_session=False)

from huggingface_hub import interpreter_login

interpreter_login()


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token is valid (permission: write).
Your token has been saved to /home/ubuntu/.cache/huggingface/token
Login successfu

By default, the model will be uploaded to your account (organization equal to your user name).

Feel free to edit the cell below if you want to upload the model to a specific [Hugging Face organization](https://huggingface.co/docs/hub/organizations).

In [5]:
from huggingface_hub import whoami

org = whoami()['name']

repo_id = f"{org}/SOLAR-10.7B-Instruct-v1.0"

model.push_to_hub("SOLAR-10.7B-Instruct-v1.0", repository_id=repo_id)

3f34f3cd8a2f29194c62.neff:   0%|          | 0.00/2.24M [00:00<?, ?B/s]

71bede1737fb379a878e.neff:   0%|          | 0.00/2.47M [00:00<?, ?B/s]

335923801d2cfe2bdc08.neff:   0%|          | 0.00/2.29M [00:00<?, ?B/s]

84e44ace3adac1e3a967.neff:   0%|          | 0.00/2.16M [00:00<?, ?B/s]

15e0db55353d08ea936f.neff:   0%|          | 0.00/2.80M [00:00<?, ?B/s]

Upload 10 LFS files:   0%|          | 0/10 [00:00<?, ?it/s]

86b0db3a648943e3b86c.neff:   0%|          | 0.00/2.02M [00:00<?, ?B/s]

a068fe6cbb44e7c903f7.neff:   0%|          | 0.00/2.31M [00:00<?, ?B/s]

fcb47ebfa77fed781720.neff:   0%|          | 0.00/2.05M [00:00<?, ?B/s]

de6008c4b69320e9d5a5.neff:   0%|          | 0.00/2.09M [00:00<?, ?B/s]

ff377a39ddf579f4db20.neff:   0%|          | 0.00/5.99M [00:00<?, ?B/s]

### A few more words about export parameters.

The minimum memory required to load a model can be computed with:

```
   memory = bytes per parameter * number of parameters
```

The **Llama 2 13B** model uses *float16* weights (stored on 2 bytes) and has 13 billion parameters, which means it requires at least 2 * 13B or ~26GB of memory to store its weights.

Each NeuronCore has 16GB of memory which means that a 26GB model cannot fit on a single NeuronCore.

In reality, the total space required is much greater than just the number of parameters due to caching attention layer projections (KV caching).
This caching mechanism grows memory allocations linearly with sequence length and batch size.

Here we set the *batch_size* to 1, meaning that we can only process one input prompt in parallel. We set the *sequence_length* to 2048, which corresponds to half the model maximum capacity (4096).

The formula to evaluate the size of the KV cache is more involved as it also depends on parameters related to the model architecture, such as the width of the embeddings and the number of decoder blocks.

Bottom-line is, to get very large language models to fit, tensor parallelism is used to split weights, data, and compute across multiple NeuronCores, keeping in mind that the memory on each core cannot exceed 16GB.

Note that increasing the number of cores beyond the minimum requirement almost always results in a faster model.
Increasing the tensor parallelism degree improves memory bandwidth which improves model performance.

To optimize performance it's recommended to use all cores available on the instance.

In this guide we use all the 24 cores of the *inf2.48xlarge*, but this should be changed to 12 if you are
using a *inf2.24xlarge* instance.

## 2. Generate text using SOLAR-10.7B on AWS Inferentia2

Once your model has been exported, you can generate text using the transformers library, as it has been described in [detail in this post](https://huggingface.co/blog/how-to-generate).

If as suggested you skipped the first section, don't worry: we will use a precompiled model already present on the hub instead.

In [6]:
from optimum.neuron import NeuronModelForCausalLM

try:
    model
except NameError:
    # Edit this to use another base model
    model = NeuronModelForCausalLM.from_pretrained('upstage/SOLAR-10.7B-Instruct-v1.0')

We will need a *SOLAR-10.7B* tokenizer to convert the prompt strings to text tokens.

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("upstage/SOLAR-10.7B-Instruct-v1.0")

tokenizer_config.json:   0%|          | 0.00/1.41k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

The following generation strategies are supported:

- greedy search,
- multinomial sampling with top-k and top-p (with temperature).

Most logits pre-processing/filters (such as repetition penalty) are supported.

In [8]:
inputs = tokenizer("What is deep-learning ?", return_tensors="pt")
outputs = model.generate(**inputs,
                         max_new_tokens=128,
                         do_sample=True,
                         temperature=0.9,
                         top_k=50,
                         top_p=0.9)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Both `max_new_tokens` (=128) and `max_length`(=20) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


2024-Mar-28 04:26:30.0458 4906:8680 [0] nccl_net_ofi_init:1415 CCOM WARN NET/OFI aws-ofi-nccl initialization failed
2024-Mar-28 04:26:30.0458 4906:8680 [0] init.cc:137 CCOM WARN OFI plugin initNet() failed is EFA enabled?


['What is deep-learning ?\nDeep learning is a subset of machine learning, which involves neural networks. A neural network is a system that is composed of multiple layers of interconnected nodes, which work together to process information. The more layers a neural network has, the deeper it is, hence the term deep learning. The number of layers is not the only thing that differentiates a deep neural network from a regular one. Deep neural networks can also handle large amounts of data and learn complex representations and features from it, whereas traditional neural networks are designed to handle smaller datasets and require preprocessing to extract features before feeding them into the model.\n\nDeep learning algorithms']

## 3. Create a chat application using SOLAR-10.7B on AWS Inferentia2

We specifically selected a **SOLAR-10.7B** chat variant to illustrate the excellent behaviour of the exported model when the length of the encoding context grows.

The model expects the prompts to be formatted following a specific template corresponding to the interactions between a *user* role and an *assistant* role.

Each chat model has its own convention for encoding such contents, and we will not go into too much details in this guide, because we will directly use the [Hugging Face chat templates](https://huggingface.co/blog/chat-templates) corresponding to our model.

The utility function below converts a list of exchanges between the user and the model into a well-formatted chat prompt.

In [9]:
def format_chat_prompt(message, history, max_tokens):
    """ Convert a history of messages to a chat prompt
    
    
    Args:
        message(str): the new user message.
        history (List[str]): the list of user messages and assistant responses.
        max_tokens (int): the maximum number of input tokens accepted by the model.
    
    Returns:
        a `str` prompt.
    """
    chat = []
    # Convert all messages in history to chat interactions
    for interaction in history:
        chat.append({"role": "user", "content" : interaction[0]})
        chat.append({"role": "assistant", "content" : interaction[1]})
    # Add the new message
    chat.append({"role": "user", "content" : message})
    # Generate the prompt, verifying that we don't go beyond the maximum number of tokens
    for i in range(0, len(chat), 2):
        # Generate candidate prompt with the last n-i entries
        prompt = tokenizer.apply_chat_template(chat[i:], tokenize=False)
        # Tokenize to check if we're over the limit
        tokens = tokenizer(prompt)
        if len(tokens.input_ids) <= max_tokens:
            # We're good, stop here
            return prompt
    # We shall never reach this line
    raise SystemError

We are now equipped to build a simplistic chat application.

We simply store the interactions between the user and the assistant in a list that we use to generate
the input prompt.

In [10]:
history = []
max_tokens = 1024

def chat(message, history, max_tokens):
    prompt = format_chat_prompt(message, history, max_tokens)
    # Uncomment the line below to see what the formatted prompt looks like
    #print(prompt)
    inputs = tokenizer(prompt, return_tensors="pt")
    outputs = model.generate(**inputs,
                             max_length=2048,
                             do_sample=True,
                             temperature=0.9,
                             top_k=50,
                             repetition_penalty=1.2)
    # Do not include the input tokens
    outputs = outputs[0, inputs.input_ids.size(-1):]
    response = tokenizer.decode(outputs, skip_special_tokens=True)
    history.append([message, response])
    return response

In [11]:
print(chat("My favorite color is blue. My favorite fruit is strawberry.", history, max_tokens))

From this sentence, we can deduce that the person's preferences include a liking for the color blue and enjoying strawberries as their preferred type of fruit. They find pleasure in perceiving things with a hue associated with sky or sea, usually depicted by the particular wavelength within visible light spectrum. Additionally, they have shown fondness towards juicy small red berries often consumed fresh due to their sweet taste and luscious texture.


In [12]:
print(chat("Name a fruit that is on my favorite colour.", history, max_tokens))

This statement does not provide any information about specific fruits colored in shades of blue which aligns with your mentioned preference (blue). However, there are some exceptions such as Blueberries, while it might not directly match the "sky" and "sea" blue colors you may associate with your favored chromatic tone - its name still refers to a shade commonly considered "blue". Nonetheless, if you were referring to another fruit bearing the typical shades of blue, none immediately comes into mind.


In [13]:
print(chat("What is the colour of my favorite fruit ?", history, max_tokens))

It has an orange skin and sweet white juice inside. 

### Answer:
 The color of your favorite fruit from the description given would be orange when referencing the outer skin but please note this might differ based on contextual clarity where interior flesh wasn't described. If what we understand as your favorite fruit is an Orange then both the skin and pulp contain varying nuances within orange range.
