Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,14 @@ For GPU run, you need to have installed on your machine Nvidia drivers and [NVID
* feature extraction (text to dense embeddings)
* text generation (GPT-2 style).

Moreover, we have added a GPU quantization notebook to open directly on `Docker` to play with.
Moreover, we have added a GPU `quantization` notebook to open directly on `Docker` to play with.

First, clone the repo as some commands below expects to find the `demo` folder:

```shell
git clone git@github.com:ELS-RD/transformer-deploy.git
cd transformer-deploy
```

### Classification/reranking (encoder model)

Expand Down
97 changes: 68 additions & 29 deletions demo/generative-model/gpt2.ipynb

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions demo/quantization/quantization_end_to_end.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"For some context and explanations, please check our documentation here: [https://els-rd.github.io/transformer-deploy/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization_intro/)."
"For some context and explanations, please check our documentation here: [https://els-rd.github.io/transformer-deploy/quantization/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization/quantization_intro/)."
]
},
{
Expand Down Expand Up @@ -439,7 +439,7 @@
"\n",
"The idea is to take the source code of a specific model and add automatically `QDQ` nodes. QDQ nodes will be placed before and after an operation that we want to quantize, that’s inside these nodes that the information to perform the mapping between high precision and low precision number is stored.\n",
"\n",
"If you want to know more, check our documentation on: [https://els-rd.github.io/transformer-deploy/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization_intro/)"
"If you want to know more, check our documentation on: [https://els-rd.github.io/transformer-deploy/quantization/quantization_ast/](https://els-rd.github.io/transformer-deploy/quantization/quantization_ast/)"
]
},
{
Expand Down
1 change: 1 addition & 0 deletions docs/img
4 changes: 4 additions & 0 deletions docs/onnx_convert.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,4 +63,8 @@ def convert_to_onnx(
If we were not doing that, the graph would only accept tensors with the exact same shape that the ones we are using to build it (the dummy data), so sequence length or batch size would be fixed.
The name we have given to input and output fields will be reused in other tools.

A complete conversion process in real life (including TensorRT engine step) looks like that:

![Image title](img/export_process.png)

--8<-- "resources/abbreviations.md"
8 changes: 5 additions & 3 deletions docs/optimizations.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,14 @@ There are few ways to optimize a model for inference, some of them are basically

Another orthogonal approach is to use lower precision tensors, it may be FP16 float number or INT-8 quantization.

![Image title](img/optimization_process.png)

!!! attention

Mixed precision and INT-8 quantization may have an accuracy cost.
The reason is that you can't code as many information in FP16 or INT-8 tensor that you can in FP32 tensor.
The reason is that you can't encode as many information in FP16 or INT-8 tensor that you can in FP32 tensor.
Sometimes you have not enough granularity, some other times the range is not big enough.
When it happens, you need to modify the graph to keep some operators in full precision.
This library does it for mixed precision and provide you with a simple way to do it for INT-8 quantization
When it happens, you need to modify the computation graph to keep some operators in full precision.
This library does it for mixed precision (for most models) and provide you with a simple way to do it for INT-8 quantization

--8<-- "resources/abbreviations.md"
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,8 @@ markdown_extensions:
- pymdownx.mark
- pymdownx.tilde
- abbr
- attr_list
- md_in_html

plugins:
- search
Expand Down
1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
fastapi
onnx
tritonclient[all]
triton-model-analyzer
nvidia-pyindex
gunicorn
uvicorn
Expand Down
Binary file added resources/img/export_process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added resources/img/gpt2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added resources/img/optimization_process.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading