dumpmemory · pull · Feb 13, 2022 · Feb 13, 2022 · Feb 13, 2022
diff --git a/README.md b/README.md
@@ -56,7 +56,14 @@ For GPU run, you need to have installed on your machine Nvidia drivers and [NVID
 * feature extraction (text to dense embeddings) 
 * text generation (GPT-2 style).  
 
-Moreover, we have added a GPU quantization notebook to open directly on `Docker` to play with.
+Moreover, we have added a GPU `quantization` notebook to open directly on `Docker` to play with.
+
+First, clone the repo as some commands below expects to find the `demo` folder:
+
+```shell
+git clone git@github.com:ELS-RD/transformer-deploy.git
+cd transformer-deploy
+```
 
 ### Classification/reranking (encoder model)
 

diff --git a/demo/generative-model/gpt2.ipynb b/demo/generative-model/gpt2.ipynb
diff --git a/demo/quantization/quantization_end_to_end.ipynb b/demo/quantization/quantization_end_to_end.ipynb
@@ -11,7 +11,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "For some context and explanations, please check our documentation here: [https://els-rd.github.io/transformer-deploy/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization_intro/)."
+    "For some context and explanations, please check our documentation here: [https://els-rd.github.io/transformer-deploy/quantization/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization/quantization_intro/)."
    ]
   },
   {
@@ -439,7 +439,7 @@
     "\n",
     "The idea is to take the source code of a specific model and add automatically `QDQ` nodes. QDQ nodes will be placed before and after an operation that we want to quantize, that’s inside these nodes that the information to perform the mapping between high precision and low precision number is stored.\n",
     "\n",
-    "If you want to know more, check our documentation on: [https://els-rd.github.io/transformer-deploy/quantization_intro/](https://els-rd.github.io/transformer-deploy/quantization_intro/)"
+    "If you want to know more, check our documentation on: [https://els-rd.github.io/transformer-deploy/quantization/quantization_ast/](https://els-rd.github.io/transformer-deploy/quantization/quantization_ast/)"
    ]
   },
   {

diff --git a/docs/img b/docs/img
@@ -0,0 +1 @@
+../resources/img/
diff --git a/docs/onnx_convert.md b/docs/onnx_convert.md
@@ -63,4 +63,8 @@ def convert_to_onnx(
     If we were not doing that, the graph would only accept tensors with the exact same shape that the ones we are using to build it (the dummy data), so sequence length or batch size would be fixed.  
     The name we have given to input and output fields will be reused in other tools.
 
+A complete conversion process in real life (including TensorRT engine step) looks like that: 
+
+![Image title](img/export_process.png)
+
 --8<-- "resources/abbreviations.md"
diff --git a/docs/optimizations.md b/docs/optimizations.md
@@ -8,12 +8,14 @@ There are few ways to optimize a model for inference, some of them are basically
 
 Another orthogonal approach is to use lower precision tensors, it may be FP16 float number or INT-8 quantization.
 
+![Image title](img/optimization_process.png)
+
 !!! attention
 
     Mixed precision and INT-8 quantization may have an accuracy cost.
-    The reason is that you can't code as many information in FP16 or INT-8 tensor that you can in FP32 tensor.  
+    The reason is that you can't encode as many information in FP16 or INT-8 tensor that you can in FP32 tensor.  
     Sometimes you have not enough granularity, some other times the range is not big enough.
-    When it happens, you need to modify the graph to keep some operators in full precision.  
-    This library does it for mixed precision and provide you with a simple way to do it for INT-8 quantization
+    When it happens, you need to modify the computation graph to keep some operators in full precision.  
+    This library does it for mixed precision (for most models) and provide you with a simple way to do it for INT-8 quantization
 
 --8<-- "resources/abbreviations.md"
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -44,6 +44,8 @@ markdown_extensions:
   - pymdownx.mark
   - pymdownx.tilde
   - abbr
+  - attr_list
+  - md_in_html
 
 plugins:
   - search

diff --git a/requirements.txt b/requirements.txt
@@ -1,7 +1,6 @@
 fastapi
 onnx
 tritonclient[all]
-triton-model-analyzer
 nvidia-pyindex
 gunicorn
 uvicorn

diff --git a/resources/img/export_process.png b/resources/img/export_process.png
diff --git a/resources/img/gpt2.png b/resources/img/gpt2.png
diff --git a/resources/img/optimization_process.png b/resources/img/optimization_process.png