Is your improvement request related to a problem? Please describe.
1, Auto provision for CPU only nodes, load the model with the minimum quantization option, one can detect the node with command and set the env variables
2, Match the models with different GPU types (GPU memory), one can map them in a config (model-config.json) and parse it with a script to get informations as needed, like {name:"baichuan-13b", memory:"20GB", quantization:"ON/OFF", quantization_bits:"4|8", ....}
3, So the end user only cares about the model name and the node running in CPU or GPU mode. For us, we need to maintain one CD/CV to support all kinds of LLMs
Is your improvement request related to a problem? Please describe.
1, Auto provision for CPU only nodes, load the model with the minimum quantization option, one can detect the node with command and set the env variables
2, Match the models with different GPU types (GPU memory), one can map them in a config (model-config.json) and parse it with a script to get informations as needed, like {name:"baichuan-13b", memory:"20GB", quantization:"ON/OFF", quantization_bits:"4|8", ....}
3, So the end user only cares about the model name and the node running in CPU or GPU mode. For us, we need to maintain one CD/CV to support all kinds of LLMs