CUDA implementation of ggml_clamp #545

jploski · 2023-09-30T15:45:02Z

ggml_clamp currently lacks a CUDA implementation, which is rather trivial, but something I ran into while porting MPT to llama.cpp. I have a ready implementation and will create a PR for it, referencing this issue.

jploski · 2023-09-30T16:03:06Z

Hmm, I'm a bit confused as the internals (e.g. of ggml_scale) on ggml master branch look different than on llama.cpp master. llama.cpp's implementation of ggml-cuda.cu seems more recent (and cleaner looking, too). Will ggml eventually catch up to llama.cpp? (In that case I would contribute the patch on llama.cpp instead.)

slaren · 2023-09-30T18:41:15Z

Most of the development of the CUDA backend happens in the llama.cpp repository, and the changes are merged here regularly. Opening the PR in the llama.cpp repository is fine, the changes will eventually end here. I see that the ggml_clamp implementation is already in your MPT PR in llama.cpp, so there is no need to open another PR here.

jploski · 2023-09-30T18:42:33Z

Most of the development of the CUDA backend happens in the llama.cpp repository, and the changes are merged here regularly. Opening the PR in the llama.cpp repository is fine, the changes will eventually end here. I see that the ggml_clamp implementation is already in your MPT PR in llama.cpp, so there is no need to open another PR here.

Thanks for clarifying - I thought it worked exactly the other way around.

* CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

== Relevant log messages from source repo: commit f5f9121de140eff558f13b5c5e78a3a3b6b94377 Author: Jan Ploski <jpl@plosquare.com> Date: Tue Oct 10 09:50:23 2023 +0200 llm : add MPT support (#3417) * CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545) * mpt : added an implementation based (mostly) on falcon integration, modified with deltas from ggml/examples/mpt * mpt : protect against "clip_qkv": null in mpt-7b * mpt : quick fix to avoid "Strange model" warning when quantizing MPT models * mpt : addendum to changeset:84e30e8 - leave parameter clamp_kqv out from metadata rather than use 0.0 to indicate "no clamping" (more compliant with the current GGUF spec?) * mpt : standardized all tensor names to follow GGUF spec * mpt : addendum to changeset:1be89c40 - use "req" parameter of GGUF_GET_KEY macro instead of duplicate code * mpt : fixed comment s/gptneox/mpt/ * mpt : remove tabs, trailing whitespace * mpt : removed ne01 + n_past == ne00 assertion from alibi (cuda/f32) and rope_shift from build_mpt * mpt : updated convert-mpt-hf-to-gguf.py to reflect changes made to convert-gptneox-hf-to-gguf.py in pr:3252 * comment out n_past instead of marking it unused * mpt : removed hardcoded +178 from convert script in favor of utilizing hparams["vocab_size"] * mpt : remove unused tokenizer_json in convert script * ggml : remove obsolete n_past assert in ggml_alibi * llama : print clam_kqv and max_alibi_bias hparams --------- Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

Current status: Working, except for the latest GPTQ-for-LLaMa format that includes `g_idx`. This turns out to require changes to GGML, so for now it only works if you use the `--outtype` option to dequantize it back to f16 (which is pointless except for debugging). I also included some cleanup for the C++ code. This script is meant to replace all the existing conversion scripts (including the ones that convert from older GGML formats), while also adding support for some new formats. Specifically, I've tested with: - [x] `LLaMA` (original) - [x] `llama-65b-4bit` - [x] `alpaca-native` - [x] `alpaca-native-4bit` - [x] LLaMA converted to 'transformers' format using `convert_llama_weights_to_hf.py` - [x] `alpaca-native` quantized with `--true-sequential --act-order --groupsize 128` (dequantized only) - [x] same as above plus `--save_safetensors` - [x] GPT4All - [x] stock unversioned ggml - [x] ggmh There's enough overlap in the logic needed to handle these different cases that it seemed best to move to a single script. I haven't tried this with Alpaca-LoRA because I don't know where to find it. Useful features: - Uses multiple threads for a speedup in some cases (though the Python GIL limits the gain, and sometimes it's disk-bound anyway). - Combines split models into a single file (both the intra-tensor split of the original and the inter-tensor split of 'transformers' format files). Single files are more convenient to work with and more friendly to future changes to use memory mapping on the C++ side. To accomplish this without increasing memory requirements, it has some custom loading code which avoids loading whole input files into memory at once. - Because of the custom loading code, it no longer depends in PyTorch, which might make installing dependencies slightly easier or faster... although it still depends on NumPy and sentencepiece, so I don't know if there's any meaningful difference. In any case, I also added a requirements.txt file to lock the dependency versions in case of any future breaking changes. - Type annotations checked with mypy. - Some attempts to be extra user-friendly: - The script tries to be forgiving with arguments, e.g. you can specify either the model file itself or the directory containing it. - The script doesn't depend on config.json / params.json, just in case the user downloaded files individually and doesn't have those handy. But you still need tokenizer.model and, for Alpaca, added_tokens.json. - The script tries to give a helpful error message if added_tokens.json is missing.

after ggerganov#545 we do not need torch, tqdm and requests in the dependencies

jploski added a commit to jploski/llama.cpp that referenced this issue Sep 30, 2023

CUDA: added support for ggml_clamp (see also: ggerganov/ggml#545)

b49792b

jploski closed this as completed Sep 30, 2023

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023

py : cleanup dependencies (#962)

a32f7ac

after ggerganov#545 we do not need torch, tqdm and requests in the dependencies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA implementation of ggml_clamp #545

CUDA implementation of ggml_clamp #545

jploski commented Sep 30, 2023

jploski commented Sep 30, 2023

slaren commented Sep 30, 2023

jploski commented Sep 30, 2023

CUDA implementation of ggml_clamp #545

CUDA implementation of ggml_clamp #545

Comments

jploski commented Sep 30, 2023

jploski commented Sep 30, 2023

slaren commented Sep 30, 2023

jploski commented Sep 30, 2023