Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference on nvidia gpu #230

Open
Louis-y-nlp opened this issue Jun 5, 2023 · 4 comments
Open

Inference on nvidia gpu #230

Louis-y-nlp opened this issue Jun 5, 2023 · 4 comments

Comments

@Louis-y-nlp
Copy link

Thanks for your great work. Im running a mpt model with nvidia v100 gpu. I think the compilation process went well, but GPU cannot be utilized during inference. Here is what i got

cmake -D CMAKE_C_COMPILER=/usr/local/bin/gcc -D CMAKE_CXX_COMPILER=/usr/local/bin/g++ -DGGML_CUBLAS=ON -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc ..
-- The C compiler identification is GNU 8.5.0
-- The CXX compiler identification is GNU 8.5.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/local/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/local/bin/g++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Could NOT find Git (missing: GIT_EXECUTABLE) 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- CMAKE_SYSTEM_PROCESSOR: x86_64
-- x86 detected
-- Linux detected
-- Found CUDAToolkit: /usr/local/cuda/include (found version "11.0.221") 
-- cuBLAS found
-- The CUDA compiler identification is NVIDIA 11.0.221
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /usr/local/cuda/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- GGML CUDA sources found, configuring CUDA architecture
-- x86 detected
-- Linux detected
-- Configuring done (3.0s)
-- Generating done (0.1s)
-- Build files have been written to: /home/work/data/codes/ggml/build

then

make -j4 mpt
[ 11%] Building C object src/CMakeFiles/ggml.dir/ggml.c.o
[ 22%] Building CUDA object src/CMakeFiles/ggml.dir/ggml-cuda.cu.o
[ 33%] Building CXX object examples/CMakeFiles/common.dir/common.cpp.o
/home/work/data/codes/ggml/src/ggml.c: In function ‘ggml_compute_forward_win_part_f32’:
/home/work/data/codes/ggml/src/ggml.c:13064:19: warning: unused variable ‘ne3’ [-Wunused-variable]
     const int64_t ne3 = dst->ne[3];
                   ^~~
[ 44%] Linking CUDA static library libggml.a
[ 44%] Built target ggml
[ 55%] Building CXX object examples/CMakeFiles/common-ggml.dir/common-ggml.cpp.o
[ 66%] Linking CXX static library libcommon.a
[ 66%] Built target common
[ 77%] Linking CXX static library libcommon-ggml.a
[ 77%] Built target common-ggml
[ 88%] Building CXX object examples/mpt/CMakeFiles/mpt.dir/main.cpp.o
[100%] Linking CXX executable ../../bin/mpt
[100%] Built target mpt

when i run, i got this output

CUDA_VISIBLE_DEVICES=1 ./bin/mpt -m /home/work/mosaicml_mpt-7b-instruct/ggml-model-f16.bin -p "This is an example"
main: seed      = 1685967171
main: n_threads = 4
main: n_batch   = 8
main: n_ctx     = 512
main: n_predict = 200

mpt_model_load: loading model from '/home/work/mosaicml_mpt-7b-instruct/ggml-model-f16.bin' - please wait ...
mpt_model_load: d_model        = 4096
mpt_model_load: max_seq_len    = 2048
mpt_model_load: n_ctx          = 512
mpt_model_load: n_heads        = 32
mpt_model_load: n_layers       = 32
mpt_model_load: n_vocab        = 50432
mpt_model_load: alibi_bias_max = 8.000000
mpt_model_load: clip_qkv       = 0.000000
mpt_model_load: ftype          = 1
mpt_model_load: qntvr          = 0
mpt_model_load: ggml ctx size = 12939.11 MB
mpt_model_load: memory_size =   256.00 MB, n_mem = 16384
mpt_model_load: ........................ done
mpt_model_load: model size = 12683.02 MB / num tensors = 194
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.

main: temp           = 0.800
main: top_k          = 50432
main: top_p          = 1.000
main: repeat_last_n  = 64
main: repeat_penalty = 1.020

main: number of tokens in prompt = 4
main: token[0] =   1552
main: token[1] =    310
main: token[2] =    271
main: token[3] =   1650

This is an example of a three-year warrantyThis product is covered by a three-year warranty.<|endoftext|>


main: sampled tokens =       18
main:  mem per token =   339828 bytes
main:      load time = 11339.55 ms
main:    sample time =   204.11 ms / 11.34 ms per token
main:      eval time = 11117.53 ms / 529.41 ms per token
main:     total time = 29913.26 ms 

During runtime, I repeatedly checked and found that the GPU was not utilized at all. If I accidentally missed something, please let me know.

@Louis-y-nlp
Copy link
Author

When I repeated the above steps, I found that the "./bin/mpt" process occupied 429MiB of GPU memory. However, during runtime, the GPU utilization remained at 0%, and the speed was consistent with the CPU, at 500ms per token.

@canyonrobins
Copy link

I'm seeing the same behavior

@ggerganov
Copy link
Owner

The tensors need to be offloaded to the GPU.
You can look at llama.cpp for a demo how to make it.
In the future, we will try to make it more seamless

@rmc135
Copy link

rmc135 commented Aug 13, 2023

I also had some frustration with GPU support and couldn't figure out why it didn't seem to do anything with the GPU, aside from consuming a small amount of VRAM each run.

Looking closer at the source, it turns out that most model handlers do not actually have any cuda support - it's built but not linked, and using -ngl on the commandline is accepted but completely ignored.

Is there a roadmap on adding proper support? At the moment the only handler which seems to provide CUDA support is starcoder.

CCLDArjun pushed a commit to CCLDArjun/ggml that referenced this issue Dec 18, 2023
* CI Improvements

Manual build feature, autoreleases for Windows

* better CI naming convention

use branch name in releases and tags
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants