Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metal uses Intel HD instead of AMD GPU on Intel iMac5k #2407

Closed
furu00 opened this issue Jul 26, 2023 · 5 comments
Closed

Metal uses Intel HD instead of AMD GPU on Intel iMac5k #2407

furu00 opened this issue Jul 26, 2023 · 5 comments
Labels

Comments

@furu00
Copy link

furu00 commented Jul 26, 2023

Expected Behavior

Utilizing Metal on a 2019 imac5k with a Radeon Pro Vega 48

Current Behavior

All runs fine, but metal uses the Intel HD graphics internal card instead of the Vega GPU

Solution

You are using "ctx->device = MTLCreateSystemDefaultDevice();" in ggml-metal.m
You need to use "MTLCopyAllDevices" to get an array.

https://developer.apple.com/documentation/metal/gpu_devices_and_work_submission/multi-gpu_systems/finding_multiple_gpus_on_an_intel-based_mac

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Mac os: 13.4.1
Chipset Model: Radeon Pro Vega 48
Type: GPU
Bus: PCIe
PCIe Lane Width: x16
VRAM (Total): 8 GB
Vendor: AMD (0x1002)
Device ID: 0x6869
Revision ID: 0x0000
EFI Driver Version: 01.01.072
Metal Support: Metal 3

Model Name: iMac
Model Identifier: iMac19,1
Processor Name: 6-Core Intel Core i5
Processor Speed: 3,7 GHz
Number of Processors: 1
Total Number of Cores: 6
L2 Cache (per Core): 256 KB
L3 Cache: 9 MB
Memory: 24 GB

@furu00 furu00 changed the title [User] Metal uses Intel HD instead of AMD GPU on Intel iMac5k Metal uses Intel HD instead of AMD GPU on Intel iMac5k Jul 26, 2023
@MichaelDays
Copy link

I've searched the issue logs, and can't find any reference to anyone else using an Metal on an Intel Mac. Hello! :)

I have access to an Intel Mac Pro (2019) with an AMD Radeon Pro Vega II 32 GB gpu. Running a Llama2 model with -ngl doesn't fail, it produces nonsense tokens. Without -ngl, I get the expected happy stories about llamas in my garden...

@furu00 Does the Intel HD graphics metal that -ngl defaults to on your intel iMac actual generate useful output v cpu mode?

@guinmoon
Copy link
Contributor

guinmoon commented Aug 7, 2023

Expected Behavior

Utilizing Metal on a 2019 imac5k with a Radeon Pro Vega 48

Current Behavior

All runs fine, but metal uses the Intel HD graphics internal card instead of the Vega GPU

Solution

You are using "ctx->device = MTLCreateSystemDefaultDevice();" in ggml-metal.m You need to use "MTLCopyAllDevices" to get an array.

https://developer.apple.com/documentation/metal/gpu_devices_and_work_submission/multi-gpu_systems/finding_multiple_gpus_on_an_intel-based_mac

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Mac os: 13.4.1 Chipset Model: Radeon Pro Vega 48 Type: GPU Bus: PCIe PCIe Lane Width: x16 VRAM (Total): 8 GB Vendor: AMD (0x1002) Device ID: 0x6869 Revision ID: 0x0000 EFI Driver Version: 01.01.072 Metal Support: Metal 3

Model Name: iMac Model Identifier: iMac19,1 Processor Name: 6-Core Intel Core i5 Processor Speed: 3,7 GHz Number of Processors: 1 Total Number of Cores: 6 L2 Cache (per Core): 256 KB L3 Cache: 9 MB Memory: 24 GB

Does your Mac have correct output?
On my hackitosh with Vega 56 i have this:

guinmoon@iMacPro llama.cpp-master-3d9a551 % ./main -ngl 1 -m ../../orca-mini-v2_7b.ggmlv3.q4_1.bin 
main: build = 0 (unknown)
main: seed  = 1691401131
llama.cpp: loading model from ../../orca-mini-v2_7b.ggmlv3.q4_1.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_head_kv  = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 3 (mostly Q4_1)
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 4319.35 MB (+  256.00 MB per state)
llama_new_context_with_model: kv self size  =  256.00 MB
ggml_metal_init: allocating
ggml_metal_init: using MPS
ggml_metal_init: loading '/Users/guinmoon/dev/alpaca_llama_etc/Sources/llama.cpp-master-3d9a551/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                         0x7ff1811065a0
ggml_metal_init: loaded kernel_add_row                     0x7ff17ff08c10
ggml_metal_init: loaded kernel_mul                         0x7ff17ff096b0
ggml_metal_init: loaded kernel_mul_row                     0x7ff17ff0a270
ggml_metal_init: loaded kernel_scale                       0x7ff181504ea0
ggml_metal_init: loaded kernel_silu                        0x7ff181404590
ggml_metal_init: loaded kernel_relu                        0x7ff181405150
ggml_metal_init: loaded kernel_gelu                        0x7ff181405d10
ggml_metal_init: loaded kernel_soft_max                    0x7ff1814068d0
ggml_metal_init: loaded kernel_diag_mask_inf               0x7ff17ff0ae30
ggml_metal_init: loaded kernel_get_rows_f16                0x7ff17ff0b9f0
ggml_metal_init: loaded kernel_get_rows_q4_0               0x7ff17ff0c720
ggml_metal_init: loaded kernel_get_rows_q4_1               0x7ff17ff0d2e0
ggml_metal_init: loaded kernel_get_rows_q2_K               0x7ff17ff0e020
ggml_metal_init: loaded kernel_get_rows_q3_K               0x7ff17ff0ed60
ggml_metal_init: loaded kernel_get_rows_q4_K               0x7ff17ff0faa0
ggml_metal_init: loaded kernel_get_rows_q5_K               0x7ff17ff107e0
ggml_metal_init: loaded kernel_get_rows_q6_K               0x7ff17ff11520
ggml_metal_init: loaded kernel_rms_norm                    0x7ff17ff120e0
ggml_metal_init: loaded kernel_norm                        0x7ff1815058a0
ggml_metal_init: loaded kernel_mul_mat_f16_f32             0x7ff181506460
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32            0x7ff181507020
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32            0x7ff17ff12e10
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32            0x7ff17ff13bb0
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32            0x7ff181107160
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32            0x7ff181107d20
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32            0x7ff1811088e0
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7ff181109620
ggml_metal_init: loaded kernel_rope                        0x7ff18110a1e0
ggml_metal_init: loaded kernel_alibi_f32                   0x7ff17ff14770
ggml_metal_init: loaded kernel_cpy_f32_f16                 0x7ff17ff15330
ggml_metal_init: loaded kernel_cpy_f32_f32                 0x7ff17ff15ef0
ggml_metal_init: loaded kernel_cpy_f16_f16                 0x7ff17ff16a90
ggml_metal_init: recommendedMaxWorkingSetSize =  8176.00 MB
ggml_metal_init: hasUnifiedMemory             = false
ggml_metal_init: maxTransferRate              = 15024.04 MB/s
llama_new_context_with_model: max tensor size =    78.12 MB
ggml_metal_add_buffer: allocated 'data            ' buffer, size =  3584.00 MB, offs =            0
ggml_metal_add_buffer: allocated 'data            ' buffer, size =   511.83 MB, offs =   3676172288, ( 4096.63 /  8176.00)
ggml_metal_add_buffer: allocated 'eval            ' buffer, size =    10.16 MB, ( 4106.79 /  8176.00)
ggml_metal_add_buffer: allocated 'kv              ' buffer, size =   258.00 MB, ( 4364.79 /  8176.00)
ggml_metal_add_buffer: allocated 'scr0            ' buffer, size =   132.00 MB, ( 4496.79 /  8176.00)
ggml_metal_add_buffer: allocated 'scr1            ' buffer, size =   160.00 MB, ( 4656.79 /  8176.00)

system_info: n_threads = 8 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = -1, n_keep = 0


 6 retr ens retr retrinitNodeΣzioneTag retr retr ens retr ens URINaN ens retr ens retr retrictionary ensictionaryimaNode retr retr ensouc ens retr retr ens dinности retr retrictionary dress Tomnost superfichart retrbaum ens nem retr retrhre

There were no such problems with the first versions of ggml-metal. On iOS everything works fine. I think this problem appears only on Intel.

@knweiss
Copy link
Contributor

knweiss commented Aug 30, 2023

FWIW: I've added this debug code:

diff --git a/ggml-metal.m b/ggml-metal.m
index e929c4b..de4de24 100644
--- a/ggml-metal.m
+++ b/ggml-metal.m
@@ -119,7 +119,14 @@ struct ggml_metal_context * ggml_metal_init(int n_cb) {
     struct ggml_metal_context * ctx = malloc(sizeof(struct ggml_metal_context));
 
     ctx->n_cb   = MIN(n_cb, GGML_METAL_MAX_BUFFERS);
+    NSArray *devices = MTLCopyAllDevices();
+    id <MTLDevice> device;
+    for (device in devices) {
+        NSString *s = [device name];
+        printf("found device: %s\n", [s cStringUsingEncoding:NSISOLatin1StringEncoding]);
+    }
     ctx->device = MTLCreateSystemDefaultDevice();
+    printf("using device: %s\n", [[ctx->device name] cStringUsingEncoding:NSISOLatin1StringEncoding]);
     ctx->queue  = [ctx->device newCommandQueue];
     ctx->n_buffers = 0;
     ctx->concur_list_len = 0;

With this I get the following output from the metal build on my Intel MacBook Pro 16-inch (2019) on macOS Ventura 13.5.1:

$ ./build-metal/bin/main -ngl 1 -m ./models/codellama-7b.Q3_K_M.gguf --color -c 100 --temp 0.7 \
  --repeat_penalty 1.1 -n 10 -p "### Instruction: Write a Python function to add two numbers.\n### Response:"
Log start
main: build = 1128 (b532a69)
[...]
llm_load_tensors: ggml ctx size =    0.09 MB
llm_load_tensors: mem required  = 3308.83 MB (+   50.00 MB per state)
.............................................................................................
llama_new_context_with_model: kv self size  =   50.00 MB
ggml_metal_init: allocating
found device: AMD Radeon Pro 5500M
found device: Intel(R) UHD Graphics 630
using device: AMD Radeon Pro 5500M
ggml_metal_init: loading '/Users/karsten/src/llama.cpp/build-metal/bin/ggml-metal.metal'
ggml_metal_init: loaded kernel_add                         0x7fe7e55073c0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_add_row                     0x7fe7e5508390 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul                         0x7fe7e5509160 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_row                     0x7fe7e5509f30 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_scale                       0x7fe7e550ad00 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_silu                        0x7fe7e550bad0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_relu                        0x7fe7e550c8a0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_gelu                        0x7fe7e550d670 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_soft_max                    0x7fe7e550e440 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_diag_mask_inf               0x7fe7e550f210 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_f16                0x7fe7e550ffe0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_0               0x7fe7e5510f20 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_1               0x7fe7e5511cf0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q8_0               0x7fe7e5512c40 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q2_K               0x7fe7e5513a10 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q3_K               0x7fe7e55147e0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q4_K               0x7fe7e55155b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q5_K               0x7fe7e5516380 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_get_rows_q6_K               0x7fe7e5517150 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_rms_norm                    0x7fe7e55181b0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_norm                        0x7fe7e5519100 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_f16_f32             0x7fe7e5519ed0 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_0_f32            0x7fe7e551aca0 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_1_f32            0x7fe7e551ba70 | th_max =  896 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q8_0_f32            0x7fe7e551c840 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q2_K_f32            0x7fe7e551d790 | th_max =  576 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q3_K_f32            0x7fe7e551e560 | th_max =  704 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q4_K_f32            0x7fe7e551f330 | th_max =  512 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q5_K_f32            0x7fe7e5520280 | th_max =  512 | th_width =   32
ggml_metal_init: loaded kernel_mul_mat_q6_K_f32            0x7fe7e5521050 | th_max = 1024 | th_width =   32
ggml_metal_init: loaded kernel_mul_mm_f16_f32                         0x0 | th_max =    0 | th_width =    0
ggml_metal_init: load pipeline error: Error Domain=CompilerError Code=2 "SC compilation failure
There is a call to an undefined label" UserInfo={NSLocalizedDescription=SC compilation failure
There is a call to an undefined label}
llama_new_context_with_model: ggml_metal_init() failed
llama_init_from_gpt_params: error: failed to create context with model './models/codellama-7b.Q3_K_M.gguf'
main: error: unable to load model

I.e. MTLCreateSystemDefaultDevice() picks the Metal-capable device but ggml_metal_init() fails.

Does the AMD Radeon Pro 5500M (8 GB VRAM; supports Metal 3) run out of memory?

@knweiss
Copy link
Contributor

knweiss commented Aug 30, 2023

Update: The following mul_mm_ kernel functions do not compile successfully on my machine. If I comment them the pipeline at least compiles:

+        // GGML_METAL_ADD_KERNEL(mul_mm_f16_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q8_0_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_1_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q2_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q3_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q4_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q5_K_f32);
+        // GGML_METAL_ADD_KERNEL(mul_mm_q6_K_f32);

Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants