-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
When connecting to the llama-cpp-python[server] server from the Pycharm CodeGPT plugin, you can spoof the OpenAI API server and use chat similar to ChatGPT.
Current Behavior
Version llama-cpp-python==0.2.6 worked as expected with the CodeGPT plugin for Pycharm, you can spoof the OpenAI API server and use chat similar to ChatGPT.
However, after upgrading to version 0.2.7, the plugin stopped working and outputs garbage.
Environment and Context
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Manufacturer ID: AuthenticAMD
Model name: AMD Ryzen 9 7950X 16-Core Processor
CPU family: 25
Model: 97
Threads per core: 2
Cores per socket: 16
Sokets: 1
Stepping: 2
Frequency boost: enabled
CPU(s) scaling MHz: 54%
CPU max MHz: 5655,7612
CPU min MHz: 3000,0000
BogoMIPS: 8983,26
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxex
t fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni
pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm e
xtapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext p
erfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi
2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512v
l xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerp
tr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pf
threshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq av
x512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 512 KiB (16 instances)
L1i: 512 KiB (16 instances)
L2: 16 MiB (16 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-31
Vulnerabilities:
Gather data sampling: Not affected
Itlb multihit: Not affected
L1tf: Not affected
Mds: Not affected
Meltdown: Not affected
Mmio stale data: Not affected
Retbleed: Not affected
Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Srbds: Not affected
Tsx async abort: Not affected
- Operating System, e.g. for Linux:
$ uname -a
Linux lynx-B650M-PG-Riptide 6.2.0-33-generic #33-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep 5 14:49:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version
Python 3.11.4
GNU Make 4.3
g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
Failure Information (for bugs)
Answers like:
AssistAIAIJKLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
and
coroutine raised StopIteration
llama_new_context_with_model: kv self size = 768.00 MB
llama_new_context_with_model: compute buffer total size = 561.47 MB
llama_new_context_with_model: not allocating a VRAM scratch buffer due to low VRAM option
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
INFO: Started server process [837281]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
llama_print_timings: load time = 3043.46 ms
llama_print_timings: sample time = 0.40 ms / 1 runs ( 0.40 ms per token, 2525.25 tokens per second)
llama_print_timings: prompt eval time = 3043.42 ms / 298 tokens ( 10.21 ms per token, 97.92 tokens per second)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
llama_print_timings: total time = 3352.60 ms
INFO: 127.0.0.1:38780 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
Llama.generate: prefix-match hit
INFO: 127.0.0.1:41998 - "POST /v1/chat/completions HTTP/1.1" 200 OK
llama_print_timings: load time = 3043.46 ms
llama_print_timings: sample time = 1.58 ms / 4 runs ( 0.40 ms per token, 2526.85 tokens per second)
llama_print_timings: prompt eval time = 1330.15 ms / 68 tokens ( 19.56 ms per token, 51.12 tokens per second)
llama_print_timings: eval time = 427.48 ms / 3 runs ( 142.49 ms per token, 7.02 tokens per second)
llama_print_timings: total time = 1833.49 ms