Skip to content

llama-cpp-python[server] bug after update to 0.2.7 #757

@LynxPDA

Description

@LynxPDA

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

When connecting to the llama-cpp-python[server] server from the Pycharm CodeGPT plugin, you can spoof the OpenAI API server and use chat similar to ChatGPT.

Current Behavior

Version llama-cpp-python==0.2.6 worked as expected with the CodeGPT plugin for Pycharm, you can spoof the OpenAI API server and use chat similar to ChatGPT.

However, after upgrading to version 0.2.7, the plugin stopped working and outputs garbage.

Environment and Context

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte order:            Little Endian
CPU(s):                  32
  On-line CPU(s) list:   0-31
Manufacturer ID:         AuthenticAMD
  Model name:            AMD Ryzen 9 7950X 16-Core Processor
    CPU family:          25
    Model:               97
    Threads per core:    2
    Cores per socket:    16
    Sokets:              1
    Stepping:            2
    Frequency boost:     enabled
    CPU(s) scaling MHz:  54%
    CPU max MHz:         5655,7612
    CPU min MHz:         3000,0000
    BogoMIPS:            8983,26
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxex
                         t fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni 
                         pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm e
                         xtapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext p
                         erfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi
                         2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512v
                         l xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerp
                         tr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pf
                         threshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq av
                         x512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization features: 
  Virtualization:         AMD-V
Caches (sum of all):     
  L1d:                   512 KiB (16 instances)
  L1i:                   512 KiB (16 instances)
  L2:                    16 MiB (16 instances)
  L3:                    64 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-31
Vulnerabilities:         
  Gather data sampling:  Not affected
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Not affected
  Spec store bypass:     Mitigation; Speculative Store Bypass disabled via prctl
  Spectre v1:            Mitigation; usercopy/swapgs barriers and __user pointer sanitization
  Spectre v2:            Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
  • Operating System, e.g. for Linux:

$ uname -a
Linux lynx-B650M-PG-Riptide 6.2.0-33-generic #33-Ubuntu SMP PREEMPT_DYNAMIC Tue Sep 5 14:49:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

  • SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version
Python 3.11.4
GNU Make 4.3
g++ (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0

Failure Information (for bugs)

Answers like:
AssistAIAIJKLMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
and
coroutine raised StopIteration

llama_new_context_with_model: kv self size  =  768.00 MB
llama_new_context_with_model: compute buffer total size =  561.47 MB
llama_new_context_with_model: not allocating a VRAM scratch buffer due to low VRAM option
AVX = 1 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | 
INFO:     Started server process [837281]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)

llama_print_timings:        load time =  3043.46 ms
llama_print_timings:      sample time =     0.40 ms /     1 runs   (    0.40 ms per token,  2525.25 tokens per second)
llama_print_timings: prompt eval time =  3043.42 ms /   298 tokens (   10.21 ms per token,    97.92 tokens per second)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_print_timings:       total time =  3352.60 ms
INFO:     127.0.0.1:38780 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error
Llama.generate: prefix-match hit
INFO:     127.0.0.1:41998 - "POST /v1/chat/completions HTTP/1.1" 200 OK

llama_print_timings:        load time =  3043.46 ms
llama_print_timings:      sample time =     1.58 ms /     4 runs   (    0.40 ms per token,  2526.85 tokens per second)
llama_print_timings: prompt eval time =  1330.15 ms /    68 tokens (   19.56 ms per token,    51.12 tokens per second)
llama_print_timings:        eval time =   427.48 ms /     3 runs   (  142.49 ms per token,     7.02 tokens per second)
llama_print_timings:       total time =  1833.49 ms

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions