While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

ArneJanning · 2024-02-10T15:58:54Z

Describe the bug
I'm using pgvector as data store, because Chroma won't work either (same problem as in #986 ), so I'm ingesting PDFs via memgpt load directory..., chunking them, getting embeddings from Azure OpenAI and storing them in pgvector.

It works fine with small PDFs, but bigger PDFs fail every time with this error message where the code breaks down in pg8000:

 File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 222, in load_directory
    store_docs(str(name), docs, user_id)
  File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 139, in store_docs
    insert_passages_into_source(passages, name, user_id, config)
  File "/home/arne/src/MemGPT/memgpt/cli/cli_load.py", line 58, in insert_passages_into_source
    storage.insert_many(passages)
  File "/home/arne/src/MemGPT/memgpt/agent_store/db.py", line 478, in insert_many
    conn.execute(upsert_stmt)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1416, in execute
    return meth(
           ^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/sql/elements.py", line 517, in _execute_on_connection
    return connection._execute_clauseelement(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1639, in _execute_clauseelement
    ret = self._execute_context(
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1848, in _execute_context
    return self._exec_single_context(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1988, in _exec_single_context
    self._handle_dbapi_exception(
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 2347, in _handle_dbapi_exception
    raise exc_info[1].with_traceback(exc_info[2])
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/base.py", line 1969, in _exec_single_context
    self.dialect.do_execute(
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/sqlalchemy/engine/default.py", line 922, in do_execute
    cursor.execute(statement, parameters)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/legacy.py", line 254, in execute
    self._context = self._c.execute_unnamed(
                    ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/core.py", line 701, in execute_unnamed
    self.send_BIND(NULL_BYTE, params)
  File "/home/arne/.cache/pypoetry/virtualenvs/pymemgpt-K6xMi7ln-py3.11/lib/python3.11/site-packages/pg8000/core.py", line 765, in send_BIND
    NULL_BYTE + statement_name_bin + h_pack(0) + h_pack(len(params))
                                                 ^^^^^^^^^^^^^^^^^^^
struct.error: 'h' format requires -32768 <= number <= 32767

Digging deeper the problem seems to be that in ./memgpt/agent_store/db.py the insert_many-method generates SQL-code like this:

INSERT INTO <table> (<columns>....) VALUES (%s %s %s ...)

But there are limits on the number of %s. So when executed into pg8000 with 9 columns (id, user_id, text, doc_id, agent_id, data_source, embedding, embedding_dim, embedding_model, metadata_) with a chunk size of maybe 20,000 it becomes 180,000 which is more than 32,767 and pg8000 will raise the error.

So when generating the statements, we need to bring down the number of %s.

Please describe your setup

How did you install memgpt?
- git clone, poetry install
Describe your setup
- What's your OS? Linux/WSL2
- How are you running memgpt? Terminal/ZSH

The text was updated successfully, but these errors were encountered:

sarahwooders · 2024-02-12T21:20:57Z

@ArneJanning thanks for reporting this - could you please try the fix in #994 to see if it resolves your issue? You can also wait for the nightly package tomorrow which should include it.

If you get a chance, could you also please let me know how large the PDF file was, and if it was a folder of files or a single file? Then I can try to reproduce the error as well.

ArneJanning · 2024-02-14T17:19:36Z

@sarahwooders Thank you very much for your quick fix! I made my own little fix and put it into #1004 which calculates and uses the optimal chunk size for pg8000 instead of hard-coding 1,000 which gives us more performance.

I was loading scientific PDFs with about 1,000 pages each in a folder of files, works without problem now.

github-project-automation bot added this to 🐛 MemGPT issue tracker Feb 10, 2024

github-project-automation bot moved this to To triage in 🐛 MemGPT issue tracker Feb 10, 2024

cpacker assigned sarahwooders Feb 11, 2024

sarahwooders moved this from To triage to Ready in 🐛 MemGPT issue tracker Feb 11, 2024

sarahwooders mentioned this issue Feb 12, 2024

fix: Chunk inserts into DB on CLI load #994

Merged

ArneJanning mentioned this issue Feb 14, 2024

perf: small fix for the insert-method of pgvector #1004

Closed

sarahwooders closed this as completed Feb 29, 2024

github-project-automation bot moved this from Ready to Done in 🐛 MemGPT issue tracker Feb 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

ArneJanning commented Feb 10, 2024 •

edited

Loading

sarahwooders commented Feb 12, 2024 •

edited

Loading

ArneJanning commented Feb 14, 2024 •

edited

Loading

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

While storing vectors into pgvector: "struct.error: 'h' format requires -32768 <= number <= 32767" #988

Comments

ArneJanning commented Feb 10, 2024 • edited Loading

sarahwooders commented Feb 12, 2024 • edited Loading

ArneJanning commented Feb 14, 2024 • edited Loading

ArneJanning commented Feb 10, 2024 •

edited

Loading

sarahwooders commented Feb 12, 2024 •

edited

Loading

ArneJanning commented Feb 14, 2024 •

edited

Loading