Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage with MSSQL ntext #76

Closed
TWAC opened this issue Apr 4, 2017 · 4 comments
Closed

High memory usage with MSSQL ntext #76

TWAC opened this issue Apr 4, 2017 · 4 comments

Comments

@TWAC
Copy link

TWAC commented Apr 4, 2017

I see sometimes see very high memory consumption when selecting from SQL Server ntext columns.
I think it might be triggered when the tables are empty.

The script below was run on Ubuntu 16.04 with turbodbc 1.1.0 and mssql-server 14.0.405.200-1.
massif output shows an heap allocation of 2GB, turning on use_async_io doubles the usage to 4GB.

The memory usage starts at snapshot 63, so I am not use if the stack trace in snapshot 64 is relevant.
(massif-visualizer is useful for visualizing)

from turbodbc import connect
conn = connect(DRIVER="ODBC Driver 13 for SQL Server",SERVER="localhost",DATABASE="test_db",UID="SA",PWD="password")
curs = conn.cursor()

curs.execute("""IF OBJECT_ID('mytable', 'U') IS NOT NULL 
                DROP TABLE [mytable]""")

curs.execute("""CREATE TABLE mytable(
	            foo ntext
                )""")

curs.execute("SELECT * FROM mytable")
valgrind --tool=massif python createmytable.py
#-----------
snapshot=62
#-----------
time=63680421
mem_heap_B=1951776
mem_heap_extra_B=74400
mem_stacks_B=0
heap_tree=empty
#-----------
snapshot=63
#-----------
time=298830808
mem_heap_B=2149397260
mem_heap_extra_B=78044
mem_stacks_B=0
heap_tree=empty
#-----------
snapshot=64
#-----------
time=298844168
mem_heap_B=2149401564
mem_heap_extra_B=78108
mem_stacks_B=0
heap_tree=peak
n2: 2149401564 (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
 n1: 2147483648 0x69C6911: cpp_odbc::multi_value_buffer::multi_value_buffer(unsigned long, unsigned long) (new_allocator.h:104)
  n1: 2147483648 0x69CFE23: turbodbc::column::column(cpp_odbc::statement const&, unsigned long, unsigned long, std::unique_ptr<turbodbc::description const, std::default_delete<turbodbc::description const> >) (column.cpp:9)
   n1: 2147483648 0x69D5EB2: void std::vector<turbodbc::column, std::allocator<turbodbc::column> >::_M_emplace_back_aux<cpp_odbc::statement const&, unsigned long&, unsigned long&, std::unique_ptr<turbodbc::description const, std::default_delete<turbodbc::description const> > >(cpp_odbc::statement const&, unsigned long&, unsigned long&, std::unique_ptr<turbodbc::description const, std::default_delete<turbodbc::description const> >&&) (new_allocator.h:120)
    n1: 2147483648 0x69D5991: turbodbc::result_sets::bound_result_set::bound_result_set(std::shared_ptr<cpp_odbc::statement const>, boost::variant<turbodbc::rows, turbodbc::megabytes>, bool) (vector.tcc:101)
     n1: 2147483648 0x69CE8CD: turbodbc::command::execute() (new_allocator.h:120)
      n1: 2147483648 0x69CD072: turbodbc::cursor::execute() (cursor.cpp:58)
       n1: 2147483648 0x67852F3: void pybind11::cpp_function::initialize<pybind11::cpp_function::initialize<void, turbodbc::cursor, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (turbodbc::cursor::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(turbodbc::cursor*)#1}, void, turbodbc::cursor*, pybind11::name, pybind11::is_method, pybind11::sibling>(pybind11::cpp_function::initialize<void, turbodbc::cursor, , pybind11::name, pybind11::is_method, pybind11::sibling>(void (turbodbc::cursor::*)(), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(turbodbc::cursor*)#1}&&, void (*)(turbodbc::cursor*), pybind11::name const&, pybind11::is_method const&, pybind11::sibling const&)::{lambda(pybind11::detail::function_record*, pybind11::handle, pybind11::detail::function_record*, pybind11::detail::function_record*)#3}::_FUN(pybind11::detail::function_record, pybind11::detail::function_record*, pybind11::detail::function_record*, pybind11::detail::function_record*) (pybind11.h:62)
        n1: 2147483648 0x677F025: pybind11::cpp_function::dispatcher(_object*, _object*, _object*) (pybind11.h:441)
         n1: 2147483648 0x4CADA0: PyEval_EvalFrameEx (in /usr/bin/python2.7)
          n1: 2147483648 0x4C2763: PyEval_EvalCodeEx (in /usr/bin/python2.7)
           n1: 2147483648 0x4DE8B6: ??? (in /usr/bin/python2.7)
            n1: 2147483648 0x4B0CB1: PyObject_Call (in /usr/bin/python2.7)
             n1: 2147483648 0x4C6ACF: PyEval_EvalFrameEx (in /usr/bin/python2.7)
              n1: 2147483648 0x4C2763: PyEval_EvalCodeEx (in /usr/bin/python2.7)
               n1: 2147483648 0x4CA8CF: PyEval_EvalFrameEx (in /usr/bin/python2.7)
                n1: 2147483648 0x4C2763: PyEval_EvalCodeEx (in /usr/bin/python2.7)
                 n1: 2147483648 0x4C2507: PyEval_EvalCode (in /usr/bin/python2.7)
                  n1: 2147483648 0x4F1DED: ??? (in /usr/bin/python2.7)
                   n1: 2147483648 0x4EC650: PyRun_FileExFlags (in /usr/bin/python2.7)
                    n1: 2147483648 0x4EAE2F: PyRun_SimpleFileExFlags (in /usr/bin/python2.7)
                     n1: 2147483648 0x49E148: Py_Main (in /usr/bin/python2.7)
                      n0: 2147483648 0x506E82E: (below main) (libc-start.c:291)
 n0: 1917916 in 361 places, all below massif's threshold (1.00%)
#-----------
snapshot=65
#-----------
time=300787456
mem_heap_B=2149388856
mem_heap_extra_B=78032
mem_stacks_B=0
heap_tree=empty
#-----------
@TWAC TWAC changed the title High memory usage with SQL Server ntext High memory usage with MSSQL ntext Apr 4, 2017
@MathMagique
Copy link
Member

Without closer inspection, my best guess would be that the ODBC driver returns a very large field size for NTEXT columns.

If the returned size is something like 1 billion, this means that 1,000,000,000 * 2 (for unicode) bytes are required to buffer a single value. Because the default read buffer size is set to 20 MB and this field alone exceeds it, it will create a buffer with just one row. For asynchronous IO, a second buffer is needed. Usually, the buffer is halved in order to keep the total read buffer size constant. But a buffer of just 1 row cannot be halved, thus leading to doubling the required memory to 4 GB.

This would be consistent with the information given here: https://docs.microsoft.com/en-us/sql/t-sql/data-types/ntext-text-and-image-transact-sql.

After confirmation, I would have to see whether there is a more clever way to transfer text data than doing the same thing as for VARCHAR.

@MathMagique
Copy link
Member

Ok, so I did some reading on retrieving large data from the database. Basically, there seem to be the following options to alleviate your problem:

  • Option 1: Allow turbodbc users to configure a maximum of characters to retrieve, even if the database indicates that this might not be sufficient for all data.
  • Option 2: Use the ODBC function SQLGetData() to retrieve long data. This seems to have some issues attached, such that it may not be possible to read batches larger than single rows.
  • Option 3: Use a reasonably sized (configurable) buffer for results. For each row in a batch, check whether warnings are present for each fetched row. Refetch rows with warnings attached using larger buffers. This sets higher requirements on the database driver, since it needs to support cursor scrolling. It may also not play well with asynchronous I/O.

Though options 2 and 3 are nice to have, I feel that option 1 is the only one that would really be in line with turbodbc's target audience of data scientists. They typically have to deal with numbers, dates, and strings of limited length. For them, bulk transfer as currently implemented is ideal.

Option 1 would allow to extend this approach to more extensive fields while keeping memory consumption acceptable. The price would be possible truncation of data. I think that is okay for turbodbc's target audience.

Options 2 and 3 could be implemented at a (very) later stage.

For the time being, you could use the following workarounds:

  • Do not select long columns.
  • When you do select long columns, cast them into types of limited length such as VARCHAR(1000).

@TWAC
Copy link
Author

TWAC commented Apr 10, 2017

Thank you for the analysis.

Yes, option 1 would probably make the most sense. I know ceODBC uses this approach via setoutputsize, but it does not truncate, just throws an exception if the limit is exceeded. It also seem to have a small default setoutputsize.

I discovered this with a legacy database, which might not be turbodbcs target audience, but multi GB allocations even when the table was empty was a bit surprising.

@MathMagique
Copy link
Member

At the same go, one should also fix the handling of VARCHAR(MAX), which is currently handled like a VARCHAR of length 0. The maximum length should be used for such fields as well. See issue #81 for more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants