New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patch/Stream message to the ColumnDataSource for DataTable widget on front-end triggers response containing entire data source #7116

Closed
OndrejZapletal opened this Issue Oct 27, 2017 · 11 comments

Comments

Projects
None yet
4 participants
@OndrejZapletal
Copy link

OndrejZapletal commented Oct 27, 2017

I've been struggling to optimize my application that is using DataTable widget on the front-end to show to the user a catalog of information updated in the real time. Originally I've used data source trigger to sent information to the front-end. This become quite quickly unfeasible solution, because number of information is quite large. For this reason I've tried to update data source via stream and patch operations. First problem that I've come across was that patch/stream operation takes progressively more time as the size of data source grows. I've raised issue #7072.

Even after my monkey patch solution of previous problem the situation didn't improve much. Using Chrome DevTools I've monitored communication over websocket and found that on each ColumnsStreamed message sent to front-end, it responds with ModelChanged message that contains entire data source. This is hugely ineffective. Because when the data source grows these messages get big very quickly especially when DataTable has multiple columns. Same situation is of course with patch operation as well.

Strange thing is that my application also contains plot widgets which are not exhibiting this behavior. Therefore my conclusion is that this is caused by DataTable widget and not by ColumnDataSource.

Expected behavior

By default Stream and Patch operation that are applied to ColumnDataSource of a DataTable widget will not trigger any response from front-end.

System Info

Since this is only very general issue I'm presuming that these are the only relevant system information:

bokeh: 0.12.10
python: 3.5.2 (default, Nov 17 2016, 17:05:23)
system: Linux-4.4.0-97-generic-x86_64-with-Ubuntu-16.04-xenial

Example

I've put together minimal example that shows this behavior. It creates page with one button and one DataTable. On button click back-end starts sending table rows with ColumnDataSource stream method. It was implemented by modification of export_csv example app.

import time
import random

from bokeh.io import curdoc
from bokeh.layouts import column
from bokeh.models import ColumnDataSource
from bokeh.models.widgets import DataTable, TableColumn, Button

class TestSource:
  def __init__(self):
    self.source = ColumnDataSource(data = self._empty_source_dict())

  def _empty_source_dict(self):
    return {'x': [], 'quantity': []}

  def add(self, record):
    record_list = {key: [value] for key, value in record.items()}
    self.source.stream(record_list)

def get_record(index):
  return {
    'x': index,
    'quantity': random.randint(0, 500)
  }

def callback():
  for i in range(30):
    record = get_record(i)
    test_source.add(record)
    time.sleep(1)

test_source = TestSource()
columns = [TableColumn(field="x", title="x"), TableColumn(field="quantity", title="quantity")]
table = DataTable(source = test_source.source, columns = columns, width = 400)
button = Button()

button.on_click(callback)

curdoc().add_root(column(button, table))
curdoc().title = "Test of patch and stream"

To illustrate the issue here is last message being sent to front-end:

{
  "references": [],
  "events": [
    {
      "column_source": {
        "id": "84c57a47-12e0-43bb-92d6-09a62dbf4707",
        "type": "ColumnDataSource"
      },
      "rollover": null,
      "kind": "ColumnsStreamed",
      "data": {
        "quantity": [41],
        "x": [29]
      }
    }
  ]
}

On which front-end response with this one:

{
  "events": [
    {
      "kind": "ModelChanged",
      "model": {
        "type": "ColumnDataSource",
        "id": "84c57a47-12e0-43bb-92d6-09a62dbf4707"
      },
      "attr": "data",
      "new": {
        "quantity": [419, 251, 99, 6, 44, 65, 97, 156, 43, 457, 387, 338, 34, 390, 287, 53, 181, 397, 439, 250, 154, 36, 386, 103, 300, 165, 23, 321, 446, 41],
        "x": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29]
      }
    }
  ],
  "references": []
}

Conclusion

My hypothesis is that DataTable is sending the source data back to bokeh because the widget provides the possibility of editable tables. Which would explain the need to send update back to back-end. If this is the case then I would propose an addition of parameter that would disable this behavior and possibly disable by default unless DataTable parameter editable is set to True.

I would appreciate some pointers on how to disable it the meantime. I starting to understand the bokeh server code base pretty well, but my knowledge is severally lacking in the front-end area.

@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Oct 27, 2017

cc @mattpap for comment

DataTable also has to do extra work to maintain a special internal index so that sorting and sub-selections work together. This might also be contributing.

I will say that using stream to stream hundreds of points is not at all something I had in mind when the feature was added. It was conceived as a way to efficiently send a handful of new points, i.e. you are streaming a time series and need to add one new point. stream lets you send just the new point, in stead of re-sending the old data.

As for what you can do in the immediately, I think the only option might be a custom extension for a table that is simpler and more efficient.

@OndrejZapletal

This comment has been minimized.

Copy link
Author

OndrejZapletal commented Oct 27, 2017

DataTable also has to do extra work to maintain a special internal index so that sorting and sub-selections work together. This might also be contributing.

I didn't think of that but you are correct that this is very likely.

I will say that using stream to stream hundreds of points is not at all something I had in mind when the feature was added. It was conceived as a way to efficiently send a handful of new points, i.e. you are streaming a time series and need to add one new point. stream lets you send just the new point, in stead of re-sending the old data.

I would just argue that streaming hundreds (in my case thousands) of records is perfect usage for feature such as streaming (patching might be a different story). Exactly like you are saying, the key here is that stream lets you to only send the newest data. That's why responding on each stream with entire data source as response seems ludicrous.

As for what you can do in the immediately, I think the only option might be a custom extension for a table that is simpler and more efficient.

Do you have any suggestion for such extension? Do you mean to create separate widget similar to DataTable without all of these special internals causing to send responses, also based on SlickGrid. Or different solution all together?

@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Oct 27, 2017

Do you have any suggestion for such extension? Do you mean to create separate widget similar to DataTable without all of these special internals causing to send responses, also based on SlickGrid. Or different solution all together?

There's actually an open issue to make a "write-only" basic HTML table widget, that would be much simpler than the SlickGrid based table. That might be a useful thing to pursue: #6916 Without, selection, sorting, editors, etc, it could be very simple and purely respond to DataSource updates

@OndrejZapletal

This comment has been minimized.

Copy link
Author

OndrejZapletal commented Oct 27, 2017

Only think that I need from this new widget is the ability to use CDSView for filtering out some rows in front-end. I'm afraid that this wouldn't work purely static HTML table. Do you think that it would be feasible somehow?

@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Oct 27, 2017

I think an HTML table could take a CDSView into account when it renders.

@OndrejZapletal

This comment has been minimized.

Copy link
Author

OndrejZapletal commented Oct 27, 2017

Ok, thanks a lot. I will investigate this and see if I can come up with something. Maybe somebody else will have something to add.

@chupach

This comment has been minimized.

Copy link

chupach commented Nov 17, 2017

I always thought with OndrejZapletal that stream/patch were only sending the minimum amount of data...
Anyway, at the moment (0.12.10/11), slickgrid integration in bokeh seems pretty much broken. Impossible to edit fields, ...

So, Bryan, are you suggesting that an editable datatable (based on slickgrid) might actually disappear in the future ? (because may have some consequence on app design) ?

@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Nov 17, 2017

@chupach It does, but DataTable is a a special case with extra requirements that need more information to be sent. I am not suggesting that DataTable is going anywhere, there are no plans to remove it. I was only suggesting an HTMLTable might be a nice alternative for some use cases where trading some capability for stream performance is worthwhile.

@codeape2

This comment has been minimized.

Copy link

codeape2 commented Jan 3, 2018

A minimal example using patch:

from bokeh.models import DataTable, TableColumn, ColumnDataSource, Button
from bokeh.plotting import curdoc

data = {'x': [1,2,3,4], 'y': [10, 20, 30, 40]}
source = ColumnDataSource(data)

table = DataTable(columns=[
    TableColumn(field="x"),
    TableColumn(field="y")
], source=source, editable=False)


btn = Button(label="Click Me!")

@btn.on_click
def btn_click():
    source.patch({"x": [(0, 42)]})

curdoc().add_root(table)
curdoc().add_root(btn)

See dataframes exchanged in the screenshot below:

  1. Browser2Server: button click (server side code does datasource.patch)
  2. Server2Browser: ColumnsPatched
  3. Browser2Server: ENTIRE DATASOURCE HAS CHANGED, HERE'S THE ENTIRE DATASOURCE

This is a problem with big data sources.

image

I'd be happy to try and submit a patch for this, but I would need some advice on the following:

  1. Would this implementation approach work: a) On the browser side, when a data table reacts to a ColumnsPatch b) ... somehow make sure that this does not result in a ModelChanged event being sent back to the server
  2. What kind of tests should be written for this? Are there selenium (or similar) tests involved?
@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Jan 3, 2018

@codeape2 Thanks for the additional information.

It's possible there is a way to do that. It will require some investigation. Offhand, my guess is that one (or both) of these two is the culprit:

https://github.com/bokeh/bokeh/blob/master/bokehjs/src/coffee/models/widgets/tables/data_table.coffee#L80-L82

https://github.com/bokeh/bokeh/blob/master/bokehjs/src/coffee/models/widgets/tables/data_table.coffee#L110-L112

And I think that the issue is that the same code path is used to set the data table regardless of whether its due to an edit on the browser side (should send an update) or whether the table just got updated via a websocket event (should not send an update). In pretty much every other situation there is already code to prevent message "ping pong" but as I said above the DataTable has some specific and particular requirements that seem to be thwarting the normal mechanisms.

Very speculatively I might suggest that it's possible to decouple these two code paths. Perhaps local browser edits can emit a signal, instead of updating the data source directly. Then the response to the "local edit" signal would make a silent (non-notifying) update. There would be several assumptions to verify first before knowing if this idea has legs. I'd be happy to discuss further, though.

Re: tests, unfortunately our selenium test suite is inoperable at the moment after some recent build refactoring. It needs to be rebuilt from the ground up at some point. Fortunately we mostly can rely on image diff and python and JS unit tests, and I think testing this could be covered in our BokehJS unit tests (but I'd have to give it some thought)

@bryevdv bryevdv added this to the 1.0 milestone Jul 1, 2018

@bryevdv

This comment has been minimized.

Copy link
Member

bryevdv commented Oct 8, 2018

As it turns out there was already mechanisms to distinguish those two code paths, but patch/stream callbacks did not make use of them. Barring discovering any unforeseen bad interactions, I will have a very small PR soon to fix this, but need to try an devise a proper test to go along with it.

@bryevdv bryevdv referenced this issue Oct 8, 2018

Merged

DataTable data model fixes #8310

6 of 6 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment