Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] DataTable column sort not working with NaNs #10251

Closed
sggaffney opened this issue Jul 2, 2020 · 5 comments · Fixed by #10318
Closed

[BUG] DataTable column sort not working with NaNs #10251

sggaffney opened this issue Jul 2, 2020 · 5 comments · Fixed by #10318

Comments

@sggaffney
Copy link

sggaffney commented Jul 2, 2020

Versions: Python 3.7.3; Bokeh 2.1.1; Numpy 1.18.5; Mac OS X 10.15.5; Chrome: Version 83.0.4103.116 (Official Build) (64-bit)

Issue: DataTable sorting not working as expected with NaNs. After sorting, NaNs are scattered through the column rather than being moved to top or bottom.

Example

import numpy as np
from bokeh import io as bkio
import bokeh.models as bkm

bkio.output_file('test_table_sort.html')

source = bkm.ColumnDataSource({
    'journal': ['The Lancet', 'Clinical and experimental pediatrics', 
                'Anales de Pediatria', 'Annals of translational medicine', 'CMAJ'],
    'citescore': [10.28, np.nan, 0.43, np.nan, 1.12],
    })
d = bkm.DataTable(columns=[bkm.widgets.TableColumn(field='journal', title='journal'),
                           bkm.widgets.TableColumn(field='citescore', title='citescore', 
                                                   formatter=bkm.NumberFormatter(format='0.0'))], source=source)
bkio.show(d)

Console output

[bokeh] setting log level to: 'info'
bokeh-tables-2.1.1.min.js:47 [bokeh] jquery-ui is required to enable DataTable.reorderable
render @ bokeh-tables-2.1.1.min.js:47
bokeh-2.1.1.min.js:162 [bokeh] document idle at 49 ms

Screen Shot 2020-07-02 at 18 33 20

@bryevdv
Copy link
Member

bryevdv commented Jul 2, 2020

Table sorting is done here:

sort(columns: any[]): void {
let cols = columns.map((column) => [column.sortCol.field, column.sortAsc ? 1 : -1])
if (cols.length == 0) {
cols = [[DTINDEX_NAME, 1]]
}
const records = this.getRecords()
const old_index = this.index.slice()
this.index.sort(function(i1, i2) {
for (const [field, sign] of cols) {
const value1 = records[old_index.indexOf(i1)][field]
const value2 = records[old_index.indexOf(i2)][field]
const result = value1 == value2 ? 0 : value1 > value2 ? sign : -sign
if (result != 0)
return result
}
return 0
})
}
}

It's definitely not nan-aware. This would be a nice first issue for a new contributor to tackle.

@verybadsoldier
Copy link

verybadsoldier commented Jul 16, 2020

I am also experiencing this problem. Probably does not add much information but I came up with this minimal example:

import numpy as np

from bokeh.models.widgets import DataTable, TableColumn, Div
from bokeh.layouts import column, gridplot, row
from bokeh.plotting import figure, output_file, show
from bokeh.models import ColumnDataSource, HoverTool

def main():
    data = dict(
        dates=[-0.816, -0.455, np.nan, np.nan, np.nan, -0.057, -0.079, -0.090, -1.999, - 1.329],
        downloads=[i for i in range(10)],
    )
    source = ColumnDataSource(data)
    columns = [
        TableColumn(field="dates", title="Date"),
        TableColumn(field="downloads", title="Downloads"),
    ]
    data_table = DataTable(source=source, columns=columns, sizing_mode='stretch_height', sortable=True)

    c = row(children=[data_table], sizing_mode='stretch_height', min_height=10)
    output_file(r"C:\temp\bla.html")
    show(c)

if __name__ == '__main__':
    main()

Clicking the column Date makes the NaN values seemingly randomly appear at the top or at the bottom.

@mattpap
Copy link
Contributor

mattpap commented Jul 16, 2020

A preliminary PR fixing the sorting algorithm is available in PR #10318. However, one important thing to note is how NaNs are currently handled in bokeh. Data like:

 data = dict(dates=[-0.816, -0.455, np.nan, np.nan, np.nan, -0.057, -0.079, -0.090, -1.999, - 1.329])

is encoded as:

[-0.816, -0.455, "NaN", "NaN", "NaN", -0.057, -0.079, -0.090, -1.999, - 1.329]

which results in textual comparisons (lexicographic), instead of numerical. To get the desired behavior, one needs to use numpy arrays:

 data = dict(dates=np.array([-0.816, -0.455, np.nan, np.nan, np.nan, -0.057, -0.079, -0.090, -1.999, - 1.329]))

This serializes as a binary array and deserializes as a typed array in bokehjs with NaNs and infinities preserved. This dichotomy is unfortunate and I will try to get this resolved ASAP (hopefully in 2.2).

@bryevdv
Copy link
Member

bryevdv commented Jul 16, 2020

This dichotomy is unfortunate and I will try to get this resolved ASAP

What is your plan here? This dichotomy exists because the JSON standard neglected to incorporate nan and inf directly. Do you propose to automagically wrap all plain lists in numpy arrays so that either the binary protocol or base64 encoding will be used?

@mattpap
Copy link
Contributor

mattpap commented Jul 17, 2020

I will start another issue to explain how I would like to proceed, especially that arrays with NaNs is just a special case of a broader problem with serialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants