Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] OWFeatureStatistics #72

Merged
merged 5 commits into from Jan 7, 2018

Conversation

Projects
None yet
5 participants
@pavlin-policar
Copy link
Contributor

commented Apr 10, 2017

Show basic statistics for every feature.

Features of note:

  • Widget is usable with large datasets (means, vars and missing are computed on init, hisograms are computed and drawn as needed and cached)
  • Histograms keep their aspect ratio as the widget resizes (however, this ratio is ignored if the height is too small).
  • Dropdown to enable coloring by any target variable, or disable coloring entirely.
  • Filter textbox for variable names
  • Sparse support

Todo:

  • Sorting (currently uses slow QSortFilterProxyModel), special sorting by variable type, special sorting by first and second moment
  • Implement report
  • Attribute selection sends dataset with these attributes to output.

Current state (housing and GDS1615, respectively):

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from cf638fb to 632090d Apr 15, 2017


def _scale_to_interval(x, start, stop):
x_min, x_max = x.min(), x.max()
return (stop - start) * (x - x_min) / (x_max - x_min) + start

This comment has been minimized.

Copy link
@kernc

kernc Apr 15, 2017

Member

See Orange.data.util.scale.

This comment has been minimized.

Copy link
@pavlin-policar

pavlin-policar Apr 15, 2017

Author Contributor

Awesome! This is exactly what I needed.

if hasattr(self.attribute, item):
return getattr(self.attribute, item)
else:
raise AttributeError('Could not find `%s`' % item)

This comment has been minimized.

Copy link
@kernc

kernc Apr 15, 2017

Member

getattr itself raises AttributeError if not hasattr and no default provided.

This comment has been minimized.

Copy link
@pavlin-policar

pavlin-policar Apr 16, 2017

Author Contributor

Yes, that's true, but then in case the error does actually get raised, the AttributeError message will refer to the self.attribute, which can be very confusing. This way, it is clear where the error occurs.

elif isinstance(attribute, ContinuousVariable):
return ContinuousAttributeRow(attribute, data)
else:
raise TypeError('Attribute type not recognised')

This comment has been minimized.

Copy link
@kernc

kernc Apr 15, 2017

Member

Could show n_defined and n_unique for string variable?

return self._data.domain.attributes.index(self.attribute)

def _get_column(self, filter_nan=True):
col = self._data.X[:, self._attr_index]

This comment has been minimized.

Copy link
@kernc

kernc Apr 15, 2017

Member

What if the variable were in Y or metas?
Prefer:

col = self._data.get_column_view(self.attribute)[0]

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from 9e2ca22 to bb7ba60 Apr 16, 2017

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from bb7ba60 to c67bae0 Jun 25, 2017

@nikicc
Copy link
Contributor

left a comment

This widget crashes with sparse data since there are many calls to numpy's function that cannot handle sparse data sets (e.g. np.is_nan). Could you make it work on sparse data as well? Please, take a look at the Orange/statistics/util.py module which has some functions that are equivalents of the numpy's just that they also handle sparse data sets.

If adopting to sparse data doens't seem feasible, can you at least add a warning, that the widget doensn't yet handle sparse data sets, so it doens't crashes? Look for example in OWBaseLearner class (Orange/widgets/utils/owlearnerwidget.py)

idx = int(stats.mode(data).mode[0])
result = data[idx]
elif attribute.is_continuous:
result = np.mean(data)

This comment has been minimized.

Copy link
@nikicc

nikicc Jun 26, 2017

Contributor

Please use our own implementation of mean (Orange/statistics/util.py) which can also handle sparse data sets.

if attribute in self.__cache['valid']:
return self.__cache['valid'][attribute]

result = int((~np.isnan(data)).sum())

This comment has been minimized.

Copy link
@nikicc

nikicc Jun 26, 2017

Contributor

Please use our own implementation of countnans (Orange/statistics/util.py) which can also handle sparse data sets.

for bin_idx in range(self.n_bins):
distributions[bin_idx] = y[mask[bin_idx]].sum(axis=0)
else:
distributions = np.bincount(bin_indices.astype(np.int64))

This comment has been minimized.

Copy link
@nikicc

nikicc Jun 26, 2017

Contributor

Please use our own implementation of bincount (Orange/statistics/util.py) which can also handle sparse data sets.

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from 7448288 to 1765233 Jul 6, 2017

scene.render(painter, target=QRectF(option.rect))
painter.restore()
else:
super().paint(painter, option, index)

This comment has been minimized.

Copy link
@kernc

kernc Jul 7, 2017

Member

Early exit saves an indent level and is generally easier to follow.

# hheader.setDefaultSectionSize(100)

vheader = self.view.verticalHeader()
# vheader.setDefaultSectionSize(100)

This comment has been minimized.

Copy link
@kernc

kernc Jul 7, 2017

Member

No commit comments, except when preceded by meta-comments.


# TODO Is there a better way to make the table take up all the space?
box = gui.vBox(self.mainArea)
box.layout().addWidget(self.view)

This comment has been minimized.

Copy link
@kernc

kernc Jul 7, 2017

Member

This will take up all the space?

self.mainArea.layout().addWidget(self.view)

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from 67dd3bb to 9336ec6 Jul 8, 2017

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jul 8, 2017

This PR depends on #2458, for the implementation of digitize.

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from 51f3772 to 5bcb73b Jul 8, 2017

@pavlin-policar pavlin-policar changed the title [WIP] OWAttributes [WIP] OWFeatureStatistics Jul 10, 2017

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from b219a1f to 919be83 Jul 10, 2017

@nikicc

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2017

@pavlin-policar the widget crashes if I input some data set with textual features (i.e. StringVariables). We should probably just skip textual features.

screen shot 2017-07-10 at 16 40 45

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jul 10, 2017

Yes, I'm also quite sure TimeVariable would crash the widget as well, but I'm not entirely sure how to handle them. On the other hand, I'm not sure hiding them would be the best idea either, since one would expect all the features to have some sort of summary.

For a StringVariable it may make sense to display the most common string (if it exists) and the number of occurences. Drawing a distribution would make no sense, since the number of possible values will probably be way to high and the bars would be next to invisible.

As for TimeVariable, I have basically no idea what we could display for that... Perhaps the date range, but that seems like a reach...

I guess I'll just not display them until we can come up with something better.

@nikicc

This comment has been minimized.

Copy link
Contributor

commented Jul 10, 2017

For a StringVariable it may make sense to display the most common string (if it exists) and the number of occurences. Drawing a distribution would make no sense, since the number of possible values will probably be way to high and the bars would be next to invisible.

I'm afraid that in many cases this would yield stopwords like a, the etc. Perhaps, should we plot a histogram of number of words inside documents?

@kernc

This comment has been minimized.

Copy link
Member

commented Jul 10, 2017

As for TimeVariable, I have basically no idea what we could display for that...

Obviously was designed by preschool children, but may otherwise work:

tvar = data.domain['time']
time_col = data.get_column_view(tvar)[0]
tvar.rerp_val(nanmin(time_col))
tvar.rerp_val(nanmax(time_col))
tvar.rerp_val(nanmean(time_col))
...
@ajdapretnar

This comment has been minimized.

Copy link
Contributor

commented Jul 11, 2017

As for TimeVariable, I think we should display a line chart of occurrences in time. I know we can't even handle that in a proper plot, but perhaps this is a good opportunity to think about it. Could we create a heuristic that figures out what is the proper granularity of time instances (minute, hour, day, year?) and plot it in a line chart as sum of instances? Normally I'd like a quick glance at when was a particular event the most frequent.

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from 6a18825 to 1b9fe16 Jul 14, 2017

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from d47c836 to f599304 Aug 24, 2017

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 4 times, most recently from 6cdb429 to 8c8a4ec Sep 4, 2017

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Sep 5, 2017

This PR depends on #2558, for the proper implementation of countnans.

@kernc kernc referenced this pull request Sep 30, 2017

Closed

[WIP] Unsupervised scorers in Rank #2640

1 of 3 tasks complete
@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Oct 13, 2017

For the time variable, I agree with @ajdapretnar; a line chart would be far more appropriate than a histogram. As she said in a histogram, we have no way of determining a good granularity for the bins.

However, how would we color the line charts to show the values of the target variable? For a discrete target, we could simply plot multiple lines, each in a different color, but what about for a continuous target? Coloring the area under the curve might be a good solution... If this turns out to work well, it may be worth testing for continuous variables as well.

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from 9e6ece8 to 508837a Oct 13, 2017

@kernc

kernc approved these changes Dec 1, 2017

@ajdapretnar

This comment has been minimized.

Copy link
Contributor

commented Dec 1, 2017

With the recent master on core and this PR I get:
AttributeError: 'OWFeatureStatistics' object has no attribute 'proxy_model'

if not different_case:
regex.setCaseSensitivity(Qt.CaseInsensitive)

self.proxy_model.setFilterRegExp(regex)

This comment has been minimized.

Copy link
@kernc

kernc Dec 10, 2017

Member

Looks like a remnant since 2e44023. AbstractTableModel unfortunately does not do filtering. So you should probably adapt/remove this.

But I think filtering is nice to have. @pavlin-policar What are your thoughts on it? Would you be willing to amend AbstractSortTableModel with filtering capabilities, or wrap FeatureStatisticsTableModel in a QSortFilterProxyModel just for its filtering?

This comment has been minimized.

Copy link
@pavlin-policar

pavlin-policar Dec 15, 2017

Author Contributor

My bad, I must have missed it when converting over to the new AbstractTableModel.

I think filtering is a very important feature, since the GEO datasets can have thousands of features and looking for features by scrolling can get tedious, very fast. I will look into it.

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch 2 times, most recently from f1c8747 to bc6a176 Dec 15, 2017

@pavlin-policar pavlin-policar force-pushed the pavlin-policar:owattributes branch from d42de90 to 5150b37 Dec 22, 2017

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2018

@kernc, @nikicc This should now be stable enough for another review. I've also squashed all the commits so it's not a complete mess anymore.

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 5, 2018

Histograms are too high? I'd expect them to be the height of a normal table view line, about 2em at most, likened to sparklines.
screenshot_2018-01-05_20-54-26

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2018

That's actually a feature 😄 The histogram resizes to take up the entire area that the text doesn't need (this could be useful if you find an interesting histogram and want to take a closer look, you resize the window and make it wider, and the histogram grows). If you open and resize the widget, you'll see it work, the default sizes should look all right.

Since it does look quite extreme on your screenshot, perhaps it may be a good idea to set some kind of a limit to how big it can grow?

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 5, 2018

Resizing in width is fine, but height? The histogram doesn't shrink smaller when the window is shrunk vertically.

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 5, 2018

That's true, it is prevented from shrinking into nothing. I'll tinker with fixing the height or limiting it. The thing is that you probably want a compact overview of histograms, so you can see many at a time, and want them to be fairly small, but then one may catch your eye and you'd want to inspect it further and make it bigger. Having a fixed height doesn't allow that.

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 5, 2018

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 6, 2018

@pavlin-policar Could you possibly fix this row height by the end of the weekend so it gets featured in the next asap release? I would but don't know how.

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2018

Sure thing! The latest commit limits the max height the histogram can grow to. This way the histograms still maintain a reasonable aspect ratio at a couple window sizes. If it turns out that the growth is pointless, it's easy to throw out.

@pavlin-policar pavlin-policar changed the title [WIP] OWFeatureStatistics [ENH] OWFeatureStatistics Jan 7, 2018

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 7, 2018

I made a change allowing to color by any variable. Is it ok?

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2018

I like this change a lot, however, now when I open the widget, there is no option selected for colouring (see screenshot). I think having the colouring default first target variable would make the most sense.

@kernc

This comment has been minimized.

Copy link
Member

commented Jan 7, 2018

Fixed.

@pavlin-policar

This comment has been minimized.

Copy link
Contributor Author

commented Jan 7, 2018

Looks good to me 👍

@nikicc agrees this is now ok.

@kernc kernc merged commit 4bd31d6 into biolab:master Jan 7, 2018

@pavlin-policar pavlin-policar deleted the pavlin-policar:owattributes branch Jan 7, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.