Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign up[ENH] OWFeatureStatistics #72
Conversation
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
cf638fb
to
632090d
Apr 15, 2017
kernc
requested changes
Apr 15, 2017
|
|
||
| def _scale_to_interval(x, start, stop): | ||
| x_min, x_max = x.min(), x.max() | ||
| return (stop - start) * (x - x_min) / (x_max - x_min) + start |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
| if hasattr(self.attribute, item): | ||
| return getattr(self.attribute, item) | ||
| else: | ||
| raise AttributeError('Could not find `%s`' % item) |
This comment has been minimized.
This comment has been minimized.
kernc
Apr 15, 2017
Member
getattr itself raises AttributeError if not hasattr and no default provided.
This comment has been minimized.
This comment has been minimized.
pavlin-policar
Apr 16, 2017
Author
Contributor
Yes, that's true, but then in case the error does actually get raised, the AttributeError message will refer to the self.attribute, which can be very confusing. This way, it is clear where the error occurs.
| elif isinstance(attribute, ContinuousVariable): | ||
| return ContinuousAttributeRow(attribute, data) | ||
| else: | ||
| raise TypeError('Attribute type not recognised') |
This comment has been minimized.
This comment has been minimized.
| return self._data.domain.attributes.index(self.attribute) | ||
|
|
||
| def _get_column(self, filter_nan=True): | ||
| col = self._data.X[:, self._attr_index] |
This comment has been minimized.
This comment has been minimized.
kernc
Apr 15, 2017
Member
What if the variable were in Y or metas?
Prefer:
col = self._data.get_column_view(self.attribute)[0]
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
9e2ca22
to
bb7ba60
Apr 16, 2017
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
bb7ba60
to
c67bae0
Jun 25, 2017
nikicc
reviewed
Jun 26, 2017
|
This widget crashes with sparse data since there are many calls to numpy's function that cannot handle sparse data sets (e.g. If adopting to sparse data doens't seem feasible, can you at least add a warning, that the widget doensn't yet handle sparse data sets, so it doens't crashes? Look for example in |
| idx = int(stats.mode(data).mode[0]) | ||
| result = data[idx] | ||
| elif attribute.is_continuous: | ||
| result = np.mean(data) |
This comment has been minimized.
This comment has been minimized.
nikicc
Jun 26, 2017
Contributor
Please use our own implementation of mean (Orange/statistics/util.py) which can also handle sparse data sets.
| if attribute in self.__cache['valid']: | ||
| return self.__cache['valid'][attribute] | ||
|
|
||
| result = int((~np.isnan(data)).sum()) |
This comment has been minimized.
This comment has been minimized.
nikicc
Jun 26, 2017
Contributor
Please use our own implementation of countnans (Orange/statistics/util.py) which can also handle sparse data sets.
| for bin_idx in range(self.n_bins): | ||
| distributions[bin_idx] = y[mask[bin_idx]].sum(axis=0) | ||
| else: | ||
| distributions = np.bincount(bin_indices.astype(np.int64)) |
This comment has been minimized.
This comment has been minimized.
nikicc
Jun 26, 2017
Contributor
Please use our own implementation of bincount (Orange/statistics/util.py) which can also handle sparse data sets.
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
7448288
to
1765233
Jul 6, 2017
kernc
reviewed
Jul 7, 2017
| scene.render(painter, target=QRectF(option.rect)) | ||
| painter.restore() | ||
| else: | ||
| super().paint(painter, option, index) |
This comment has been minimized.
This comment has been minimized.
| # hheader.setDefaultSectionSize(100) | ||
|
|
||
| vheader = self.view.verticalHeader() | ||
| # vheader.setDefaultSectionSize(100) |
This comment has been minimized.
This comment has been minimized.
|
|
||
| # TODO Is there a better way to make the table take up all the space? | ||
| box = gui.vBox(self.mainArea) | ||
| box.layout().addWidget(self.view) |
This comment has been minimized.
This comment has been minimized.
kernc
Jul 7, 2017
•
Member
This will take up all the space?
self.mainArea.layout().addWidget(self.view)
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
67dd3bb
to
9336ec6
Jul 8, 2017
This comment has been minimized.
This comment has been minimized.
|
This PR depends on #2458, for the implementation of |
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
51f3772
to
5bcb73b
Jul 8, 2017
pavlin-policar
changed the title
[WIP] OWAttributes
[WIP] OWFeatureStatistics
Jul 10, 2017
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
b219a1f
to
919be83
Jul 10, 2017
This comment has been minimized.
This comment has been minimized.
|
@pavlin-policar the widget crashes if I input some data set with textual features (i.e. StringVariables). We should probably just skip textual features. |
This comment has been minimized.
This comment has been minimized.
|
Yes, I'm also quite sure For a As for I guess I'll just not display them until we can come up with something better. |
This comment has been minimized.
This comment has been minimized.
I'm afraid that in many cases this would yield stopwords like |
This comment has been minimized.
This comment has been minimized.
Obviously was designed by preschool children, but may otherwise work: tvar = data.domain['time']
time_col = data.get_column_view(tvar)[0]
tvar.rerp_val(nanmin(time_col))
tvar.rerp_val(nanmax(time_col))
tvar.rerp_val(nanmean(time_col))
... |
This comment has been minimized.
This comment has been minimized.
|
As for TimeVariable, I think we should display a line chart of occurrences in time. I know we can't even handle that in a proper plot, but perhaps this is a good opportunity to think about it. Could we create a heuristic that figures out what is the proper granularity of time instances (minute, hour, day, year?) and plot it in a line chart as sum of instances? Normally I'd like a quick glance at when was a particular event the most frequent. |
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
6a18825
to
1b9fe16
Jul 14, 2017
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
d47c836
to
f599304
Aug 24, 2017
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
4 times, most recently
from
6cdb429
to
8c8a4ec
Sep 4, 2017
This comment has been minimized.
This comment has been minimized.
|
This PR depends on #2558, for the proper implementation of |
This comment has been minimized.
This comment has been minimized.
|
For the time variable, I agree with @ajdapretnar; a line chart would be far more appropriate than a histogram. As she said in a histogram, we have no way of determining a good granularity for the bins. However, how would we color the line charts to show the values of the target variable? For a discrete target, we could simply plot multiple lines, each in a different color, but what about for a continuous target? Coloring the area under the curve might be a good solution... If this turns out to work well, it may be worth testing for continuous variables as well. |
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
9e6ece8
to
508837a
Oct 13, 2017
kernc
approved these changes
Dec 1, 2017
This comment has been minimized.
This comment has been minimized.
|
With the recent master on core and this PR I get: |
kernc
reviewed
Dec 10, 2017
| if not different_case: | ||
| regex.setCaseSensitivity(Qt.CaseInsensitive) | ||
|
|
||
| self.proxy_model.setFilterRegExp(regex) |
This comment has been minimized.
This comment has been minimized.
kernc
Dec 10, 2017
Member
Looks like a remnant since 2e44023. AbstractTableModel unfortunately does not do filtering. So you should probably adapt/remove this.
But I think filtering is nice to have. @pavlin-policar What are your thoughts on it? Would you be willing to amend AbstractSortTableModel with filtering capabilities, or wrap FeatureStatisticsTableModel in a QSortFilterProxyModel just for its filtering?
This comment has been minimized.
This comment has been minimized.
pavlin-policar
Dec 15, 2017
Author
Contributor
My bad, I must have missed it when converting over to the new AbstractTableModel.
I think filtering is a very important feature, since the GEO datasets can have thousands of features and looking for features by scrolling can get tedious, very fast. I will look into it.
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
2 times, most recently
from
f1c8747
to
bc6a176
Dec 15, 2017
pavlin-policar
force-pushed the
pavlin-policar:owattributes
branch
from
d42de90
to
5150b37
Dec 22, 2017
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Histograms are too high? I'd expect them to be the height of a normal table view line, about 2em at most, likened to sparklines. |
This comment has been minimized.
This comment has been minimized.
|
That's actually a feature Since it does look quite extreme on your screenshot, perhaps it may be a good idea to set some kind of a limit to how big it can grow? |
This comment has been minimized.
This comment has been minimized.
|
Resizing in width is fine, but height? The histogram doesn't shrink smaller when the window is shrunk vertically. |
This comment has been minimized.
This comment has been minimized.
|
That's true, it is prevented from shrinking into nothing. I'll tinker with fixing the height or limiting it. The thing is that you probably want a compact overview of histograms, so you can see many at a time, and want them to be fairly small, but then one may catch your eye and you'd want to inspect it further and make it bigger. Having a fixed height doesn't allow that. |
This comment has been minimized.
This comment has been minimized.
|
Mouse hover could pop over a larger copy of the histogram for closer
inspection? Or better yet, since the binning is as crude, one could instead
select the attribute and pass it out into a Distributions widget for more
precise inspection.
|
This comment has been minimized.
This comment has been minimized.
|
@pavlin-policar Could you possibly fix this row height by the end of the weekend so it gets featured in the next asap release? I would but don't know how. |
This comment has been minimized.
This comment has been minimized.
|
Sure thing! The latest commit limits the max height the histogram can grow to. This way the histograms still maintain a reasonable aspect ratio at a couple window sizes. If it turns out that the growth is pointless, it's easy to throw out. |
pavlin-policar
changed the title
[WIP] OWFeatureStatistics
[ENH] OWFeatureStatistics
Jan 7, 2018
This comment has been minimized.
This comment has been minimized.
|
I made a change allowing to color by any variable. Is it ok? |
This comment has been minimized.
This comment has been minimized.
pavlin-policar
and others
added some commits
Jan 7, 2018
This comment has been minimized.
This comment has been minimized.
|
Fixed. |
This comment has been minimized.
This comment has been minimized.
|
Looks good to me |
kernc
dismissed
nikicc’s
stale review
Jan 7, 2018
@nikicc agrees this is now ok.



pavlin-policar commentedApr 10, 2017
•
edited
Show basic statistics for every feature.
Features of note:
Todo:
QSortFilterProxyModel), special sorting by variable type, special sorting by first and second momentCurrent state (housing and GDS1615, respectively):

