add `gather`, `count`, fix #17, fix `isNumber`, fix #19 #18

Vindaar · 2019-11-05T20:13:39Z

See update below

This PR first of all changes our definition of what constitutes a number, since our definition was broken (question remains why we still use that proc, instead of parseutils.parseBiggestFloat).

Then adds gather, so that the plot mentioned in issue #17 can be written as:

let dfNew = df.gather("C1", "C2", key = "Channel", value = "value")
ggplot(dfNew, aes("s", "value", color = "Channel")) +
  geom_line() +
  ggsave("correct_scale.pdf")

instead of the handling with select and bind_rows.

Finally it will fix #17 properly.

Update:

This turned out to be a major rewrite of a big chunk of the ggplotnim.nim file. Essentially the calculation part of the code and the drawing portion was better split now.

It also contains a (hastily written) discussion of a performance bottleneck I encountered during the rewrite. While the specific case of that has been fixed, it seems to be a bigger problem, which is GC related. If weird performance regressions happen, attempt to compile the code with --gc:boehm, which does not suffer from the same slowdowns.

Taken from the commit with the major changes:

This turned out to be a major rewrite of a big chunk of the ggplotnim.nim file. Essentially the calculation part of the code and the drawing portion was better split now.

Taken from the commit with the major changes:
This commit performs a major rewrite of the way the data is prepared
for the plots and the data is accessed during the drawing process.

Instead of performing calculations when the drawing (that is
construction of GraphObjects) is done and having to modify different
parameters like scales etc. again and again is now completely avoided.

Everything required is now calculated beforehand and stored in a
FilledScales object, which has several FilledGeom objects (one for
each geom call) which bring everything that is required and provide a
unified, DataFrame based, access to the data required for each type of
geom.

This implementation might be somewhat slower for certain types of
plots, but the separation of drawing and preprocessing is worth it in
my opinion.

It might be slower simply due to more DataFrame based operations
taking place (due to inefficiencies in the DF impls) and possibly
unnecessary copies.
The former has now a direct motivation to be improved upon and the
latter will be fixed (which is now easier done).

There's still quite some code that is not needed anymore, which has to
be removed. With this commit simply all plots I created and used
before hand are now working correctly again and expands some features,
including adding (working) stats fields to the different geoms, so
that e.g. a geom_histogram can take stat="identity" and be drawn
from already precomputed bins and counts. Thus fixes issue #19.

Update 2

Well, after the last change, which rewrote the data processing I then realized that the drawing logic wasn't really made to handle the new data that was available. So I rewrote that code too.

Now the geom that's being drawn is almost decoupled from the calculations that happen for the data, so that e.g. points can be used to represent (or add to) a bar plot, add points to a binned plot etc.

More recipes for all these cases will be added.

Previously our definition of what constitutes a number was off. But why don't we use parseutils.parseBiggestFloat? I assume because originally this thing was supposed to be simple? Or did we have cases which were not covered by parseBiggestFloat?

The previous definition of the lift templates was problematic, if one wanted to lift some proc only locally (which may be necessary in some circumstances, e.g. when lifting a proc for a unit test). Set the `toExport` static bool argument to false. Also exports the string templates for the user to use.

Vindaar · 2019-11-07T10:57:43Z

Rebased onto current master.

Vindaar · 2019-11-12T12:42:44Z

Phew, this proved more work / thinking than I thought it would.

NOTE: I decided against using `parseBiggestFloat`, because while it would have been easier (and possibly faster) it parses floats from a string. That does not mean the whole string is a valid number. We could probably have gotten away with something like: ```nim var tmp: float64 let numParsed = parseBiggestFloat(s, tmp, 0) if numParsed > 0 and numParsed == s.len - 1: result = true ``` but this still wouldn't have accounted for allowed spaces at the end of the string... Well, I guess part of me wanted so see how hard it would be to write a float checker (which probably still has bugs). And well, we don't even want the number from this proc... :)

Fortunately the CSV parser =rowEntry= proc returns a =var string=, which allows us to remove the white space in place before we assign it. This finally gets rid of our darn string copy bottleneck. Yay!

Count is in principle a convenience proc for `group_by` and `summarize`

Otherwise the `ids` set will be printed completely. Since we often have the set full this "breaks" printing of scales.

Since `marker` is now a field of a ginger.Style, should hash it.

Adds a proc to assign a specific value to the position `idx` in column `k` to value `val`.

Now `f{}` works both for cases where a called function is a field of an object and also if simply a function is available under an identifier and not a raw proc. This allows such a formula: ```nim f{ s.col ~ s.trans(s.col) } ``` to compile (and work!), mapping the `s.col` to its transformed.

These are requried to decide which kind of plot to draw. I.e. is it a discrete or continuous plot along each axis. Note that discrete in y direction is not implemented yet!

Previously this forbid the application of more complicated functions than simply e.g. "tan("cty")"

The `callHistogram` proc handles the binning related fields of the Geom object and takes the correct information as input. Also takes care to assign the `dcKind*` fields of the FilledGeom as well as adding the `binWidths` field to the resulting DF for the binning stat.

Previously wasn't allowed to draw ticks with relative coordinates, but for discrete ticks it makes sense, since the labels aren't numerical anyways and we know perfectly well where the ticks are going to be.

The assertion that the resulting `numX` should be the same as the input geom.numBins is bad due to the additional ways to set the number of bins beyond just the `numBins` geom field (binWidth, breaks etc). Then we also don't want to assign the `numX` back to the geom, since that is supposed to store the value the user assigned originally. Later we're going to use `numX` only anyways.

This allows complete control both about the binning for the stat_bin like case as well as the interpretation in the stat identity case for the data (in case user hands prebinned DF and wants geom_point overlaid onto center of bins)

Also raises an exception if an unsupported geom is to be post processed instead of just ignoring it. `geom_point` can now be binned due to changes to the binning related fields of the Geom object. Information is available and accessible via geom_point proc.

Instead of simply having one proc for each geom which has to deal with all sorts of different cases we now divide the logic rather by: - discreteness - stat kinds - and only then geoms This allows to handle all sorts of different options in a more streamlined way, especially regarding adding more options in the future (at least I hope so). Thought the code could probably still be compactified some more, but well. Fine of for now.

We still have to add facet plots back in. This will simply be done by having a facet field in the FilledScales and if facet is set in the GgPlot object we'll group_by the desired field before the `postProcessScales` step.

Have to extract the ginger view from the PlotView object now, instead of receiving it from ggcreate directly. Pretty surprised that the tests still passed though!

If a scale does not exist, so the user wants to set some value we shouldn't use `select`. Thus use `getIdentityData` here as well for the DF we hand to `fillScaleImpl`

Test compares the resulting PDF files line by line except the line containing the creation date.

Apparently the order is machine specific after all, hehe.

Vindaar mentioned this pull request Nov 6, 2019

Fixes #20 by introducing fixed isInt #21

Merged

Vindaar added 5 commits November 7, 2019 11:57

fix isNumber check for strings

220c466

Previously our definition of what constitutes a number was off. But why don't we use parseutils.parseBiggestFloat? I assume because originally this thing was supposed to be simple? Or did we have cases which were not covered by parseBiggestFloat?

add gather to simplify plot creation of issue #17

6075b21

add transmute test cases

ef1eac2

add gather test cases

fdea91e

Vindaar force-pushed the addGatherAndFix17 branch from 06b41f8 to fdea91e Compare November 7, 2019 10:57

Vindaar added 21 commits November 12, 2019 19:08

update understand.org document

6c106a2

update understanding...org with more numbers

871f1ea

remove whitespace prefix/suffix from csv row entry

566e6eb

Fortunately the CSV parser =rowEntry= proc returns a =var string=, which allows us to remove the white space in place before we assign it. This finally gets rid of our darn string copy bottleneck. Yay!

update understand fast & slow stuff

6053b5f

update docstring of isNumber

fa1a94d

finish the understandin doc

16941c7

update understand...org (final?) time

ed6a6be

add StatKind identity, count and bin

fd9fced

put freqpoly, bar gobj creation impl back in

ff8a868

refactor out tryGetStyle

17b4d18

use bin, count as default for geom_freqpoly, geom_bar

3310a5e

allow summarize to take implicit formula (no LHS), add test case

ff03721

add test case for CSV parsing with whitespace

2cb9f25

add length proc for persvector to determine length in formula

f56d70e

lift a bunch more procs from math (taken from seqmath)

5cdf427

add count proc for DFs and add test

833b15d

Count is in principle a convenience proc for `group_by` and `summarize`

add $ for Scale, helps printing of ids field

19f97f9

Otherwise the `ids` set will be printed completely. Since we often have the set full this "breaks" printing of scales.

also hash marker field of Style

cb81550

Since `marker` is now a field of a ginger.Style, should hash it.

make [], []= inline and add []= for data frame at idx

e809e6a

Adds a proc to assign a specific value to the position `idx` in column `k` to value `val`.

Vindaar added 20 commits November 15, 2019 20:28

fix $ for Geom to include previously added fields

79e1571

add dcKind* fields to filledGeom

704059a

These are requried to decide which kind of plot to draw. I.e. is it a discrete or continuous plot along each axis. Note that discrete in y direction is not implemented yet!

allow fkTerm in RHS of transmute

0ce1b87

Previously this forbid the application of more complicated functions than simply e.g. "tan("cty")"

fix ggdraw to use the view of PlotView

0e3d206

draw discrete ticks in relative coordinates, ref. ginger v0.1.4

d40ab9b

Previously wasn't allowed to draw ticks with relative coordinates, but for discrete ticks it makes sense, since the labels aren't numerical anyways and we know perfectly well where the ticks are going to be.

add binning related args to geom_* procs

6d9183b

This allows complete control both about the binning for the stat_bin like case as well as the interpretation in the stat identity case for the data (in case user hands prebinned DF and wants geom_point overlaid onto center of bins)

change args to generateFacetPlot, content still commented out

8210de5

We still have to add facet plots back in. This will simply be done by having a facet field in the FilledScales and if facet is set in the GgPlot object we'll group_by the desired field before the `postProcessScales` step.

fix test cases for new changes in type structure

1e768d5

Have to extract the ginger view from the PlotView object now, instead of receiving it from ggcreate directly. Pretty surprised that the tests still passed though!

fix arg to histogram when using binWidth

5eb804d

add large number of new recipes

f122d1f

only use select in fillScale if mapping, not if setting data

c52bf81

If a scale does not exist, so the user wants to set some value we shouldn't use `select`. Thus use `getIdentityData` here as well for the DF we hand to `fillScaleImpl`

add all new recipe media files

4c053d1

add nimble task to generate recipes and pngs from it

6208d59

add test case which compares recipes output plots

a1c9312

Test compares the resulting PDF files line by line except the line containing the creation date.

update docs to current version

a88d0dc

fix image links in recipes

a132cbf

Vindaar force-pushed the addGatherAndFix17 branch 3 times, most recently from f758336 to 374f1b7 Compare November 16, 2019 06:18

Vindaar added 3 commits November 16, 2019 08:24

fix freq poly recipe plot, update PNGs

ccd3ea9

use HashSet in DF count test

c15836f

Apparently the order is machine specific after all, hehe.

run tests, install ntangle, fix link & header in recipes

c8cb5c2

Vindaar force-pushed the addGatherAndFix17 branch from 3cf821c to d88cdd3 Compare November 16, 2019 07:26

fix path in recipes, don't use shell for now, fix travis, add img

becf674

Vindaar force-pushed the addGatherAndFix17 branch from 2970bf8 to becf674 Compare November 16, 2019 08:13

Vindaar merged commit af01e28 into master Nov 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `gather`, `count`, fix #17, fix `isNumber`, fix #19 #18

add `gather`, `count`, fix #17, fix `isNumber`, fix #19 #18

Vindaar commented Nov 5, 2019 •

edited

Vindaar commented Nov 7, 2019

Vindaar commented Nov 12, 2019

add gather, count, fix #17, fix isNumber, fix #19 #18

add gather, count, fix #17, fix isNumber, fix #19 #18

Conversation

Vindaar commented Nov 5, 2019 • edited

Update:

Update 2

Vindaar commented Nov 7, 2019

Vindaar commented Nov 12, 2019

add `gather`, `count`, fix #17, fix `isNumber`, fix #19 #18

add `gather`, `count`, fix #17, fix `isNumber`, fix #19 #18

Vindaar commented Nov 5, 2019 •

edited