Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gather, count, fix #17, fix isNumber, fix #19 #18

Merged
merged 72 commits into from
Nov 16, 2019
Merged

Conversation

Vindaar
Copy link
Owner

@Vindaar Vindaar commented Nov 5, 2019

See update below

This PR first of all changes our definition of what constitutes a number, since our definition was broken (question remains why we still use that proc, instead of parseutils.parseBiggestFloat).

Then adds gather, so that the plot mentioned in issue #17 can be written as:

let dfNew = df.gather("C1", "C2", key = "Channel", value = "value")
ggplot(dfNew, aes("s", "value", color = "Channel")) +
  geom_line() +
  ggsave("correct_scale.pdf")

instead of the handling with select and bind_rows.

Finally it will fix #17 properly.

  • test cases for gather
  • fix y scale of several axes is not combined #17
  • test cases for 17
  • fix stat="identity" for aes #19
  • adds count for DF
  • fix all tests according to rewrite
  • put facet wrap back in will be done in another PR soon
  • fix code for continuous (fill)color, shape, size scales
  • write new tests for new features for now done via plot comparison
  • add more recipes
  • completely rewrite drawing logic
  • actually make e.g. geom_point w/ stat="identity" work
  • add tests comparing PNG output of recipes as PPMs

Update:

This turned out to be a major rewrite of a big chunk of the ggplotnim.nim file. Essentially the calculation part of the code and the drawing portion was better split now.

It also contains a (hastily written) discussion of a performance bottleneck I encountered during the rewrite. While the specific case of that has been fixed, it seems to be a bigger problem, which is GC related. If weird performance regressions happen, attempt to compile the code with --gc:boehm, which does not suffer from the same slowdowns.

Taken from the commit with the major changes:

This turned out to be a major rewrite of a big chunk of the ggplotnim.nim file. Essentially the calculation part of the code and the drawing portion was better split now.

Taken from the commit with the major changes:
This commit performs a major rewrite of the way the data is prepared
for the plots and the data is accessed during the drawing process.

Instead of performing calculations when the drawing (that is
construction of GraphObjects) is done and having to modify different
parameters like scales etc. again and again is now completely avoided.

Everything required is now calculated beforehand and stored in a
FilledScales object, which has several FilledGeom objects (one for
each geom call) which bring everything that is required and provide a
unified, DataFrame based, access to the data required for each type of
geom.

This implementation might be somewhat slower for certain types of
plots, but the separation of drawing and preprocessing is worth it in
my opinion.

It might be slower simply due to more DataFrame based operations
taking place (due to inefficiencies in the DF impls) and possibly
unnecessary copies.
The former has now a direct motivation to be improved upon and the
latter will be fixed (which is now easier done).

There's still quite some code that is not needed anymore, which has to
be removed. With this commit simply all plots I created and used
before hand are now working correctly again and expands some features,
including adding (working) stats fields to the different geoms, so
that e.g. a geom_histogram can take stat="identity" and be drawn
from already precomputed bins and counts. Thus fixes issue #19.

Update 2

Well, after the last change, which rewrote the data processing I then realized that the drawing logic wasn't really made to handle the new data that was available. So I rewrote that code too.

Now the geom that's being drawn is almost decoupled from the calculations that happen for the data, so that e.g. points can be used to represent (or add to) a bar plot, add points to a binned plot etc.

More recipes for all these cases will be added.

Previously our definition of what constitutes a number was off.

But why don't we use parseutils.parseBiggestFloat? I assume because
originally this thing was supposed to be simple? Or did we have cases
which were not covered by parseBiggestFloat?
The previous definition of the lift templates was problematic, if one
wanted to lift some proc only locally (which may be necessary in some
circumstances, e.g. when lifting a proc for a unit test). Set the
`toExport` static bool argument to false.

Also exports the string templates for the user to use.
@Vindaar
Copy link
Owner Author

Vindaar commented Nov 7, 2019

Rebased onto current master.

@Vindaar
Copy link
Owner Author

Vindaar commented Nov 12, 2019

Phew, this proved more work / thinking than I thought it would.

NOTE: I decided against using `parseBiggestFloat`, because while it
would have been easier (and possibly faster) it parses floats from a
string. That does not mean the whole string is a valid number.
We could probably have gotten away with something like:

```nim
var tmp: float64
let numParsed = parseBiggestFloat(s, tmp, 0)
if numParsed > 0 and numParsed == s.len - 1:
  result = true
```
but this still wouldn't have accounted for allowed spaces at the end
of the string... Well, I guess part of me wanted so see how hard it
would be to write a float checker (which probably still has bugs). And
well, we don't even want the number from this proc... :)
Fortunately the CSV parser =rowEntry= proc returns a =var string=,
which allows us to remove the white space in place before we assign
it.

This finally gets rid of our darn string copy bottleneck. Yay!
Count is in principle a convenience proc for `group_by` and `summarize`
Otherwise the `ids` set will be printed completely. Since we often
have the set full this "breaks" printing of scales.
Since `marker` is now a field of a ginger.Style, should hash it.
Adds a proc to assign a specific value to the position `idx` in column
`k` to value `val`.
Now `f{}` works both for cases where a called function is a field of
an object and also if simply a function is available under an
identifier and not a raw proc. This allows such a formula:

```nim
f{ s.col ~ s.trans(s.col) }
```

to compile (and work!), mapping the `s.col` to its transformed.
These are requried to decide which kind of plot to draw. I.e. is it a
discrete or continuous plot along each axis.
Note that discrete in y direction is not implemented yet!
Previously this forbid the application of more complicated functions
than simply e.g. "tan("cty")"
The `callHistogram` proc handles the binning related fields of the
Geom object and takes the correct information as input.

Also takes care to assign the `dcKind*` fields of the FilledGeom as
well as adding the `binWidths` field to the resulting DF for the
binning stat.
Previously wasn't allowed to draw ticks with relative coordinates, but
for discrete ticks it makes sense, since the labels aren't numerical
anyways and we know perfectly well where the ticks are going to be.
The assertion that the resulting `numX` should be the same as the
input geom.numBins is bad due to the additional ways to set the number
of bins beyond just the `numBins` geom field (binWidth, breaks etc).

Then we also don't want to assign the `numX` back to the geom, since
that is supposed to store the value the user assigned
originally. Later we're going to use `numX` only anyways.
This allows complete control both about the binning for the stat_bin
like case as well as the interpretation in the stat identity case for
the data (in case user hands prebinned DF and wants geom_point
overlaid onto center of bins)
Also raises an exception if an unsupported geom is to be post
processed instead of just ignoring it.

`geom_point` can now be binned due to changes to the binning related
fields of the Geom object. Information is available and accessible via
geom_point proc.
Instead of simply having one proc for each geom which has to deal with
all sorts of different cases we now divide the logic rather by:

- discreteness
- stat kinds
- and only then geoms

This allows to handle all sorts of different options in a more
streamlined way, especially regarding adding more options in the
future (at least I hope so).

Thought the code could probably still be compactified some more, but
well. Fine of for now.
We still have to add facet plots back in. This will simply be done by
having a facet field in the FilledScales and if facet is set in the
GgPlot object we'll group_by the desired field before the
`postProcessScales` step.
Have to extract the ginger view from the PlotView object now, instead
of receiving it from ggcreate directly.

Pretty surprised that the tests still passed though!
If a scale does not exist, so the user wants to set some value we
shouldn't use `select`. Thus use `getIdentityData` here as well for
the DF we hand to `fillScaleImpl`
Test compares the resulting PDF files line by line except the line
containing the creation date.
@Vindaar Vindaar force-pushed the addGatherAndFix17 branch 3 times, most recently from f758336 to 374f1b7 Compare November 16, 2019 06:18
@Vindaar Vindaar merged commit af01e28 into master Nov 16, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stat="identity" for aes y scale of several axes is not combined
1 participant