New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic plot functionality for Series #294
Conversation
@dvgodoy thanks for the first PR! I know this is WIP, but can you describe (either in code or just as comments here) the summary algorithm you use to the type of plots? I think eventually we should document those in code as part of the docstring, but it'd be great to discuss them here too. |
@rxin Sure, I've added comments on the code, but I can outline them here as well. The idea is to create Koalas specific classes that inherit from pandas plotting originals BarPlot, HistPlot and BoxPlot. A lot can be accomplished implementing the method Regarding the summarizing algorithms:
|
One question: any preferred way to handle testing for plots? In the past, I've handled this by converting the figures to base64 and then comparing them, generated and expected - it worked fine for Histogram and Bar plots, as both Spark and pandas produced exactly the same numbers. |
@dvgodoy do you know how pandas test plots? base64 probably works but it'd be somewhat difficult to inspect if anything goes wrong. |
Also cc @falaki who's been our in-house plotting experts (although more on the R side). |
databricks/koalas/series.py
Outdated
@@ -89,6 +90,7 @@ class Series(_Frame): | |||
:ivar _index_info: Each pair holds the index field name which exists in Spark fields, | |||
and the index name. | |||
""" | |||
plot = CachedAccessor("plot", KoalasSeriesPlotMethods) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this our lazy_property defined in utils.py?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it was not there when I started my PR, but I've merged master into my branch and changed code to use it.
@dvgodoy I also just went through your algorithms. They make intuitive sense to me. I'm going to talk with couple more people tomorrow to get their thoughts as well. For bar plots -- if a DataFrame has more than 1000 values, can we show some text in the generated plot saying we only take the first 1000 values? That'd be a useful message to get. We can also do that without computation overhead by just taking the first 1001 values, and if it is greater than 1000, we know we have more than 1000 values. |
@dvgodoy I talked with @falaki today and one thing he suggested was to make it more explicit in code that there are two parts to visualization: (1) the summarization step, which is unique to big data, and (2) the visualization part, which is almost identical to pandas. We can then write unit tests specifically for summarization, and just have limited integration tests verifying the pixels like you stated with base64 encoding. |
@rxin Thanks for the suggestions. I've made changes in that direction already.
|
|
Conflicts: databricks/koalas/series.py
@rxin I've fixed the bar plot, added tests and documentation - so, it is possible to plot values for a single column (no groupby supported yet). The next step (in another PR) is to add support to groupby and, after that, go for dataframes and multiple columns. I've been struggling with the Travis build, though. At first I made some mistakes with the docstrings, cause the example was incomplete. Then I fixed it, but I kept getting failing builds, regardless of several attempts to figure what is going on. For some reason, it just crashes after And, of course, in my local setup, it works. Do you have any idea of what I am doing wrong? |
|
||
import base64 | ||
from io import BytesIO | ||
from matplotlib import pyplot as plt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dvgodoy you are not setting the backend, which is going to fail in headless environemnts like travis. You can actually see that in your laptop: when you run tests, a new window appears.
Put right between from io...
and from matplotlib...
the following lines:
import matplotlib
matplotlib.use('agg')
It should work.
@dvgodoy can you try the suggestion above? |
Also, can you solve the merging conflicts? |
@thunterdb Thanks, I will do it! What I find puzzling is that I use Travis with my HandySpark and even though I do not set the backend as you suggested there, I never had these problems. That's why I would never think about this as an issue here. |
@thunterdb I've tried your suggestion but Travis is still crashing at the same point - right after |
Codecov Report
@@ Coverage Diff @@
## master #294 +/- ##
==========================================
+ Coverage 93.12% 93.15% +0.02%
==========================================
Files 28 29 +1
Lines 3448 3694 +246
==========================================
+ Hits 3211 3441 +230
- Misses 237 253 +16
Continue to review full report at Codecov.
|
@thunterdb I've finally passed all checks! :-) |
@dvgodoy my apologies for the delay, glad to hear that you found a solution. I am a bit constrained in time the next weeks. @HyukjinKwon can you assist in the review? |
Also, @dvgodoy , would you mind resolving the conflicts? I think that this PR adds enough functionality that we do not need further features for now. Additional plots can happen separately. |
docs/source/reference/series.rst
Outdated
Series.hist | ||
|
||
Datetime Methods | ||
---------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are datetime methods included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake... I've updated it. But, this time, as I inserted the plot functions after the other accessors, I ended up moving the conversion methods and they appear as both deleted and included on the PR.
Softagram Impact Report for pull/294 (head commit: c31ef3e)⭐ Change Overview
⭐ Details of Dependency Changes
📄 Full report
Give feedback on this report to support@softagram.com |
I've solved the conflicts and the checks passed :-) |
I’m going to merge this. We can improve the functionality and add new features as follow-up PRs. Thanks @dvgodoy! |
This is nice! |
As mentioned in #293 , this PR creates Series.plot functions for plotting data in Koalas.Series.
The idea is to use pandas.plotting._core as base for inheritance as well to copy some functions/methods from and then adjust them to compute the necessary summarized data using Spark.