Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic plot functionality for Series #294

Merged
merged 26 commits into from Jun 10, 2019

Conversation

dvgodoy
Copy link
Collaborator

@dvgodoy dvgodoy commented May 10, 2019

As mentioned in #293 , this PR creates Series.plot functions for plotting data in Koalas.Series.

The idea is to use pandas.plotting._core as base for inheritance as well to copy some functions/methods from and then adjust them to compute the necessary summarized data using Spark.

@rxin
Copy link
Contributor

rxin commented May 11, 2019

@dvgodoy thanks for the first PR! I know this is WIP, but can you describe (either in code or just as comments here) the summary algorithm you use to the type of plots? I think eventually we should document those in code as part of the docstring, but it'd be great to discuss them here too.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 12, 2019

@rxin Sure, I've added comments on the code, but I can outline them here as well.
I am currently focusing on 3 plots: bar, hist and box - scatter should be on DataFrame, not Series, my mistake when listing it on the issue.

The idea is to create Koalas specific classes that inherit from pandas plotting originals BarPlot, HistPlot and BoxPlot. A lot can be accomplished implementing the method _compute_plot_data, and it turns out it is the only change needed for the BarPlot. HistPlot demand some changes to other methods as well (_plot and _args_adjust), while BoxPlot demands a whole lot of copying and adapting functions from pandas plotting.

Regarding the summarizing algorithms:

  1. BarPlot: it just limits the number of values to 1k and converts it into pandas to use default plotting capabilities - it was thought to be typically used in combination with value_counts() like
    kdf.x.value_counts().plot.bar()
    At first, I thought of performing value_counts inside the method, as it would make sense for assessing distributions of categorical variables, but it would break compatibility with pandas functionality.

  2. HistPlot: Spark methods are invoked at two different moments
    2.1) bins: if user provides number of bins, instead of the actual bins, the method will compute min and max values of the corresponding column and split the interval into equally spaced bins.
    2.2) histogram: it creates a Bucketizer with the computed (or informed) bins, transforms the dataframe and groups by the newly created column to get the corresponding counts. Then it transforms into a pandas DF (as there are only as many rows as bins). After getting the number of counts, it is possible to use ax.hist with those counts as weights, thus generating the histogram.

  3. BoxPlot: this plot requires a lot more of data handling to be built
    3.1) statistics (median, Q1, Q3): it uses approx_percentile SQL function to compute those statistics with a default precision of 0.01 (100) to make it fast - it is possible to use precision as a kwarg to fine-tune it. These statistics allow for the computation of Tukey's fences and the corresponding fliers / outliers.
    3.2) Using Tukey's fences, it creates a column to flag rows as fliers / outliers.
    3.3) For the non-outliers, computes min and max values - these are the whiskers.
    3.4) If showfliers, it sorts outliers in descending order by the their absolute values and limits it to 1k points - plotting more than that would be pointless. The purpose of sorting is to make sure the most extreme values are plot, should we run into a case of having more than 1k points.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 12, 2019

One question: any preferred way to handle testing for plots?

In the past, I've handled this by converting the figures to base64 and then comparing them, generated and expected - it worked fine for Histogram and Bar plots, as both Spark and pandas produced exactly the same numbers.
It was a bit trickier for the boxplot, given the approximate statistics... in these cases, I just compared pixel by pixel and assessed the differences, comparing to those I knew to be derived from the approximation.

@rxin
Copy link
Contributor

rxin commented May 12, 2019

@dvgodoy do you know how pandas test plots? base64 probably works but it'd be somewhat difficult to inspect if anything goes wrong.

@rxin
Copy link
Contributor

rxin commented May 12, 2019

Also cc @falaki who's been our in-house plotting experts (although more on the R side).

@@ -89,6 +90,7 @@ class Series(_Frame):
:ivar _index_info: Each pair holds the index field name which exists in Spark fields,
and the index name.
"""
plot = CachedAccessor("plot", KoalasSeriesPlotMethods)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this our lazy_property defined in utils.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it was not there when I started my PR, but I've merged master into my branch and changed code to use it.

@rxin
Copy link
Contributor

rxin commented May 13, 2019

@dvgodoy I also just went through your algorithms. They make intuitive sense to me. I'm going to talk with couple more people tomorrow to get their thoughts as well.

For bar plots -- if a DataFrame has more than 1000 values, can we show some text in the generated plot saying we only take the first 1000 values? That'd be a useful message to get. We can also do that without computation overhead by just taking the first 1001 values, and if it is greater than 1000, we know we have more than 1000 values.

@rxin
Copy link
Contributor

rxin commented May 13, 2019

@dvgodoy I talked with @falaki today and one thing he suggested was to make it more explicit in code that there are two parts to visualization: (1) the summarization step, which is unique to big data, and (2) the visualization part, which is almost identical to pandas.

We can then write unit tests specifically for summarization, and just have limited integration tests verifying the pixels like you stated with base64 encoding.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 13, 2019

@rxin Thanks for the suggestions. I've made changes in that direction already.

  • added a message "showing top 1,000 elements only" when we do get 1001 values returned
  • split Spark and plotting code into two different classes for both Hist and Box plots (Bar plot is so straightforward there was no need for it)
  • I checked pandas testing, it uses ax.get_lines() and then get_xydata() for each line it draws to compare plots - I like this better than my initial base64 idea, and I will follow this when implementing tests

@rxin
Copy link
Contributor

rxin commented May 13, 2019

Cool. I know this is still WIP, but I tested the three plots. Two questions:

  1. bar failed. Is this expected?
  2. Why is "frequency" for hist 1000?

Screen Shot 2019-05-13 at 3 12 24 PM

BTW - depending on how complex the PR gets, we might ask you to split it into three separate PRs so we can review and merge faster.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 14, 2019

  1. It is a problem indeed, I need to fix it.
  2. Frequency is 1,000 because the default for bins is 10, so the points were evenly split in 10 bars of 1,000 each.
    I am working on the developing the tests now, so hopefully it will be ok soon.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 18, 2019

@rxin I've fixed the bar plot, added tests and documentation - so, it is possible to plot values for a single column (no groupby supported yet). The next step (in another PR) is to add support to groupby and, after that, go for dataframes and multiple columns.

I've been struggling with the Travis build, though. At first I made some mistakes with the docstrings, cause the example was incomplete. Then I fixed it, but I kept getting failing builds, regardless of several attempts to figure what is going on.

For some reason, it just crashes after databricks/koalas/tests/test_dataframe_conversion.py::DataFrameConversionTest::test_csv - next test would be to_clipboard.

And, of course, in my local setup, it works. Do you have any idea of what I am doing wrong?


import base64
from io import BytesIO
from matplotlib import pyplot as plt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dvgodoy you are not setting the backend, which is going to fail in headless environemnts like travis. You can actually see that in your laptop: when you run tests, a new window appears.

Put right between from io... and from matplotlib... the following lines:

import matplotlib
matplotlib.use('agg')

It should work.

@thunterdb
Copy link
Contributor

@dvgodoy can you try the suggestion above?

@thunterdb
Copy link
Contributor

Also, can you solve the merging conflicts?

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 21, 2019

@thunterdb Thanks, I will do it!

What I find puzzling is that I use Travis with my HandySpark and even though I do not set the backend as you suggested there, I never had these problems. That's why I would never think about this as an issue here.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 25, 2019

@thunterdb I've tried your suggestion but Travis is still crashing at the same point - right after databricks/koalas/tests/test_dataframe_conversion.py::DataFrameConversionTest::test_csv.
It never gets to output the result of test_to_clipboard.

@codecov-io
Copy link

codecov-io commented May 29, 2019

Codecov Report

Merging #294 into master will increase coverage by 0.02%.
The diff coverage is 93.54%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
+ Coverage   93.12%   93.15%   +0.02%     
==========================================
  Files          28       29       +1     
  Lines        3448     3694     +246     
==========================================
+ Hits         3211     3441     +230     
- Misses        237      253      +16
Impacted Files Coverage Δ
databricks/koalas/missing/series.py 100% <ø> (ø) ⬆️
databricks/koalas/series.py 92% <85.71%> (-0.12%) ⬇️
databricks/koalas/plot.py 93.77% <93.77%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05b55e4...c31ef3e. Read the comment docs.

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented May 31, 2019

@thunterdb I've finally passed all checks! :-)
The build was crashing (without a message) due to this QXcbConnection: Could not connect to display :0.0.
It turns out, setting the backend was not enough, as the functions that copied to the clipboard were triggering the error, whenever Matplotlib was imported (even if there was NO code using MPL at all!).
I was able to fix this including - export QT_QPA_PLATFORM="offscreen" at the before_script section of travis' yaml file.
This PR should be good to go now, could you please review it?

@thunterdb
Copy link
Contributor

@dvgodoy my apologies for the delay, glad to hear that you found a solution.

I am a bit constrained in time the next weeks. @HyukjinKwon can you assist in the review?

@thunterdb
Copy link
Contributor

Also, @dvgodoy , would you mind resolving the conflicts?

I think that this PR adds enough functionality that we do not need further features for now. Additional plots can happen separately.

Series.hist

Datetime Methods
----------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are datetime methods included?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My mistake... I've updated it. But, this time, as I inserted the plot functions after the other accessors, I ended up moving the conversion methods and they appear as both deleted and included on the PR.

@softagram-bot
Copy link

Softagram Impact Report for pull/294 (head commit: c31ef3e)

⭐ Change Overview

Showing the changed files, dependency changes and the impact - click for full size
(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

details of dependency changes - click for full size
(Open in Softagram Desktop for full details)

📄 Full report

Give feedback on this report to support@softagram.com

@dvgodoy
Copy link
Collaborator Author

dvgodoy commented Jun 9, 2019

Hi @thunterdb @HyukjinKwon

I've solved the conflicts and the checks passed :-)
I agree this is too big already, so we can leave other plots/groupby to a different PR.

@rxin rxin changed the title [WIP] Series plot accessor Basic plot functionality for Series Jun 10, 2019
@rxin
Copy link
Contributor

rxin commented Jun 10, 2019

I’m going to merge this. We can improve the functionality and add new features as follow-up PRs.

Thanks @dvgodoy!

@rxin rxin merged commit 66f5119 into databricks:master Jun 10, 2019
@HyukjinKwon
Copy link
Member

This is nice!

@HyukjinKwon HyukjinKwon mentioned this pull request Oct 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants