Basic plot functionality for Series #294

dvgodoy · 2019-05-10T21:16:27Z

As mentioned in #293 , this PR creates Series.plot functions for plotting data in Koalas.Series.

The idea is to use pandas.plotting._core as base for inheritance as well to copy some functions/methods from and then adjust them to compute the necessary summarized data using Spark.

rxin · 2019-05-11T06:07:00Z

@dvgodoy thanks for the first PR! I know this is WIP, but can you describe (either in code or just as comments here) the summary algorithm you use to the type of plots? I think eventually we should document those in code as part of the docstring, but it'd be great to discuss them here too.

dvgodoy · 2019-05-12T15:51:53Z

@rxin Sure, I've added comments on the code, but I can outline them here as well.
I am currently focusing on 3 plots: bar, hist and box - scatter should be on DataFrame, not Series, my mistake when listing it on the issue.

The idea is to create Koalas specific classes that inherit from pandas plotting originals BarPlot, HistPlot and BoxPlot. A lot can be accomplished implementing the method _compute_plot_data, and it turns out it is the only change needed for the BarPlot. HistPlot demand some changes to other methods as well (_plot and _args_adjust), while BoxPlot demands a whole lot of copying and adapting functions from pandas plotting.

Regarding the summarizing algorithms:

BarPlot: it just limits the number of values to 1k and converts it into pandas to use default plotting capabilities - it was thought to be typically used in combination with value_counts() like
kdf.x.value_counts().plot.bar()
At first, I thought of performing value_counts inside the method, as it would make sense for assessing distributions of categorical variables, but it would break compatibility with pandas functionality.
HistPlot: Spark methods are invoked at two different moments
2.1) bins: if user provides number of bins, instead of the actual bins, the method will compute min and max values of the corresponding column and split the interval into equally spaced bins.
2.2) histogram: it creates a Bucketizer with the computed (or informed) bins, transforms the dataframe and groups by the newly created column to get the corresponding counts. Then it transforms into a pandas DF (as there are only as many rows as bins). After getting the number of counts, it is possible to use ax.hist with those counts as weights, thus generating the histogram.
BoxPlot: this plot requires a lot more of data handling to be built
3.1) statistics (median, Q1, Q3): it uses approx_percentile SQL function to compute those statistics with a default precision of 0.01 (100) to make it fast - it is possible to use precision as a kwarg to fine-tune it. These statistics allow for the computation of Tukey's fences and the corresponding fliers / outliers.
3.2) Using Tukey's fences, it creates a column to flag rows as fliers / outliers.
3.3) For the non-outliers, computes min and max values - these are the whiskers.
3.4) If showfliers, it sorts outliers in descending order by the their absolute values and limits it to 1k points - plotting more than that would be pointless. The purpose of sorting is to make sure the most extreme values are plot, should we run into a case of having more than 1k points.

dvgodoy · 2019-05-12T15:57:38Z

One question: any preferred way to handle testing for plots?

In the past, I've handled this by converting the figures to base64 and then comparing them, generated and expected - it worked fine for Histogram and Bar plots, as both Spark and pandas produced exactly the same numbers.
It was a bit trickier for the boxplot, given the approximate statistics... in these cases, I just compared pixel by pixel and assessed the differences, comparing to those I knew to be derived from the approximation.

rxin · 2019-05-12T20:55:21Z

@dvgodoy do you know how pandas test plots? base64 probably works but it'd be somewhat difficult to inspect if anything goes wrong.

rxin · 2019-05-12T20:58:07Z

Also cc @falaki who's been our in-house plotting experts (although more on the R side).

rxin · 2019-05-13T05:31:28Z

databricks/koalas/series.py

@@ -89,6 +90,7 @@ class Series(_Frame):
    :ivar _index_info: Each pair holds the index field name which exists in Spark fields,
      and the index name.
    """
+    plot = CachedAccessor("plot", KoalasSeriesPlotMethods)


is this our lazy_property defined in utils.py?

Yes, it was not there when I started my PR, but I've merged master into my branch and changed code to use it.

rxin · 2019-05-13T05:42:45Z

@dvgodoy I also just went through your algorithms. They make intuitive sense to me. I'm going to talk with couple more people tomorrow to get their thoughts as well.

For bar plots -- if a DataFrame has more than 1000 values, can we show some text in the generated plot saying we only take the first 1000 values? That'd be a useful message to get. We can also do that without computation overhead by just taking the first 1001 values, and if it is greater than 1000, we know we have more than 1000 values.

rxin · 2019-05-13T19:50:07Z

@dvgodoy I talked with @falaki today and one thing he suggested was to make it more explicit in code that there are two parts to visualization: (1) the summarization step, which is unique to big data, and (2) the visualization part, which is almost identical to pandas.

We can then write unit tests specifically for summarization, and just have limited integration tests verifying the pixels like you stated with base64 encoding.

dvgodoy · 2019-05-13T21:43:50Z

@rxin Thanks for the suggestions. I've made changes in that direction already.

added a message "showing top 1,000 elements only" when we do get 1001 values returned
split Spark and plotting code into two different classes for both Hist and Box plots (Bar plot is so straightforward there was no need for it)
I checked pandas testing, it uses ax.get_lines() and then get_xydata() for each line it draws to compare plots - I like this better than my initial base64 idea, and I will follow this when implementing tests

rxin · 2019-05-13T22:13:33Z

Cool. I know this is still WIP, but I tested the three plots. Two questions:

bar failed. Is this expected?
Why is "frequency" for hist 1000?

BTW - depending on how complex the PR gets, we might ask you to split it into three separate PRs so we can review and merge faster.

dvgodoy · 2019-05-14T19:50:32Z

It is a problem indeed, I need to fix it.
Frequency is 1,000 because the default for bins is 10, so the points were evenly split in 10 bars of 1,000 each.
I am working on the developing the tests now, so hopefully it will be ok soon.

Conflicts: databricks/koalas/series.py

dvgodoy · 2019-05-18T10:05:56Z

@rxin I've fixed the bar plot, added tests and documentation - so, it is possible to plot values for a single column (no groupby supported yet). The next step (in another PR) is to add support to groupby and, after that, go for dataframes and multiple columns.

I've been struggling with the Travis build, though. At first I made some mistakes with the docstrings, cause the example was incomplete. Then I fixed it, but I kept getting failing builds, regardless of several attempts to figure what is going on.

For some reason, it just crashes after databricks/koalas/tests/test_dataframe_conversion.py::DataFrameConversionTest::test_csv - next test would be to_clipboard.

And, of course, in my local setup, it works. Do you have any idea of what I am doing wrong?

thunterdb · 2019-05-19T23:04:41Z

databricks/koalas/tests/test_series_plot.py

+
+import base64
+from io import BytesIO
+from matplotlib import pyplot as plt


@dvgodoy you are not setting the backend, which is going to fail in headless environemnts like travis. You can actually see that in your laptop: when you run tests, a new window appears.

Put right between from io... and from matplotlib... the following lines:

import matplotlib matplotlib.use('agg')

It should work.

thunterdb · 2019-05-19T23:05:13Z

@dvgodoy can you try the suggestion above?

thunterdb · 2019-05-19T23:05:38Z

Also, can you solve the merging conflicts?

dvgodoy · 2019-05-21T20:10:46Z

@thunterdb Thanks, I will do it!

What I find puzzling is that I use Travis with my HandySpark and even though I do not set the backend as you suggested there, I never had these problems. That's why I would never think about this as an issue here.

dvgodoy · 2019-05-25T20:30:44Z

@thunterdb I've tried your suggestion but Travis is still crashing at the same point - right after databricks/koalas/tests/test_dataframe_conversion.py::DataFrameConversionTest::test_csv.
It never gets to output the result of test_to_clipboard.

codecov-io · 2019-05-29T20:50:14Z

Codecov Report

Merging #294 into master will increase coverage by 0.02%.
The diff coverage is 93.54%.

@@            Coverage Diff             @@
##           master     #294      +/-   ##
==========================================
+ Coverage   93.12%   93.15%   +0.02%     
==========================================
  Files          28       29       +1     
  Lines        3448     3694     +246     
==========================================
+ Hits         3211     3441     +230     
- Misses        237      253      +16

Impacted Files	Coverage Δ
databricks/koalas/missing/series.py	`100% <ø> (ø)`	⬆️
databricks/koalas/series.py	`92% <85.71%> (-0.12%)`	⬇️
databricks/koalas/plot.py	`93.77% <93.77%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 05b55e4...c31ef3e. Read the comment docs.

dvgodoy · 2019-05-31T21:21:12Z

@thunterdb I've finally passed all checks! :-)
The build was crashing (without a message) due to this QXcbConnection: Could not connect to display :0.0.
It turns out, setting the backend was not enough, as the functions that copied to the clipboard were triggering the error, whenever Matplotlib was imported (even if there was NO code using MPL at all!).
I was able to fix this including - export QT_QPA_PLATFORM="offscreen" at the before_script section of travis' yaml file.
This PR should be good to go now, could you please review it?

thunterdb · 2019-06-05T20:29:22Z

@dvgodoy my apologies for the delay, glad to hear that you found a solution.

I am a bit constrained in time the next weeks. @HyukjinKwon can you assist in the review?

thunterdb · 2019-06-05T20:30:30Z

Also, @dvgodoy , would you mind resolving the conflicts?

I think that this PR adds enough functionality that we do not need further features for now. Additional plots can happen separately.

thunterdb · 2019-06-05T20:31:34Z

docs/source/reference/series.rst

+   Series.hist
+
+Datetime Methods
+----------------


why are datetime methods included?

My mistake... I've updated it. But, this time, as I inserted the plot functions after the other accessors, I ended up moving the conversion methods and they appear as both deleted and included on the PR.

softagram-bot · 2019-06-09T15:38:43Z

Softagram Impact Report for pull/294 (head commit: `c31ef3e`)

⭐ Change Overview

(Open in Softagram Desktop for full details)

⭐ Details of Dependency Changes

(Open in Softagram Desktop for full details)

📄 Full report

Permalink: Full report for pull/294

Give feedback on this report to support@softagram.com

dvgodoy · 2019-06-09T15:58:02Z

Hi @thunterdb @HyukjinKwon

I've solved the conflicts and the checks passed :-)
I agree this is too big already, so we can leave other plots/groupby to a different PR.

rxin · 2019-06-10T21:34:41Z

I’m going to merge this. We can improve the functionality and add new features as follow-up PRs.

Thanks @dvgodoy!

HyukjinKwon · 2019-06-11T09:50:34Z

This is nice!

dvgodoy added 3 commits May 10, 2019 22:23

porting serieplotmethods to koalas

105eb18

hist method not available yet

4dccf2b

docstrings with derived_from

59e7e5d

adding boxplot and bar plots

80bc02f

rxin reviewed May 13, 2019

View reviewed changes

dvgodoy added 2 commits May 13, 2019 22:23

merging master into branch

0fca75b

separating summary and plot

c20fccf

dvgodoy added 8 commits May 16, 2019 23:09

test plots

b2c6e69

Merge branch 'master' into series-plot-accessor

0e0af80

Conflicts: databricks/koalas/series.py

adding documentation

4c73d52

fixing docstring

ce978f1

fixing docstring!

a862a7f

removing example from docstring

66dbcea

Merge branch 'master' into series-plot-accessor

82d62e2

including matplotlib as req

ba668d7

thunterdb reviewed May 19, 2019

View reviewed changes

dvgodoy added 2 commits May 25, 2019 21:48

fixing conflicts and plot test

8c81c1b

fix documentation for series plots

2cec972

dvgodoy added 2 commits May 29, 2019 22:38

fixing crash due to matplotlib

f3dc242

merging master

9f0e291

dvgodoy added 7 commits May 29, 2019 22:54

triggering failed python 3.7 build

ef1eba8

triggering it once again

fa2ce54

another attempt at travis

edaac9f

increasing coverage

36451ec

solving conflict in doc

437feec

fix hist test on series

79658a9

increasing coverage still

928d9d3

thunterdb reviewed Jun 5, 2019

View reviewed changes

merging master

c31ef3e

rxin changed the title ~~[WIP] Series plot accessor~~ Basic plot functionality for Series Jun 10, 2019

rxin merged commit 66f5119 into databricks:master Jun 10, 2019

HyukjinKwon mentioned this pull request Oct 7, 2020

Series Boxplot #1827

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic plot functionality for Series #294

Basic plot functionality for Series #294

dvgodoy commented May 10, 2019

rxin commented May 11, 2019

dvgodoy commented May 12, 2019

dvgodoy commented May 12, 2019

rxin commented May 12, 2019

rxin commented May 12, 2019

rxin May 13, 2019

dvgodoy May 13, 2019

rxin commented May 13, 2019

rxin commented May 13, 2019

dvgodoy commented May 13, 2019

rxin commented May 13, 2019

dvgodoy commented May 14, 2019

dvgodoy commented May 18, 2019

thunterdb May 19, 2019

thunterdb commented May 19, 2019

thunterdb commented May 19, 2019

dvgodoy commented May 21, 2019

dvgodoy commented May 25, 2019

codecov-io commented May 29, 2019 •

edited

dvgodoy commented May 31, 2019

thunterdb commented Jun 5, 2019

thunterdb commented Jun 5, 2019

thunterdb Jun 5, 2019

dvgodoy Jun 9, 2019

softagram-bot commented Jun 9, 2019

dvgodoy commented Jun 9, 2019

rxin commented Jun 10, 2019

HyukjinKwon commented Jun 11, 2019

Basic plot functionality for Series #294

Basic plot functionality for Series #294

Conversation

dvgodoy commented May 10, 2019

rxin commented May 11, 2019

dvgodoy commented May 12, 2019

dvgodoy commented May 12, 2019

rxin commented May 12, 2019

rxin commented May 12, 2019

rxin May 13, 2019

Choose a reason for hiding this comment

dvgodoy May 13, 2019

Choose a reason for hiding this comment

rxin commented May 13, 2019

rxin commented May 13, 2019

dvgodoy commented May 13, 2019

rxin commented May 13, 2019

dvgodoy commented May 14, 2019

dvgodoy commented May 18, 2019

thunterdb May 19, 2019

Choose a reason for hiding this comment

thunterdb commented May 19, 2019

thunterdb commented May 19, 2019

dvgodoy commented May 21, 2019

dvgodoy commented May 25, 2019

codecov-io commented May 29, 2019 • edited

Codecov Report

dvgodoy commented May 31, 2019

thunterdb commented Jun 5, 2019

thunterdb commented Jun 5, 2019

thunterdb Jun 5, 2019

Choose a reason for hiding this comment

dvgodoy Jun 9, 2019

Choose a reason for hiding this comment

softagram-bot commented Jun 9, 2019

Softagram Impact Report for pull/294 (head commit: c31ef3e)

⭐ Change Overview

⭐ Details of Dependency Changes

📄 Full report

dvgodoy commented Jun 9, 2019

rxin commented Jun 10, 2019

HyukjinKwon commented Jun 11, 2019

codecov-io commented May 29, 2019 •

edited

Softagram Impact Report for pull/294 (head commit: `c31ef3e`)