John Tukey’s stem-and-leaf plot first appeared in 1970. Although very useful back then, it cannot handle more than 300 data points and is completely text-based. Stemgraphic is a very easy to use python package providing a solution to these limitations (no size limit, graphical tool). It also supports categorical and text as input.
A typical stem_graphic output:
For an in depth look at the algorithms and the design of stemgraphic, see
Stemgraphic: A Stem-and-Leaf Plot for the Age of Big Data
Documentation is available as pdf stemgraphic.pdf and online html.
The official website of stemgraphic is: http://stemgraphic.org
Stemgraphic requires docopt, matplotlib and pandas. Optionally, having Scipy installed will give you secondary plots and Dask (see requirements_dev.txt for all needed to run all the functional tests) will allow for out of core, big data visualization.
Installation is simple:
pip3 install -U stemgraphic
or from this cloned repository, in the package root:
python3 setup.py install
- Bugfix for VERSION
- back-to-back stem-and-leaf plots can use predefined axes (secondary ax added)
- added quantize function (basically a round trip number->stem-and-leaf->number))
- density_plot added for numerical values with stem-and-leaf quantization and sampling
- density_plot also support multiple secondary plots like box, violin, rug, strip
- notebook demoing density_plot
- notebook demoing comparison of violin, box and stem-and-leaf for certain distributions
Version bump to 0.6 due to order of params changing. Shouldn't affect using named args
Major code change and expansion for num.stem_graphic including:
- back-to-back stem-and-leaf plots
- allows comparison of very skewed data
- bug fix (rounding issue) due to python precision
- better stem handling
- alpha down to 10% for bars
- median alpha can be specified
- stems can be hidden
- added title option, besides the legend
Other changes:
- More notebook examples
- added leaf_skip, stem_skip to a few functions missing them
- heatmap_grid bugfix
- added reverse to a few functions missing it
- improved documentation
- matrix_difference ord param added added
- ngram_data now properly defaults to case insensitive
- switched magenta to 'C4' - compatible with mpl styles now
- functions to read/write .npy and .pkl files
- more unicode typographical glyphs added to the list of non alpha
- scatter 3d support
- added 3rd source to compare (in 3d) with scatter plots
- more scatter plot fixes
- some warnings added to deal with 3d and log scale issues
- added fig_xy to scatter - useful to quickly adjust figsize in a notebook
- added normalize, percentage and whole (integer) to scatter
- added alpha to scatter
- added documentation for scatter plots
- added jitter to scatter plots
- added log scale to scatter plots
- more notebooks
- stem_text legend fix
- missed adding the code for scatter plots
- more notebooks
Major new release.
- All 0.4.0 private changes were merged
- new module stemgraphic.alpha:
- n-gram support
- stem_graphic supporting categorical
- stem_graphic supporting text
- stem_text supporting categorical
- stem_text supporting text
- stem command line supporting categorical when column specified
- heatmap for n-grams
- heatmap grid to compare multiple text sources
- Frobenius norm on diff matrices
- radar plot with Levenshtein distance
- frequency plot (bar, barh, hist, area, pie)
- sunburst char
- interactive charts with cufflinks
- new module stemgraphic.num to match .alpha
- stop word dictionaries for English, Spanish and French
- Massively improved documentation of modules and functions
- Improved HTML documentation
- Improved PDF documentation
Internal release for customer.
-
Added Heatmap
-
Basic PDF documentation
-
Quickstart notebook
Matploblib 2.0 compatibility
-
Persist sample from command line tool (-k filename.pkl or -k filename.csv).
-
Windows compatible bat file wrapper (stem.bat).
-
Added full command line access to dask distributed server (-d, -s, use file in '' when using glob / wildcard).
-
For operations with dask, performance has been increased by 25% in this latest release, by doing a compute once of min, max and count all at once. Count replaces len(x).
Added the companion PDF as it will be presented at PyData Carolinas 2016.
- multivariate support
- provide support for secondary plots with dask
- automatic dense layout
- add a way to provide an alternate function to the sampling
- support for spark rdds and/or sparkling pandas
- create a bokeh version. Ideally rbokeh too.
- add unit tests
- add feather, hdf5 etc support, particularly on sample persistence
- more charts
- more examples