Changelog

0.12.0 / 2016-11-03

DataFrame

Return a series when functions given to dataframe.map_partitions return scalars (:pr:`1515`)
Fix type size inference for series (:pr:`1513`)
dataframe.DataFrame.categorize no longer includes missing values in the categories. This is for compatibility with a pandas change<pandas-dev/pandas#10929> (:pr:`1565`)
Fix head parser error in dataframe.read_csv when some lines have quotes (:pr:`1495`)
Add dataframe.reduction and series.reduction methods to apply generic row-wise reduction to dataframes and series (:pr:`1483`)
Add dataframe.select_dtypes, which mirrors the `pandas method<http://pandas.pydata.org/pandas-docs/version/0.18.1/generated/pandas.DataFrame.select_dtypes.html>`_ (:pr:`1556`)
dataframe.read_hdf now supports reading Series (:pr:`1564`)
Support Pandas 0.19.0 (:pr:`1540`)
Implement select_dtypes (:pr:`1556`)
String accessor works with indexes (:pr:`1561`)
Add pipe method to dask.dataframe (:pr:`1567`)
Add indicator keyword to merge (:pr:`1575`)
Support Series in read_hdf (:pr:`1575`)
Support Categories with missing values (:pr:`1578`)
Support inplace operators like df.x += 1 (:pr:`1585`)
Str accessor passes through args and kwargs (:pr:`1621`)
Improved groupby support for single-machine multiprocessing scheduler (:pr:`1625`)
Tree reductions (:pr:`1663`)
Pivot tables (:pr:`1665`)
Add clip (:pr:`1667`), align (:pr:`1668`), combine_first (:pr:`1725`), and any/all (:pr:`1724`)
Improved handling of divisions on dask-pandas merges (:pr:`1666`)
Add groupby.aggregate method (:pr:`1678`)
Add dd.read_table function (:pr:`1682`)
Improve support for multi-level columns (:pr:`1697`) (:pr:`1712`)
Support 2d indexing in loc (:pr:`1726`)
Extend resample to include DataFrames (:pr:`1741`)
Support dask.array ufuncs on dask.dataframe objects (:pr:`1669`)

Array

Add information about how dask.array chunks argument work (:pr:`1504`)
Fix field access with non-scalar fields in dask.array (:pr:`1484`)
Add concatenate= keyword to atop to concatenate chunks of contracted dimensions
Optimized slicing performance (:pr:`1539`) (:pr:`1731`)
Extend atop with a concatenate= (:pr:`1609`) new_axes= (:pr:`1612`) and adjust_chunks= (:pr:`1716`) keywords
Add clip (:pr:`1610`) swapaxes (:pr:`1611`) round (:pr:`1708`) repeat (:pr:``)
Automatically align chunks in atop-backed operations (:pr:`1644`)
Cull dask.arrays on slicing (:pr:`1709`)

Bag

Fix issue with callables in bag.from_sequence being interpreted as tasks (:pr:`1491`)
Avoid non-lazy memory use in reductions (:pr:`1747`)

Administration

Added changelog (:pr:`1526`)
Create new threadpool when operating from thread (:pr:`1487`)
Unify example documentation pages into one (:pr:`1520`)
Add versioneer for git-commit based versions (:pr:`1569`)
Pass through node_attr and edge_attr keywords in dot visualization (:pr:`1614`)
Add continuous testing for Windows with Appveyor (:pr:`1648`)
Remove use of multiprocessing.Manager (:pr:`1653`)
Add global optimizations keyword to compute (:pr:`1675`)
Micro-optimize get_dependencies (:pr:`1722`)

0.11.0 / 2016-08-24

Major Points

DataFrames now enforce knowing full metadata (columns, dtypes) everywhere. Previously we would operate in an ambiguous state when functions lost dtype information (such as apply). Now all dataframes always know their dtypes and raise errors asking for information if they are unable to infer (which they usually can). Some internal attributes like _pd and _pd_nonempty have been moved.

The internals of the distributed scheduler have been refactored to transition tasks between explicit states. This improves resilience, reasoning about scheduling, plugin operation, and logging. It also makes the scheduler code easier to understand for newcomers.

Breaking Changes

The distributed.s3 and distributed.hdfs namespaces are gone. Use protocols in normal methods like read_text('s3://...' instead.
Dask.array.reshape now errs in some cases where previously it would have create a very large number of tasks

0.10.2 / 2016-07-27

More Dataframe shuffles now work in distributed settings, ranging from setting-index to hash joins, to sorted joins and groupbys.
Dask passes the full test suite when run when under in Python's optimized-OO mode.
On-disk shuffles were found to produce wrong results in some highly-concurrent situations, especially on Windows. This has been resolved by a fix to the partd library.
Fixed a growth of open file descriptors that occurred under large data communications
Support ports in the --bokeh-whitelist option ot dask-scheduler to better routing of web interface messages behind non-trivial network settings
Some improvements to resilience to worker failure (though other known failures persist)
You can now start an IPython kernel on any worker for improved debugging and analysis
Improvements to dask.dataframe.read_hdf, especially when reading from multiple files and docs

0.10.0 / 2016-06-13

Major Changes

This version drops support for Python 2.6
Conda packages are built and served from conda-forge
The dask.distributed executables have been renamed from dfoo to dask-foo. For example dscheduler is renamed to dask-scheduler
Both Bag and DataFrame include a preliminary distributed shuffle.

Bag

Add task-based shuffle for distributed groupbys
Add accumulate for cumulative reductions

DataFrame

Add a task-based shuffle suitable for distributed joins, groupby-applys, and set_index operations. The single-machine shuffle remains untouched (and much more efficient.)
Add support for new Pandas rolling API with improved communication performance on distributed systems.
Add groupby.std/var
Pass through S3/HDFS storage options in read_csv
Improve categorical partitioning
Add eval, info, isnull, notnull for dataframes

Distributed

Rename executables like dscheduler to dask-scheduler
Improve scheduler performance in the many-fast-tasks case (important for shuffling)
Improve work stealing to be aware of expected function run-times and data sizes. The drastically increases the breadth of algorithms that can be efficiently run on the distributed scheduler without significant user expertise.
Support maximum buffer sizes in streaming queues
Improve Windows support when using the Bokeh diagnostic web interface
Support compression of very-large-bytestrings in protocol
Support clean cancellation of submitted futures in Joblib interface

Other

All dask-related projects (dask, distributed, s3fs, hdfs, partd) are now building conda packages on conda-forge.
Change credential handling in s3fs to only pass around delegated credentials if explicitly given secret/key. The default now is to rely on managed environments. This can be changed back by explicitly providing a keyword argument. Anonymous mode must be explicitly declared if desired.

0.9.0 / 2016-05-11

API Changes

dask.do and dask.value have been renamed to dask.delayed
dask.bag.from_filenames has been renamed to dask.bag.read_text
All S3/HDFS data ingest functions like db.from_s3 or distributed.s3.read_csv have been moved into the plain read_text, read_csv functions, which now support protocols, like dd.read_csv('s3://bucket/keys*.csv')

Array

Add support for scipy.LinearOperator
Improve optional locking to on-disk data structures
Change rechunk to expose the intermediate chunks

Bag

Rename from_filename``s to ``read_text
Remove from_s3 in favor of read_text('s3://...')

DataFrame

Fixed numerical stability issue for correlation and covariance
Allow no-hash from_pandas for speedy round-trips to and from-pandas objects
Generally reengineered read_csv to be more in line with Pandas behavior
Support fast set_index operations for sorted columns

Delayed

Rename do/value to delayed
Rename to/from_imperative to to/from_delayed

Distributed

Move s3 and hdfs functionality into the dask repository
Adaptively oversubscribe workers for very fast tasks
Improve PyPy support
Improve work stealing for unbalanced workers
Scatter data efficiently with tree-scatters

Other

Add lzma/xz compression support
Raise a warning when trying to split unsplittable compression types, like gzip or bz2
Improve hashing for single-machine shuffle operations
Add new callback method for start state
General performance tuning

0.8.1 / 2016-03-11

Array

Bugfix for range slicing that could periodically lead to incorrect results.
Improved support and resiliency of arg reductions (argmin,

argmax, etc.)

Bag

Add zip function

DataFrame

Add corr and cov functions
Add melt function
Bugfixes for io to bcolz and hdf5

0.8.0 / 2016-02-20

Array

Changed default array reduction split from 32 to 4
Linear algebra, tril, triu, LU, inv, cholesky, solve, solve_triangular, eye``, lstsq, diag, corrcoef.

Bag

Add tree reductions
Add range function
drop from_hdfs function (better functionality now exists in hdfs3 and distributed projects)

DataFrame

Refactor dask.dataframe to include a full empty pandas dataframe as metadata. Drop the .columns attribute on Series
Add Series categorical accessor, series.nunique, drop the .columns attribute for series.
read_csv fixes (multi-column parse_dates, integer column names, etc. )
Internal changes to improve graph serialization

Other

Documentation updates
Add from_imperative and to_imperative functions for all collections
Aesthetic changes to profiler plots
Moved the dask project to a new dask organization

0.7.6 / 2016-01-05

Array

Improve thread safety
Tree reductions
Add view, compress, hstack, dstack, vstack methods
map_blocks can now remove and add dimensions

DataFrame

Improve thread safety
Extend sampling to include replacement options

Imperative

Removed optimization passes that fused results.

Core

Removed dask.distributed
Improved performance of blocked file reading
Serialization improvements
Test Python 3.5

0.7.4 / 2015-10-23

This was mostly a bugfix release. Some notable changes:

Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17
Fixed a bug with random number generation that would cause repeated blocks due to the birthday paradox
Use locks in dask.dataframe.read_hdf by default to avoid concurrency issues
Change dask.get to point to dask.async.get_sync by default
Allow visualization functions to accept general graphviz graph options like rankdir='LR'
Add reshape and ravel to dask.array
Support the creation of dask.arrays from dask.imperative objects

Deprecation

This release also includes a deprecation warning for dask.distributed, which will be removed in the next version.

Future development in distributed computing for dask is happening here: https://distributed.readthedocs.io . General feedback on that project is most welcome from this community.

0.7.3 / 2015-09-25

Diagnostics

A utility for profiling memory and cpu usage has been added to the dask.diagnostics module.

DataFrame

This release improves coverage of the pandas API. Among other things it includes nunique, nlargest, quantile. Fixes encoding issues with reading non-ascii csv files. Performance improvements and bug fixes with resample. More flexible read_hdf with globbing. And many more. Various bug fixes in dask.imperative and dask.bag.

0.7.0 / 2015-08-15

DataFrame

This release includes significant bugfixes and alignment with the Pandas API. This has resulted both from use and from recent involvement by Pandas core developers.

New operations: query, rolling operations, drop
Improved operations: quantiles, arithmetic on full dataframes, dropna, constructor logic, merge/join, elemwise operations, groupby aggregations

Bag

Fixed a bug in fold where with a null default argument

Array

New operations: da.fft module, da.image.imread

Infrastructure

The array and dataframe collections create graphs with deterministic keys. These tend to be longer (hash strings) but should be consistent between computations. This will be useful for caching in the future.
All collections (Array, Bag, DataFrame) inherit from common subclass

0.6.1 / 2015-07-23

Distributed

Improved (though not yet sufficient) resiliency for dask.distributed when workers die

DataFrame

Improved writing to various formats, including to_hdf, to_castra, and to_csv
Improved creation of dask DataFrames from dask Arrays and Bags
Improved support for categoricals and various other methods

Array

Various bug fixes
Histogram function

Scheduling

Added tie-breaking ordering of tasks within parallel workloads to better handle and clear intermediate results

Other

Added the dask.do function for explicit construction of graphs with normal python code
Traded pydot for graphviz library for graph printing to support Python3
There is also a gitter chat room and a stackoverflow tag

Files

changelog.rst

Latest commit

History

changelog.rst

File metadata and controls

Changelog

0.12.0 / 2016-11-03

DataFrame

Array

Bag

Administration

0.11.0 / 2016-08-24

Major Points

Breaking Changes

0.10.2 / 2016-07-27

0.10.0 / 2016-06-13

Major Changes

Bag

DataFrame

Distributed

Other

0.9.0 / 2016-05-11

API Changes

Array

Bag

DataFrame

Delayed

Distributed

Other

0.8.1 / 2016-03-11

Array

Bag

DataFrame

0.8.0 / 2016-02-20

Array

Bag

DataFrame

Other

0.7.6 / 2016-01-05

Array

DataFrame

Imperative

Core

0.7.4 / 2015-10-23

Deprecation

0.7.3 / 2015-09-25

Diagnostics

DataFrame

0.7.0 / 2015-08-15

DataFrame

Bag

Array

Infrastructure

0.6.1 / 2015-07-23

Distributed

DataFrame

Array

Scheduling

Other