Changelog

0.18.0 / 2018-MM-DD

Array

Dataframe

Bag

Core

0.17.4 / 2018-05-03

Dataframe

Add support for indexing Dask DataFrames with string subclasses (3461) James Bourbeau
Allow using both sorted_index and chunksize in read_hdf (3463) Pierre Bartet
Pass filesystem to arrow piece reader (3466) Martin Durant
Switches to using dask.compat string_types (#3462) James Bourbeau

0.17.3 / 2018-05-02

Array

Add einsum for Dask Arrays (3412) Simon Perkins
Add piecewise for Dask Arrays (3350) John A Kirkham
Fix handling of nan in broadcast_shapes (3356) John A Kirkham
Add isin for dask arrays (3363). Stephan Hoyer
Overhauled topk for Dask Arrays: faster algorithm, particularly for large k's; added support for multiple axes, recursive aggregation, and an option to pick the bottom k elements instead. (3395) Guido Imperiale
The topk API has changed from topk(k, array) to the more conventional topk(array, k). The legacy API still works but is now deprecated. (2965) Guido Imperiale
New function argtopk for Dask Arrays (3396) Guido Imperiale
Fix handling partial depth and boundary in map_overlap (3445) John A Kirkham
Add gradient for Dask Arrays (3434) John A Kirkham

DataFrame

Allow t as shorthand for table in to_hdf for pandas compatibility (3330) Jörg Dietrich
Added top level isna method for Dask DataFrames (3294) Christopher Ren
Fix selection on partition column on read_parquet for engine="pyarrow" (3207) Uwe Korn
Added DataFrame.squeeze method (3366) Christopher Ren
Added infer_divisions option to read_parquet to specify whether read engines should compute divisions (3387) Jon Mease
Added support for inferring division for engine="pyarrow" (3387) Jon Mease
Provide more informative error message for meta= errors (3343) Matthew Rocklin
add orc reader (3284) Martin Durant
Default compression for parquet now always Snappy, in line with pandas (3373) Martin Durant
Fixed bug in Dask DataFrame and Series comparisons with NumPy scalars (3436) James Bourbeau
Remove outdated requirement from repartition docstring (3440) Jörg Dietrich
Fixed bug in aggregation when only a Series is selected (3446) Jörg Dietrich
Add default values to make_timeseries (3421) Matthew Rocklin

Bag

Core

Support traversing collections in persist, visualize, and optimize (3410) Jim Crist

0.17.2 / 2018-03-21

Array

Add broadcast_arrays for Dask Arrays (3217) John A Kirkham
Add bitwise_* ufuncs (3219) John A Kirkham
Add optional axis argument to squeeze (3261) John A Kirkham
Validate inputs to atop (3307) Matthew Rocklin
Avoid calls to astype in concatenate if all parts have the same dtype (3301) Martin Durant

DataFrame

Fixed bug in shuffle due to aggressive truncation (3201) Matthew Rocklin
Support specifying categorical columns on read_parquet with categories=[…] for engine="pyarrow" (3177) Uwe Korn
Add dd.tseries.Resampler.agg (3202) Richard Postelnik
Support operations that mix dataframes and arrays (3230) Matthew Rocklin
Support extra Scalar and Delayed args in dd.groupby._Groupby.apply (3256) Gabriele Lanaro

Bag

Support joining against single-partitioned bags and delayed objects (3254) Matthew Rocklin

Core

Fixed bug when using unexpected but hashable types for keys (3238) Daniel Collins
Fix bug in task ordering so that we break ties consistently with the key name (3271) Matthew Rocklin
Avoid sorting tasks in order when the number of tasks is very large (3298) Matthew Rocklin

0.17.1 / 2018-02-22

Array

Corrected dimension chunking in indices (3166, 3167) Simon Perkins
Inline store_chunk calls for store's return_stored option (3153) John A Kirkham
Compatibility with struct dtypes for NumPy 1.14.1 release (3187) Matthew Rocklin

DataFrame

Bugfix to allow column assignment of pandas datetimes(3164) Max Epstein

Bag

Core

New file-system for HTTP(S), allowing direct loading from specific URLs (3160) Martin Durant
Fix bug when tokenizing partials with no keywords (3191) Matthew Rocklin
Use more recent LZ4 API (3157) Thrasibule
Introduce output stream parameter for progress bar (3185) Dieter Weber

0.17.0 / 2018-02-09

Array

Added a support object-type arrays for nansum, nanmin, and nanmax (3133) Keisuke Fujii
Update error handling when len is called with empty chunks (3058) Xander Johnson
Fixes a metadata bug with store's return_stored option (3064) John A Kirkham
Fix a bug in optimization.fuse_slice to properly handle when first input is None (3076) James Bourbeau
Support arrays with unknown chunk sizes in percentile (3107) Matthew Rocklin
Tokenize scipy.sparse arrays and np.matrix (3060) Roman Yurchak

DataFrame

Support month timedeltas in repartition(freq=...) (3110) Matthew Rocklin
Avoid mutation in dataframe groupby tests (3118) Matthew Rocklin
read_csv, read_table, and read_parquet accept iterables of paths (3124) Jim Crist
Deprecates the dd.to_delayed function in favor of the existing method (3126) Jim Crist
Return dask.arrays from df.map_partitions calls when the UDF returns a numpy array (3147) Matthew Rocklin
Change handling of columns and index in dd.read_parquet to be more consistent, especially in handling of multi-indices (3149) Jim Crist
fastparquet append=True allowed to create new dataset (3097) Martin Durant
dtype rationalization for sql queries (3100) Martin Durant

Bag

Document bag.map_paritions function may recieve either a list or generator. (3150) Nir

Core

Change default task ordering to prefer nodes with few dependents and then many downstream dependencies (3056) Matthew Rocklin
Add color= option to visualize to color by task order (3057) (3122) Matthew Rocklin
Deprecate dask.bytes.open_text_files (3077) Jim Crist
Remove short-circuit hdfs reads handling due to maintenance costs. May be re-added in a more robust manner later (3079) Jim Crist
Add dask.base.optimize for optimizing multiple collections without computing. (3071) Jim Crist
Rename dask.optimize module to dask.optimization (3071) Jim Crist
Change task ordering to do a full traversal (3066) Matthew Rocklin
Adds an optimize_graph keyword to all to_delayed methods to allow controlling whether optimizations occur on conversion. (3126) Jim Crist
Support using pyarrow for hdfs integration (3123) Jim Crist
Move HDFS integration and tests into dask repo (3083) Jim Crist
Remove write_bytes (3116) Jim Crist

0.16.1 / 2018-01-09

Array

Fix handling of scalar percentile values in percentile (3021) James Bourbeau
Prevent bool() coercion from calling compute (2958) Albert DeFusco
Add matmul (2904) John A Kirkham
Support N-D arrays with matmul (2909) John A Kirkham
Add vdot (2910) John A Kirkham
Explicit chunks argument for broadcast_to (2943) Stephan Hoyer
Add meshgrid (2938) John A Kirkham and (3001) Markus Gonser
Preserve singleton chunks in fftshift/ifftshift (2733) John A Kirkham
Fix handling of negative indexes in vindex and raise errors for out of bounds indexes (2967) Stephan Hoyer
Add flip, flipud, fliplr (2954) John A Kirkham
Add float_power ufunc (2962) (2969) John A Kirkham
Compatability for changes to structured arrays in the upcoming NumPy 1.14 release (2964) Tom Augspurger
Add block (2650) John A Kirkham
Add frompyfunc (3030) Jim Crist
Add the return_stored option to store for chaining stored results (2980) John A Kirkham

DataFrame

Fixed naming bug in cumulative aggregations (3037) Martijn Arts
Fixed dd.read_csv when names is given but header is not set to None (2976) Martijn Arts
Fixed dd.read_csv so that passing instances of CategoricalDtype in dtype will result in known categoricals (2997) Tom Augspurger
Prevent bool() coercion from calling compute (2958) Albert DeFusco
DataFrame.read_sql() (2928) to an empty database tables returns an empty dask dataframe Apostolos Vlachopoulos
Compatability for reading Parquet files written by PyArrow 0.8.0 (2973) Tom Augspurger
Correctly handle the column name (df.columns.name) when reading in dd.read_parquet (2973) Tom Augspurger
Fixed dd.concat losing the index dtype when the data contained a categorical (2932) Tom Augspurger
Add dd.Series.rename (3027) Jim Crist
DataFrame.merge() now supports merging on a combination of columns and the index (2960) Jon Mease
Removed the deprecated dd.rolling* methods, in preperation for their removal in the next pandas release (2995) Tom Augspurger
Fix metadata inference bug in which single-partition series were mistakenly special cased (3035) Jim Crist
Add support for Series.str.cat (3028) Jim Crist

Core

Improve 32-bit compatibility (2937) Matthew Rocklin
Change task prioritization to avoid upwards branching (3017) Matthew Rocklin

0.16.0 / 2017-11-17

This is a major release. It includes breaking changes, new protocols, and a large number of bug fixes.

Array

Add atleast_1d, atleast_2d, and atleast_3d (2760) (2765) John A Kirkham
Add allclose (2771) by John A Kirkham
Remove random.different_seeds from Dask Array API docs (2772) John A Kirkham
Deprecate vnorm in favor of dask.array.linalg.norm (2773) John A Kirkham
Reimplement unique to be lazy (2775) John A Kirkham
Support broadcasting of Dask Arrays with 0-length dimensions (2784) John A Kirkham
Add asarray and asanyarray to Dask Array API docs (2787) James Bourbeau
Support unique's return_* arguments (2779) John A Kirkham
Simplify _unique_internal (2850) (2855) John A Kirkham
Avoid removing some getter calls in array optimizations (2826) Jim Crist

DataFrame

Support pyarrow in dd.to_parquet (2868) Jim Crist
Fixed DataFrame.quantile and Series.quantile returning nan when missing values are present (2791) Tom Augspurger
Fixed DataFrame.quantile losing the result .name when q is a scalar (2791) Tom Augspurger
Fixed dd.concat return a dask.Dataframe when concatenating a single series along the columns, matching pandas' behavior (2800) James Munroe
Fixed default inplace parameter for DataFrame.eval to match the pandas defualt for pandas >= 0.21.0 (2838) Tom Augspurger
Fix exception when calling DataFrame.set_index on text column where one of the partitions was empty (2831) Jesse Vogt
Do not raise exception when calling DataFrame.set_index on empty dataframe (2827) Jesse Vogt
Fixed bug in Dataframe.fillna when filling with a Series value (2810) Tom Augspurger
Deprecate old argument ordering in dd.to_parquet to better match convention of putting the dataframe first (2867) Jim Crist
df.astype(categorical_dtype -> known categoricals (2835) Jim Crist
Test against Pandas release candidate (2814) Tom Augspurger
Add more tests for read_parquet(engine='pyarrow') (2822) Uwe Korn
Remove unnecessary map_partitions in aggregate (2712) Christopher Prohm
Fix bug calling sample on empty partitions (2818) @xwang777
Error nicely when parsing dates in read_csv (2863) Jim Crist
Cleanup handling of passing filesystem objects to PyArrow readers (2527) @fjetter
Support repartitioning even if there are no divisions (2873) @Ced4
Support reading/writing to hdfs using pyarrow in dd.to_parquet (2894, 2881) Jim Crist

Core

Allow tuples as sharedict keys (2763) Matthew Rocklin
Calling compute within a dask.distributed task defaults to distributed scheduler (2762) Matthew Rocklin
Auto-import gcsfs when gcs:// protocol is used (2776) Matthew Rocklin
Fully remove dask.async module, use dask.local instead (2828) Thomas Caswell
Compatability with bokeh 0.12.10 (2844) Tom Augspurger
Reduce test memory usage (2782) Jim Crist
Add Dask collection interface (2748) Jim Crist
Update Dask collection interface during XArray integration (2847) Matthew Rocklin
Close resource profiler process on __exit__ (2871) Jim Crist
Fix S3 tests (2875) Jim Crist
Fix port for bokeh dashboard in docs (2889) Ian Hopkinson
Wrap Dask filesystems for PyArrow compatibility (2881) Jim Crist

0.15.4 / 2017-10-06

Array

da.random.choice now works with array arguments (2781)
Support indexing in arrays with np.int (fixes regression) (2719)
Handle zero dimension with rechunking (2747)
Support -1 as an alias for "size of the dimension" in chunks (2749)
Call mkdir in array.to_npy_stack (2709)

DataFrame

Added the .str accessor to Categoricals with string categories (2743)
Support int96 (spark) datetimes in parquet writer (2711)
Pass on file scheme to fastparquet (2714)
Support Pandas 0.21 (2737)

Bag

Add tree reduction support for foldby (2710)

Core

Drop s3fs from pip install dask[complete] (2750)

0.15.3 / 2017-09-24

Array

Add masked arrays (2301)
Add *_like array creation functions (2640)
Indexing with unsigned integer array (2647)
Improved slicing with boolean arrays of different dimensions (2658)
Support literals in top and atop (2661)
Optional axis argument in cumulative functions (2664)
Improve tests on scalars with assert_eq (2681)
Fix norm keepdims (2683)
Add ptp (2691)
Add apply_along_axis (2690) and apply_over_axes (2702)

DataFrame

Added Series.str[index] (2634)
Allow the groupby by param to handle columns and index levels (2636)
DataFrame.to_csv and Bag.to_textfiles now return the filenames to

which they have written (2655)
Fix combination of partition_on and append in to_parquet (2645)
Fix for parquet file schemes (2667)
Repartition works with mixed categoricals (2676)

Core

python setup.py test now runs tests (2641)
Added new cheatsheet (2649)
Remove resize tool in Bokeh plots (2688)

0.15.2 / 2017-08-25

Array

Remove spurious keys from map_overlap graph (2520)
where works with non-bool condition and scalar values (2543) (2549)
Improve compress (2541) (2545) (2555)
Add argwhere, _nonzero, and where(cond) (2539)
Generalize vindex in dask.array to handle multi-dimensional indices (2573)
Add choose method (2584)
Split code into reorganized files (2595)
Add linalg.norm (2597)
Add diff, ediff1d (2607), (2609)
Improve dtype inference and reflection (2571)

Bag

Remove deprecated Bag behaviors (2525)

DataFrame

Support callables in assign (2513)
better error messages for read_csv (2522)
Add dd.to_timedelta (2523)
Verify metadata in from_delayed (2534) (2591)
Add DataFrame.isin (2558)
Read_hdf supports iterables of files (2547)

Core

Remove bare except: blocks everywhere (2590)

0.15.1 / 2017-07-08

Add storage_options to to_textfiles and to_csv (2466)
Rechunk and simplify rfftfreq (2473), (2475)
Better support ndarray subclasses (2486)
Import star in dask.distributed (2503)
Threadsafe cache handling with tokenization (2511)

0.15.0 / 2017-06-09

Array

Add dask.array.stats submodule (2269)
Support ufunc.outer (2345)
Optimize fancy indexing by reducing graph overhead (2333) (2394)
Faster array tokenization using alternative hashes (2377)
Added the matmul @ operator (2349)
Improved coverage of the numpy.fft module (2320) (2322) (2327) (2323)
Support NumPy's __array_ufunc__ protocol (2438)

Bag

Fix bug where reductions on bags with no partitions would fail (2324)
Add broadcasting and variadic db.map top-level function. Also remove auto-expansion of tuples as map arguments (2339)
Rename Bag.concat to Bag.flatten (2402)

DataFrame

Parquet improvements (2277) (2422)

Core

Move dask.async module to dask.local (2318)
Support callbacks with nested scheduler calls (2397)
Support pathlib.Path objects as uris (2310)

0.14.3 / 2017-05-05

DataFrame

Pandas 0.20.0 support

0.14.2 / 2017-05-03

Array

Add da.indices (2268), da.tile (2153), da.roll (2135)
Simultaneously support drop_axis and new_axis in da.map_blocks (2264)
Rechunk and concatenate work with unknown chunksizes (2235) and (2251)
Support non-numpy container arrays, notably sparse arrays (2234)
Tensordot contracts over multiple axes (2186)
Allow delayed targets in da.store (2181)
Support interactions against lists and tuples (2148)
Constructor plugins for debugging (2142)
Multi-dimensional FFTs (single chunk) (2116)

Bag

to_dataframe enforces consistent types (2199)

DataFrame

Set_index always fully sorts the index (2290)
Support compatibility with pandas 0.20.0 (2249), (2248), and (2246)
Support Arrow Parquet reader (2223)
Time-based rolling windows (2198)
Repartition can now create more partitions, not just less (2168)

Core

Always use absolute paths when on POSIX file system (2263)
Support user provided graph optimizations (2219)
Refactor path handling (2207)
Improve fusion performance (2129), (2131), and (2112)

0.14.1 / 2017-03-22

Array

Micro-optimize optimizations (2058)
Change slicing optimizations to avoid fusing raw numpy arrays (2075) (2080)
Dask.array operations now work on numpy arrays (2079)
Reshape now works in a much broader set of cases (2089)
Support deepcopy python protocol (2090)
Allow user-provided FFT implementations in da.fft (2093)

Bag

DataFrame

Fix to_parquet with empty partitions (2020)
Optional npartitions='auto' mode in set_index (2025)
Optimize shuffle performance (2032)
Support efficient repartitioning along time windows like repartition(freq='12h') (2059)
Improve speed of categorize (2010)
Support single-row dataframe arithmetic (2085)
Automatically avoid shuffle when setting index with a sorted column (2091)
Improve handling of integer-na handling in read_csv (2098)

Delayed

Repeated attribute access on delayed objects uses the same key (2084)

Core

Improve naming of nodes in dot visuals to avoid generic apply (2070)
Ensure that worker processes have different random seeds (2094)

0.14.0 / 2017-02-24

Array

Fix corner cases with zero shape and misaligned values in arange (1902), (1904), (1935), (1955), (1956)
Improve concatenation efficiency (1923)
Avoid hashing in from_array if name is provided (1972)

Bag

Repartition can now increase number of partitions (1934)
Fix bugs in some reductions with empty partitions (1939), (1950), (1953)

DataFrame

Support non-uniform categoricals (1877), (1930)
Groupby cumulative reductions (1909)
DataFrame.loc indexing now supports lists (1913)
Improve multi-level groupbys (1914)
Improved HTML and string repr for DataFrames (1637)
Parquet append (1940)
Add dd.demo.daily_stock function for teaching (1992)

Delayed

Add traverse= keyword to delayed to optionally avoid traversing nested data structures (1899)
Support Futures in from_delayed functions (1961)
Improve serialization of decorated delayed functions (1969)

Core

Improve windows path parsing in corner cases (1910)
Rename tasks when fusing (1919)
Add top level persist function (1927)
Propagate errors= keyword in byte handling (1954)
Dask.compute traverses Python collections (1975)
Structural sharing between graphs in dask.array and dask.delayed (1985)

0.13.0 / 2017-01-02

Array

Mandatory dtypes on dask.array. All operations maintain dtype information and UDF functions like map_blocks now require a dtype= keyword if it can not be inferred. (1755)
Support arrays without known shapes, such as arises when slicing arrays with arrays or converting dataframes to arrays (1838)
Support mutation by setting one array with another (1840)
Tree reductions for covariance and correlations. (1758)
Add SerializableLock for better use with distributed scheduling (1766)
Improved atop support (1800)
Rechunk optimization (1737), (1827)

Bag

Avoid wrong results when recomputing the same groupby twice (1867)

DataFrame

Add map_overlap for custom rolling operations (1769)
Add shift (1773)
Add Parquet support (1782) (1792) (1810), (1843), (1859), (1863)
Add missing methods combine, abs, autocorr, sem, nsmallest, first, last, prod, (1787)
Approximate nunique (1807), (1824)
Reductions with multiple output partitions (for operations like drop_duplicates) (1808), (1823) (1828)
Add delitem and copy to DataFrames, increasing mutation support (1858)

Delayed

Changed behaviour for delayed(nout=0) and delayed(nout=1): delayed(nout=1) does not default to out=None anymore, and delayed(nout=0) is also enabled. I.e. functions with return tuples of length 1 or 0 can be handled correctly. This is especially handy, if functions with a variable amount of outputs are wrapped by delayed. E.g. a trivial example: delayed(lambda *args: args, nout=len(vals))(*vals)

Core

Refactor core byte ingest (1768), (1774)
Improve import time (1833)

0.12.0 / 2016-11-03

DataFrame

Return a series when functions given to dataframe.map_partitions return scalars (1515)
Fix type size inference for series (1513)
dataframe.DataFrame.categorize no longer includes missing values in the categories. This is for compatibility with a pandas change (1565)
Fix head parser error in dataframe.read_csv when some lines have quotes (1495)
Add dataframe.reduction and series.reduction methods to apply generic row-wise reduction to dataframes and series (1483)
Add dataframe.select_dtypes, which mirrors the pandas method (1556)
dataframe.read_hdf now supports reading Series (1564)
Support Pandas 0.19.0 (1540)
Implement select_dtypes (1556)
String accessor works with indexes (1561)
Add pipe method to dask.dataframe (1567)
Add indicator keyword to merge (1575)
Support Series in read_hdf (1575)
Support Categories with missing values (1578)
Support inplace operators like df.x += 1 (1585)
Str accessor passes through args and kwargs (1621)
Improved groupby support for single-machine multiprocessing scheduler (1625)
Tree reductions (1663)
Pivot tables (1665)
Add clip (1667), align (1668), combine_first (1725), and any/all (1724)
Improved handling of divisions on dask-pandas merges (1666)
Add groupby.aggregate method (1678)
Add dd.read_table function (1682)
Improve support for multi-level columns (1697) (1712)
Support 2d indexing in loc (1726)
Extend resample to include DataFrames (1741)
Support dask.array ufuncs on dask.dataframe objects (1669)

Array

Add information about how dask.array chunks argument work (1504)
Fix field access with non-scalar fields in dask.array (1484)
Add concatenate= keyword to atop to concatenate chunks of contracted dimensions
Optimized slicing performance (1539) (1731)
Extend atop with a concatenate= (1609) new_axes= (1612) and adjust_chunks= (1716) keywords
Add clip (1610) swapaxes (1611) round (1708) repeat
Automatically align chunks in atop-backed operations (1644)
Cull dask.arrays on slicing (1709)

Bag

Fix issue with callables in bag.from_sequence being interpreted as tasks (1491)
Avoid non-lazy memory use in reductions (1747)

Administration

Added changelog (1526)
Create new threadpool when operating from thread (1487)
Unify example documentation pages into one (1520)
Add versioneer for git-commit based versions (1569)
Pass through node_attr and edge_attr keywords in dot visualization (1614)
Add continuous testing for Windows with Appveyor (1648)
Remove use of multiprocessing.Manager (1653)
Add global optimizations keyword to compute (1675)
Micro-optimize get_dependencies (1722)

0.11.0 / 2016-08-24

Major Points

DataFrames now enforce knowing full metadata (columns, dtypes) everywhere. Previously we would operate in an ambiguous state when functions lost dtype information (such as apply). Now all dataframes always know their dtypes and raise errors asking for information if they are unable to infer (which they usually can). Some internal attributes like _pd and _pd_nonempty have been moved.

The internals of the distributed scheduler have been refactored to transition tasks between explicit states. This improves resilience, reasoning about scheduling, plugin operation, and logging. It also makes the scheduler code easier to understand for newcomers.

Breaking Changes

The distributed.s3 and distributed.hdfs namespaces are gone. Use protocols in normal methods like read_text('s3://...' instead.
Dask.array.reshape now errs in some cases where previously it would have create a very large number of tasks

0.10.2 / 2016-07-27

More Dataframe shuffles now work in distributed settings, ranging from setting-index to hash joins, to sorted joins and groupbys.
Dask passes the full test suite when run when under in Python's optimized-OO mode.
On-disk shuffles were found to produce wrong results in some highly-concurrent situations, especially on Windows. This has been resolved by a fix to the partd library.
Fixed a growth of open file descriptors that occurred under large data communications
Support ports in the --bokeh-whitelist option ot dask-scheduler to better routing of web interface messages behind non-trivial network settings
Some improvements to resilience to worker failure (though other known failures persist)
You can now start an IPython kernel on any worker for improved debugging and analysis
Improvements to dask.dataframe.read_hdf, especially when reading from multiple files and docs

0.10.0 / 2016-06-13

Major Changes

This version drops support for Python 2.6
Conda packages are built and served from conda-forge
The dask.distributed executables have been renamed from dfoo to dask-foo. For example dscheduler is renamed to dask-scheduler
Both Bag and DataFrame include a preliminary distributed shuffle.

Bag

Add task-based shuffle for distributed groupbys
Add accumulate for cumulative reductions

DataFrame

Add a task-based shuffle suitable for distributed joins, groupby-applys, and set_index operations. The single-machine shuffle remains untouched (and much more efficient.)
Add support for new Pandas rolling API with improved communication performance on distributed systems.
Add groupby.std/var
Pass through S3/HDFS storage options in read_csv
Improve categorical partitioning
Add eval, info, isnull, notnull for dataframes

Distributed

Rename executables like dscheduler to dask-scheduler
Improve scheduler performance in the many-fast-tasks case (important for shuffling)
Improve work stealing to be aware of expected function run-times and data sizes. The drastically increases the breadth of algorithms that can be efficiently run on the distributed scheduler without significant user expertise.
Support maximum buffer sizes in streaming queues
Improve Windows support when using the Bokeh diagnostic web interface
Support compression of very-large-bytestrings in protocol
Support clean cancellation of submitted futures in Joblib interface

Other

All dask-related projects (dask, distributed, s3fs, hdfs, partd) are now building conda packages on conda-forge.
Change credential handling in s3fs to only pass around delegated credentials if explicitly given secret/key. The default now is to rely on managed environments. This can be changed back by explicitly providing a keyword argument. Anonymous mode must be explicitly declared if desired.

0.9.0 / 2016-05-11

API Changes

dask.do and dask.value have been renamed to dask.delayed
dask.bag.from_filenames has been renamed to dask.bag.read_text
All S3/HDFS data ingest functions like db.from_s3 or distributed.s3.read_csv have been moved into the plain read_text, read_csv functions, which now support protocols, like dd.read_csv('s3://bucket/keys*.csv')

Array

Add support for scipy.LinearOperator
Improve optional locking to on-disk data structures
Change rechunk to expose the intermediate chunks

Bag

Rename from_filenames to read_text
Remove from_s3 in favor of read_text('s3://...')

DataFrame

Fixed numerical stability issue for correlation and covariance
Allow no-hash from_pandas for speedy round-trips to and from-pandas objects
Generally reengineered read_csv to be more in line with Pandas behavior
Support fast set_index operations for sorted columns

Delayed

Rename do/value to delayed
Rename to/from_imperative to to/from_delayed

Distributed

Move s3 and hdfs functionality into the dask repository
Adaptively oversubscribe workers for very fast tasks
Improve PyPy support
Improve work stealing for unbalanced workers
Scatter data efficiently with tree-scatters

Other

Add lzma/xz compression support
Raise a warning when trying to split unsplittable compression types, like gzip or bz2
Improve hashing for single-machine shuffle operations
Add new callback method for start state
General performance tuning

0.8.1 / 2016-03-11

Array

Bugfix for range slicing that could periodically lead to incorrect results.
Improved support and resiliency of arg reductions (argmin, argmax, etc.)

Bag

Add zip function

DataFrame

Add corr and cov functions
Add melt function
Bugfixes for io to bcolz and hdf5

0.8.0 / 2016-02-20

Array

Changed default array reduction split from 32 to 4
Linear algebra, tril, triu, LU, inv, cholesky, solve, solve_triangular, eye, lstsq, diag, corrcoef.

Bag

Add tree reductions
Add range function
drop from_hdfs function (better functionality now exists in hdfs3 and distributed projects)

DataFrame

Refactor dask.dataframe to include a full empty pandas dataframe as metadata. Drop the .columns attribute on Series
Add Series categorical accessor, series.nunique, drop the .columns attribute for series.
read_csv fixes (multi-column parse_dates, integer column names, etc. )
Internal changes to improve graph serialization

Other

Documentation updates
Add from_imperative and to_imperative functions for all collections
Aesthetic changes to profiler plots
Moved the dask project to a new dask organization

0.7.6 / 2016-01-05

Array

Improve thread safety
Tree reductions
Add view, compress, hstack, dstack, vstack methods
map_blocks can now remove and add dimensions

DataFrame

Improve thread safety
Extend sampling to include replacement options

Imperative

Removed optimization passes that fused results.

Core

Removed dask.distributed
Improved performance of blocked file reading
Serialization improvements
Test Python 3.5

0.7.4 / 2015-10-23

This was mostly a bugfix release. Some notable changes:

Fix minor bugs associated with the release of numpy 1.10 and pandas 0.17
Fixed a bug with random number generation that would cause repeated blocks due to the birthday paradox
Use locks in dask.dataframe.read_hdf by default to avoid concurrency issues
Change dask.get to point to dask.async.get_sync by default
Allow visualization functions to accept general graphviz graph options like rankdir='LR'
Add reshape and ravel to dask.array
Support the creation of dask.arrays from dask.imperative objects

Deprecation

This release also includes a deprecation warning for dask.distributed, which will be removed in the next version.

Future development in distributed computing for dask is happening here: https://distributed.readthedocs.io . General feedback on that project is most welcome from this community.

0.7.3 / 2015-09-25

Diagnostics

A utility for profiling memory and cpu usage has been added to the dask.diagnostics module.

DataFrame

This release improves coverage of the pandas API. Among other things it includes nunique, nlargest, quantile. Fixes encoding issues with reading non-ascii csv files. Performance improvements and bug fixes with resample. More flexible read_hdf with globbing. And many more. Various bug fixes in dask.imperative and dask.bag.

0.7.0 / 2015-08-15

DataFrame

This release includes significant bugfixes and alignment with the Pandas API. This has resulted both from use and from recent involvement by Pandas core developers.

New operations: query, rolling operations, drop
Improved operations: quantiles, arithmetic on full dataframes, dropna, constructor logic, merge/join, elemwise operations, groupby aggregations

Bag

Fixed a bug in fold where with a null default argument

Array

New operations: da.fft module, da.image.imread

Infrastructure

The array and dataframe collections create graphs with deterministic keys. These tend to be longer (hash strings) but should be consistent between computations. This will be useful for caching in the future.
All collections (Array, Bag, DataFrame) inherit from common subclass

0.6.1 / 2015-07-23

Distributed

Improved (though not yet sufficient) resiliency for dask.distributed when workers die

DataFrame

Improved writing to various formats, including to_hdf, to_castra, and to_csv
Improved creation of dask DataFrames from dask Arrays and Bags
Improved support for categoricals and various other methods

Array

Various bug fixes
Histogram function

Scheduling

Added tie-breaking ordering of tasks within parallel workloads to better handle and clear intermediate results

Other

Added the dask.do function for explicit construction of graphs with normal python code
Traded pydot for graphviz library for graph printing to support Python3
There is also a gitter chat room and a stackoverflow tag

Files

changelog.rst

Latest commit

History

changelog.rst

File metadata and controls

Changelog

0.18.0 / 2018-MM-DD

Array

Dataframe

Bag

Core

0.17.4 / 2018-05-03

Dataframe

0.17.3 / 2018-05-02

Array

DataFrame

Bag

Core

0.17.2 / 2018-03-21

Array

DataFrame

Bag

Core

0.17.1 / 2018-02-22

Array

DataFrame

Bag

Core

0.17.0 / 2018-02-09

Array

DataFrame

Bag

Core

0.16.1 / 2018-01-09

Array

DataFrame

Core

0.16.0 / 2017-11-17

Array

DataFrame

Core

0.15.4 / 2017-10-06

Array

DataFrame

Bag

Core

0.15.3 / 2017-09-24

Array

DataFrame

Core

0.15.2 / 2017-08-25

Array

Bag

DataFrame

Core

0.15.1 / 2017-07-08

0.15.0 / 2017-06-09

Array

Bag

DataFrame

Core

0.14.3 / 2017-05-05

DataFrame

0.14.2 / 2017-05-03

Array

Bag

DataFrame

Core

0.14.1 / 2017-03-22

Array

Bag

DataFrame

Delayed

Core

0.14.0 / 2017-02-24

Array

Bag

DataFrame

Delayed