# Dask Bag

Dask bags are often used to parallelize simple computations on unstructured or semi-structured data like text data, log files, JSON records, or user defined Python objects.

Dask Bag doesn’t perform well on computations that include a great deal of inter-worker communication.

Known Limitations
Bags provide very general computation (any Python function). This generality comes at cost. Bags have the following known limitations:

1. By default, they rely on the multiprocessing scheduler, which has its own set of known limitations (see Shared Memory)

2. Bags are immutable and so you can not change individual elements

3. Bag operations tend to be slower than array/DataFrame computations in the same way that standard Python containers tend to be slower than NumPy arrays and Pandas DataFrames

4. Bag’s groupby is slow. You should try to use Bag’s foldby if possible. Using foldby requires more thought though



In [1]:
from dask.distributed import Client

client = Client(n_workers=3, threads_per_worker = 2, memory_limit='4G')
display(client)

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 3
Total threads: 6,Total memory: 11.18 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:38593,Workers: 3
Dashboard: http://127.0.0.1:8787/status,Total threads: 6
Started: Just now,Total memory: 11.18 GiB

0,1
Comm: tcp://127.0.0.1:45965,Total threads: 2
Dashboard: http://127.0.0.1:43197/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:34199,
Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-o70uz3d6,Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-o70uz3d6

0,1
Comm: tcp://127.0.0.1:44019,Total threads: 2
Dashboard: http://127.0.0.1:35953/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:43755,
Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-ntogfczs,Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-ntogfczs

0,1
Comm: tcp://127.0.0.1:46791,Total threads: 2
Dashboard: http://127.0.0.1:41253/status,Memory: 3.73 GiB
Nanny: tcp://127.0.0.1:36801,
Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-gdbr24ue,Local directory: /home/jana/delavnice/Dask/natebook/dask-worker-space/worker-gdbr24ue


In [2]:
import dask.bag as db
b = db.read_text("../cities.json")

In [3]:
import json
#transform it to Python dictionary
js = b.map(json.loads)
# take: inspect first few elements
js.take(3)

({'_id': '01001',
  'city': 'AGAWAM',
  'loc': [-72.622739, 42.070206],
  'pop': 15338,
  'state': 'MA'},
 {'_id': '01002',
  'city': 'CUSHMAN',
  'loc': [-72.51565, 42.377017],
  'pop': 36963,
  'state': 'MA'},
 {'_id': '01005',
  'city': 'BARRE',
  'loc': [-72.108354, 42.409698],
  'pop': 4546,
  'state': 'MA'})

In [4]:
js.filter(lambda record: record['pop'] > 10000).take(10)

({'_id': '01001',
  'city': 'AGAWAM',
  'loc': [-72.622739, 42.070206],
  'pop': 15338,
  'state': 'MA'},
 {'_id': '01002',
  'city': 'CUSHMAN',
  'loc': [-72.51565, 42.377017],
  'pop': 36963,
  'state': 'MA'},
 {'_id': '01007',
  'city': 'BELCHERTOWN',
  'loc': [-72.410953, 42.275103],
  'pop': 10579,
  'state': 'MA'},
 {'_id': '01013',
  'city': 'CHICOPEE',
  'loc': [-72.607962, 42.162046],
  'pop': 23396,
  'state': 'MA'},
 {'_id': '01020',
  'city': 'CHICOPEE',
  'loc': [-72.576142, 42.176443],
  'pop': 31495,
  'state': 'MA'},
 {'_id': '01027',
  'city': 'MOUNT TOM',
  'loc': [-72.679921, 42.264319],
  'pop': 16864,
  'state': 'MA'},
 {'_id': '01028',
  'city': 'EAST LONGMEADOW',
  'loc': [-72.505565, 42.067203],
  'pop': 13367,
  'state': 'MA'},
 {'_id': '01030',
  'city': 'FEEDING HILLS',
  'loc': [-72.675077, 42.07182],
  'pop': 11985,
  'state': 'MA'},
 {'_id': '01040',
  'city': 'HOLYOKE',
  'loc': [-72.626193, 42.202007],
  'pop': 43704,
  'state': 'MA'},
 {'_id': '01056',


You can chain operations.

In [15]:
result = (js.filter(lambda record: record['pop'] > 50000).map(lambda r: r['city']).frequencies(sort=True).topk(10, key=1))

As usual with dask result is lazy.

In [16]:
result

dask.bag<topk-aggregate, npartitions=1>

In [17]:
result.take(3)

(('BROOKLYN', 27), ('CHICAGO', 26), ('NEW YORK', 14))

**Shuffle operations on bag**

Some operations, like groupby, require substantial inter-worker communication. On a single machine, Dask uses partd to perform efficient, parallel, spill-to-disk shuffles. When working in a cluster, Dask uses a task based shuffle.

These shuffle operations are expensive and better handled by projects like dask.dataframe. It is best to use dask.bag to clean and process data, then transform it into an array or DataFrame before embarking on the more complex operations that require shuffle steps. The hard to use foldby method uses a streaming combined broupby and reduction.

- groupby: Shuffles data so that all items with the same key are in the same key-value pair
- foldby: Walks through the data accumulating a result per key


In [28]:
%%time
js.groupby(lambda k: k['city']).starmap(lambda k, v: (k, len(v))).compute()

CPU times: user 248 ms, sys: 6.57 ms, total: 255 ms
Wall time: 1.24 s


[('AGAWAM', 1),
 ('CUSHMAN', 2),
 ('BARRE', 2),
 ('BELCHERTOWN', 1),
 ('BLANDFORD', 1),
 ('BRIMFIELD', 2),
 ('CHESTER', 21),
 ('CHESTERFIELD', 7),
 ('CHICOPEE', 2),
 ('WESTOVER AFB', 1),
 ('CUMMINGTON', 1),
 ('MOUNT TOM', 1),
 ('EAST LONGMEADOW', 1),
 ('FEEDING HILLS', 1),
 ('GILBERTVILLE', 1),
 ('GOSHEN', 9),
 ('GRANBY', 4),
 ('TOLLAND', 2),
 ('HADLEY', 4),
 ('HAMPDEN', 5),
 ('HATFIELD', 5),
 ('HAYDENVILLE', 1),
 ('HOLYOKE', 3),
 ('HUNTINGTON', 12),
 ('LEEDS', 4),
 ('LEVERETT', 1),
 ('LUDLOW', 8),
 ('MONSON', 2),
 ('FLORENCE', 20),
 ('OAKHAM', 1),
 ('PALMER', 6),
 ('PLAINFIELD', 7),
 ('RUSSELL', 5),
 ('SHUTESBURY', 1),
 ('SOUTHAMPTON', 3),
 ('SOUTH HADLEY', 1),
 ('SOUTHWICK', 1),
 ('THREE RIVERS', 4),
 ('WALES', 5),
 ('WARE', 1),
 ('MONTGOMERY', 20),
 ('WEST SPRINGFIELD', 4),
 ('WEST WARREN', 1),
 ('WILBRAHAM', 1),
 ('WILLIAMSBURG', 10),
 ('WORTHINGTON', 8),
 ('SPRINGFIELD', 41),
 ('LONGMEADOW', 1),
 ('INDIAN ORCHARD', 1),
 ('PITTSFIELD', 6),
 ('ADAMS', 9),
 ('ASHLEY FALLS', 1),
 ('BE

The key difference between the starmap() function and the map() function is that starmap() supports a target function with more than one argument, whereas the map() function supports target functions with only one argument.

Is makes an iterator that computes the function using arguments obtained from the iterable. Used instead of map() when argument parameters are already grouped in tuples from a single iterable (the data has been “pre-zipped”).

Foldby provides a combined groupby and reduce for efficient parallel split-apply-combine tasks.

When using foldby you provide

1. A key function on which to group elements

2. A binary operator such as you would pass to reduce that you use to perform reduction per each group

3. A combine binary operator that can combine the results of two reduce calls on different parts of your dataset.

In [27]:
%%time

from operator import add
def incr(tot, _):
    return tot+1

js.foldby(key='city', binop=incr, initial=0, combine=add, combine_initial=0).compute()


CPU times: user 442 ms, sys: 3.21 ms, total: 446 ms
Wall time: 1.17 s


[('AGAWAM', 1),
 ('CUSHMAN', 2),
 ('BARRE', 2),
 ('BELCHERTOWN', 1),
 ('BLANDFORD', 1),
 ('BRIMFIELD', 2),
 ('CHESTER', 21),
 ('CHESTERFIELD', 7),
 ('CHICOPEE', 2),
 ('WESTOVER AFB', 1),
 ('CUMMINGTON', 1),
 ('MOUNT TOM', 1),
 ('EAST LONGMEADOW', 1),
 ('FEEDING HILLS', 1),
 ('GILBERTVILLE', 1),
 ('GOSHEN', 9),
 ('GRANBY', 4),
 ('TOLLAND', 2),
 ('HADLEY', 4),
 ('HAMPDEN', 5),
 ('HATFIELD', 5),
 ('HAYDENVILLE', 1),
 ('HOLYOKE', 3),
 ('HUNTINGTON', 12),
 ('LEEDS', 4),
 ('LEVERETT', 1),
 ('LUDLOW', 8),
 ('MONSON', 2),
 ('FLORENCE', 20),
 ('OAKHAM', 1),
 ('PALMER', 6),
 ('PLAINFIELD', 7),
 ('RUSSELL', 5),
 ('SHUTESBURY', 1),
 ('SOUTHAMPTON', 3),
 ('SOUTH HADLEY', 1),
 ('SOUTHWICK', 1),
 ('THREE RIVERS', 4),
 ('WALES', 5),
 ('WARE', 1),
 ('MONTGOMERY', 20),
 ('WEST SPRINGFIELD', 4),
 ('WEST WARREN', 1),
 ('WILBRAHAM', 1),
 ('WILLIAMSBURG', 10),
 ('WORTHINGTON', 8),
 ('SPRINGFIELD', 41),
 ('LONGMEADOW', 1),
 ('INDIAN ORCHARD', 1),
 ('PITTSFIELD', 6),
 ('ADAMS', 9),
 ('ASHLEY FALLS', 1),
 ('BE

You can transform it to a dataframe.

In [23]:
df1 = js.to_dataframe()

In [24]:
df1.head()

Unnamed: 0,_id,city,loc,pop,state
0,1001,AGAWAM,"[-72.622739, 42.070206]",15338,MA
1,1002,CUSHMAN,"[-72.51565, 42.377017]",36963,MA
2,1005,BARRE,"[-72.108354, 42.409698]",4546,MA
3,1007,BELCHERTOWN,"[-72.410953, 42.275103]",10579,MA
4,1008,BLANDFORD,"[-72.936114, 42.182949]",1240,MA


In [33]:
df1[df1.city =='AGAWAM'].compute()

Unnamed: 0,_id,city,loc,pop,state
0,1001,AGAWAM,"[-72.622739, 42.070206]",15338,MA
