# Install dependencies
- `apache-beam[dask]`: core package being demonstrated here
- `bokeh`: for dask dashboard
- `mimesis`: required for generating example data
- `Pygments`: to `cat` example beam script with syntax highlighting

In [1]:
# !pip install "apache-beam[dask]" "bokeh!=3.0.*,>=2.4.2" mimesis Pygments

Pinning upper bound of `dask` & `distributed` to `2023.9.2` as a workaround until
[this fix](https://github.com/apache/beam/pull/27618/files#diff-bfb5ae715e9067778f492058e8a02ff877d6e7584624908ddbdd316853e6befbL102-R107)
goes in.


In [2]:
# !pip install -U "distributed>=2022.6.0,<2023.9.3"

# Start a client

In [3]:
from distributed import Client
client = Client()
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 5
Total threads: 10,Total memory: 16.00 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:63834,Workers: 5
Dashboard: http://127.0.0.1:8787/status,Total threads: 10
Started: Just now,Total memory: 16.00 GiB

0,1
Comm: tcp://127.0.0.1:63847,Total threads: 2
Dashboard: http://127.0.0.1:63852/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:63837,
Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-rqgl1e_g,Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-rqgl1e_g

0,1
Comm: tcp://127.0.0.1:63848,Total threads: 2
Dashboard: http://127.0.0.1:63851/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:63838,
Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-jtffs33t,Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-jtffs33t

0,1
Comm: tcp://127.0.0.1:63849,Total threads: 2
Dashboard: http://127.0.0.1:63856/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:63839,
Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-07gujy4q,Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-07gujy4q

0,1
Comm: tcp://127.0.0.1:63850,Total threads: 2
Dashboard: http://127.0.0.1:63858/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:63840,
Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-meb915gf,Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-meb915gf

0,1
Comm: tcp://127.0.0.1:63855,Total threads: 2
Dashboard: http://127.0.0.1:63860/status,Memory: 3.20 GiB
Nanny: tcp://127.0.0.1:63841,
Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-d21_cpew,Local directory: /var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/dask-scratch-space/worker-d21_cpew


# Example data

In [4]:
import dask
import json
import os

os.makedirs('data', exist_ok=True)
dask.datasets.make_people().map(json.dumps).to_textfiles('data/*.json')

['/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/0.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/1.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/2.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/3.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/4.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/5.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/6.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/7.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/8.json',
 '/Users/charlesstern/Dropbox/pangeo/beam-dask-demo/data/9.json']

In [5]:
!head -n 2 data/0.json

{"age": 18, "name": ["Luana", "Dyer"], "occupation": "Insurance Assessor", "telephone": "+1-805-917-0430", "address": {"address": "66 Raycliff Mews", "city": "Folsom"}, "credit-card": {"number": "4439 5434 4716 0773", "expiration-date": "10/21"}}
{"age": 54, "name": ["Domenica", "Vincent"], "occupation": "Pattern Maker", "telephone": "+1-369-692-3326", "address": {"address": "401 Lawton Junction", "city": "Maywood"}, "credit-card": {"number": "4965 9972 5346 9474", "expiration-date": "05/21"}}


In [17]:
!pygmentize -g example.py

[34mimport[39;49;00m [04m[36mglob[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mapache_beam[39;49;00m [34mas[39;49;00m [04m[36mbeam[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mapache_beam[39;49;00m[04m[36m.[39;49;00m[04m[36moptions[39;49;00m[04m[36m.[39;49;00m[04m[36mpipeline_options[39;49;00m [34mimport[39;49;00m PipelineOptions[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mapache_beam[39;49;00m[04m[36m.[39;49;00m[04m[36mrunners[39;49;00m[04m[36m.[39;49;00m[04m[36mdask[39;49;00m[04m[36m.[39;49;00m[04m[36mdask_runner[39;49;00m [34mimport[39;49;00m DaskRunner[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32myield_jsonlines[39;49;00m(fname: [36mstr[39;49;00m):[37m[39;49;00m
    [34mwith[39;49;00m [36mopen[39;49;00m(fname) [34mas[39;49;

In [15]:
!python -W ignore example.py --dask_client_address={client.scheduler.address}

('Alan Benson', 45, 'Cargo Operator')
('Adolph Burton', 31, 'Caretaker')
('Antwan Buckner', 41, 'Car Park Attendant')
('Andree Boyer', 49, 'Caterer')
('Annalisa Blanchard', 60, 'Caulker')
