# Install dependencies
- `apache-beam[dask]`: core package being demonstrated here
- `bokeh`: for dask dashboard
- `mimesis`: required for generating example data
- `Pygments`: to `cat` example beam script with syntax highlighting

In [1]:
# !pip install "apache-beam[dask]" "bokeh!=3.0.*,>=2.4.2" mimesis Pygments

Pinning upper bound of `dask` & `distributed` to `2023.9.2` as a workaround until
[this fix](https://github.com/apache/beam/pull/27618/files#diff-bfb5ae715e9067778f492058e8a02ff877d6e7584624908ddbdd316853e6befbL102-R107)
goes in.


In [2]:
# !pip install -U "distributed>=2022.6.0,<2023.9.3"

# Start a client

In [3]:
from distributed import Client
client = Client()
client.dashboard_link

'http://127.0.0.1:8787/status'

# Example data

Based on https://examples.dask.org/bag.html#Create-Random-Data

In [4]:
import dask
import json
import tempfile

td = tempfile.TemporaryDirectory()
dask.datasets.make_people().map(json.dumps).to_textfiles(f'{td.name}/*.json')

['/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/0.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/1.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/2.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/3.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/4.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/5.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/6.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/7.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/8.json',
 '/var/folders/tt/4f941hdn0zq549zdwhcgg98c0000gn/T/tmpuxyspq3f/9.json']

Note data is in https://jsonlines.org format:

In [5]:
!head -n 2 {td.name}/0.json

{"age": 34, "name": ["Arlie", "James"], "occupation": "Ship Broker", "telephone": "+1-341-662-1903", "address": {"address": "584 Beeman Bayou", "city": "McKeesport"}, "credit-card": {"number": "2524 7603 9393 8305", "expiration-date": "07/19"}}
{"age": 56, "name": ["Heath", "Ochoa"], "occupation": "Traffic Warden", "telephone": "+1-843-811-6941", "address": {"address": "890 Ada Glen", "city": "Ramsey"}, "credit-card": {"number": "3448 503627 97253", "expiration-date": "02/16"}}


# Dask

Read, load, and filter the data using the Dask Bag API:

In [6]:
import dask.bag as db

b = (
    db
    .read_text(f'{td.name}/*.json')
    .map(json.loads)
    .filter(lambda record: record['age'] > 30)
    .filter(lambda record: record['name'][0].startswith('A'))
    .filter(lambda record: record['name'][1].startswith('B'))
    .filter(lambda record: record['occupation'].startswith('C'))
    .map(lambda record: (" ".join(record['name']), record['age'], record['occupation']))
)
b.compute()

[('Adaline Britt', 57, 'Cleaner'),
 ('Archie Bowen', 62, 'Cartoonist'),
 ('Alishia Burch', 31, 'Chiropodist'),
 ('Alia Bryant', 47, 'Chauffeur')]

# Beam

Read, load, and apply the same filters using the Beam API.

Beam's `DaskRunner` doesn't yet support ipython evaluation, so we use a Python script:

In [7]:
!pygmentize -g example.py

[34mimport[39;49;00m [04m[36mglob[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36mjson[39;49;00m[37m[39;49;00m
[34mimport[39;49;00m [04m[36msys[39;49;00m[37m[39;49;00m
[37m[39;49;00m
[34mimport[39;49;00m [04m[36mapache_beam[39;49;00m [34mas[39;49;00m [04m[36mbeam[39;49;00m[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mapache_beam[39;49;00m[04m[36m.[39;49;00m[04m[36moptions[39;49;00m[04m[36m.[39;49;00m[04m[36mpipeline_options[39;49;00m [34mimport[39;49;00m PipelineOptions[37m[39;49;00m
[34mfrom[39;49;00m [04m[36mapache_beam[39;49;00m[04m[36m.[39;49;00m[04m[36mrunners[39;49;00m[04m[36m.[39;49;00m[04m[36mdask[39;49;00m[04m[36m.[39;49;00m[04m[36mdask_runner[39;49;00m [34mimport[39;49;00m DaskRunner[37m[39;49;00m
[37m[39;49;00m
[37m[39;49;00m
[34mdef[39;49;00m [32myield_jsonlines[39;49;00m(fname: [36mstr[39;49;00m):[37m[39;49;00m
    [34mwith[39;49;00m [36mopen[39;49;00m(fname) [34mas[39;49;

And run this computation on the _same Dask cluster_ as we used for the Dask Bag operation:

In [8]:
!python -W ignore example.py {td.name} --dask_client_address={client.scheduler.address}

('Adaline Britt', 57, 'Cleaner')
('Alia Bryant', 47, 'Chauffeur')
('Archie Bowen', 62, 'Cartoonist')
('Alishia Burch', 31, 'Chiropodist')
