# dask.bag

Bag: Parallel Lists for semi-structured data

Dask-bag excels in processing data that can be represented as a sequence of arbitrary inputs. We'll refer to this as "messy" data, because it can contain complex nested structures, missing fields, mixtures of data types, etc.

Messy data is often encountered at the beginning of data processing pipelines when large volumes of raw data are first consumed. The initial set of data might be log files, or data stored in JSON, CSV, XML, or any other format that does not enforce strict structure and datatypes. For this reason, the initial data massaging and processing is often done with Python lists, dicts, and sets.

These core data structures are optimized for general-purpose storage and processing. Adding streaming computation with iterators/generator expressions or libraries like itertools or toolz let us process large volumes in a small space. If we combine this with parallel processing then we can churn through a fair amount of data.

Dask Bag implements operations like `map`, `filter`, `groupby` and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators.

Full API documentation is available here: http://docs.dask.org/en/latest/bag-api.html

## An aside about dirty, unstructured data from web/REST APIs

The term `REST API` is used a lot to mean a number of things. REST means [Representational State Transfer](https://en.wikipedia.org/wiki/Representational_state_transfer). Most people take REST to mean "a web host that gives me data in JSON format" (and JSON means [Javascript Object Notation](https://en.wikipedia.org/wiki/JSON)). This technically isn't accurate, but you will often hear people use the terms `REST API` and `web API` interchangably.

As an example, the Compute Canada docunentation has a web API for doing searches and fetching answers in a machine readable format. For example, visit this page:

<https://docs.computecanada.ca/mediawiki/api.php?action=query&list=search&srsearch=Python&format=json>

Python has a library for fetching and parsing data from web APIs called `Requests`. Below is an example of using requests to fetch this same data from the Compute Canada documentation:

In [1]:
import requests

r = requests.get('https://docs.computecanada.ca/mediawiki/api.php?action=query&list=search&srsearch=Python&format=json'
)

# We should always check the response status code coming from the server ... 200 is the one we want from a GET request
r.status_code

200

In [2]:
# Decode the JSON response from the server into something Python understands ...
data = r.json()
data

{'batchcomplete': '',
 'query': {'searchinfo': {'totalhits': 7},
  'search': [{'ns': 0,
    'title': 'Python',
    'pageid': 618,
    'size': 30470,
    'wordcount': 4439,
    'snippet': "...hy stressing the readability of code. Its syntax is simple and expressive. <span class='searchmatch'>Python</span> has an extensive, easy-to-use standard library.\n...in their own directories. However, most systems offer several versions of <span class='searchmatch'>Python</span> as well as tools to help you install the third-party packages that you need\n",
    'timestamp': '2024-10-17T19:35:58Z'},
   {'ns': 0,
    'title': 'Python/en',
    'pageid': 2586,
    'size': 29110,
    'wordcount': 4217,
    'snippet': "...hy stressing the readability of code. Its syntax is simple and expressive. <span class='searchmatch'>Python</span> has an extensive, easy-to-use standard library.\n...in their own directories. However, most systems offer several versions of <span class='searchmatch'>Python</span> as we

In [3]:
# The output is a python dict
type(data)

dict

In [4]:
# We can check out the keys of the dict
data.keys()

dict_keys(['batchcomplete', 'query'])

In [5]:
# But better yet, we can explore the data to narrow down on the information we want
data['query']['search'][0]['title']

'Python'

In [6]:
data['query']['search'][0]['snippet']

"...hy stressing the readability of code. Its syntax is simple and expressive. <span class='searchmatch'>Python</span> has an extensive, easy-to-use standard library.\n...in their own directories. However, most systems offer several versions of <span class='searchmatch'>Python</span> as well as tools to help you install the third-party packages that you need\n"

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [None]:
# NOTE!!! Colab, don't do this

from dask.distributed import Client, progress
client = Client(n_workers=4, threads_per_worker=1)
client

## Create Random Data

We create a random set of record data and store it to disk as many JSON files.  This will serve as our data for this notebook.

In [9]:
# We might need this, uncomment out as needed ...
!pip install fsspec
#!conda install fsspec

# Note, Colab needs this:
import sys
if 'google.colab' in sys.modules:
    !pip install mimesis



In [10]:
import dask
import json
import os

os.makedirs('data', exist_ok=True)              # Create data/ directory

b = dask.datasets.make_people()                 # Make records of people
b.map(json.dumps).to_textfiles('data/*.json')   # Encode as JSON, write to disk

['/content/data/0.json',
 '/content/data/1.json',
 '/content/data/2.json',
 '/content/data/3.json',
 '/content/data/4.json',
 '/content/data/5.json',
 '/content/data/6.json',
 '/content/data/7.json',
 '/content/data/8.json',
 '/content/data/9.json']

## Read JSON data

Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module.

In [11]:
!head -n 2 data/0.json

{"age": 92, "name": ["Detra", "Reid"], "occupation": "Relocation Agent", "telephone": "+12255295418", "address": {"address": "944 Crags Plaza", "city": "Seguin"}, "credit-card": {"number": "3716 159233 48474", "expiration-date": "05/19"}}
{"age": 58, "name": ["Huey", "Rhodes"], "occupation": "Storeman", "telephone": "+1-364-480-8896", "address": {"address": "191 Metson Ferry", "city": "Edwardsville"}, "credit-card": {"number": "2478 4733 6300 7004", "expiration-date": "02/21"}}


In [12]:
import dask.bag as db
import json

b = db.read_text('data/*.json').map(json.loads)
b

dask.bag<loads, npartitions=10>

In [13]:
b.take(2)

({'age': 92,
  'name': ['Detra', 'Reid'],
  'occupation': 'Relocation Agent',
  'telephone': '+12255295418',
  'address': {'address': '944 Crags Plaza', 'city': 'Seguin'},
  'credit-card': {'number': '3716 159233 48474', 'expiration-date': '05/19'}},
 {'age': 58,
  'name': ['Huey', 'Rhodes'],
  'occupation': 'Storeman',
  'telephone': '+1-364-480-8896',
  'address': {'address': '191 Metson Ferry', 'city': 'Edwardsville'},
  'credit-card': {'number': '2478 4733 6300 7004',
   'expiration-date': '02/21'}})

## Map, Filter, Aggregate

We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value.

In [14]:
b.filter(lambda record: record['age'] > 30).take(2)  # Select only people over 30

({'age': 92,
  'name': ['Detra', 'Reid'],
  'occupation': 'Relocation Agent',
  'telephone': '+12255295418',
  'address': {'address': '944 Crags Plaza', 'city': 'Seguin'},
  'credit-card': {'number': '3716 159233 48474', 'expiration-date': '05/19'}},
 {'age': 58,
  'name': ['Huey', 'Rhodes'],
  'occupation': 'Storeman',
  'telephone': '+1-364-480-8896',
  'address': {'address': '191 Metson Ferry', 'city': 'Edwardsville'},
  'credit-card': {'number': '2478 4733 6300 7004',
   'expiration-date': '02/21'}})

In [15]:
b.map(lambda record: record['occupation']).take(2)  # Select the occupation field

('Relocation Agent', 'Storeman')

In [16]:
b.count().compute()  # Count total number of records

10000

## Chain computations

It is common to do many of these steps in one pipeline, only calling `compute` or `take` at the end.

In [17]:
result = (b.filter(lambda record: record['age'] > 30)
           .map(lambda record: record['occupation'])
           .frequencies(sort=True)
           .topk(10, key=1))
result

dask.bag<topk-aggregate, npartitions=1>

As with all lazy Dask collections, we need to call `compute` to actually evaluate our result.  The `take` method used in earlier examples is also like `compute` and will also trigger computation.

In [18]:
result.compute()

[('Sheet Metal Worker', 15),
 ('Geophysicist', 15),
 ('Masseuse', 15),
 ('Actress', 14),
 ('Insurance Broker', 14),
 ('Legal Secretary', 14),
 ('Air Traffic Controller', 14),
 ('Foster Parent', 14),
 ('Storeman', 13),
 ('Line Manager', 13)]

## Transform and Store

Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses.  For that we can use methods like `to_textfiles` and `json.dumps`, or we can convert to Dask Dataframes and use their storage systems, which we'll see more of in the next section.

In [19]:
(b.filter(lambda record: record['age'] > 30)  # Select records of interest
  .map(json.dumps)                            # Convert Python objects to text
  .to_textfiles('data/processed.*.json'))     # Write to local disk

['/content/data/processed.0.json',
 '/content/data/processed.1.json',
 '/content/data/processed.2.json',
 '/content/data/processed.3.json',
 '/content/data/processed.4.json',
 '/content/data/processed.5.json',
 '/content/data/processed.6.json',
 '/content/data/processed.7.json',
 '/content/data/processed.8.json',
 '/content/data/processed.9.json']

We can use standard UNIX commands to look at some of the files created:

In [20]:
!ls -l data/*.json

-rw-r--r-- 1 root root 248517 Mar 30 19:43 data/0.json
-rw-r--r-- 1 root root 248041 Mar 30 19:43 data/1.json
-rw-r--r-- 1 root root 248173 Mar 30 19:43 data/2.json
-rw-r--r-- 1 root root 248280 Mar 30 19:43 data/3.json
-rw-r--r-- 1 root root 248265 Mar 30 19:43 data/4.json
-rw-r--r-- 1 root root 248140 Mar 30 19:43 data/5.json
-rw-r--r-- 1 root root 248007 Mar 30 19:43 data/6.json
-rw-r--r-- 1 root root 248044 Mar 30 19:43 data/7.json
-rw-r--r-- 1 root root 248213 Mar 30 19:43 data/8.json
-rw-r--r-- 1 root root 248175 Mar 30 19:43 data/9.json
-rw-r--r-- 1 root root 185199 Mar 30 19:45 data/processed.0.json
-rw-r--r-- 1 root root 184669 Mar 30 19:45 data/processed.1.json
-rw-r--r-- 1 root root 185691 Mar 30 19:45 data/processed.2.json
-rw-r--r-- 1 root root 185466 Mar 30 19:45 data/processed.3.json
-rw-r--r-- 1 root root 188699 Mar 30 19:45 data/processed.4.json
-rw-r--r-- 1 root root 186097 Mar 30 19:45 data/processed.5.json
-rw-r--r-- 1 root root 180851 Mar 30 19:45 data/processed.6.

In [21]:
!head data/processed.7.json

{"age": 114, "name": ["Bryce", "Hughes"], "occupation": "Bank Messenger", "telephone": "+16140253991", "address": {"address": "270 Brumiss Run", "city": "Baldwin Park"}, "credit-card": {"number": "4072 7026 1499 6326", "expiration-date": "01/24"}}
{"age": 106, "name": ["Theodore", "Brock"], "occupation": "Car Wash Attendant", "telephone": "+1-701-222-5386", "address": {"address": "319 Owen Canyon", "city": "West Hollywood"}, "credit-card": {"number": "4893 1113 7338 5597", "expiration-date": "07/25"}}
{"age": 57, "name": ["Leisha", "Carlson"], "occupation": "Recorder", "telephone": "+14348505317", "address": {"address": "130 Colusa Heights", "city": "Moreno Valley"}, "credit-card": {"number": "5172 7708 0694 0143", "expiration-date": "08/22"}}
{"age": 82, "name": ["Marcelo", "Andrews"], "occupation": "Roofer", "telephone": "+1-316-325-2615", "address": {"address": "1101 Mangels Freeway", "city": "Pottstown"}, "credit-card": {"number": "5483 7453 2934 2419", "expiration-date": "08/25"}}

## Convert to Dask Dataframes

Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes.  Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.  

However, Dask Dataframes also expect data that is organized as flat columns.  It does not support nested JSON data very well (Bag is better for this).

Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe.

In [22]:
b.take(1)

({'age': 92,
  'name': ['Detra', 'Reid'],
  'occupation': 'Relocation Agent',
  'telephone': '+12255295418',
  'address': {'address': '944 Crags Plaza', 'city': 'Seguin'},
  'credit-card': {'number': '3716 159233 48474', 'expiration-date': '05/19'}},)

In [23]:
def flatten(record):
    return {
        'age': record['age'],
        'occupation': record['occupation'],
        'telephone': record['telephone'],
        'credit-card-number': record['credit-card']['number'],
        'credit-card-expiration': record['credit-card']['expiration-date'],
        'name': ' '.join(record['name']),
        'street-address': record['address']['address'],
        'city': record['address']['city']
    }

b.map(flatten).take(1)

({'age': 92,
  'occupation': 'Relocation Agent',
  'telephone': '+12255295418',
  'credit-card-number': '3716 159233 48474',
  'credit-card-expiration': '05/19',
  'name': 'Detra Reid',
  'street-address': '944 Crags Plaza',
  'city': 'Seguin'},)

In [24]:
df = b.map(flatten).to_dataframe()
df.head()



Unnamed: 0,age,occupation,telephone,credit-card-number,credit-card-expiration,name,street-address,city
0,92,Relocation Agent,+12255295418,3716 159233 48474,05/19,Detra Reid,944 Crags Plaza,Seguin
1,58,Storeman,+1-364-480-8896,2478 4733 6300 7004,02/21,Huey Rhodes,191 Metson Ferry,Edwardsville
2,15,Pig Manager,+1-925-752-8725,4084 2874 7488 3166,01/23,Casey Whitfield,444 Leo Trace,Glendale Heights
3,94,Blinds Installer,+18120910541,3708 221873 70407,11/25,Dane Hebert,1022 Hunt Garden,Flagstaff
4,50,Bar Manager,+19103765933,3464 179894 55315,02/16,Jamaal Velazquez,1032 Torney Extension,Foster City


We can now perform the same computation as before, but now using Pandas and Dask dataframe.

In [25]:
df[df.age > 30].occupation.value_counts().nlargest(10).compute()

Unnamed: 0_level_0,count
occupation,Unnamed: 1_level_1
Masseuse,15
Geophysicist,15
Sheet Metal Worker,15
Air Traffic Controller,14
Actress,14
Legal Secretary,14
Insurance Broker,14
Foster Parent,14
Shelter Warden,13
Rig Worker,13


## Learn More

You may be interested in the following links:

-  [Dask Bag Documentation](http://docs.dask.org/en/latest/bag-overview.html)
-  [API Documentation](http://docs.dask.org/en/latest/bag-api.html)

[On to the next (optional) notebook (HPC Clusters)](07-hpc-clusters.ipynb) ...