# Dask Bag

## Notebook Objectives
* **Read and Manipulate data with Dask Bag**, high-level interface to parallelize generic Python objects.
* **Convert Dask Bag to Dask DataFrame**.
* **Limitations of Dask Bag**.
* **References** for further reading.

## Read data with Dask Bag

We can create a Dask Bag from any Python sequence: lists, dict, set, from files (json, xml), S3, etc.

Before that, let's start a Cluster:

In [2]:
from dask.distributed import Client

client = Client(n_workers=4)
client

0,1
Client  Scheduler: tcp://127.0.0.1:50672  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 12  Memory: 16.00 GiB


Open the dashboards!

### Reading from Python sequence

Here we create a Dask Bag from a Python list. You can create Bags similarly from sets and dictionaries.

Data is partitioned into blocks. In the following example, there are two partitions with 5 elements each.

In [3]:
import dask.bag as db

b = db.from_sequence(['Alaska', 'Minnesota', 'Georgia', 'Maine', 'West Virginia', 'California', 'South Dakota', 'Indiana', 'New York', 'Nebraska'], npartitions=2)
b

dask.bag<from_sequence, npartitions=2>

Bag object are also evaluated lazily by default, so we need to call `compute` to get the result.

In [4]:
b.compute()

['Alaska',
 'Minnesota',
 'Georgia',
 'Maine',
 'West Virginia',
 'California',
 'South Dakota',
 'Indiana',
 'New York',
 'Nebraska']

`take()` can be used to show elements of the data.

In [5]:
b.take(3)

('Alaska', 'Minnesota', 'Georgia')

### Reading from JSON file

Here we create a Dask Bag from the JSON files.

In [6]:
# Create random data and store as JSON files 

import dask
import json
import os

b = dask.datasets.make_people()
b.map(json.dumps).to_textfiles('data/*.json')

['/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/0.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/1.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/2.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/3.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/4.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/5.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/6.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/7.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/8.json',
 '/Users/pavithra-coiled/Developer/talkpython-dask-course/2-dask-fundamentals/data/9.json']

Then, read the data using `read_text`.

In [7]:
b = db.read_text('data/*.json')
b

dask.bag<bag-from-delayed, npartitions=10>

In [7]:
b.take(2)

('{"age": 30, "name": ["Darrel", "Soto"], "occupation": "Audiologist", "telephone": "527.475.4983", "address": {"address": "460 Rivas Drung", "city": "Winston-Salem"}, "credit-card": {"number": "2446 9077 9141 7987", "expiration-date": "09/22"}}\n',
 '{"age": 38, "name": ["Sindy", "Campbell"], "occupation": "Foreman", "telephone": "946.885.3965", "address": {"address": "1185 Bass Spur", "city": "Millville"}, "credit-card": {"number": "4956 2525 9272 9241", "expiration-date": "08/20"}}\n')

Note the partitions for the 10 files in our data.

The data comes out as lines of text, we can make this data more readable using `json.loads`.

In [8]:
b = b.map(json.loads)
b.take(2)

({'age': 60,
  'name': ['Jeffery', 'Garcia'],
  'occupation': 'Training Consultant',
  'telephone': '1-702-673-7969',
  'address': {'address': '744 Langton Parade', 'city': 'Sugar Hill'},
  'credit-card': {'number': '3745 852410 45994', 'expiration-date': '06/23'}},
 {'age': 54,
  'name': ['Parker', 'Reed'],
  'occupation': 'Window Dresser',
  'telephone': '223-543-9697',
  'address': {'address': '1065 Mill Field', 'city': 'South Portland'},
  'credit-card': {'number': '3789 947854 38464', 'expiration-date': '09/23'}})

## Manipulate data with Dask Bag

Bag objects have the standard functional API found in projects like the Python standard library, toolz, or pyspark, including map, filter, groupby, etc.

Operations on Bag objects create new bags. 

### Filter operation

Filter the file for all records having age over 25.

In [9]:
b.filter(lambda record: record['age'] > 25).take(5)

({'age': 60,
  'name': ['Jeffery', 'Garcia'],
  'occupation': 'Training Consultant',
  'telephone': '1-702-673-7969',
  'address': {'address': '744 Langton Parade', 'city': 'Sugar Hill'},
  'credit-card': {'number': '3745 852410 45994', 'expiration-date': '06/23'}},
 {'age': 54,
  'name': ['Parker', 'Reed'],
  'occupation': 'Window Dresser',
  'telephone': '223-543-9697',
  'address': {'address': '1065 Mill Field', 'city': 'South Portland'},
  'credit-card': {'number': '3789 947854 38464', 'expiration-date': '09/23'}},
 {'age': 44,
  'name': ['Nicolas', 'Duncan'],
  'occupation': 'Forest Ranger',
  'telephone': '064.491.6735',
  'address': {'address': '529 Cameron Alley', 'city': 'Garner'},
  'credit-card': {'number': '4165 7976 6426 7113',
   'expiration-date': '11/22'}},
 {'age': 42,
  'name': ['Patrick', 'Rasmussen'],
  'occupation': 'Technical Clerk',
  'telephone': '530-726-3639',
  'address': {'address': '988 Western Shore Line', 'city': 'Yorba Linda'},
  'credit-card': {'number'

### Map operation

Get only the first name.

In [10]:
x = b.map(lambda record: record['name'][0]).take(10)
x

('Jeffery',
 'Parker',
 'Nicolas',
 'Rickie',
 'Patrick',
 'Caleb',
 'Cruz',
 'Jeanene',
 'Wade',
 'Jarrett')

### Groupby Operation

Group data by some function or key.

In [11]:
b = db.from_sequence(x, npartitions=2)
b.groupby(len).compute()

[(6, ['Parker', 'Rickie']),
 (4, ['Cruz', 'Wade']),
 (7, ['Jeffery', 'Nicolas', 'Patrick', 'Jeanene', 'Jarrett']),
 (5, ['Caleb'])]

**Note:**

Often we want to group data by some function or key. We can do this either with the `.groupby` method, which is straightforward but forces a full shuffle of the data (expensive) or with the harder-to-use but faster `.foldby` method, which does a streaming combined groupby and reduction.

* `groupby`: Shuffles data so that all items with the same key are in the same key-value pair
* `foldby`: Walks through the data accumulating a result per key

_~ Source: [tutorial.dask.org](https://tutorial.dask.org/02_bag.html#Groupby-and-Foldby)_

## Checkpoint

**Question:** Find all cities from the JSON data we created earlier.

In [None]:
# Your answer here

In [None]:
b = db.read_text('data/*.json').map(json.loads)
x = b.map(lambda record: record['address']['city']).take(10)
x

## Convert Dask Bag to Dask DataFrame

Dask Bag can be used for simple analysis but for more complex computations, Dask DataFrame or Dask Array might be a better choice. They are faster for the same reason pandas and numpy are faster than Python. They also have more functionality suited for data analysis.

`to_dataframe` method can be used to transform Dask Bag to Dask DataFrame.

In [19]:
b = db.read_text('data/*.json').map(json.loads)
df = b.to_dataframe()
df.head()

Unnamed: 0,age,name,occupation,telephone,address,credit-card
0,60,"[Jeffery, Garcia]",Training Consultant,1-702-673-7969,"{'address': '744 Langton Parade', 'city': 'Sug...","{'number': '3745 852410 45994', 'expiration-da..."
1,54,"[Parker, Reed]",Window Dresser,223-543-9697,"{'address': '1065 Mill Field', 'city': 'South ...","{'number': '3789 947854 38464', 'expiration-da..."
2,44,"[Nicolas, Duncan]",Forest Ranger,064.491.6735,"{'address': '529 Cameron Alley', 'city': 'Garn...","{'number': '4165 7976 6426 7113', 'expiration-..."
3,23,"[Rickie, Dickerson]",Chiropodist,(393) 425-7342,"{'address': '733 Miramar Run', 'city': 'Shakop...","{'number': '4904 7032 6941 1961', 'expiration-..."
4,42,"[Patrick, Rasmussen]",Technical Clerk,530-726-3639,"{'address': '988 Western Shore Line', 'city': ...","{'number': '4075 4659 6389 2457', 'expiration-..."


Remember to close the Cluster. :)

In [20]:
client.close()

## Limitations

* Does not perform well on computations that include a great deal of inter-worker communication.
* Bag operations are slower than array/DataFrame computations (Python is slower than NumPy/pandas).
* Bag.groupby is slow. You should try to use Bag.foldby if possible.
* Bags are immutable and so you can not change individual elements.

## References
* [Dask Bag documentation](https://docs.dask.org/en/latest/bag.html)
* [Dask Bag API](https://docs.dask.org/en/latest/bag-api.html)
* [Dask Bag examples](https://docs.dask.org/en/latest/bag-api.html)
* [Dask Tutotial - Bag](https://tutorial.dask.org/02_bag.html)