# Dask Bags and Map-Reduce part 2


Dask Bag implements operations like `map`, `filter`, `groupby` and aggregations on collections of Python objects. It does this in parallel and in small memory using Python iterators. It is similar to a parallel version of itertools or a Pythonic version of the PySpark RDD.

Dask Bags are often used to do simple preprocessing on log files, JSON records, or other user defined Python objects.

In [2]:
from dask.distributed import Client
client = Client() #Client('dask-scheduler:8786')
client

0,1
Client  Scheduler: tcp://127.0.0.1:60330  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 7.92 GB


## Read JSON data

Now that we have some JSON data in a file lets take a look at it with Dask Bag and Python JSON module.

In [3]:
!head -n 2 data/bag/0.json

"head" non Š riconosciuto come comando interno o esterno,
 un programma eseguibile o un file batch.


In [4]:
import dask.bag as db
import json

b = db.read_text('data/bag/*.json').map(json.loads)
b

dask.bag<loads, npartitions=10>

In [9]:
b.take(10)

({'age': 63,
  'name': ['Ike', 'Rowland'],
  'occupation': 'Wholesale Newspaper',
  'telephone': '117.341.3466',
  'address': {'address': '951 Jean Walk', 'city': 'West Chicago'},
  'credit-card': {'number': '3748 984723 54799', 'expiration-date': '02/22'}},
 {'age': 16,
  'name': ['Ethelene', 'Wolf'],
  'occupation': 'Barmaid',
  'telephone': '(153) 829-8347',
  'address': {'address': '611 Ridgewood Garden', 'city': 'Sanford'},
  'credit-card': {'number': '4943 1602 2905 5284',
   'expiration-date': '10/21'}},
 {'age': 22,
  'name': ['Manie', 'Mcintosh'],
  'occupation': 'Welder',
  'telephone': '762-503-7369',
  'address': {'address': '404 Nadell Line', 'city': 'Mount Juliet'},
  'credit-card': {'number': '3715 662910 59872', 'expiration-date': '05/22'}},
 {'age': 61,
  'name': ['Morgan', 'Baird'],
  'occupation': 'Builder',
  'telephone': '307.854.7396',
  'address': {'address': '1215 Coralino Private', 'city': 'Mason'},
  'credit-card': {'number': '3740 881515 86602', 'expiration-d

## Map, Filter, Aggregate

We can process this data by filtering out only certain records of interest, mapping functions over it to process our data, and aggregating those results to a total value.

In [10]:
def filt(record):
    
    return record['age'] > 30

b.filter(lambda record: record['age'] > 30).take(2)  # Select only people over 30

({'age': 63,
  'name': ['Ike', 'Rowland'],
  'occupation': 'Wholesale Newspaper',
  'telephone': '117.341.3466',
  'address': {'address': '951 Jean Walk', 'city': 'West Chicago'},
  'credit-card': {'number': '3748 984723 54799', 'expiration-date': '02/22'}},
 {'age': 61,
  'name': ['Morgan', 'Baird'],
  'occupation': 'Builder',
  'telephone': '307.854.7396',
  'address': {'address': '1215 Coralino Private', 'city': 'Mason'},
  'credit-card': {'number': '3740 881515 86602', 'expiration-date': '02/20'}})

In [19]:
b.map(lambda record: record['occupation']).take(10)  # Select the occupation field

('Wholesale Newspaper',
 'Barmaid',
 'Welder',
 'Builder',
 'Radiologist',
 'Painter',
 'Optician',
 'Toy Maker',
 'Miner',
 'Speech Therapist')

In [12]:
b.count().compute()  # Count total number of records

10000

## Chain computations

It is common to do many of these steps in one pipeline, only calling `compute` or `take` at the end.

In [13]:
result = (b.filter(lambda record: record['age'] > 30)
           .map(lambda record: record['occupation'])
           .frequencies(sort=True)
           .topk(10, key=1))
result

dask.bag<topk-aggregate, npartitions=1>

As always in roder to retrieve the results even for the bag you have to use the `compute` to actually evaluate our result.  The `take` method used in earlier examples is also like `compute` and will also trigger computation.

In [14]:
result.compute()

[('Mineralologist', 15),
 ('Stone Sawyer', 15),
 ('Gate Keeper', 15),
 ('Airport Controller', 13),
 ('Translator', 13),
 ('Driver', 13),
 ('Marine Geologist', 13),
 ('Stone Cutter', 13),
 ('Gaming Board Inspector', 13),
 ('Pig Man', 13)]

## Transform and Store

Sometimes we want to compute aggregations as above, but sometimes we want to store results to disk for future analyses.  For that we can use methods like `to_textfiles` and `json.dumps`, or we can convert to Dask Dataframes and use their storage systems, which we'll see more of in the next section.

In [15]:
(b.filter(lambda record: record['age'] > 30)  # Select records of interest
  .map(json.dumps)                            # Convert Python objects to text
  .to_textfiles('data/bag/processed.*.json'))     # Write to local disk

['C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.0.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.1.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.2.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.3.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.4.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.5.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.6.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/processed.7.json',
 'C:/Users/feder/Desktop/uni/mapdb/ManagementAndAnalysisOfPhysicsDatasetsB/lecture_3/data/bag/pr

## Convert to Dask Dataframes

Dask Bags are good for reading in initial data, doing a bit of pre-processing, and then handing off to some other more efficient form like Dask Dataframes.  Dask Dataframes use Pandas internally, and so can be much faster on numeric data and also have more complex algorithms.  

However, Dask Dataframes also expect data that is organized as flat columns.  It does not support nested JSON data very well (Bag is better for this).

Here we make a function to flatten down our nested data structure, map that across our records, and then convert that to a Dask Dataframe.

In [16]:
b.take(1)

({'age': 63,
  'name': ['Ike', 'Rowland'],
  'occupation': 'Wholesale Newspaper',
  'telephone': '117.341.3466',
  'address': {'address': '951 Jean Walk', 'city': 'West Chicago'},
  'credit-card': {'number': '3748 984723 54799', 'expiration-date': '02/22'}},)

In [17]:
def flatten(record):
    return {
        'age': record['age'],
        'occupation': record['occupation'],
        'telephone': record['telephone'],
        'credit-card-number': record['credit-card']['number'],
        'credit-card-expiration': record['credit-card']['expiration-date'],
        'name': ' '.join(record['name']),
        'street-address': record['address']['address'],
        'city': record['address']['city']   
    }

b.map(flatten).take(1)

({'age': 63,
  'occupation': 'Wholesale Newspaper',
  'telephone': '117.341.3466',
  'credit-card-number': '3748 984723 54799',
  'credit-card-expiration': '02/22',
  'name': 'Ike Rowland',
  'street-address': '951 Jean Walk',
  'city': 'West Chicago'},)

In [18]:
df = b.map(flatten).to_dataframe()
df.head()

Unnamed: 0,age,occupation,telephone,credit-card-number,credit-card-expiration,name,street-address,city
0,63,Wholesale Newspaper,117.341.3466,3748 984723 54799,02/22,Ike Rowland,951 Jean Walk,West Chicago
1,16,Barmaid,(153) 829-8347,4943 1602 2905 5284,10/21,Ethelene Wolf,611 Ridgewood Garden,Sanford
2,22,Welder,762-503-7369,3715 662910 59872,05/22,Manie Mcintosh,404 Nadell Line,Mount Juliet
3,61,Builder,307.854.7396,3740 881515 86602,02/20,Morgan Baird,1215 Coralino Private,Mason
4,59,Radiologist,1-440-144-8989,3766 058764 62690,07/17,Darrel Holloway,1321 Conservatory Point,Alice


We can now perform the same computation as before, but now using Pandas and Dask dataframe.

In [20]:
df[df.age > 30].occupation.value_counts().nlargest(10).compute()

Gate Keeper               15
Mineralologist            15
Stone Sawyer              15
Plant Attendant           13
Stone Cutter              13
Packaging                 13
Driver                    13
Translator                13
Tyre Inspector            13
Gaming Board Inspector    13
Name: occupation, dtype: int64