# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from Pandas to run in a Dask environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [1]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:51194  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.50 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

In [2]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [3]:
ddf = dask.datasets.timeseries()
ddf

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int32,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [4]:
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801
2000-01-01 00:00:01,1014,Zelda,0.688073,-0.251359


In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [5]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head(2)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801
2000-01-01 00:00:01,1014,Zelda,0.688073,-0.251359


## Creating a `Dask dataframe` from Pandas
In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [10]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2

Unnamed: 0_level_0,id,name,x,y
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,int32,object,float64,float64
2000-01-04 00:00:00,...,...,...,...
...,...,...,...,...
2000-01-28 00:00:00,...,...,...,...
2000-01-30 23:59:59,...,...,...,...


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  

An example for this can be seen when examing the `reset_ index()` method:

In [7]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.loc[0]

timestamp    2000-01-01 00:00:00
id                          1035
name                       Jerry
x                      0.0973096
y                      -0.725801
Name: 0, dtype: object

In [11]:
ddf2 = ddf2.reset_index()
# each partition has an index=0
ddf2.loc[0].compute() 

Unnamed: 0,timestamp,id,name,x,y
0,2000-01-01,1035,Jerry,0.09731,-0.725801
0,2000-01-04,933,Quinn,-0.536902,-0.165153
0,2000-01-07,1024,Alice,0.096824,-0.051993
0,2000-01-10,1057,Norbert,0.515502,0.386068
0,2000-01-13,992,Oliver,-0.655252,-0.0571
0,2000-01-16,1021,Frank,0.860617,-0.390791
0,2000-01-19,1097,Ursula,-0.970976,-0.873724
0,2000-01-22,991,Victor,-0.77089,-0.796061
0,2000-01-25,962,Norbert,-0.761734,0.504892
0,2000-01-28,995,Victor,-0.207978,0.855133


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

## Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

* using `inplace=True` is not considerd to be *best practice*. 

In [12]:
# Pandas 
print(pdf.columns)
# pdf.rename(columns={'id':'ID'}, inplace=True)
pdf = pdf.rename(columns={'id':'ID'})
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [13]:
# Dask - Error
ddf.rename(columns={'id':'ID'}, inplace=True)
ddf.columns

TypeError: rename() got an unexpected keyword argument 'inplace'

In [14]:
# Dask
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data munipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [15]:
cond = (pdf['x']>0.5) &(pdf['x']<0.8)
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,1014,Zelda,0.688073,-0.251359
2000-01-01 00:00:03,1014,Ray,0.525594,-0.631304


In [16]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head()

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917
2000-01-01 00:00:03,1014,Ray,0.525594,-63.13038
2000-01-01 00:00:05,1025,Michael,0.621431,91.171588
2000-01-01 00:00:06,995,George,0.589302,15.217299
2000-01-01 00:00:20,970,Bob,0.700838,46.547723


### Dask - use mask/where

In [17]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)

In [18]:
# Error
ddf.loc[cond_dask, ['y']] = ddf['y']* 100

TypeError: '_LocIndexer' object does not support item assignment

In [19]:
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head()

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917
2000-01-01 00:00:03,1014,Ray,0.525594,-63.13038
2000-01-01 00:00:05,1025,Michael,0.621431,91.171588
2000-01-01 00:00:06,995,George,0.589302,15.217299
2000-01-01 00:00:20,970,Bob,0.700838,46.547723


## Meta
One key difference is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation.  
For additinal information see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata)

In [20]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze


In [21]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze


In [22]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [23]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1], meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze


In [24]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [25]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,97.309594
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,688.072656


* We can supply a function to run on each partition using the [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions) method.  
The function could also include arguments.

In [26]:
def func2(df, col1, col2):
    z = df[col1] * df[col2]
    return z

In [27]:
ddf['a'] = ddf.map_partitions(func2, col1='x', col2='y', meta=('a', 'float'))

In [28]:
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,a
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,97.309594,-0.070627
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,688.072656,-17.295337


* Finally we can return a `dataframe` which needs to be described in the `meta` argument

In [29]:
def func3(df, col1, col2, col3, col4):
    df['col12'] = df[col1] * df[col2]
    df = df.drop([col3, col4], axis=1)
    return df

In [30]:
ddf_z = ddf.map_partitions(func3, 'x', 'y', 'z','a',
                           meta=pd.DataFrame({'ID':'i8'
                                              , 'name':str
                                              , 'x':'f8'
                                              ,'y':'f8'
                                              , 'initials':str
                                              , 'col12':'f8'}, index=[0]) 
                          )
ddf_z.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,col12
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,-0.070627
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,-17.295337


### Convert index into Time column

In [31]:
# Pandas
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,00:00:00
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,00:00:01


In [32]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index).dt.time )
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z,a,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,97.309594,-0.070627,00:00:00
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,688.072656,-17.295337,00:00:01


In [33]:
# Dask or Pandas
ddf = ddf.assign(Time2=ddf.index)
ddf['Time2'] = ddf['Time2'].dt.time
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,a,Time,Time2
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,97.309594,-0.070627,00:00:00,00:00:00
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,688.072656,-17.295337,00:00:01,00:00:01
2000-01-01 00:00:02,1000,Zelda,0.040817,-0.9153,Ze,40.817265,-0.03736,00:00:02,00:00:02
2000-01-01 00:00:03,1014,Ray,0.525594,-63.13038,Ra,525.593513,-33.180918,00:00:03,00:00:03
2000-01-01 00:00:04,995,Frank,-0.060979,0.374551,Fr,-0.374551,-0.02284,00:00:04,00:00:04


## Drop NA on column

In [34]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head(2))
pdf.dropna(axis=1, how='all', inplace=True)
print(pdf.head(2))

                       ID   name         x          y initials      Time colna
timestamp                                                                     
2000-01-01 00:00:00  1035  Jerry  0.097310  -0.725801       Je  00:00:00  None
2000-01-01 00:00:01  1014  Zelda  0.688073 -25.135917       Ze  00:00:01  None
                       ID   name         x          y initials      Time
timestamp                                                               
2000-01-01 00:00:00  1035  Jerry  0.097310  -0.725801       Je  00:00:00
2000-01-01 00:00:01  1014  Zelda  0.688073 -25.135917       Ze  00:00:01


In odrer for `Dask` to drop a column with all `na` it must check all the partitions with `compute()`

In [35]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head(2))
if ddf.colna.isnull().all().compute() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head(2))

                       ID   name         x          y initials           z  \
timestamp                                                                    
2000-01-01 00:00:00  1035  Jerry  0.097310  -0.725801       Je   97.309594   
2000-01-01 00:00:01  1014  Zelda  0.688073 -25.135917       Ze  688.072656   

                             a      Time     Time2 colna  
timestamp                                                 
2000-01-01 00:00:00  -0.070627  00:00:00  00:00:00  None  
2000-01-01 00:00:01 -17.295337  00:00:01  00:00:01  None  
                       ID   name         x          y initials           z  \
timestamp                                                                    
2000-01-01 00:00:00  1035  Jerry  0.097310  -0.725801       Je   97.309594   
2000-01-01 00:00:01  1014  Zelda  0.688073 -25.135917       Ze  688.072656   

                             a      Time     Time2  
timestamp                                           
2000-01-01 00:00:00  -0.070627  

##  1.4 Reset Index

In [36]:
# Pandas
pdf.reset_index(drop=True, inplace=True)
pdf.head(2)

Unnamed: 0,ID,name,x,y,initials,Time
0,1035,Jerry,0.09731,-0.725801,Je,00:00:00
1,1014,Zelda,0.688073,-25.135917,Ze,00:00:01


In [None]:
# Dask
ddf = ddf.reset_index()
ddf['Time'] = ddf['timestamp'].dt.time
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.columns = ['ID','name','x','y','Time']
ddf.head(2)

# Read / Save files

When working with `pandas` and `dask` preferable try and work with [parquet](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#store-data-in-apache-parquet-format).  
Even so when working with `Dask` - the files can be read with multiple workers .  
Most `kwargs` are applicable for reading and writing files [see documentaion](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.to_csv) (including the option for output file naming).  
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header=False).  

However some are not available such as  `nrows`.

## Save files

In [37]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')

In [None]:
!dir data  # use ls on linux systems

`Dask`
Notice the '*' to allow for multiple file renaming. 



In [39]:
# Dask
!mkdir data\pd2dd
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)

['data/pd2dd/ddf00.csv',
 'data/pd2dd/ddf01.csv',
 'data/pd2dd/ddf02.csv',
 'data/pd2dd/ddf03.csv',
 'data/pd2dd/ddf04.csv',
 'data/pd2dd/ddf05.csv',
 'data/pd2dd/ddf06.csv',
 'data/pd2dd/ddf07.csv',
 'data/pd2dd/ddf08.csv',
 'data/pd2dd/ddf09.csv',
 'data/pd2dd/ddf10.csv',
 'data/pd2dd/ddf11.csv',
 'data/pd2dd/ddf12.csv',
 'data/pd2dd/ddf13.csv',
 'data/pd2dd/ddf14.csv',
 'data/pd2dd/ddf15.csv',
 'data/pd2dd/ddf16.csv',
 'data/pd2dd/ddf17.csv',
 'data/pd2dd/ddf18.csv',
 'data/pd2dd/ddf19.csv',
 'data/pd2dd/ddf20.csv',
 'data/pd2dd/ddf21.csv',
 'data/pd2dd/ddf22.csv',
 'data/pd2dd/ddf23.csv',
 'data/pd2dd/ddf24.csv',
 'data/pd2dd/ddf25.csv',
 'data/pd2dd/ddf26.csv',
 'data/pd2dd/ddf27.csv',
 'data/pd2dd/ddf28.csv',
 'data/pd2dd/ddf29.csv']

In [None]:
!dir data\pd2dd\ 

To find the number of partitions which will determine the number of output files use [dask.dataframe.npartitions](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.npartitions)  

To change the number of output files use [repartition](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.repartition) which is an expensive operation.

## Read files

For `pandas` it is possible to iterate and concat the files [see answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe).

In [54]:
# Pandas 
import glob
import os

path = r'data/pd2dd/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df = pd.concat(df_from_each_file, ignore_index=True)
len(concatenated_df)

2592000

In [41]:
# Dask
_ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
len(_ddf)

2592000

# Group By - custom aggregations
In addition to the notebook example that is in the repository - 
This is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping and by columns into unique list.

In [42]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

Unnamed: 0,ID,name,x,y,initials,Time,seconds
0,1035,Jerry,0.09731,-0.725801,Je,00:00:00,0
1,1014,Zelda,0.688073,-25.135917,Ze,00:00:00,0
2,1000,Zelda,0.040817,-0.9153,Ze,00:00:00,0
3,1014,Ray,0.525594,-63.13038,Ra,00:00:00,0
4,995,Frank,-0.060979,0.374551,Fr,00:00:00,0


In [43]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [44]:
%%time
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Wall time: 4.33 s


In [45]:
%%time
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Wall time: 2.3 s


In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [46]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

Unnamed: 0_level_0,ID,name,x,y,initials,z,a,Time,Time2,seconds
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
2000-01-01 00:00:00,1035,Jerry,0.09731,-0.725801,Je,97.309594,-0.070627,00:00:00,00:00:00,0
2000-01-01 00:00:01,1014,Zelda,0.688073,-25.135917,Ze,688.072656,-17.295337,00:00:01,00:00:01,1
2000-01-01 00:00:02,1000,Zelda,0.040817,-0.9153,Ze,40.817265,-0.03736,00:00:02,00:00:02,2
2000-01-01 00:00:03,1014,Ray,0.525594,-63.13038,Ra,525.593513,-33.180918,00:00:03,00:00:03,3
2000-01-01 00:00:04,995,Frank,-0.060979,0.374551,Fr,-0.374551,-0.02284,00:00:04,00:00:04,4


In [47]:
%%time
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Wall time: 1min 18s


Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [48]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [49]:
%%time
# Dask option1 using apply
df_gb = ddf.groupby(ddf.name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Wall time: 8.85 s


 ## Consider using Persist
Since Dask is lazy - it may run the **entire** graph/DAG (again) even if it already run part of the calculation in a previous cell.  Thus use [persist](https://docs.dask.org/en/latest/dataframe-best-practices.html?highlight=parquet#persist-intelligently) to kepp the results in memory 
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates loops within the code.

## Debugging
Debugging may be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining