# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from Pandas to run in a Dask environment.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  It will provide a dashboard which 
is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create the client below.  We recommend having it open on one side of your screen while using your notebook on the other side.  This can take some effort to arrange your windows, but seeing them both at the same is very useful when learning.

In [1]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

Port 8787 is already in use. 
Perhaps you already have a cluster running?
Hosting the diagnostics dashboard on a random port instead.


0,1
Client  Scheduler: tcp://127.0.0.1:33855  Dashboard: http://127.0.0.1:44485/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


## Create 2 DataFrames: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples.  
In order to create a Pandas dataframe all that is needed is to run compute()

In [2]:
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.1.5
Pandas versoin: 0.24.2


In [3]:
ddf = dask.datasets.timeseries()
pdf = ddf.compute()  # create a pandas dataframe
pdf.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1063,Sarah,0.907688,-0.048602
2000-01-01 00:00:01,988,Zelda,0.096239,0.302358
2000-01-01 00:00:02,957,Tim,0.699176,-0.026133
2000-01-01 00:00:03,1002,Dan,0.985693,-0.255042
2000-01-01 00:00:04,996,George,0.100571,-0.400752


In [4]:
ddf

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


In [5]:
# Remember Dask DataFrames are **lazy** thus in order to see the result we need to run computer 
# (or head which runs under the hood compute()) )
ddf.head()

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1063,Sarah,0.907688,-0.048602
2000-01-01 00:00:01,988,Zelda,0.096239,0.302358
2000-01-01 00:00:02,957,Tim,0.699176,-0.026133
2000-01-01 00:00:03,1002,Dan,0.985693,-0.255042
2000-01-01 00:00:04,996,George,0.100571,-0.400752


Now that we have both `dataframes` we can start to compair the interactions with them

## 1. Conceptual shift - from Update to Insert/Delete
Dask does not update - thus no arguments such as "inplace= True" which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)


### 1.1 Rename

In [6]:
# Pandas
print(pdf.columns)
pdf.rename(columns={'id':'ID','name':'Name','x':'coor_x', 'y':'coor_y'},inplace=True)
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'Name', 'coor_x', 'coor_y'], dtype='object')

In [7]:
# Dask
# Must update using the correct sequence of the columns
print(ddf.columns)
ddf.columns = ['ID','Name','coor_x','coor_y']
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'Name', 'coor_x', 'coor_y'], dtype='object')

## 1.2 Column munipilations  
There are several diffrences when manimuplating data. mose of which require to ran a couple of lines (instead of one-liners)

### Convert index into Time column

In [8]:
# Pandas
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf.head()

Unnamed: 0_level_0,ID,Name,coor_x,coor_y,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1063,Sarah,0.907688,-0.048602,00:00:00
2000-01-01 00:00:01,988,Zelda,0.096239,0.302358,00:00:01
2000-01-01 00:00:02,957,Tim,0.699176,-0.026133,00:00:02
2000-01-01 00:00:03,1002,Dan,0.985693,-0.255042,00:00:03
2000-01-01 00:00:04,996,George,0.100571,-0.400752,00:00:04


In [9]:
# Dask
ddf = ddf.assign(Time=ddf.index)
ddf['Time'] = ddf['Time'].dt.time
ddf.head()

  result = method(y)


Unnamed: 0_level_0,ID,Name,coor_x,coor_y,Time
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,1063,Sarah,0.907688,-0.048602,00:00:00
2000-01-01 00:00:01,988,Zelda,0.096239,0.302358,00:00:01
2000-01-01 00:00:02,957,Tim,0.699176,-0.026133,00:00:02
2000-01-01 00:00:03,1002,Dan,0.985693,-0.255042,00:00:03
2000-01-01 00:00:04,996,George,0.100571,-0.400752,00:00:04


## 1.3 Drop NA on column

In [10]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head())
pdf.dropna(axis=1, how='all', inplace=True)
print(pdf.head())

                       ID    Name    coor_x    coor_y      Time colna
timestamp                                                            
2000-01-01 00:00:00  1063   Sarah  0.907688 -0.048602  00:00:00  None
2000-01-01 00:00:01   988   Zelda  0.096239  0.302358  00:00:01  None
2000-01-01 00:00:02   957     Tim  0.699176 -0.026133  00:00:02  None
2000-01-01 00:00:03  1002     Dan  0.985693 -0.255042  00:00:03  None
2000-01-01 00:00:04   996  George  0.100571 -0.400752  00:00:04  None
                       ID    Name    coor_x    coor_y      Time
timestamp                                                      
2000-01-01 00:00:00  1063   Sarah  0.907688 -0.048602  00:00:00
2000-01-01 00:00:01   988   Zelda  0.096239  0.302358  00:00:01
2000-01-01 00:00:02   957     Tim  0.699176 -0.026133  00:00:02
2000-01-01 00:00:03  1002     Dan  0.985693 -0.255042  00:00:03
2000-01-01 00:00:04   996  George  0.100571 -0.400752  00:00:04


In [11]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head())
if ddf.colna.isnull().all().compute() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head())

                       ID    Name    coor_x    coor_y      Time colna
timestamp                                                            
2000-01-01 00:00:00  1063   Sarah  0.907688 -0.048602  00:00:00  None
2000-01-01 00:00:01   988   Zelda  0.096239  0.302358  00:00:01  None
2000-01-01 00:00:02   957     Tim  0.699176 -0.026133  00:00:02  None
2000-01-01 00:00:03  1002     Dan  0.985693 -0.255042  00:00:03  None
2000-01-01 00:00:04   996  George  0.100571 -0.400752  00:00:04  None
                       ID    Name    coor_x    coor_y      Time
timestamp                                                      
2000-01-01 00:00:00  1063   Sarah  0.907688 -0.048602  00:00:00
2000-01-01 00:00:01   988   Zelda  0.096239  0.302358  00:00:01
2000-01-01 00:00:02   957     Tim  0.699176 -0.026133  00:00:02
2000-01-01 00:00:03  1002     Dan  0.985693 -0.255042  00:00:03
2000-01-01 00:00:04   996  George  0.100571 -0.400752  00:00:04


##  1.4 Reset Index

In [12]:
# Pandas
pdf.reset_index(drop=True, inplace=True)
pdf.head()

Unnamed: 0,ID,Name,coor_x,coor_y,Time
0,1063,Sarah,0.907688,-0.048602,00:00:00
1,988,Zelda,0.096239,0.302358,00:00:01
2,957,Tim,0.699176,-0.026133,00:00:02
3,1002,Dan,0.985693,-0.255042,00:00:03
4,996,George,0.100571,-0.400752,00:00:04


Dask is in a development mode
thus there bugs that are fixed all the time   
e.g. [reset_index fails when index is named ](https://github.com/dask/dask/pull/4509)

In [13]:
# This currently fails without reseting the dataframe......
ddf = dask.datasets.timeseries()

In [14]:
# Dask
ddf.index.name = None   # workaround
ddf = ddf.reset_index()
ddf['Time'] = ddf['timestamp'].dt.time
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.columns = ['ID','Name','coor_x','coor_y','Time']
ddf.head()

Unnamed: 0,ID,Name,coor_x,coor_y,Time
0,952,Frank,-0.711713,-0.108789,00:00:00
1,986,Bob,-0.637641,0.260295,00:00:01
2,1040,Michael,-0.506356,-0.851453,00:00:02
3,919,Alice,0.537937,0.186887,00:00:03
4,998,Dan,0.842193,-0.99133,00:00:04


# 2. Reads/Save files

When working with pandas and dask preferable try and work with parquet.  
Even so when working with Dask - the files can be read with multiple workers 

### Save files

In [15]:
# Pandas
!mkdir ~/tmp/pd2dd/
pdf.to_csv('~/tmp/pd2dd/pdf_single_file.csv')
!ls ~/tmp/pd2dd/

mkdir: cannot create directory ‘/home/ds/tmp/pd2dd/’: File exists
ddf00.csv  ddf06.csv  ddf12.csv  ddf18.csv  ddf24.csv  pdf_single_file.csv
ddf01.csv  ddf07.csv  ddf13.csv  ddf19.csv  ddf25.csv
ddf02.csv  ddf08.csv  ddf14.csv  ddf20.csv  ddf26.csv
ddf03.csv  ddf09.csv  ddf15.csv  ddf21.csv  ddf27.csv
ddf04.csv  ddf10.csv  ddf16.csv  ddf22.csv  ddf28.csv
ddf05.csv  ddf11.csv  ddf17.csv  ddf23.csv  ddf29.csv


In [16]:
# Dask
# 1. notice the '*' to allow for multiple file renaming. all kwrgs are applicable
# 2. notice that the path to the directory may change based on the location of the running notebook
ddf.to_csv('../../../tmp/pd2dd/ddf*.csv', index = False)
!ls ~/tmp/pd2dd/
# to fild number of partitions use dask.dataframe.npartitions

ddf00.csv  ddf06.csv  ddf12.csv  ddf18.csv  ddf24.csv  pdf_single_file.csv
ddf01.csv  ddf07.csv  ddf13.csv  ddf19.csv  ddf25.csv
ddf02.csv  ddf08.csv  ddf14.csv  ddf20.csv  ddf26.csv
ddf03.csv  ddf09.csv  ddf15.csv  ddf21.csv  ddf27.csv
ddf04.csv  ddf10.csv  ddf16.csv  ddf22.csv  ddf28.csv
ddf05.csv  ddf11.csv  ddf17.csv  ddf23.csv  ddf29.csv


### Read files

In [29]:
# Dask
ddf = dask.dataframe.read_csv('../../../tmp/pd2dd/ddf*.csv')
ddf.head()

Unnamed: 0,ID,Name,coor_x,coor_y,Time
0,952,Frank,-0.711713,-0.108789,00:00:00
1,986,Bob,-0.637641,0.260295,00:00:01
2,1040,Michael,-0.506356,-0.851453,00:00:02
3,919,Alice,0.537937,0.186887,00:00:03
4,998,Dan,0.842193,-0.99133,00:00:04


## 3. Group By
In addition to the notebook example that is in the repository - 
This is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by coloumns into unique list.

In [18]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

Unnamed: 0,ID,Name,coor_x,coor_y,Time,seconds
0,1063,Sarah,0.907688,-0.048602,00:00:00,0
1,988,Zelda,0.096239,0.302358,00:00:00,0
2,957,Tim,0.699176,-0.026133,00:00:00,0
3,1002,Dan,0.985693,-0.255042,00:00:00,0
4,996,George,0.100571,-0.400752,00:00:00,0


In [19]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [23]:
%%timeit
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

1.25 s ± 8.63 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [25]:
%%timeit
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

901 ms ± 4.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [30]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

Unnamed: 0,ID,Name,coor_x,coor_y,Time,seconds
0,952,Frank,-0.711713,-0.108789,00:00:00,0
1,986,Bob,-0.637641,0.260295,00:00:01,1
2,1040,Michael,-0.506356,-0.851453,00:00:02,2
3,919,Alice,0.537937,0.186887,00:00:03,3
4,998,Dan,0.842193,-0.99133,00:00:04,4


In [34]:
%%timeit
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

5.49 s ± 103 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [35]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [37]:
%%timeit
# Dask option1 using apply
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

1.2 s ± 64.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


 ## 4. Consider using Persist
Since Dask is lazy - it may ran the **entire** graph (again) even if it already ran part of it in order to generate a result 
in a previous cell.  
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates a loop logic within the code.

## 5. Debugging
Debugging my be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining