# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from Pandas to run in a Dask environment.  
Most issues have a link to the [Dask documentation](https://docs.dask.org/en/latest/) for additional information.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will also provide a dashboard which is useful to gain insight on the computation.  
For additional information on [Dask Client see documentation](https://docs.dask.org/en/latest/setup.html?highlight=client#setup)  

The link to the dashboard will become visible when you create a client (as shown below).  
When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [1]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:33223  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 67.44 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

In [30]:
# since Dask is activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.0
Pandas versoin: 0.24.2


# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [3]:
ddf = dask.datasets.timeseries()
ddf

Unnamed: 0_level_0,id,name,x,y
npartitions=30,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01,int64,object,float64,float64
2000-01-02,...,...,...,...
...,...,...,...,...
2000-01-30,...,...,...,...
2000-01-31,...,...,...,...


* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [5]:
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559
2000-01-01 00:00:01,989,George,-0.944781,-0.52412


In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [6]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head(2)

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559
2000-01-01 00:00:01,989,George,-0.944781,-0.52412


## Creating a `Dask dataframe` from Pandas
In order to utilize `Dask` capablities on an existing `Pandas dataframe` (pdf) we need to convert the `Pandas dataframe` into a `Dask dataframe` (ddf)  with the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [32]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2

Unnamed: 0_level_0,ID,name,x,y,initials
npartitions=10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,int64,object,float64,float64,object
2000-01-04 00:00:00,...,...,...,...,...
...,...,...,...,...,...
2000-01-28 00:00:00,...,...,...,...,...
2000-01-30 23:59:59,...,...,...,...,...


## Partitions in Dask Dataframes

Notice that when we created a `Dask dataframe` we needed to supply an argument of `npartitions`.  
The number of partitions will assist `Dask` on how it's going to parallelize the computation.  
Each partition is a *separate* dataframe. For additional information see [partition documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#partitions)  

An example for this can be seen when examing the `reset_ index()` method:

In [36]:
pdf2 = pdf.reset_index()
# Only 1 row
pdf2.loc[0]

timestamp    2000-01-01 00:00:00
ID                           962
name                      Yvonne
x                      -0.441537
y                       0.481559
initials                      Yv
Name: 0, dtype: object

In [33]:
ddf2 = ddf2.reset_index()
# each partition has an index=0
ddf2.loc[0].compute() 

Unnamed: 0,timestamp,ID,name,x,y,initials
0,2000-01-01,962,Yvonne,-0.441537,0.481559,Yv
0,2000-01-04,1035,Wendy,-0.789589,-0.509654,We
0,2000-01-07,987,Charlie,-0.490072,0.564672,Ch
0,2000-01-10,1044,George,-0.099254,0.78241,Ge
0,2000-01-13,1013,Hannah,-0.919592,0.799971,Ha
0,2000-01-16,979,Alice,-0.316781,0.347231,Al
0,2000-01-19,1010,Victor,0.216737,0.118467,Vi
0,2000-01-22,966,Xavier,0.736792,1296.834466,Xa
0,2000-01-25,1005,Dan,-0.094537,-0.518438,Da
0,2000-01-28,1047,Oliver,-0.117828,0.082837,Ol


Now that we have a `dask` (ddf) and a `pandas` (pdf) dataframe we can start to compair the interactions with them.

## Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

In [9]:
# Pandas
print(pdf.columns)
pdf.rename(columns={'id':'ID'},inplace=True)
pdf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

In [10]:
# Dask - Error
ddf.rename(columns={'id':'ID'}, inplace=True)
ddf.columns

TypeError: rename() got an unexpected keyword argument 'inplace'

In [11]:
# Dask
# Must update using the correct sequence of the columns
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID'})
# or ddf.columns = ['ID']
ddf.columns

Index(['id', 'name', 'x', 'y'], dtype='object')


Index(['ID', 'name', 'x', 'y'], dtype='object')

## Data munipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [14]:
cond = (pdf['x']>0.5) &(pdf['x']<0.8)
pdf[cond].head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:08,998,Ingrid,0.742488,-0.679296
2000-01-01 00:00:09,1064,Patricia,0.777471,0.632791


In [16]:
pdf.loc[cond, ['y']] = pdf['y']* 100
pdf[cond].head()

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:08,998,Ingrid,0.742488,-6792.964947
2000-01-01 00:00:09,1064,Patricia,0.777471,6327.913798
2000-01-01 00:00:31,1029,Kevin,0.593245,-7367.511692
2000-01-01 00:00:45,1075,Ursula,0.599396,-299.403878
2000-01-01 00:01:04,945,Ray,0.504267,5196.196694


### Dask - use mask/where

In [17]:
cond_dask = (ddf['x']>0.5) & (ddf['x']<0.8)

In [18]:
# Error
ddf.loc[cond_dask, ['y']] = ddf['y']* 100

TypeError: '_LocIndexer' object does not support item assignment

* remember that each dataframe partition can have duplicate indecies 

In [19]:
ddf['y'] = ddf['y'].mask(cond_dask, ddf['y']* 100)
ddf[cond_dask].head()

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:08,998,Ingrid,0.742488,-67.929649
2000-01-01 00:00:09,1064,Patricia,0.777471,63.279138
2000-01-01 00:00:31,1029,Kevin,0.593245,-73.675117
2000-01-01 00:00:45,1075,Ursula,0.599396,-2.994039
2000-01-01 00:01:04,945,Ray,0.504267,51.961967


## Meta
One key difference is the introduction of `meta` arguement.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since `Dask` creates a DAG for the computation it requires to understand what are the outputs of each calculation.  
For additinal information see [meta documentation](https://docs.dask.org/en/latest/dataframe-design.html?highlight=meta%20utils#metadata)

In [20]:
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559
2000-01-01 00:00:01,989,George,-0.944781,-0.52412


In [22]:
pdf['initials'] = pdf['name'].apply(lambda x: x[0]+x[1])
pdf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559,Yv
2000-01-01 00:00:01,989,George,-0.944781,-0.52412,Ge


In [23]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1])
ddf.head(2)

You did not provide metadata, so Dask is running your function on a small dataset to guess output types. It is possible that Dask will guess incorrectly.
To provide an explicit output types or to silence this message, please provide the `meta=` keyword, as described in the map or apply function that you are using.
  Before: .apply(func)
  After:  .apply(func, meta=('name', 'object'))



Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559,Yv
2000-01-01 00:00:01,989,George,-0.944781,-0.52412,Ge


In [24]:
# Describe the outcome type of the calculation
meta_cal = pd.Series(object, name='initials')

In [26]:
ddf['initials'] = ddf['name'].apply(lambda x: x[0]+x[1], meta = meta_cal)
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559,Yv
2000-01-01 00:00:01,989,George,-0.944781,-0.52412,Ge


In [64]:
ddf = dask.datasets.timeseries()
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,1072,Kevin,0.684326,-0.225795
2000-01-01 00:00:01,1008,Norbert,-0.154597,-0.696777


* We can also supply a function with arguments

In [27]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [41]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))

In [42]:
ddf.head(2)

Unnamed: 0_level_0,ID,name,x,y,initials,z
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2000-01-01 00:00:00,962,Yvonne,-0.441537,0.481559,Yv,-0.481559
2000-01-01 00:00:01,989,George,-0.944781,-0.52412,Ge,0.52412


* We can supply a function to run on each partition using [map_partitions](https://dask.readthedocs.io/en/latest/dataframe-api.html#dask.dataframe.DataFrame.map_partitions)

In [59]:
def func2(df, col1, col2):
    z = df[col1] * df[col2]
    return z

In [70]:
ddf['z'] = ddf.map_partitions(func2, col1='x', col2='y', meta=('z', 'float'))

In [69]:
z = ddf['x'] *ddf['y']
# z.compute()

In [71]:
# ddf_ = ddf
ddf_z = ddf.map_partitions(func2, 'x', 'y',
#                            meta=pd.DataFrame({'id':'i8', 'name':'str', 'x':'f8','y':'f8', 'z':'f8'}) 
                          )

In [72]:
ddf_z.head(2)

timestamp
2000-01-01 00:00:00   -0.154517
2000-01-01 00:00:01    0.107719
Freq: S, dtype: float64

### Convert index into Time column

In [None]:
# Pandas
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf.head()

In [None]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index).dt.time )
ddf.head()

In [None]:
# Dask or Pandas
ddf = ddf.assign(Time2=ddf.index)
ddf['Time2'] = ddf['Time2'].dt.time
ddf.head()

## Drop NA on column

In [None]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head())
pdf.dropna(axis=1, how='all', inplace=True)
print(pdf.head())

In [None]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head())
if ddf.colna.isnull().all().compute() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head())

##  1.4 Reset Index

In [None]:
# Pandas
pdf.reset_index(drop=True, inplace=True)
pdf.head()

Dask is in a development mode
thus there bugs that are fixed all the time   
e.g. [reset_index fails when index is named ](https://github.com/dask/dask/pull/4509)

In [None]:
# This currently fails without reseting the dataframe......
ddf = dask.datasets.timeseries()

In [None]:
# Dask
ddf.index.name = None   # workaround
ddf = ddf.reset_index()
ddf['Time'] = ddf['timestamp'].dt.time
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.columns = ['ID','Name','coor_x','coor_y','Time']
ddf.head()

# 2. Reads/Save files

When working with pandas and dask preferable try and work with parquet.  
Even so when working with Dask - the files can be read with multiple workers 

### Save files

In [None]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')
!ls data

In [None]:
# Dask
# 1. notice the '*' to allow for multiple file renaming. all kwrgs are applicable
# 2. notice that the path to the directory may change based on the location of the running notebook
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)
!ls data/pd2dd/
# to fild number of partitions use dask.dataframe.npartitions

### Read files

based on an [answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe)

In [None]:
# Pandas 
path = r'data/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)

In [None]:
# Dask
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
ddf.head()

Most `kwarg` are available for reading and writing 
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header none)
However `nrows` is not available


## 3. Group By
In addition to the notebook example that is in the repository - 
This is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by coloumns into unique list.

In [None]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

In [None]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [None]:
%%timeit
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In [None]:
%%timeit
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [None]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

In [None]:
%%timeit
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [None]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [None]:
%%timeit
# Dask option1 using apply
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

 ## 4. Consider using Persist
Since Dask is lazy - it may ran the **entire** graph (again) even if it already ran part of it in order to generate a result 
in a previous cell.  
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates a loop logic within the code.

## 5. Debugging
Debugging my be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining