# Gotcha's from Pandas to Dask

This notebook highlights some key differences when transfering code from Pandas to run in a Dask environment.

## Start Dask Client for Dashboard

Starting the Dask Client is optional.  In this example we are running on a `LocalCluster`, this  will provide a dashboard which is useful to gain insight on the computation.  

The link to the dashboard will become visible when you create a client (as shown below). When running in `Jupyter Lab` an [extenstion](https://github.com/dask/dask-labextension) can be installed to be able to view the various dashboard widgets. 

In [1]:
from dask.distributed import Client
# client = Client(n_workers=1, threads_per_worker=4, processes=False, memory_limit='2GB')
client = Client()
client

0,1
Client  Scheduler: tcp://127.0.0.1:63296  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 4  Memory: 8.50 GB


See [documentation for addtional cluster configuration](http://distributed.dask.org/en/latest/local-cluster.html)

In [2]:
# since Dask activly beeing developed - the current example is running with the below version
import dask
import pandas as pd
print(f'Dask versoin: {dask.__version__}')
print(f'Pandas versoin: {pd.__version__}')

Dask versoin: 1.2.2
Pandas versoin: 0.24.2


# Create 2 DataFrames for comparison: 
1. for Dask 
2. for Pandas  
Dask comes with builtin dataset samples, we will use this sample for our example. 

In [None]:
ddf = dask.datasets.timeseries()
ddf

* Remember `Dask framework` is **lazy** thus in order to see the result we need to run [compute()](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.DataFrame.compute) 
 (or `head()` which runs under the hood compute()) )

In [None]:
ddf.head()

In order to create a `Pandas` dataframe we can use the `compute()` method from a `Dask dataframe`

In [None]:
pdf = ddf.compute()  
print(type(pdf))
pdf.head()

## Creating a `Dask dataframe` from Pandas
In order to create a `Dask dataframe` (ddf) from a `Pandas dataframe` (pdf) we can use the [from_pandas](https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.from_pandas) method. 
You must supply the number of partitions or chunksize that will be used to generate the dask dataframe

In [None]:
ddf2 = dask.dataframe.from_pandas(pdf, npartitions=10)
ddf2

Now that we have both `dataframes` we can start to compair the interactions with them

## Conceptual shift - from Update to Insert/Delete
Dask does not update - thus there are no arguments such as `inplace=True` which exist in Pandas.  
For more detials see [issue#653 on github](https://github.com/dask/dask/issues/653)

### Rename Columns

In [None]:
# Pandas
print(pdf.columns)
pdf.rename(columns={'id':'ID','name':'Name','x':'coor_x', 'y':'coor_y'},inplace=True)
pdf.columns

In [None]:
# Dask - Error
ddf.rename(columns={'id':'ID','name':'Name','x':'coor_x', 'y':'coor_y'}, inplace=True)
ddf.columns

In [None]:
# Dask
# Must update using the correct sequence of the columns
print(ddf.columns)
ddf = ddf.rename(columns={'id':'ID','name':'Name','x':'coor_x', 'y':'coor_y'})
# or ddf.columns = ['ID','Name','coor_x','coor_y']
ddf.columns

## Data munipilations  
There are several diffrences when manipulating data.  

### loc - Pandas

In [None]:
cond = (pdf['coor_x']>0.5) &(pdf['coor_x']<0.8)
pdf[cond].head()

In [None]:
pdf.loc[cond, ['coor_x']] = pdf['coor_x']* 10
pdf[cond].head()

### Dask - use mask/where

In [None]:
cond_dask = (ddf['coor_x']>0.5) & (ddf['coor_x']<0.8)

In [None]:
# Error
ddf.loc[cond_dask, ['coor_x']] = ddf['coor_x']* 10

* remember that each dataframe partition can have duplicate indecies 

In [None]:
ddf['coor_x'] = ddf['coor_x'].mask(cond_dask, ddf['coor_x']* 10)
ddf[cond_dask].head()

## Meta
One key difference is the introduction of `meta`.  
> `meta` is the prescription of the names/types of the output from the computation  
[see stack overflow answer](https://stackoverflow.com/questions/44432868/dask-dataframe-apply-meta)

Since Dask creates a DAG for the computation it requires to understand what are the outputs of each calculation.  

In [None]:
pdf.head()

In [None]:
pdf['initials'] = pdf.Name.apply(lambda x: x[0]+x[1])
pdf.head()

In [None]:
ddf['initials'] = ddf.Name.apply(lambda x: x[0]+x[1])
ddf.head()

In [None]:
ddf['initials'] = ddf.Name.apply(lambda x: x[0]+x[1], meta = pd.Series(object, name='initials'))
ddf.head()

In [5]:
ddf = dask.datasets.timeseries()
ddf.head(2)

Unnamed: 0_level_0,id,name,x,y
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2000-01-01 00:00:00,941,Charlie,0.323091,0.758048
2000-01-01 00:00:01,1042,Quinn,0.876541,-0.36625


In [None]:
def func(row):
    if (row['x']> 0):
        return row['x'] * 1000  
    else:
        return row['y'] * -1

In [None]:
# ddf['z'] = ddf.apply(func, args=('coor_x', 'coor_y'), axis=1, meta=('z', 'float'))
ddf['z'] = ddf.apply(func, axis=1, meta=('z', 'float'))

In [None]:
ddf.head()

In [None]:
def func2(df, col1, col2):
#     if row[col1]< 0:
#         return row[col1] * -1  
#     else:
#         return row[col2] * 10000
        return df[col1] * df[col2]

In [None]:
ddf['z'] = ddf.map_partitions(func2, 'x', 'y', meta=('z', 'float'))

In [None]:
ddf.head()

In [3]:
from pyproj import Proj, transform
inProj = Proj(init='epsg:3857')
outProj = Proj(init='epsg:4326')
def coor_trans(row, x1, y1):
    x2,y2 = transform(inProj,outProj,row[x1],row[y1])
    return x2,y2

In [None]:
ddf.head()

In [None]:
ddf.map_partitions?
    data['LatLong'] = 
    data.apply(lambda row:  transform(Proj(init=row['EPSG']),Proj(init='epsg:4326'),row['X'],row['Y']), axis=1)

In [None]:
ddf.apply(lambda row:  transform(Proj(init='epsg:3857'),Proj(init='epsg:4326'),row['x'],row['y']), axis=1, meta={'x2':float,'y2':float}).compute()

In [None]:
df2 = ddf.apply(coor_trans, 'x','y', axis=1, meta={'x2':float,'y2':float} )

In [None]:
df2.head()

### Convert index into Time column

In [None]:
# Pandas
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf.head()

In [None]:
# ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index, format='%Y-%m-%d'). )
ddf = ddf.assign(Time= dask.dataframe.to_datetime(ddf.index).dt.time )
ddf.head()

In [None]:
# Dask or Pandas
ddf = ddf.assign(Time2=ddf.index)
ddf['Time2'] = ddf['Time2'].dt.time
ddf.head()

## Drop NA on column

In [None]:
# Pandas
pdf = pdf.assign(colna = None)
print(pdf.head())
pdf.dropna(axis=1, how='all', inplace=True)
print(pdf.head())

In [None]:
# Dask
ddf = ddf.assign(colna = None)
print(ddf.head())
if ddf.colna.isnull().all().compute() == True:   # check if all values in column are Null - VERY slow
    ddf = ddf.drop(labels=['colna'],axis=1)
print(ddf.head())

##  1.4 Reset Index

In [None]:
# Pandas
pdf.reset_index(drop=True, inplace=True)
pdf.head()

Dask is in a development mode
thus there bugs that are fixed all the time   
e.g. [reset_index fails when index is named ](https://github.com/dask/dask/pull/4509)

In [None]:
# This currently fails without reseting the dataframe......
ddf = dask.datasets.timeseries()

In [None]:
# Dask
ddf.index.name = None   # workaround
ddf = ddf.reset_index()
ddf['Time'] = ddf['timestamp'].dt.time
ddf = ddf.drop(labels=['timestamp'], axis=1 )
ddf.columns = ['ID','Name','coor_x','coor_y','Time']
ddf.head()

# 2. Reads/Save files

When working with pandas and dask preferable try and work with parquet.  
Even so when working with Dask - the files can be read with multiple workers 

### Save files

In [None]:
# Pandas
!mkdir data
pdf.to_csv('data/pdf_single_file.csv')
!ls data

In [None]:
# Dask
# 1. notice the '*' to allow for multiple file renaming. all kwrgs are applicable
# 2. notice that the path to the directory may change based on the location of the running notebook
ddf.to_csv('data/pd2dd/ddf*.csv', index = False)
!ls data/pd2dd/
# to fild number of partitions use dask.dataframe.npartitions

### Read files

based on an [answer from stack overflow](https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe)

In [None]:
# Pandas 
path = r'data/'                     # use your path
all_files = glob.glob(os.path.join(path, "*.csv"))     # advisable to use os.path.join as this makes concatenation OS independent
df_from_each_file = (pd.read_csv(f) for f in all_files)
concatenated_df   = pd.concat(df_from_each_file, ignore_index=True)

In [None]:
# Dask
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv')
ddf.head()

Most `kwarg` are available for reading and writing 
e.g. 
ddf = dask.dataframe.read_csv('data/pd2dd/ddf*.csv', compression='gzip', header none)
However `nrows` is not available


## 3. Group By
In addition to the notebook example that is in the repository - 
This is another example how to try to eliminate the use of `groupby.apply`  
In this example we are grouping by coloumns into unique list.

In [None]:
# prepare pandas dataframe
pdf = pdf.assign(Time=pd.to_datetime(pdf.index).time)
pdf['seconds'] = pdf.Time.astype(str).str[-2:]
pdf.head()

In [None]:
# pandas preperations
def set_list_att(x: dask.dataframe.Series):
        return list(set([item for item in x.values]))

In [None]:
%%timeit
# pandas option 1 using apply
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(set_list_att) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In [None]:
%%timeit
# pandas option 2 using lambda
pdf_gb = pdf.groupby(pdf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [pdf_gb[att_col_gr].apply(lambda x: list(set(x.to_list()))) for att_col_gr in gp_col]
df_edge_att = pdf_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

In any case sometimes using Pandas is more efficiante (assuming that you can load all the data into the RAM).  
In this case Pandas is faster

In [None]:
# prepare dask dataframe
ddf['seconds'] = ddf.Time.astype(str).str[-2:]
ddf = client.persist(ddf)
ddf.head()

In [None]:
%%timeit
# Dask option1 using apply
# notice the meta argument in the apply function
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].apply(set_list_att
                                      ,meta=pd.Series(dtype='object', name=f'{att_col_gr}_att')) 
               for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

Using [dask custom aggregation](https://docs.dask.org/en/latest/dataframe-api.html?highlight=dropna#dask.dataframe.groupby.Aggregation) is consideribly better

In [None]:
# Dask
# some preperations
import itertools
custom_agg = dask.dataframe.Aggregation(
    'custom_agg', 
    lambda s: s.apply(set), 
    lambda s: s.apply(lambda chunks: list(set(itertools.chain.from_iterable(chunks)))),
)

In [None]:
%%timeit
# Dask option1 using apply
df_gb = ddf.groupby(ddf.Name)
gp_col = ['ID', 'seconds']
list_ser_gb = [df_gb[att_col_gr].agg(custom_agg) for att_col_gr in gp_col]
df_edge_att = df_gb.size().to_frame(name="Weight")
for ser in list_ser_gb:
        df_edge_att = df_edge_att.join(ser.to_frame(), how='left')        
df_edge_att.head()

 ## 4. Consider using Persist
Since Dask is lazy - it may ran the **entire** graph (again) even if it already ran part of it in order to generate a result 
in a previous cell.  
```python
ddf = client.persist(ddf)
```
This is different from Pandas which once a variable was created it will keep all data in memory.  
Additional information can be read in this [stackoverflow issue](https://stackoverflow.com/questions/45941528/how-to-efficiently-send-a-large-numpy-array-to-the-cluster-with-dask-array/45941529#45941529) or see an exampel in [this post](http://matthewrocklin.com/blog/work/2017/01/12/dask-dataframes)   
This concept should also  be used when running a code within a script (rather then a jupyter notebook) which incoperates a loop logic within the code.

## 5. Debugging
Debugging my be more challenging since
1. when using a client - mutliprocessing is complecated
2. sometime introducing a faulty command into a graph (such as in a jupyter notebook) requirues to cache-out the graph and start the process from the begining