# Parallel Pandas
This notebook creates a set of example data then shows various methods for parallelizing operations over it.

The first test involves converting a time string such as `'9/17/2021'` into a `datetime` object, then shifting it by 14 days.

The second test involves squaring a float.

## Generate Example Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
days = np.arange(1,29,1)
months = np.arange(1,13,1)
years = np.arange(1970,2023,1)

In [3]:
def gen_date():
    return f"{np.random.choice(months)}/{np.random.choice(days)}/{np.random.choice(years)}"

In [4]:
dates = []
for i in range(1000):
    dates.append(gen_date())

In [5]:
df = pd.DataFrame(np.random.rand(1000,2), columns=['foo','bar'])

In [6]:
df['date'] = dates

In [7]:
dfs = []
for i in range(1000):
    dfs.append(df)
df = pd.concat(dfs).reset_index()
del(df['index'])

# Example Parallelization

In [8]:
from datetime import datetime, timedelta
import multiprocess as mp
import pandas as pd
import swifter
 
cores = mp.cpu_count()-2 #Number of CPU cores on your system
partitions = cores #Define as many partitions as you want

In [9]:
df.shape

(1000000, 3)

In [10]:
def proc_date(date):
    return datetime.strptime(date, '%m/%d/%Y') + timedelta(days=14)

def proc_df(df):
    df['date_shift'] = df['date'].apply(lambda x: proc_date(x))
    return df

def proc_df_swifter(df):
    df['date_shift'] = df['date'].swifter.apply(lambda x: proc_date(x))
    return df

In [11]:
def parallelize(data, func):
    data_split = np.array_split(data, partitions)
    pool = mp.Pool(cores)
    data = pd.concat(pool.map(func, data_split))
    pool.close()
    pool.join()
    return data

Pandas `apply`

In [12]:
%time df = proc_df(df)

CPU times: user 7.52 s, sys: 70.7 ms, total: 7.6 s
Wall time: 7.65 s


Pandas `apply` with multiple cores

In [13]:
%time df = parallelize(df, proc_df)

CPU times: user 1.43 s, sys: 150 ms, total: 1.58 s
Wall time: 2.22 s


`swifter` for string processing

In [14]:
%time df = proc_df_swifter(df)

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

CPU times: user 9.89 s, sys: 232 ms, total: 10.1 s
Wall time: 10.1 s


### Vectorizing operations

`iterrows` and a loop with only 100K rows, not the full 1M :(

In [15]:
def loop(df):
    for kk, vv in df.iterrows():
        vv['foo2'] = vv['foo']**2

%time loop(df[:100000])

CPU times: user 40.5 s, sys: 279 ms, total: 40.8 s
Wall time: 41.1 s


Pandas `apply`

In [16]:
%time df['foo2'] = df.foo.apply(lambda x: x**2)

CPU times: user 224 ms, sys: 17.8 ms, total: 241 ms
Wall time: 241 ms


`Swifter` numerical operation

In [17]:
%time df['foo2'] = df['foo'].swifter.apply(lambda x: x**2)

Pandas Apply:   0%|          | 0/1000000 [00:00<?, ?it/s]

CPU times: user 1.42 s, sys: 45.5 ms, total: 1.47 s
Wall time: 1.46 s


Pandas vectorized operation

In [18]:
%time df['foo2'] = df['foo']**2

CPU times: user 6.2 ms, sys: 4.12 ms, total: 10.3 ms
Wall time: 6.9 ms
