https://realpython.com/fast-flexible-pandas/#pandas-apply

# Fast, Flexible, Easy and Intuitive: How to Speed Up Your Pandas Projects

In [None]:
Table of Contents

But I Heard That Pandas Is Slow…
This Tutorial
The Task at Hand
Saving Time With Datetime Data
Simple Looping Over Pandas Data
Looping with .itertuples() and .iterrows()
Pandas’ .apply()
Selecting Data With .isin()
Can We Do Better?
Don’t Forget NumPy!
Prevent Reprocessing with HDFStore
Conclusions

## Saving Time With Datetime Data

In [2]:
import pandas as pd
pd.__version__

'1.3.0'

In [3]:
pwd

'/mnt/c/Users/bkise/documents/repos/real_python'

In [55]:
df = pd.read_csv('demand_profile.csv')
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date_time   8760 non-null   object 
 1   energy_kwh  8760 non-null   float64
dtypes: float64(1), object(1)
memory usage: 137.0+ KB


In [57]:
df.dtypes




type(df.iat[0, 0])

str

This is not ideal. object is a container for not just str, but any column that can’t neatly fit into one data type. It would be arduous and inefficient to work with dates as strings. (It would also be memory-inefficient.)

For working with time series data, you’ll want the date_time column to be formatted as an array of datetime objects. (Pandas calls this a Timestamp.) Pandas makes each step here rather simple:

In [58]:
df['date_time'] = pd.to_datetime(df['date_time'])
df['date_time'].dtype


dtype('<M8[ns]')

In [59]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,2013-01-01 00:00:00,0.586
1,2013-01-01 01:00:00,0.58
2,2013-01-01 02:00:00,0.572
3,2013-01-01 03:00:00,0.596
4,2013-01-01 04:00:00,0.592


In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8760 entries, 0 to 8759
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype         
---  ------      --------------  -----         
 0   date_time   8760 non-null   datetime64[ns]
 1   energy_kwh  8760 non-null   float64       
dtypes: datetime64[ns](1), float64(1)
memory usage: 137.0 KB


This is not ideal. object is a container for not just str, but any column that can’t neatly fit into one data type. It would be arduous and inefficient to work with dates as strings. (It would also be memory-inefficient.)

For working with time series data, you’ll want the date_time column to be formatted as an array of datetime objects. (Pandas calls this a Timestamp.) Pandas makes each step here rather simple:

In [18]:
import timeit
timeit.repeat(repeat=3, number=10)
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

# Read in again so that we have `object` dtype to start 
df['date_time'] = convert(df, 'date_time')

In [23]:
#BK Using %% time

In [47]:
%%time
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

# Read in again so that we have `object` dtype to start 
df['date_time'] = convert(df, 'date_time')

CPU times: user 8.28 ms, sys: 0 ns, total: 8.28 ms
Wall time: 7.75 ms


## Simple Looping Over Pandas Data

In [None]:
Now that your dates and times are in a convenient format, you are ready to get down to the business of calculating your electricity costs. Remember that cost varies by hour, so you will need to conditionally apply a cost factor to each hour of the day. In this example, the time-of-use costs will be defined as follows:

Tariff Type	Cents per kWh	Time Range
Peak	28	17:00 to 24:00
Shoulder	20	7:00 to 17:00
Off-Peak	12	0:00 to 7:00

If the price were a flat 28 cents per kWh for every hour of the day, most people familiar with Pandas would know that this calculation could be achieved in one line:

In [62]:
df['cost_cents'] = df['energy_kwh'] * 28

In [25]:
#First, let’s create a function to apply the appropriate tariff to a given hour:

def apply_tariff(kwh, hour):
    """Calculates cost of electricity for given hour."""    
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f'Invalid hour: {hour}')
    return rate * kwh

In [26]:
%%time
#Here’s the loop that isn’t Pythonic, in all its glory:

# NOTE: Don't do this!

def apply_tariff_loop(df):
    """Calculate costs in loop.  Modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
        energy_used = df.iloc[i]['energy_kwh']
        hour = df.iloc[i]['date_time'].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list

apply_tariff_loop(df)

CPU times: user 1.88 s, sys: 5.43 ms, total: 1.89 s
Wall time: 1.9 s


In [49]:
df.head()

Unnamed: 0,date_time,energy_kwh,cost_cents
0,2013-01-01 00:00:00,0.586,16.408
1,2013-01-01 01:00:00,0.58,16.24
2,2013-01-01 02:00:00,0.572,16.016
3,2013-01-01 03:00:00,0.596,16.688
4,2013-01-01 04:00:00,0.592,16.576


## Looping with .itertuples() and .iterrows()

What other approaches can you take? Well, Pandas has actually made the for i in range(len(df)) syntax redundant by introducing the DataFrame.itertuples() and DataFrame.iterrows() methods. These are both generator methods that yield one row at a time.

.itertuples() yields a namedtuple for each row, with the row’s index value as the first element of the tuple. A nametuple is a data structure from Python’s collections module that behaves like a Python tuple but has fields accessible by attribute lookup.

.iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame.

While .itertuples() tends to be a bit faster, let’s stay in Pandas and use .iterrows() in this example, because some readers might not have run across nametuple. Let’s see what this achieves:

In [29]:
%%time
#@timeit(repeat=3, number=100)
def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        # Get electricity used and hour of day
        energy_used = row['energy_kwh']
        hour = row['date_time'].hour
        # Append cost list
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df['cost_cents'] = energy_cost_list

apply_tariff_iterrows(df)

CPU times: user 421 ms, sys: 0 ns, total: 421 ms
Wall time: 419 ms


### Pandas’ .apply()

You can further improve this operation using the .apply() method instead of .iterrows(). Pandas’ .apply() method takes functions (callables) and applies them along an axis of a DataFrame (all rows, or all columns). In this example, a lambda function will help you pass the two columns of data into apply_tariff():

In [31]:
%%time
#@timeit(repeat=3, number=100)
def apply_tariff_withapply(df):
    df['cost_cents'] = df.apply(
        lambda row: apply_tariff(
            kwh=row['energy_kwh'],
            hour=row['date_time'].hour),
        axis=1)

apply_tariff_withapply(df)

CPU times: user 125 ms, sys: 0 ns, total: 125 ms
Wall time: 123 ms


In [63]:
df.head()

Unnamed: 0,date_time,energy_kwh,cost_cents
0,2013-01-01 00:00:00,0.586,16.408
1,2013-01-01 01:00:00,0.58,16.24
2,2013-01-01 02:00:00,0.572,16.016
3,2013-01-01 03:00:00,0.596,16.688
4,2013-01-01 04:00:00,0.592,16.576


## Selecting Data With .isin()

Earlier, you saw that if there were a single electricity price, you could apply that price across all the electricity consumption data in one line of code (df['energy_kwh'] * 28). This particular operation was an example of a vectorized operation, and it is the fastest way to do things in Pandas.

But how can you apply condition calculations as vectorized operations in Pandas? One trick is to select and group parts the DataFrame based on your conditions and then apply a vectorized operation to each selected group.

In this next example, you will see how to select rows with Pandas’ .isin() method and then apply the appropriate tariff in a vectorized operation. Before you do this, it will make things a little more convenient if you set the date_time column as the DataFrame’s index:

In [64]:
%%time
df.set_index('date_time', inplace=True)
#@timeit(repeat=3, number=100)
def apply_tariff_isin(df):
    # Define hour range Boolean arrays
    peak_hours = df.index.hour.isin(range(17, 24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    # Apply tariffs to hour ranges
    df.loc[peak_hours, 'cost_cents'] = df.loc[peak_hours, 'energy_kwh'] * 28
    df.loc[shoulder_hours,'cost_cents'] = df.loc[shoulder_hours, 'energy_kwh'] * 20
    df.loc[off_peak_hours,'cost_cents'] = df.loc[off_peak_hours, 'energy_kwh'] * 12
    
apply_tariff_isin(df)

CPU times: user 6.54 ms, sys: 993 µs, total: 7.53 ms
Wall time: 6.19 ms


In [65]:
df.head(15)

Unnamed: 0_level_0,energy_kwh,cost_cents
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01 00:00:00,0.586,7.032
2013-01-01 01:00:00,0.58,6.96
2013-01-01 02:00:00,0.572,6.864
2013-01-01 03:00:00,0.596,7.152
2013-01-01 04:00:00,0.592,7.104
2013-01-01 05:00:00,0.592,7.104
2013-01-01 06:00:00,0.596,7.152
2013-01-01 07:00:00,0.239,4.78
2013-01-01 08:00:00,0.566,11.32
2013-01-01 09:00:00,0.557,11.14


## Can We Do Better?

In apply_tariff_isin(), we are still admittedly doing some “manual work” by calling df.loc and df.index.hour.isin() three times each. You could argue that this solution isn’t scalable if we had a more granular range of time slots. (A different rate for each hour would require 24 .isin() calls.) Luckily, you can do things even more programmatically with Pandas’ pd.cut() function in this case:

In [66]:
%%time
#@timeit(repeat=3, number=100)
def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df['cost_cents'] = cents_per_kwh * df['energy_kwh']

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 5.25 µs


Let’s take a second to see what’s going on here. pd.cut() is applying an array of labels (our costs) according to which bin each hour belongs in. Note that the include_lowest parameter indicates whether the first interval should be left-inclusive or not. (You want to include time=0 in a group.)

This is a fully vectorized way to get to your intended result, and it comes out on top in terms of timing:

In [67]:
%%time
apply_tariff_cut(df)

CPU times: user 13.5 ms, sys: 10.9 ms, total: 24.4 ms
Wall time: 74.5 ms


## Don’t Forget NumPy

One point that should not be forgotten when you are using Pandas is that Pandas Series and DataFrames are designed on top of the NumPy library. This gives you even more computation flexibility, because Pandas works seamlessly with NumPy arrays and operations.

In this next case you’ll use NumPy’s digitize() function. It is similar to Pandas’ cut() in that the data will be binned, but this time it will be represented by an array of indexes representing which bin each hour belongs to. These indexes are then applied to a prices array:

In [69]:
%%time
#@timeit(repeat=3, number=100)
import numpy as np
def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df['cost_cents'] = prices[bins] * df['energy_kwh'].values
    
apply_tariff_digitize(df)

CPU times: user 2.48 ms, sys: 386 µs, total: 2.87 ms
Wall time: 1.8 ms


In [70]:
df.head(20)

Unnamed: 0_level_0,energy_kwh,cost_cents
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01 00:00:00,0.586,7.032
2013-01-01 01:00:00,0.58,6.96
2013-01-01 02:00:00,0.572,6.864
2013-01-01 03:00:00,0.596,7.152
2013-01-01 04:00:00,0.592,7.104
2013-01-01 05:00:00,0.592,7.104
2013-01-01 06:00:00,0.596,7.152
2013-01-01 07:00:00,0.239,4.78
2013-01-01 08:00:00,0.566,11.32
2013-01-01 09:00:00,0.557,11.14


In [None]:
At this point, there’s still a performance improvement, but it’s becoming more marginal in nature. This is probably a good time to call it a day on hacking away at code improvement and think about the bigger picture.

With Pandas, it can help to maintain “hierarchy,” if you will, of preferred options for doing batch calculations like you’ve done here. These will usually rank from fastest to slowest (and most to least flexible):

Use vectorized operations: Pandas methods and functions with no for-loops.
Use the .apply() method with a callable.
Use .itertuples(): iterate over DataFrame rows as namedtuples from Python’s collections module.
Use .iterrows(): iterate over DataFrame rows as (index, pd.Series) pairs. While a Pandas Series is a flexible data structure, it can be costly to construct each row into a Series and then access it.
Use “element-by-element” for loops, updating each cell or row one at a time with df.loc or df.iloc. (Or, .at/.iat for fast scalar access.)

In [None]:
Here’s the “order of precedence” above at work, with each function you’ve built here:

Function	Runtime (seconds)
apply_tariff_loop()	3.152
apply_tariff_iterrows()	0.713
apply_tariff_withapply()	0.272
apply_tariff_isin()	0.010
apply_tariff_cut()	0.003
apply_tariff_digitize()	0.002

In [71]:
df.index.hour.values

array([ 0,  1,  2, ..., 21, 22, 23])

In [72]:
bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
bins

array([0, 0, 0, ..., 2, 2, 2])

In [74]:
bins[:30]

array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2,
       2, 2, 0, 0, 0, 0, 0, 0])

In [78]:
np.unique(bins)

array([0, 1, 2])

In [76]:
prices = np.array([12, 20, 28])
prices

array([12, 20, 28])

In [85]:
prices[bins][:30]

array([12, 12, 12, 12, 12, 12, 12, 20, 20, 20, 20, 20, 20, 20, 20, 20, 20,
       28, 28, 28, 28, 28, 28, 28, 12, 12, 12, 12, 12, 12])

In [86]:
df.head(30)

Unnamed: 0_level_0,energy_kwh,cost_cents
date_time,Unnamed: 1_level_1,Unnamed: 2_level_1
2013-01-01 00:00:00,0.586,7.032
2013-01-01 01:00:00,0.58,6.96
2013-01-01 02:00:00,0.572,6.864
2013-01-01 03:00:00,0.596,7.152
2013-01-01 04:00:00,0.592,7.104
2013-01-01 05:00:00,0.592,7.104
2013-01-01 06:00:00,0.596,7.152
2013-01-01 07:00:00,0.239,4.78
2013-01-01 08:00:00,0.566,11.32
2013-01-01 09:00:00,0.557,11.14


In [83]:
prices[bins].shape

(8760,)

In [89]:
cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)

cents_per_kwh[:30]

array([12, 12, 12, 12, 12, 12, 12, 12, 20, 20, 20, 20, 20, 20, 20, 20, 20,
       20, 28, 28, 28, 28, 28, 28, 12, 12, 12, 12, 12, 12])

## Prevent Reprocessing with HDFStore

Now that you have looked at quick data processes in Pandas, let’s explore how to avoid reprocessing time altogether with HDFStore, which was recently integrated into Pandas.

Often when you are building a complex data model, it is convenient to do some pre-processing of your data. For example, if you had 10 years of minute-frequency electricity consumption data, simply converting the date and time to datetime might take 20 minutes, even if you specify the format parameter. You really only want to have to do this once, not every time you run your model, for testing or analysis.

A very useful thing you can do here is pre-process and then store your data in its processed form to be used when needed. But how can you store data in the right format without having to reprocess it again? If you were to save as CSV, you would simply lose your datetime objects and have to re-process it when accessing again.

Pandas has a built-in solution for this which uses HDF5 , a high-performance storage format designed specifically for storing tabular arrays of data. Pandas’ HDFStore class allows you to store your DataFrame in an HDF5 file so that it can be accessed efficiently, while still retaining column types and other metadata. It is a dictionary-like class, so you can read and write just as you would for a Python dict object.

Here’s how you would go about storing your pre-processed electricity consumption DataFrame, df, in an HDF5 file:

In [90]:
# Create storage object with filename `processed_data`
data_store = pd.HDFStore('processed_data.h5')

# Put DataFrame into the object setting the key as 'preprocessed_df'
data_store['preprocessed_df'] = df
data_store.close()

Now you can shut your computer down and take a break knowing that you can come back and your processed data will be waiting for you when you need it. No reprocessing required. Here’s how you would access your data from the HDF5 file, with data types preserved:

In [91]:
# Access data store
data_store = pd.HDFStore('processed_data.h5')

# Retrieve data using key
preprocessed_df = data_store['preprocessed_df']
data_store.close()

In [93]:
preprocessed_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 8760 entries, 2013-01-01 00:00:00 to 2013-12-31 23:00:00
Data columns (total 2 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   energy_kwh  8760 non-null   float64
 1   cost_cents  8760 non-null   float64
dtypes: float64(2)
memory usage: 205.3 KB
