<h1><b>Fast, Flexible, Easy and Intuitive: How to Speed Up Your pandas Projects</b></h1>
<p>This notebook will cover a great tutorial taught by <a href="https://realpython.com/team/jwyndham/">Joe Wyndham</a> on how to get the most out of pandas in terms of performance. By the end of his remarks, we get to understand the real power of pandas and to demytify the saying that pandas is too slow. We get to learn that the main problem is <b>applying a Pythonic logic in a library that was created based on other structures.</b> The main difference here is that pandas is designed for <b>vectorized operations</b> - keep that in mind -, a concept we will discuss along the following notes.</p>
<p>By the end of this notebook, I hope to have covered the following topics:</p>
<ul>
    <li>The advantages of using <code>datetime</code> data with time series</li>
    <li>The most efficient ways to do batch calculations</li>
    <li>How to save time by store data with HDFStore</li>
</ul>

<h2><b>Table of Contents</b></h2>
<ol>
<li>Import Libraries and the Data</li>
<li>Saving Time With Datetime Data</li>
<li>Simple Looping Over pandas Data</li>
<li>Looping with .itertuples() and <code>.iterrows()</code></li>
<li>pandas’ <code>.apply()</code></li>
<li>Selecting Data With <code>.isin()</code></li>
<li>Can We Do Better?</li>
<li>Don’t Forget NumPy!</li>
<li>Prevent Reprocessing with HDFStore</li>
<li>Conclusions</li>
</ol>
<hr>
<h2><b>1. Import libraries and the data</b></h2>
<p>First, let's import the libraries, define our working folders, and import our dataset. The author used an example taken from his job, a time series of electricity consumption. Given different tariffs (in USD cents) for energy consumption throughout a 24-period, the task was to multiply the electricity consumed for each hour by the correct hour in which it was consumed.</p>

In [1]:
import pandas as pd
import os
import time
import functools
import gc
import itertools
import sys
import numpy as np
from timeit import default_timer as _timer

In [2]:
path = os.getcwd()

In [3]:
data_repo = f"{path}/data/"
data_in = f"{data_repo}raw/"
data_out = f"{data_repo}output/"

In [4]:
pd.__version__

'1.5.3'

In [5]:
df = pd.read_csv(f"{data_in}demand_profile.csv")

In [6]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


<h2><b>2. Saving Time With Datetime Data</b></h3>
<p>The first problem we face is related to <b>data types</b>. Let's take a look at the <code>dtypes</code> for each of the two variables.</p>

In [7]:
df.dtypes

date_time      object
energy_kwh    float64
dtype: object

In [8]:
type(df.iat[0, 0])

str

<p>As we can see, <code>"date_time"</code> was stored as an object, which can basically take any data type and is considered a <b>string</b>. Any operation in pandas with this kind of data type will be slower and inefficient. Fortunately, we can format this column as a datetime object using <code>.to_datetime</code>.</p>
<p>Note: because we will format this column many times using different approaches, I commented the code below. Uncomment it to see the changes in the data format, comment it again, and re-run the notebook so that we can deal with the column with its original data format.</p>

In [9]:
# df["date_time"] = pd.to_datetime(df["date_time"])
df["date_time"].dtype

dtype('O')

In [10]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


<p>Now, to measure how fast our code is, let's use a <a href="https://github.com/realpython/materials/blob/master/pandas-fast-flexible-intuitive/tutorial/timer.py">timing decorator</a> provided by the author

In [11]:
from timer import timeit

@timeit(repeat=3, number=10)
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

# Read it again so that we have `object` dtype to start
df['date_time'] = convert(df, 'date_time')

Best of 3 trials with 10 function calls per trial:
Function `convert` ran in average of 0.642 seconds



<p>Now, we can speed this process up if we tell pandas what what the date and time format looks like and pass the code in the <code>format</code> parameter.</p>

In [12]:
@timeit(repeat=3, number=10)
def convert_with_format(df, column_name):
    return pd.to_datetime(df[column_name],
                          format="%d%m%y %H:%M")

df["date_time"] = convert_with_format(df, "date_time")

Best of 3 trials with 10 function calls per trial:
Function `convert_with_format` ran in average of 0.005 seconds



<p>That's quite an improvement - 128 times faster! Hence, it pays off to be explicit about the date format you want to use.</p>

<hr>
<h2><b>4. Simple Looping Over pandas Data</b></h2>
<p>We can now take a look at our challenge. We want to calculate electricity costs, yet they vary by the hour, which requires us to apply a cost factor to each hour of the day. The table below, provided by the author, describes the price changes through the day.</p>

| **Tariff Type** | **Cents per kWh** | **Time Range** |
|-----------------|-------------------|----------------|
| Peak            | 28                | 17:00 to 24:00 |
| Shoulder        | 20                | 07:00 to 17:00 |
| Off-Peak        | 12                | 00:00 to 07:00 |

<p>If the price was only 28 cents for every hour of the day, we would only need to do one line of code, as below.</p>

In [13]:
df["cost_cents"] = df["energy_kwh"] * 28 # if price were a flat 28 cents per kWh

In [14]:
df.head()

Unnamed: 0,date_time,energy_kwh,cost_cents
0,2013-01-01 00:00:00,0.586,16.408
1,2013-01-01 01:00:00,0.58,16.24
2,2013-01-01 02:00:00,0.572,16.016
3,2013-01-01 03:00:00,0.596,16.688
4,2013-01-01 04:00:00,0.592,16.576


<p>Yet, we need to apply a condition to such calculation, as prices are not always the same according to the day hour. Now, we will look at how people usually write conditional calculations in Python using <b>loops</b> and how Python does <b>not</b> perform as well when using this method.</p>
<p>First, before looking at each approach, let's creae a function to apply the right tariff to a given hour.</p>

In [15]:
def apply_tariff(kwh, hour):
    """Calculates the cost of electricity for a given hour."""
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f"Invalid hour: {hour}")
    return rate * kwh

<p>And here's our standard iteration, followed by the amount of time taken for Python to calculate it.</p>

In [16]:
@timeit(repeat=3, number=10)
def apply_tariff_loop(df):
    """Calculate costs in loop and modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
        energy_used = df.iloc[i]["energy_kwh"]
        hour = df.iloc[i]["date_time"].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df["cost_cents"] = energy_cost_list

In [17]:
apply_tariff_loop(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_loop` ran in average of 1.452 seconds



<h3>Why is it non-Pythonic to do loops with pandas?</h3>
<ol>
<li>You create an empty list in which the results will be stored.</li>
<li>It uses <code>range(len(df))</code> to loop over, and then applies <code>apply_tariff</code>. </li>
<li>Following, it has to append the result to a list that will be then used as a new column in the DataFrame.</li>
<li><b>Cost of the calculations</b> - 1.5 second for some 8760 rows.</li>
</ol>

<hr>
<h2><b>5. Looping with <code>.itertuples()</code> and <code>.iterrows()</code></h2>


<p><code>.itertuples()</code> yields a <code>namedtuple</code> for each row, with the row's index value as the first element of the tuple. This structure, from Python's <code>collections</code> module, behaves like a Python tuple but has fields accessible by attribute lookup.</p>
<p><code>.iterrows()</code> yields pairs of tuples of (index, <code>Series</code>) for each row in the DataFrame. Because the latter is more common in our context, we'll use it only.</p>

In [18]:
@timeit(repeat=3, number=10)
def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        # Get electricity used and hour of day
        energy_used = row["energy_kwh"]
        hour = row["date_time"].hour
        # Append cost list
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df["cost_cents"] = energy_cost_list

In [19]:
apply_tariff_iterrows(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_iterrows` ran in average of 0.363 seconds



<h3>Changes:</h3>
<ol>
<li>The syntax is more explicit.</li>
<li>Less clutter in the row value references, which makes our code more readable.</li>
<li>4 times quicker.</li>
</ol>
<p>Let's consider other resources than similar Python <code>for</code> loops and take a look at pandas' internal architeture.

<hr>
<h2><b>6.</b> pandas' <code>.apply()</code></h2>
<p><code>.apply()</code> takes <b>functions</b> and applies tham along an axis of a DataFrames (either all rows or all columns). Here, we will pass, with the assistance of a lambda function, the two columns of data into <code>apply_tariff()</code>.</p>

In [20]:
@timeit(repeat=3, number=10)
def apply_tariff_withapply(df):
    df["cost_cents"] = df.apply(
        lambda row: apply_tariff(
            kwh=row["energy_kwh"],
            hour=row["date_time"].hour),
        axis=1
    )

In [21]:
apply_tariff_withapply(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_withapply` ran in average of 0.069 seconds



<p>We got further improvements here then when compared to <code>.iterrows()</code>. In addition to being more concise and readable, our code is <b>5.27</b> times faster. We are not there in terms of improvements, though. Considering a very large dataset, we would probably have to sit and wait for many minutes of processing time. We will deal now with vectorized operations.</p>

<hr>
<h2><b>7. Selecting Data With <code>.isin()</code></b></h2>
<style>
  .callout {
    border: 1px solid #444;
    background-color: #333;
    padding: 10px;
    margin: 20px;
    border-radius: 5px;
    box-shadow: 0px 0px 5px rgba(255, 255, 255, 0.2);
  }
</style>
</head>
<body>

<div class="callout">

   <h3><b>Intermission</b> | On vectorized operations</h3>
<p><b>Vectorized operations</b> in pandas are a fundamental concept that contributes to the efficiency and power of the library. Pandas is built on top of the NumPy library, which provides support for array computations. Vectorized operations allow you to perform operations on entire arrays of series of data <b>without the need for explicit iteration</b>, resulting in significantly fast and more concise mode.</p>
<p>Here are some considerations about vectorized operations in pandas:</p>
<ol>
<li><b>Element-wise operations</b>. Liky NumPy, pandas supports element-wise operations, which means that you can apply a function or arithmetic operation to <b>every</b> element in a Series or DataFrame <b>without needing to loop through each element</b>.</li>
<li><b>Broadcasting</b>. Pandas allows for broadcasting, which means you can perform operations <b>between arrays or series of different shapes</b>. The smaller array is automatically "broadcast" over the larger one to match dimensions, making the operation possible. This is similar to broadcasting in NumPy.</li>
<li><b>Performance Benefits</b>. Vectorized operations are much faster than equivalent operations performed using loops, as they take advantage of underlying <b>C</b> or <b>Cython</b> implementations for computations. This is particularly important when dealing with <b>large datasets</b>.</li>
<li><b>Code Readability</b>. Using vectorized operations can lead to more concise and readable code. You can express complex operations in a single line of code, improving the maintainability of your codebase.</li>
<li><b>Examples</b> of vectorized operations in pandas:</li>
<ul>
<li>Arithmetic operations - you can sadd, subtract, multiply, or divide entire Series or DataFrames element-wise.</li>
<li>Element-wise functions - applying functions like <code>numpy.sqrt()</code>, <code>numpy.log()</code>, etc, to a Series or DataFrames.</li>
<li>Comparison operations - you can perform element-wise comparison, resulting in boolean Series, which can be used for filtering.</li>
<li>Conditional operations - using <code>numpy.where()</code> or pandas' <code>.loc[]</code> to perform conditional assignments.</li>
<li>Mathematical operations - computing aggregates like mean, sum, median, etc, directly on Series or columns of DataFrames.</li>
</ul>
</ol>
  </p>
</div>
<p>Going back to our notes, let's consider pandas' first alternative: <code>.isin()</code>.</p>

In [22]:
# Set date_time as the DataFrame's index for convenience purposes
df.set_index("date_time", inplace=True)

@timeit(repeat=3, number=10)
def apply_tariff_isin(df):
    # Define hour range boolean arrays
    peak_hours = df.index.hour.isin(range(17,24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    # Apply tariffs to hour ranges
    df.loc[peak_hours, "cost_cents"] = df.loc[peak_hours, "energy_kwh"] * 28
    df.loc[shoulder_hours, "cost_cents"] = df.loc[shoulder_hours, "energy_kwh"] * 20
    df.loc[off_peak_hours, "cost_cents"] = df.loc[off_peak_hours, "energy_kwh"] * 12


In [23]:
apply_tariff_isin(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_isin` ran in average of 0.004 seconds



<hr>
<h2><b>8. Can We Do Better?</b></h2>

In [24]:
@timeit(repeat=3, number=10)
def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df["cost_cents"] = cents_per_kwh * df["energy_kwh"]
    # return df

In [25]:
apply_tariff_cut(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_cut` ran in average of 0.002 seconds



<hr>
<h2><b>9. Don't Forget NumPy!</b></h2>

In [26]:
@timeit(repeat=3, number=10)
def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df["cost_cents"] = prices[bins] * df["energy_kwh"].values

In [27]:
apply_tariff_digitize(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_digitize` ran in average of 0.001 seconds



<hr>
<h2><b>10. Prevent Reprocessing with HDFStore</b></h2>


In [28]:
# Create stprage object with filename `processed_data`
data_store = pd.HDFStore('processed_data.h5')

# Put DataFrame into the object setting the key as 'preprocessed_df'
data_store["preprocessed_df"] = df
data_store.close()

In [29]:
# Acess data store
data_store = pd.HDFStore(f"{data_out}processed_data.h5")

# Retrieve data using key
preprocessed_df = data_store["preprocessed_df"]
data_store.close()

<hr>
<h2><b>11. Conclusions</b></h2>