<h1><b>Fast, Flexible, Easy and Intuitive: How to Speed Up Your pandas Projects</b></h1>
<p>In this notebook, I will be covering an important concept that users must know once they get acquainted with Pandas: the <code>SettingWithCopyWarning</code> issue. As a beginner, I would often disregard this sign as the outcome of the code would not necessarily change (apparently). Now, there are <b>shallow copies</b> and <b>deep copies</b>, notions that are related to the aforementioned warning. I'll discuss the difference between these concepts as well as assessing the impact of their use in data analysis.</p>
<p>By the end of this notebook, I hope to have covered the following topics:</p>
<ul>
    <li>The definition of <b>views</b> and <b>copies</b> in NumPy and Pandas</li>
    <li>How to work with views and copies in these libraries</li>
    <li>Why <code>SettingWithCopyWarning</code> happens in Pandas</li>
    <li>How to avoid getting a <code>SettingWithCopyWarning</code> in Pandas</li>
</ul>

<h2><b>Table of Contents</b></h2>
<ul>
<li>Introduction</li>
<li>Import Libraries and the Data</li>
<li>Saving Time With Datetime Data</li>
<li>Simple Looping Over pandas Data</li>
<li>Looping with .itertuples() and <code>.iterrows()</code></li>
<li>pandas’ <code>.apply()</code></li>
<li>Selecting Data With <code>.isin()</code></li>
<li>Can We Do Better?</li>
<li>Don’t Forget NumPy!</li>
<li>Prevent Reprocessing with HDFStore</li>
<li>Conclusions</li>
</ul>
<hr>
<h2><b>1. Introduction</b></h2>
<hr>
<h2><b>2. Import libraries and the data</b></h2>

In [1]:
import pandas as pd
import os
import time
import functools
import gc
import itertools
import sys
import numpy as np
from timeit import default_timer as _timer

In [2]:
path = os.getcwd()
print(os.getcwd())

c:\Users\Felipe\python_work\notebooks\py_pandas\8_fast_flexible_pandas


In [3]:
data_repo = f"{path}/data/"
data_in = f"{data_repo}raw/"

In [4]:
pd.__version__

'1.5.3'

In [5]:
df = pd.read_csv(f"{data_in}demand_profile.csv")

<h2><b>3. Saving Time With Datetime Data</b></h3>

In [6]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


In [7]:
df.dtypes

date_time      object
energy_kwh    float64
dtype: object

In [8]:
type(df.iat[0, 0])

str

In [9]:
# df["date_time"] = pd.to_datetime(df["date_time"])
df["date_time"].dtype

dtype('O')

In [10]:
df.head()

Unnamed: 0,date_time,energy_kwh
0,1/1/13 0:00,0.586
1,1/1/13 1:00,0.58
2,1/1/13 2:00,0.572
3,1/1/13 3:00,0.596
4,1/1/13 4:00,0.592


In [11]:
from timer import timeit

@timeit(repeat=3, number=10)
def convert(df, column_name):
    return pd.to_datetime(df[column_name])

# Read it again so that we have `object` dtype to start
df['date_time'] = convert(df, 'date_time')

Best of 3 trials with 10 function calls per trial:
Function `convert` ran in average of 0.817 seconds



In [12]:
@timeit(repeat=3, number=10)
def convert_with_format(df, column_name):
    return pd.to_datetime(df[column_name],
                          format="%d%m%y %H:%M")

df["date_time"] = convert_with_format(df, "date_time")

Best of 3 trials with 10 function calls per trial:
Function `convert_with_format` ran in average of 0.008 seconds



<hr>
<h2><b>4. Simple Looping Over pandas Data</b></h2>

In [13]:
df["cost_cents"] = df["energy_kwh"] * 28 # if price were a flat 28 cents per kWh

In [14]:
df.head()

Unnamed: 0,date_time,energy_kwh,cost_cents
0,2013-01-01 00:00:00,0.586,16.408
1,2013-01-01 01:00:00,0.58,16.24
2,2013-01-01 02:00:00,0.572,16.016
3,2013-01-01 03:00:00,0.596,16.688
4,2013-01-01 04:00:00,0.592,16.576


In [15]:
def apply_tariff(kwh, hour):
    """Calculates the cost of electricity for a given hour."""
    if 0 <= hour < 7:
        rate = 12
    elif 7 <= hour < 17:
        rate = 20
    elif 17 <= hour < 24:
        rate = 28
    else:
        raise ValueError(f"Invalid hour: {hour}")
    return rate * kwh

In [16]:
@timeit(repeat=3, number=10)
def apply_tariff_loop(df):
    """Calculate costs in loop and modifies `df` inplace."""
    energy_cost_list = []
    for i in range(len(df)):
        # Get electricity used and hour of day
        energy_used = df.iloc[i]["energy_kwh"]
        hour = df.iloc[i]["date_time"].hour
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df["cost_cents"] = energy_cost_list

In [17]:
apply_tariff_loop(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_loop` ran in average of 1.781 seconds



<hr>
<h2><b>5. Looping with <code>.itertuples()</code> and <code>.iterrows()</code></h2>


In [18]:
@timeit(repeat=3, number=10)
def apply_tariff_iterrows(df):
    energy_cost_list = []
    for index, row in df.iterrows():
        # Get electricity used and hour of day
        energy_used = row["energy_kwh"]
        hour = row["date_time"].hour
        # Append cost list
        energy_cost = apply_tariff(energy_used, hour)
        energy_cost_list.append(energy_cost)
    df["cost_cents"] = energy_cost_list

In [19]:
apply_tariff_iterrows(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_iterrows` ran in average of 0.419 seconds



<hr>
<h2><b>6. pandas' <code>.apply()</code>

In [20]:
@timeit(repeat=3, number=10)
def apply_tariff_withapply(df):
    df["cost_cents"] = df.apply(
        lambda row: apply_tariff(
            kwh=row["energy_kwh"],
            hour=row["date_time"].hour),
        axis=1
    )

In [21]:
apply_tariff_withapply(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_withapply` ran in average of 0.084 seconds



<hr>
<h2><b>7. Selecting Data With <code>.isin()</code></b></h2>
<p>Make an observation on vectorized operations in pandas.</p>

In [22]:
# Set date_time as the DataFrame's index for convenience purposes
df.set_index("date_time", inplace=True)

@timeit(repeat=3, number=10)
def apply_tariff_isin(df):
    # Define hour range boolean arrays
    peak_hours = df.index.hour.isin(range(17,24))
    shoulder_hours = df.index.hour.isin(range(7, 17))
    off_peak_hours = df.index.hour.isin(range(0, 7))

    # Apply tariffs to hour ranges
    df.loc[peak_hours, "cost_cents"] = df.loc[peak_hours, "energy_kwh"] * 28
    df.loc[shoulder_hours, "cost_cents"] = df.loc[shoulder_hours, "energy_kwh"] * 20
    df.loc[off_peak_hours, "cost_cents"] = df.loc[off_peak_hours, "energy_kwh"] * 12


In [23]:
apply_tariff_isin(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_isin` ran in average of 0.006 seconds



<hr>
<h2><b>8. Can We Do Better?</b></h2>

In [24]:
@timeit(repeat=3, number=10)
def apply_tariff_cut(df):
    cents_per_kwh = pd.cut(x=df.index.hour,
                           bins=[0, 7, 17, 24],
                           include_lowest=True,
                           labels=[12, 20, 28]).astype(int)
    df["cost_cents"] = cents_per_kwh * df["energy_kwh"]
    # return df

In [25]:
apply_tariff_cut(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_cut` ran in average of 0.002 seconds



<hr>
<h2><b>9. Don't Forget NumPy!</b></h2>

In [26]:
@timeit(repeat=3, number=10)
def apply_tariff_digitize(df):
    prices = np.array([12, 20, 28])
    bins = np.digitize(df.index.hour.values, bins=[7, 17, 24])
    df["cost_cents"] = prices[bins] * df["energy_kwh"].values

In [27]:
apply_tariff_digitize(df)

Best of 3 trials with 10 function calls per trial:
Function `apply_tariff_digitize` ran in average of 0.001 seconds



<hr>
<h2><b>10. Prevent Reprocessing with HDFStore</b></h2>


In [29]:
# Create stprage object with filename `processed_data`
data_store = pd.HDFStore('processed_data.h5')

# Put DataFrame into the object setting the key as 'preprocessed_df'
data_store["preprocessed_df"] = df
data_store.close()

In [31]:
# Acess data store
data_store = pd.HDFStore('processed_data.h5')

# Retrieve data using key
preprocessed_df = data_store["preprocessed_df"]
data_store.close()

<hr>
<h2><b>11. Conclusions</b></h2>