## Exercise 05 : Pandas optimizations

Import Libraries

In [14]:
import pandas as pd
import gc

pd.options.display.float_format = "{:.2f}".format  # float precision setting

* read the fines.csv that you saved in the previous exercise

In [15]:
fines_file_path = "../data/fines.csv"

df = pd.read_csv(filepath_or_buffer=fines_file_path, sep=",")
display(df)

Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year
0,Y163O8161RUS,2,3200.00,Ford,Focus,2009
1,E432XX77RUS,1,6500.00,Toyota,Camry,2010
2,7184TT36RUS,1,2100.00,Ford,Focus,1990
3,X582HE161RUS,2,2000.00,Ford,Focus,2004
4,92918M178RUS,1,5700.00,Ford,Focus,1990
...,...,...,...,...,...,...
925,NEW001,1,4373.09,Ford,Fiesta,2012
926,NEW002,2,3034.48,Toyota,Corolla,2008
927,NEW003,3,4288.84,Honda,Civic,2012
928,NEW004,1,4738.31,Chevrolet,Impala,2004


* iterations: in all the following subtasks, you need to calculate fines/refund*year for
each row and create a new column with the calculated data and measure the time
using the magic command %%timeit in the cell
  * loop: write a function that iterates through the dataframe using for i in
  range(0, len(df)), iloc and append() to a list, assign the result of the function to a new column in the dataframe

In [16]:
%%timeit


def make_calculations(df) -> list[float]:
    result_values = []
    for i in range(0, len(df)):
        result_values.append(df.iloc[i, 2] / df.iloc[i, 1] * df.iloc[i, 5])
    return result_values


df["Calculations"] = make_calculations(df)

113 ms ± 446 μs per loop (mean ± std. dev. of 7 runs, 1 loop each)


  * do it using iterrows()

In [17]:
%%timeit


def make_calculations(df) -> list[float]:
    result_values = []
    for index, row in df.iterrows():
        result_values.append(row["Fines"] / row["Refund"] * row["Year"])
    return result_values


df["Calculations"] = make_calculations(df)

95.4 ms ± 1.49 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


  * do it using apply() and lambda function

In [18]:
%%timeit

df["Calculations"] = df.apply(
    lambda row: row["Fines"] / row["Refund"] * row["Year"], axis=1
)

16.8 ms ± 2.36 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


  * do it using Series objects from the dataframe

In [19]:
%%timeit

df["Calculations"] = df["Fines"] / df["Refund"] * df["Year"]

219 μs ± 4.05 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


  * do it as in the previous subtask but with the method .values

In [20]:
%%timeit

df["Calculations"] = df["Fines"].values / df["Refund"].values * df["Year"].values

108 μs ± 729 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [21]:
display(df)

Unnamed: 0,CarNumber,Refund,Fines,Make,Model,Year,Calculations
0,Y163O8161RUS,2,3200.00,Ford,Focus,2009,3214400.00
1,E432XX77RUS,1,6500.00,Toyota,Camry,2010,13065000.00
2,7184TT36RUS,1,2100.00,Ford,Focus,1990,4179000.00
3,X582HE161RUS,2,2000.00,Ford,Focus,2004,2004000.00
4,92918M178RUS,1,5700.00,Ford,Focus,1990,11343000.00
...,...,...,...,...,...,...,...
925,NEW001,1,4373.09,Ford,Fiesta,2012,8798663.16
926,NEW002,2,3034.48,Toyota,Corolla,2008,3046617.40
927,NEW003,3,4288.84,Honda,Civic,2012,2876384.17
928,NEW004,1,4738.31,Chevrolet,Impala,2004,9495572.75


* indexing: measure the time using the magic command %%timeit in the cell
  * get a row for a specific CarNumber, for example, ’O136HO197RUS’
  * set the index in your dataframe with CarNumber
  * again, get a row for the same CarNumber

In [22]:
%%timeit

searched_row_wthout_index = df[df["CarNumber"] == "O136HO197RUS"]

332 μs ± 1.71 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [23]:
df_indexed = df.set_index("CarNumber")
display(df_indexed)

Unnamed: 0_level_0,Refund,Fines,Make,Model,Year,Calculations
CarNumber,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Y163O8161RUS,2,3200.00,Ford,Focus,2009,3214400.00
E432XX77RUS,1,6500.00,Toyota,Camry,2010,13065000.00
7184TT36RUS,1,2100.00,Ford,Focus,1990,4179000.00
X582HE161RUS,2,2000.00,Ford,Focus,2004,2004000.00
92918M178RUS,1,5700.00,Ford,Focus,1990,11343000.00
...,...,...,...,...,...,...
NEW001,1,4373.09,Ford,Fiesta,2012,8798663.16
NEW002,2,3034.48,Toyota,Corolla,2008,3046617.40
NEW003,3,4288.84,Honda,Civic,2012,2876384.17
NEW004,1,4738.31,Chevrolet,Impala,2004,9495572.75


In [24]:
%%timeit

searched_row_with_index = df_indexed.loc[
    "O136HO197RUS"
]  # search for the index 'O136HO197RUS'

158 μs ± 2.28 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


* downcasting:
  * run df.info(memory_usage=’deep’), pay attention to the Dtype and the memory usage
  * make a copy() of your initial dataframe into another dataframe optimized

In [25]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   CarNumber     930 non-null    object 
 1   Refund        930 non-null    int64  
 2   Fines         930 non-null    float64
 3   Make          930 non-null    object 
 4   Model         919 non-null    object 
 5   Year          930 non-null    int64  
 6   Calculations  930 non-null    float64
dtypes: float64(2), int64(2), object(3)
memory usage: 182.1 KB


In [26]:
dtype_dict = {
    "CarNumber": "category",
    "Refund": "int8",
    "Fines": "float32",
    "Make": "category",
    "Model": "category",
    "Year": "int16",
    "Calculations": "float32",
}

optimized_df = df.copy()

for column, dtype in dtype_dict.items():
    optimized_df[column] = optimized_df[column].astype(dtype)


optimized_df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 930 entries, 0 to 929
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype   
---  ------        --------------  -----   
 0   CarNumber     930 non-null    category
 1   Refund        930 non-null    int8    
 2   Fines         930 non-null    float32 
 3   Make          930 non-null    category
 4   Model         919 non-null    category
 5   Year          930 non-null    int16   
 6   Calculations  930 non-null    float32 
dtypes: category(3), float32(2), int16(1), int8(1)
memory usage: 63.8 KB
