<h1>Pandas optimizations<h1>

In [2]:
import pandas as pd
import gc

<h2>read the fines.csv that you saved in the previous exercise<h2>

In [48]:
df = pd.read_csv("../../data/fines.csv")

<h2>iterations: in all the following subtasks, you need to calculate fines/refund*year for each row and create a new column with the calculated data and measure the time using the magic command %%timeit in the cell<h2>

In [None]:
def loop(df):
    l = []
    for i in range(0, len(df)):
        value = df.iloc[i][] / df.iloc[i]["Refund"] * df.iloc[i]["Year"]
        l.append(value)
    return l

%timeit df["CalculatedData"] = loop(df)
        

30.8 ms ± 118 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [None]:
def iterrows(df):
    l = []
    for row in df.iterrows():
        value = row[1]["Fines"] / row[1]["Refund"] * row[1]["Year"]
        l.append(value)
    return l
%timeit df["CalculatedData"] = iterrows(df)

10.6 ms ± 69.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [26]:
%timeit df["CalculatedData"] = df.apply(lambda row: row["Fines"] / row["Refund"] * row["Year"], axis=1)

3.15 ms ± 263 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [29]:
%timeit df["CalculatedData"] = df["Fines"] / df["Refund"] * df["Year"]

54.3 μs ± 1.62 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [37]:
%timeit df["CalculatedData"] = df["Fines"].values / df["Refund"].values * df["Year"].values

28.6 μs ± 148 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


<h2>indexing: measure the time using the magic command %%timeit in the cell<h2>

In [49]:
%timeit df[df["CarNumber"] == "Y163O8161RUS"]
df.set_index("CarNumber", inplace=True)
%timeit df.loc["Y163O8161RUS"]


79.7 μs ± 2.27 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
38.4 μs ± 1.04 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


<h2>downcasting<h2>

In [50]:
df.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Q912S2345RUS
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Refund  930 non-null    int64  
 1   Fines   930 non-null    float64
 2   Make    930 non-null    object 
 3   Model   918 non-null    object 
 4   Year    930 non-null    int64  
dtypes: float64(1), int64(2), object(2)
memory usage: 207.0 KB


In [61]:
optimized = df.copy()
optimized["Fines"] = optimized["Fines"].astype("float32")
optimized["Refund"] = pd.to_numeric(optimized["Refund"], downcast="integer")
optimized["Year"] = pd.to_numeric(optimized["Year"], downcast="integer")
optimized.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Q912S2345RUS
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Refund  930 non-null    int8    
 1   Fines   930 non-null    float32 
 2   Make    930 non-null    category
 3   Model   918 non-null    object  
 4   Year    930 non-null    int16   
dtypes: category(1), float32(1), int16(1), int8(1), object(1)
memory usage: 144.6 KB


<h2>categories<h2>

In [58]:
optimized["Make"] = optimized["Make"].astype("category")
optimized["Model"] = optimized["Model"].astype("category")
optimized.info(memory_usage='deep')


<class 'pandas.core.frame.DataFrame'>
Index: 930 entries, Y163O8161RUS to Q912S2345RUS
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype   
---  ------  --------------  -----   
 0   Refund  930 non-null    int8    
 1   Fines   930 non-null    float32 
 2   Make    930 non-null    category
 3   Model   918 non-null    category
 4   Year    930 non-null    int16   
dtypes: category(2), float32(1), int16(1), int8(1)
memory usage: 97.5 KB


<h2>memory clean<h2>

In [65]:
%reset_selective df
gc.collect()

31