<div style="text-align:center;font-size:22pt; font-weight:bold;color:white;border:solid black 1.5pt;background-color:#1e7263;">
    Timing Reading Plain Text Data  <br> 
    Comparison between Pandas & Polars
</div>

In [1]:
# ============================================================
#                                                            =
#             Title: Polars Vs Pandas                        =
#                    Timing Reading Plain Text Data          =
#             ---------------------------------              =
#                                                            =
#             Author: Dr. Saad Laouadi                       =
#                                                            =
#             Copyright: Dr. Saad Laouadi                    =
# ============================================================
#                                                            =
#                       LICENSE                              =
#             ----------------------                         =
#                                                            =
#             This material is intended for educational      =
#             purposes only and may not be used directly in  =
#             courses, video recordings, or similar          =
#             without prior consent from the author.         =
#             When using or referencing this material,       =
#             proper credit must be attributed to the        =
#             author.                                        =
# ============================================================

In [1]:
# Adds the 'scripts' directory to the Python path
import sys
sys.path.append('../../scripts/')  

In [8]:
# import the working libraries
from importlibs import *
from pathlib import Path

### Machine Characteristics

Check your machine characterisctics using the `psutil` first. 

In [6]:
import psutil

# Get CPU information
cpu_count = psutil.cpu_count(logical=False)
cpu_count_logical = psutil.cpu_count()
# cpu_freq = psutil.cpu_freq()

# RAM information
ram = psutil.virtual_memory()

print(f"{'Physical CPU cores':<20}: {cpu_count}")
print(f"{'Logical CPU cores':<20}: {cpu_count_logical}")

# if cpu_freq:
    # print(f"{'CPU Frequency':<20}: {cpu_freq.current:.2f}MHz")
    
print(f"{'Total RAM':<20}: {ram.total / (1024 ** 3):.2f} GB")
print(f"{'Available RAM':<20}: {ram.available / (1024 ** 3):.2f} GB")
print(f"{'Used RAM':<20}: {ram.used / (1024 ** 3):.2f} GB")

Physical CPU cores  : 16
Logical CPU cores   : 16
Total RAM           : 128.00 GB
Available RAM       : 89.14 GB
Used RAM            : 37.67 GB


In [9]:
# Data path settings
DATA_ROOT = Path("../../datasets").resolve()
EARTHQUAKE_DATA = Path.joinpath(DATA_ROOT, 'earthquake.csv')
CITY_TEMPERATURE = Path.joinpath(DATA_ROOT, "city_temperature.csv")

# print(DATA_ROOT)
# print(EARTHQUAKE_DATA)

## Timing Reading Data Using `time` Module

- In this section we will use the `time()` function from the `time` module, which is a very simple way to time the execution of multi-line code. 

### Pandas 

In [10]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pd.read_csv(EARTHQUAKE_DATA)

# Record the end time
end_time = time.time()

# Calculate the execution time
pd_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pd_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.0179400 seconds


### With Polars

In [11]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pl.read_csv(EARTHQUAKE_DATA)

# Record the end time
end_time = time.time()

# Calculate the execution time
pl_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pl_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.0075309 seconds


### The Time Difference Polars and Pandas Execution Time


In [12]:
print(f"The execution time difference: {pd_execution_time - pl_execution_time}")

The execution time difference: 0.010409116744995117


## Basic Timing with `%%time` Cell Magic Command

- We will use first `%%time`, the Jupyter Notebook cell magic command to measure the execution time of an entire cell in the notebook. 

- This cell magic command reports the time taken to run all the statements in the cell, providing a simple way to assess the performance of a block of code.


- Note we can use the `%time` magic command instead, but my personal pereference is to use `%%time`

### Pandas 

In [27]:
%%time
%%capture
pd.read_csv(EARTHQUAKE_DATA);

CPU times: user 22.5 ms, sys: 3.05 ms, total: 25.6 ms
Wall time: 23.9 ms


### Polars

In [28]:
%%time
%%capture

pl.read_csv(EARTHQUAKE_DATA);

CPU times: user 17.1 ms, sys: 5.52 ms, total: 22.6 ms
Wall time: 9.96 ms


## Advanced Timing with `%timeit` Magic Command 

- A better approach for performance assessment is by using the `%%timeit`, which is also a cell magic command used to measure the execution time of an entire cell by running it multiple times and report the average execution time. 

### Pandas

In [29]:
%%timeit
pd.read_csv(EARTHQUAKE_DATA);

5.41 ms ± 19.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Polars

In [16]:
%%timeit
pl.read_csv(EARTHQUAKE_DATA);

1.2 ms ± 5.52 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


## Timing Reading Data Larger Dataset

### Pandas

In [17]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pd.read_csv(CITY_TEMPERATURE, low_memory=False)

# Record the end time
end_time = time.time()

# Calculate the execution time
pd_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pd_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.6896589 seconds


### Polars 

In [18]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pl.read_csv(CITY_TEMPERATURE, low_memory=False)

# Record the end time
end_time = time.time()

# Calculate the execution time
pl_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pl_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.0854189 seconds


In [19]:
print(f"The execution time difference: {pd_execution_time - pl_execution_time}")

The execution time difference: 0.6042399406433105


## Timing Reading Larger Dataset with `%%time`

### Pandas

In [31]:
%%time
%%capture

pd.read_csv(CITY_TEMPERATURE, low_memory=False);

CPU times: user 603 ms, sys: 116 ms, total: 719 ms
Wall time: 727 ms


### Polars 

In [33]:
%%time
%%capture

pl.read_csv(CITY_TEMPERATURE, low_memory=False);

CPU times: user 503 ms, sys: 31.2 ms, total: 534 ms
Wall time: 54.4 ms


## Timing Reading Larger Dataset with `%%timeit`

### Pandas

In [34]:
%%timeit
%%capture

pd.read_csv(CITY_TEMPERATURE, low_memory=False);

691 ms ± 17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Polars

In [35]:
%%timeit
%%capture

pl.read_csv(CITY_TEMPERATURE, low_memory=False);

40 ms ± 976 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


## Timing Reading Data with `PyArrow` Engine for Pandas

In [25]:
%%timeit
%%capture
pd.read_csv(CITY_TEMPERATURE, engine="pyarrow");

124 ms ± 537 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Changing the Number of Threads for Polars

- You can change the number of physical threads used when reading the data using `n_threads` argument. By default, polars uses all the available physical threads. 

- In this example, we will loop through the available physical CPUs available in the machine. 

In [26]:
for n_thread in range(1, cpu_count+1):
    print(f"Timing with {n_thread} thread(s):")
    %timeit -o pl.read_csv(CITY_TEMPERATURE, n_threads=n_thread)
    print()

Timing with 1 thread(s):
312 ms ± 4.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 2 thread(s):
168 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 3 thread(s):
128 ms ± 492 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 4 thread(s):
101 ms ± 2.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 5 thread(s):
83.2 ms ± 2.48 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 6 thread(s):
73.5 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 7 thread(s):
60.8 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 8 thread(s):
54.5 ms ± 242 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 9 thread(s):
49.5 ms ± 303 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 10 thread(s):
46.4 ms ± 349 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Timing with 11 thread(s):
45.4 ms ± 250 µs

### Get help

We used two different cell magic commands and pandas `read_csv()` method, you can get more information about them by checking their documentation


```python
# Check magic commands documentation
%time?
%timeit?

# Check pd.read_csv docs
help(pd.read_csv)            # assuming you imported pandas as pd

# Check pl.read_csv docs
help(pl.read_csv)            # assuming you imported pandas as pl
```