<div style="text-align:center;font-size:22pt; font-weight:bold;color:white;border:solid black 1.5pt;background-color:#1e7263;">
    Timing Reading Plain Text Data  <br> 
    Comparison between Pandas & Polars
</div>

In [1]:
# ============================================================
#                                                            =
#             Title: Polars Vs Pandas                        =
#                    Timing Reading Plain Text Data          =
#             ---------------------------------              =
#                                                            =
#             Author: Dr. Saad Laouadi                       =
#                                                            =
#             Copyright: Dr. Saad Laouadi                    =
# ============================================================
#                                                            =
#                       LICENSE                              =
#             ----------------------                         =
#                                                            =
#             This material is intended for educational      =
#             purposes only and may not be used directly in  =
#             courses, video recordings, or similar          =
#             without prior consent from the author.         =
#             When using or referencing this material,       =
#             proper credit must be attributed to the        =
#             author.                                        =
# ============================================================

In [2]:
# Adds the 'scripts' directory to the Python path
import sys
sys.path.append('../../scripts/')  

In [3]:
# import the working libraries
from importlibs import *

******************************************
          The imported libs are:          
******************************************
polars version is :     0.20.2
pandas version is :      2.1.4
numpy version is  :     1.26.2
pyarrow version is:     14.0.2
******************************************
The imported builtin modules are:
['os', 'sys', 'pathlib', 'time', 'shutil', 're']
**************************************************************
The python executable path is:
 /usr/local/Caskroom/mambaforge/base/envs/plenv/bin/python3.12
**************************************************************
Important Reminder:
Before proceeding, please ensure that you have activated the appropriate virtual environment for this project.
This step is crucial to maintain consistent dependencies and project settings.


### Machine Characteristics

Check your machine characterisctics using the `psutil` first. 

In [4]:
import psutil

# Get CPU information
cpu_count = psutil.cpu_count(logical=False)
cpu_count_logical = psutil.cpu_count()
cpu_freq = psutil.cpu_freq()

# RAM information
ram = psutil.virtual_memory()

print(f"{'Physical CPU cores':<20}: {cpu_count}")
print(f"{'Logical CPU cores':<20}: {cpu_count_logical}")

if cpu_freq:
    print(f"{'CPU Frequency':<20}: {cpu_freq.current:.2f}MHz")
    
print(f"{'Total RAM':<20}: {ram.total / (1024 ** 3):.2f} GB")
print(f"{'Available RAM':<20}: {ram.available / (1024 ** 3):.2f} GB")
print(f"{'Used RAM':<20}: {ram.used / (1024 ** 3):.2f} GB")

Physical CPU cores  : 6
Logical CPU cores   : 12
CPU Frequency       : 2600.00MHz
Total RAM           : 16.00 GB
Available RAM       : 8.68 GB
Used RAM            : 5.48 GB


In [5]:
# Data path settings
DATA_ROOT = Path("../../datasets").resolve()
EARTHQUAKE_DATA = Path.joinpath(DATA_ROOT, 'earthquake.csv')
CITY_TEMPERATURE = Path.joinpath(DATA_ROOT, "city_temperature.csv")

# print(DATA_ROOT)
# print(EARTHQUAKE_DATA)

## Timing Reading Data Using `time` Module

- In this section we will use the `time()` function from the `time` module, which is a very simple way to time the execution of multi-line code. 

### Pandas 

In [6]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pd.read_csv(EARTHQUAKE_DATA)

# Record the end time
end_time = time.time()

# Calculate the execution time
pd_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pd_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.0283918 seconds


### With Polars

In [7]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pl.read_csv(EARTHQUAKE_DATA)

# Record the end time
end_time = time.time()

# Calculate the execution time
pl_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pl_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.0072021 seconds


### The Time Difference Polars and Pandas Execution Time


In [8]:
print(f"The execution time difference: {pd_execution_time - pl_execution_time}")

The execution time difference: 0.02118968963623047


## Basic Timing with `%%time` Cell Magic Command

- We will use first `%%time`, the Jupyter Notebook cell magic command to measure the execution time of an entire cell in the notebook. 

- This cell magic command reports the time taken to run all the statements in the cell, providing a simple way to assess the performance of a block of code.


- Note we can use the `%time` magic command instead, but my personal pereference is to use `%%time`

### Pandas 

In [9]:
%%time
pd.read_csv(EARTHQUAKE_DATA);

CPU times: user 19.7 ms, sys: 5.17 ms, total: 24.9 ms
Wall time: 23.1 ms


### Polars

In [10]:
%%time
pl.read_csv(EARTHQUAKE_DATA);

CPU times: user 14.9 ms, sys: 7.73 ms, total: 22.6 ms
Wall time: 4.4 ms


## Advanced Timing with `%timeit` Magic Command 

- A better approach for performance assessment is by using the `%%timeit`, which is also a cell magic command used to measure the execution time of an entire cell by running it multiple times and report the average execution time. 

### Pandas

In [11]:
%%timeit
pd.read_csv(EARTHQUAKE_DATA);

16.6 ms ± 1.37 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)


### Polars

In [12]:
%%timeit
pl.read_csv(EARTHQUAKE_DATA);

1.97 ms ± 236 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## Timing Reading Data Larger Dataset

### Pandas

In [13]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pd.read_csv(CITY_TEMPERATURE, low_memory=False)

# Record the end time
end_time = time.time()

# Calculate the execution time
pd_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pd_execution_time:.7f} seconds")

Time taken to read the CSV file: 2.1000221 seconds


### Polars 

In [14]:
# Record the start time
start_time = time.time()

# Read the CSV file
data = pl.read_csv(CITY_TEMPERATURE, low_memory=False)

# Record the end time
end_time = time.time()

# Calculate the execution time
pl_execution_time = end_time - start_time

print(f"Time taken to read the CSV file: {pl_execution_time:.7f} seconds")

Time taken to read the CSV file: 0.2532461 seconds


In [15]:
print(f"The execution time difference: {pd_execution_time - pl_execution_time}")

The execution time difference: 1.846776008605957


## Timing Reading Larger Dataset with `%%time`

### Pandas

In [16]:
%%time
pd.read_csv(CITY_TEMPERATURE, low_memory=False);

CPU times: user 1.74 s, sys: 364 ms, total: 2.11 s
Wall time: 2.11 s


### Polars 

In [17]:
%%time
pl.read_csv(CITY_TEMPERATURE, low_memory=False);

CPU times: user 1.16 s, sys: 230 ms, total: 1.39 s
Wall time: 201 ms


## Timing Reading Larger Dataset with `%%timeit`

### Pandas

In [18]:
%%timeit
pd.read_csv(CITY_TEMPERATURE, low_memory=False);

1.98 s ± 65.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Polars

In [19]:
%%timeit
pl.read_csv(CITY_TEMPERATURE, low_memory=False);

209 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


## Timing Reading Data with `PyArrow` Engine for Pandas

In [20]:
%%timeit
pd.read_csv(CITY_TEMPERATURE, engine="pyarrow");

370 ms ± 7.02 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Changing the Number of Threads for Polars

- You can change the number of physical threads used when reading the data using `n_threads` argument. By default, polars uses all the available physical threads. 

- In this example, we will loop through the available physical CPUs available in the machine. 

In [21]:
for n_thread in range(1, cpu_count+1):
    print(f"Timing with {n_thread} thread(s):")
    %timeit -o pl.read_csv(CITY_TEMPERATURE, n_threads=n_thread)
    print()

Timing with 1 thread(s):
660 ms ± 3.65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 2 thread(s):
459 ms ± 24.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 3 thread(s):
424 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 4 thread(s):
336 ms ± 14.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 5 thread(s):
264 ms ± 3.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Timing with 6 thread(s):
286 ms ± 7.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



### Get help

We used two different cell magic commands and pandas `read_csv()` method, you can get more information about them by checking their documentation


```python
# Check magic commands documentation
%time?
%timeit?

# Check pd.read_csv docs
help(pd.read_csv)            # assuming you imported pandas as pd

# Check pl.read_csv docs
help(pl.read_csv)            # assuming you imported pandas as pl
```