In [1]:
import yfinance as yf
import polars as pl
import polars_talib as plta
import pandas as pd
import talib.abstract as ta

### Download stock data and save as parquet

In [2]:
tickers = pl.scan_csv("nasdaq_screener.csv").sort("Market Cap", descending=True).filter(
    pl.col("Market Cap").is_not_null() & (pl.col("Market Cap") > 0)
).select(pl.col("Symbol").str.strip_chars(" "), pl.col("Market Cap")).collect()["Symbol"].to_list()

In [3]:
data = yf.download(tickers[:2000], start='2001-01-01', end='2024-12-31', group_by='ticker', threads=32).stack(level=0).reset_index()

[*********             18%%                      ]  362 of 2000 completedFailed to get ticker 'BRK/B' reason: Expecting value: line 1 column 1 (char 0)
[*****************     35%%                      ]  695 of 2000 completed

$EAI: possibly delisted; No price data found  (1d 2001-01-01 -> 2024-12-31)


[**********************47%%                      ]  934 of 2000 completedFailed to get ticker 'BRK/A' reason: Expecting value: line 1 column 1 (char 0)
[**********************95%%********************  ]  1909 of 2000 completed

$EMP: possibly delisted; No price data found  (1d 2001-01-01 -> 2024-12-31)


[*********************100%%**********************]  2000 of 2000 completed

6 Failed downloads:
['SFB', 'DHCNL']: YFInvalidPeriodError("%ticker%: Period 'max' is invalid, must be one of ['1d', '5d']")
['BRK/B', 'BRK/A']: YFTzMissingError('$%ticker%: possibly delisted; No timezone found')
['EAI', 'EMP']: YFPricesMissingError('$%ticker%: possibly delisted; No price data found  (1d 2001-01-01 -> 2024-12-31)')


In [4]:
df = pl.from_pandas(data).with_columns(
    pl.col("Date").cast(pl.Date),
)

In [5]:
df.write_parquet("us_market_cap2000.parquet")

### Let's look at the syntax of polars talib and check the performace

This code demonstrates a convenient method to convert all column names of type float to lowercase. 
This allows us to use functions similar to talib abstract without needing to explicitly specify column names, enabling the use of default column name.

In [6]:
p = pl.scan_parquet("us_market_cap2000.parquet").select(
    pl.col("Date"), pl.col("Ticker").alias("Symbol"),
    pl.selectors.float().name.to_lowercase()
)

#### This is the simple example how to use polars_talib
Using the `over` syntax, you can easily apply SMA to each symbol, and this operation, including reading the file and transforming and calculating, takes only 139 ms.

In [8]:
%%timeit
df = p.with_columns(
    plta.sma(timeperiod=5).over("Symbol").alias("sma5"),
).filter(
    pl.col("Symbol") == "NVDA"
).collect()

139 ms ± 5.76 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


With pandas, just reading the file and transforming the column names to lowercase takes 1.21 seconds.

In [9]:
%%timeit
df = pd.read_parquet("us_market_cap2000.parquet").set_index(["Ticker", "Date"]).rename(
    columns={c: c.lower() for c in ["Open", "High", "Low", "Close"]}
)

1.21 s ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [10]:
df = pd.read_parquet("us_market_cap2000.parquet").set_index(["Ticker", "Date"]).rename(
    columns={c: c.lower() for c in ["Open", "High", "Low", "Close"]}
)

There are two ways to perform the calculation: using transform and using apply. The transform method is faster, so we will use transform whenever possible. For cases where transform cannot be used, we will resort to apply. The difference in calculation speeds can be seen in the results below.

In [11]:
%%timeit
df["sma5"] = df.groupby("Ticker")["close"].transform(lambda x: ta.SMA(x, timeperiod=5))

1.84 s ± 62.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [12]:
%%timeit
df["sma5"] = df.groupby("Ticker").apply(lambda x: ta.SMA(x, timeperiod=5)).droplevel(0)

3.15 s ± 56.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Performance Summary

    • pandas with transform: 1.84 seconds + 1.21 seconds = 3.05 seconds (22 times slower)
    • pandas with apply: 3.15 seconds + 1.21 seconds = 4.36 seconds (31 times slower)
    • polars with over syntax and optimized by query plan: 0.139 seconds

polars is significantly faster than pandas for these operations, including reading the file and performing the analysis.


Let’s explore different talib functions with varying inputs and outputs and compare their usage in polars versus pandas. Some functions have multiple outputs rather than a single series, and we will demonstrate how polars offers a consistent syntax for using these functions conveniently.

This operation, including reading the file and transforming and calculating the output, takes only 135 milliseconds.

In [None]:
%%timeit
df = p.with_columns(
    plta.sma(timeperiod=5).over("Symbol").alias("sma5"),
    plta.macd(fastperiod=10, slowperiod=20, signalperiod=5).over("Symbol").alias("macd"),
    plta.stoch(pl.col("high"), pl.col("low"), pl.col("close"), fastk_period=14, slowk_period=7, slowd_period=7).over("Symbol").alias("stoch"),
    plta.wclprice().over("Symbol").alias("wclprice"),
).with_columns(
    pl.col("macd").struct.field("macd"),
    pl.col("macd").struct.field("macdsignal"),
    pl.col("macd").struct.field("macdhist"),
    pl.col("stoch").struct.field("slowk"),
    pl.col("stoch").struct.field("slowd"),
).select(
    pl.exclude("stoch")
).filter(
    pl.col("Symbol") == "AAPL"
).collect()

135 ms ± 5.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In pandas, handling multiple outputs from talib functions requires more steps and different syntax, which can be inconsistent and confusing.

It takes approximately 19.2 seconds, showcasing the inefficiency and inconsistency of pandas syntax.

In [13]:
%%timeit
df["sma5"] = df.groupby("Ticker")["close"].transform(lambda x: ta.SMA(x, timeperiod=5))
df["macd"] = df.groupby("Ticker")["close"].transform(lambda x: ta.MACD(x, fastperiod=10, slowperiod=20, signalperiod=5)[0])
df["macdsignal"] = df.groupby("Ticker")["close"].transform(lambda x: ta.MACD(x, fastperiod=10, slowperiod=20, signalperiod=5)[1])
df["macdhist"] = df.groupby("Ticker")["close"].transform(lambda x: ta.MACD(x, fastperiod=10, slowperiod=20, signalperiod=5)[2])
df["slowk"] = df.groupby("Ticker").apply(lambda x: ta.STOCH(x, fastk_period=14, slowk_period=7, slowd_period=7)).droplevel(0)["slowk"] 
df["slowd"] = df.groupby("Ticker").apply(lambda x: ta.STOCH(x, fastk_period=14, slowk_period=7, slowd_period=7)).droplevel(0)["slowd"]
df["wclprice"] = df.groupby("Ticker").apply(lambda x: ta.WCLPRICE(x)).droplevel(0)

19.2 s ± 367 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [23]:
df.loc["AAPL"]

Unnamed: 0_level_0,Adj Close,close,high,low,open,Volume,sma5,macd,macdsignal,macdhist,slowk,slowd,wclprice
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2001-01-02,0.228412,0.265625,0.272321,0.260045,0.265625,452312000.0,,,,,,,0.265904
2001-01-03,0.251445,0.292411,0.297991,0.257813,0.258929,817073600.0,,,,,,,0.285157
2001-01-04,0.262002,0.304688,0.330357,0.300223,0.323940,739396000.0,,,,,,,0.309989
2001-01-05,0.251445,0.292411,0.310268,0.286830,0.302455,412356000.0,,,,,,,0.295480
2001-01-08,0.254324,0.295759,0.303292,0.284598,0.302455,373699200.0,0.290179,,,,,,0.294852
...,...,...,...,...,...,...,...,...,...,...,...,...,...
2024-05-29,190.289993,190.289993,192.250000,189.509995,189.610001,53068000.0,189.607999,3.279104,3.631761,-0.352657,79.339054,88.250096,190.584995
2024-05-30,191.289993,191.289993,192.179993,190.630005,190.759995,49947900.0,189.685999,3.173640,3.479054,-0.305414,77.642846,85.987470,191.347496
2024-05-31,192.250000,192.250000,192.570007,189.910004,191.440002,75158300.0,190.759998,3.123732,3.360613,-0.236881,76.612846,83.248118,191.745003
2024-06-03,194.029999,194.029999,194.990005,192.520004,192.899994,50080500.0,191.569998,3.186809,3.302678,-0.115870,77.248478,80.835460,193.892502


In [22]:
p.with_columns(
    plta.sma(timeperiod=5).over("Symbol").alias("sma5"),
    plta.macd(fastperiod=10, slowperiod=20, signalperiod=5).over("Symbol").alias("macd"),
    plta.stoch(pl.col("high"), pl.col("low"), pl.col("close"), fastk_period=14, slowk_period=7, slowd_period=7).over("Symbol").alias("stoch"),
    plta.wclprice().over("Symbol").alias("wclprice"),
).with_columns(
    pl.col("macd").struct.field("macd"),
    pl.col("macd").struct.field("macdsignal"),
    pl.col("macd").struct.field("macdhist"),
    pl.col("stoch").struct.field("slowk"),
    pl.col("stoch").struct.field("slowd"),
).select(
    pl.exclude("stoch")
).filter(
    pl.col("Symbol") == "AAPL"
).collect()

Date,Symbol,adj close,close,high,low,open,volume,sma5,macd,wclprice,macdsignal,macdhist,slowk,slowd
date,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
2001-01-02,"""AAPL""",0.228412,0.265625,0.272321,0.260045,0.265625,4.52312e8,,,0.265904,,,,
2001-01-03,"""AAPL""",0.251445,0.292411,0.297991,0.257813,0.258929,8.170736e8,,,0.285157,,,,
2001-01-04,"""AAPL""",0.262002,0.304688,0.330357,0.300223,0.32394,7.39396e8,,,0.309989,,,,
2001-01-05,"""AAPL""",0.251445,0.292411,0.310268,0.28683,0.302455,4.12356e8,,,0.29548,,,,
2001-01-08,"""AAPL""",0.254324,0.295759,0.303292,0.284598,0.302455,3.736992e8,0.290179,,0.294852,,,,
…,…,…,…,…,…,…,…,…,…,…,…,…,…,…
2024-05-29,"""AAPL""",190.289993,190.289993,192.25,189.509995,189.610001,5.3068e7,189.607999,3.279104,190.584995,3.631761,-0.352657,79.339054,88.250096
2024-05-30,"""AAPL""",191.289993,191.289993,192.179993,190.630005,190.759995,4.99479e7,189.685999,3.17364,191.347496,3.479054,-0.305414,77.642846,85.98747
2024-05-31,"""AAPL""",192.25,192.25,192.570007,189.910004,191.440002,7.51583e7,190.759998,3.123732,191.745003,3.360613,-0.236881,76.612846,83.248118
2024-06-03,"""AAPL""",194.029999,194.029999,194.990005,192.520004,192.899994,5.00805e7,191.569998,3.186809,193.892502,3.302678,-0.11587,77.248478,80.83546


In [24]:
(1.2 + 19.2) * 1000 / 135

151.11111111111111

The performance comparison between polars and pandas shows a significant speed difference. Here’s the detailed comparison:

	• pandas with transform and apply:
	• Reading the file: 1.2 seconds
	• Performing calculations: 19.2 seconds
	• Total time: 1.2 seconds + 19.2 seconds = 20.4 seconds
	• polars with over syntax and optimized by query plan:
	• Total time: 0.135 seconds

Improvement Factor

The speed improvement factor can be calculated as follows:
 $$\text{Improvement Factor} = \frac{20.4 \text{ seconds} \times 1000}{135 \text{ milliseconds}} \approx 151$$

Thus, polars is approximately 151 times faster than pandas for these operations, including reading the file and performing the calculations.

By comparing these methods, it is evident that polars offers a significant performance advantage over pandas for these types of calculations, with a consistent and streamlined syntax that reduces confusion and makes the code more maintainable.