Dataset to use:
https://www.kaggle.com/datasets/paultimothymooney/stock-market-data

A site to get thechnical indicators from:
https://www.investopedia.com/articles/active-trading/011815/top-technical-indicators-rookie-traders.asp?utm_source=chatgpt.com

Piano:
- Trovare indicatori da usare su una singola stock
- Trovare indicatori da usare su tutto l indice (più semplici come mean e std sull ultimo dato e una window di 5, 10, 20) per esempio:
    - Market index price
    - Market index trading volume
- Creare un nuovo dataset che usa questi indicatori

Libreria da usare: pandas‑ta
- contiene 130 indicatori già fatti (usarne una 50ina)


Chat per spunto:
https://chatgpt.com/share/681e1a2a-a944-8013-834e-2ef66c01417b

# Dataset Creation

In this notebook we download our data from kaggle and then augment it with classic technical analysis indicators.
We will use pandas-ta, an extension of pandas for Technical Analysis. 

It is a Numerical Time Series Feature Generator where the Time Series data is biased towards Financial Market data; typical data includes columns named :"open", "high", "low", "close", "volume".

In [8]:
# import kagglehub

# # Download latest version
# path = kagglehub.dataset_download("paultimothymooney/stock-market-data")

# print("Path to dataset files:", path)

## Basic Guide

In [1]:
import numpy as np
# re–add the old alias so pandas_ta can import it
if not hasattr(np, "NaN"):
    np.NaN = np.nan     # Needed by pandas_ta

import pandas as pd
import pandas_ta as ta

df = pd.DataFrame() # Empty DataFrame

# Load data
df = pd.read_csv("data/nasdaq/csv/AAL.csv", sep=",")
# OR if you have yfinance installed
# df = df.ta.ticker("aapl")



In [2]:
# Print DataFrame structure
print(df.info())
print(df.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4333 entries, 0 to 4332
Data columns (total 7 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Date            4333 non-null   object 
 1   Low             4333 non-null   float64
 2   Open            4333 non-null   float64
 3   Volume          4333 non-null   int64  
 4   High            4333 non-null   float64
 5   Close           4333 non-null   float64
 6   Adjusted Close  4333 non-null   float64
dtypes: float64(5), int64(1), object(1)
memory usage: 237.1+ KB
None
         Date        Low       Open   Volume       High      Close  \
0  27-09-2005  19.100000  21.049999   961200  21.400000  19.299999   
1  28-09-2005  19.200001  19.299999  5747900  20.530001  20.500000   
2  29-09-2005  20.100000  20.400000  1078200  20.580000  20.209999   
3  30-09-2005  20.180000  20.260000  3123300  21.049999  21.010000   
4  03-10-2005  20.900000  20.900000  1057900  21.750000  21.500000   


In [3]:
# VWAP requires the DataFrame index to be a DatetimeIndex.
# Replace "datetime" with the appropriate column from your DataFrame
df.set_index(pd.DatetimeIndex(df["Date"]), inplace=True)

# Drop the "Date" column if it's no longer needed
df.drop(columns=["Date"], inplace=True)

# Calculate Returns and append to the df DataFrame
# This automatically calculates the log return and percent return 
#  from the 'close' field in our dataset using the pandas-ta indicator
df.ta.log_return(cumulative=True, append=True)
df.ta.percent_return(cumulative=True, append=True)

# New Columns with results
df.columns

# Take a peek
df.tail()

# vv Continue Post Processing vv

Unnamed: 0_level_0,Low,Open,Volume,High,Close,Adjusted Close,CUMLOGRET_1,CUMPCTRET_1
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
2022-06-12,14.0,14.14,20781500,14.34,14.33,14.33,-0.29775,-0.257513
2022-07-12,13.53,14.24,28161400,14.24,13.55,13.55,-0.353718,-0.297927
2022-08-12,13.38,13.65,25300900,13.8,13.6,13.6,-0.350035,-0.295337
2022-09-12,13.42,13.52,18489800,13.66,13.53,13.53,-0.355196,-0.298964
2022-12-12,13.45,13.49,8048550,13.93,13.915,13.915,-0.327138,-0.279016


In [4]:
# Help about this, 'ta', extension
# help(df.ta)

# List of all indicators
df.ta.indicators()

# # Help about an indicator such as bbands
# help(ta.bbands)

Pandas TA - Technical Analysis Indicators - v0.3.14b0
Total Indicators & Utilities: 205
Abbreviations:
    aberration, above, above_value, accbands, ad, adosc, adx, alma, amat, ao, aobv, apo, aroon, atr, bbands, below, below_value, bias, bop, brar, cci, cdl_pattern, cdl_z, cfo, cg, chop, cksp, cmf, cmo, coppock, cross, cross_value, cti, decay, decreasing, dema, dm, donchian, dpo, ebsw, efi, ema, entropy, eom, er, eri, fisher, fwma, ha, hilo, hl2, hlc3, hma, hwc, hwma, ichimoku, increasing, inertia, jma, kama, kc, kdj, kst, kurtosis, kvo, linreg, log_return, long_run, macd, mad, massi, mcgd, median, mfi, midpoint, midprice, mom, natr, nvi, obv, ohlc4, pdist, percent_return, pgo, ppo, psar, psl, pvi, pvo, pvol, pvr, pvt, pwma, qqe, qstick, quantile, rma, roc, rsi, rsx, rvgi, rvi, short_run, sinwma, skew, slope, sma, smi, squeeze, squeeze_pro, ssf, stc, stdev, stoch, stochrsi, supertrend, swma, t3, td_seq, tema, thermo, tos_stdevall, trima, trix, true_range, tsi, tsignals, ttm_trend, ui, 

Pandas-ta offers 205 technical indicators. To enrich our dataset, we will use all of them as additional features.

Pandas-ta offer pre-packed 'strategies' including a subset of indicators.

Two popular one are
- ta.CommonStrategy  (subset of the most commonly used)
- ta.AllStrategy     (this includes all the indicators, usually used for feature generation)

We will focus on ta.AllStrategy

In [5]:
# print dataset size
print("Dataset size:", df.shape)

Dataset size: (4333, 8)


In [None]:
# Common Strategy
print("Common Strategy")
print(ta.CommonStrategy)

for indicator in ta.CommonStrategy.ta:
    print(indicator)


# All Strategy
print("All Strategy")
print(ta.AllStrategy)
# While common strategy is a defined subset of all indicators, 
#  all strategy dinamically generates a list of all indicators from the current list
#  It is a special command

# for indicator in ta.AllStrategy.ta:
df.ta.strategy(
    ta.AllStrategy,
    calculate=True,
    verbose=True,
    chunks=4,        # 4 batches
    processors=4     # 4 workers
)

Common Strategy
Strategy(name='Common Price and Volume SMAs', ta=[{'kind': 'sma', 'length': 10}, {'kind': 'sma', 'length': 20}, {'kind': 'sma', 'length': 50}, {'kind': 'sma', 'length': 200}, {'kind': 'sma', 'close': 'volume', 'length': 20, 'prefix': 'VOL'}], description='Common Price SMAs: 10, 20, 50, 200 and Volume SMA: 20.', created='Monday May 12, 2025, NYSE: 12:19:54, Local: 16:19:54 W. Europe Summer Time, Day 132/365 (36.00%)')
{'kind': 'sma', 'length': 10}
{'kind': 'sma', 'length': 20}
{'kind': 'sma', 'length': 50}
{'kind': 'sma', 'length': 200}
{'kind': 'sma', 'close': 'volume', 'length': 20, 'prefix': 'VOL'}
All Strategy
Strategy(name='All', ta=None, description='All the indicators with their default settings. Pandas TA default.', created='Monday May 12, 2025, NYSE: 12:19:54, Local: 16:19:54 W. Europe Summer Time, Day 132/365 (36.00%)')
[+] Strategy: All
[i] Indicator arguments: {'calculate': True, 'chunks': 4, 'processors': 4, 'append': True}
[i] Excluded[12]: above, above_val

TODO: The strategy calculation is increadibly slow find a way to speed it up or select a small subset

In [None]:
# Add indicators to the DataFrame

In [None]:
# Calculate the Common Strategy indicators and print the top of the DataFrame

# Apply the CommonStrategy to the DataFrame
df.ta.strategy(
    ta.CommonStrategy,
    calculate=True,
    verbose=True
)

# Print the first 5 rows to check the new columns
print(df.head())

[+] Strategy: Common Price and Volume SMAs
[i] Indicator arguments: {'calculate': True, 'append': True}
[i] Multiprocessing 5 indicators with 15 chunks and 16/16 cpus.
