## Welcome to Time Series Forecasting in Finance!

By the end of this notebook, you will understand the importance of data processing and how basic mathematical models can be successful in finance markets.

### Data Processing

We will start by exploring data processing steps such as:

- Data cleaning
- Tick to Bar Conversion

These stages are essential to ensure the data you work with is accurate and relevant for your model.



In [11]:
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.subplots as sp
import plotly.graph_objs as go

import optuna


import sys
sys.path.append('../utils/')
from preprocess import tick_to_dollar_bar, clean_and_filter_data
from primary_model import generate_cusum_events, triple_barrier_method, generate_trading_signals_ma

### Loading and Cleaning HBAR (Hedera Hashgraph) Dataset

In this section, we will perform the following tasks:
1. Load the minute resolution dataset of HBAR (Hedera Hashgraph) provided in the repository
2. Clean the data, and any other tasks necessary before the data transormation steps

First, to load the data and to check for proper formatting.

In [12]:
ohlcv_data = pd.read_csv("../datasets/hbar_data.csv", parse_dates=['datetime'])
# Keep only the datetime, close, and volume columns
ohlcv_data.head()

Unnamed: 0,datetime,close,volume
0,2020-05-15 06:01:00,0.0378,25222.0
1,2020-05-15 06:02:00,0.0378,6531.0
2,2020-05-15 06:03:00,0.0379,17315.1
3,2020-05-15 06:04:00,0.0379,0.0
4,2020-05-15 06:05:00,0.0381,13543.5


The columns consist of:  datetime, close, and volume. 

Traditional finance data also includes the open, high and low prices

But for the sake of this tutorial, close suffices.

Now to visualize the price action and ideate how we should process the data.

In [13]:
import plotly.graph_objs as go

# Resample the data to daily frequency
ohlcv_data_daily = ohlcv_data.resample('D', on='datetime').agg({'close': 'mean', 'volume': 'sum'}).reset_index()

# Create a plot with two y-axes
fig = go.Figure()

# Add the 'close' column in the original data using Plotly
fig.add_trace(go.Scatter(x=ohlcv_data_daily['datetime'],
                         y=ohlcv_data_daily['close'],
                         name='Price',
                         yaxis='y1'))

# Add volume bars to the plot
#fig.add_trace(go.Bar(x=ohlcv_data_daily['datetime'],
#                     y=ohlcv_data_daily['volume'],
#                     name='Volume',
#                     yaxis='y2',
#                     marker=dict(color='rgba(0, 0, 0, 0.2)')))

# Update layout to create two separate y-axes with a logarithmic scale for the volume axis
fig.update_layout(
    yaxis=dict(title='Price (USD)', side='left', showgrid=False),
    yaxis2=dict(title='Volume', side='right', overlaying='y', showgrid=False, type='log'),
    xaxis=dict(title='Date'),
    legend=dict(orientation='h', yanchor='bottom', y=1.02, xanchor='right', x=1)
)

# Display the plot
fig.show()

It looks like January 2021-June 2022 was an abnormally volatile time for the market. 

The market has settled down considerably since then. 

The volatile data will not be reflect today's market, so it's best to remove it.

In [14]:
ohlcv_data_cleaned = clean_and_filter_data(ohlcv_data, ['2021-01-01', '2022-06-01'])
ohlcv_data_cleaned_daily = ohlcv_data_cleaned.resample('D', on='datetime').agg({'close': 'mean', 'volume': 'sum'}).reset_index()



# Create subplots
fig = sp.make_subplots(rows=1, cols=2, subplot_titles=("Original Data", "Cleaned Data"))

# Plot the 'close' column in the original data using Plotly
fig.add_trace(go.Scatter(x=ohlcv_data_daily['datetime'], y=ohlcv_data_daily['close'], name='Original Data'), row=1, col=1)

# Plot the 'close' column in the cleaned data using Plotly
fig.add_trace(go.Scatter(x=ohlcv_data_cleaned_daily['datetime'], y=ohlcv_data_cleaned_daily['close'], name='Cleaned Data'), row=1, col=2)

# Update layout
fig.update_layout(showlegend=False, xaxis_title='Date', xaxis2_title='Date', yaxis_title='Price (USD)')

# Display the plot
fig.show()

The cleaned data displays fairly consistent price action, providing a solid starting point for data transformations.

**Tick Data vs Bar Data:** Tick data can be unreliable, as it doesn't take trade volume into account. In contrast, *bar data* is more reliable due to its consideration of trade volume.

**Types of Bar Data:** There are various types of bar data, including:

- *Volume bars*
- *Dollar bars*
- *Run bars*
- *Imbalance bars*

For simplicity, we will focus on **dollar bars** in this analysis. I used 50 bars per day because it has desirable statistical attributes, according to de Prado.

In [15]:
bars_per_day=50
dollar_bars = tick_to_dollar_bar(ohlcv_data_cleaned, bars_per_day=bars_per_day)

In [16]:
ohlcv_data_cleaned_daily = ohlcv_data_cleaned.iloc[::int(1440/bars_per_day),:]

# Create subplots
fig = sp.make_subplots(rows=2, cols=2, subplot_titles=("Cleaned Data (Price)", "Dollar Bars (Price)", "Cleaned Data (Volume)", "Dollar Bars (Volume)"))

# Plot the 'close' column in the cleaned data using Plotly
fig.add_trace(go.Scatter(x=ohlcv_data_cleaned_daily['datetime'], y=ohlcv_data_cleaned_daily['close'], name='Cleaned Data (Price)'), row=1, col=1)

# Plot the 'close' column in the dollar bars using Plotly
fig.add_trace(go.Scatter(x=dollar_bars['datetime'], y=dollar_bars['close'], name='Dollar Bars (Price)'), row=1, col=2)

# Plot the 'volume' column in the cleaned data using Plotly
fig.add_trace(go.Bar(x=ohlcv_data_cleaned_daily['datetime'], y=ohlcv_data_cleaned_daily['volume'], name='Cleaned Data (Volume)', marker=dict(color='rgba(200, 0, 0, 1)')), row=2, col=1)

# Plot the 'volume' column in the dollar bars using Plotly
fig.add_trace(go.Bar(x=dollar_bars['datetime'], y=dollar_bars['volume'], name='Dollar Bars (Volume)'), row=2, col=2)

# Update layout
fig.update_layout(showlegend=False, xaxis3_title='Date', xaxis4_title='Date',
                  yaxis_title='Price (USD)', yaxis2_title='Price (USD)', yaxis3_title='Volume', yaxis4_title='Volume')

# Display the plot
fig.show()

### Primary Model Building

Next, we will dive into model building, focusing on simple yet effective methods like:

- Rolling averages
- Trade Event Detection
- Triple Barrier Method
- Hyperparameter Optimization

These simple methods will provide valuable insights into financial time series data.



**Trade Event Detection**

 The `generate_trading_signals_ma` method generates trading signals based on the relationship between two moving averages:

- *Fast MA*
- *Slow MA*


By analyzing the relationship between fast and slow moving averages, we approximate whether or not we should buy or sell at each timepoint.

- If the *fast MA* is greater than the *slow MA*, the `side` column is set to **1**, indicating a *buy signal*.
- If the *fast MA* is less than the *slow MA*, the `side` column is set to **-1**, indicating a *sell signal*.



In [17]:
slow_window=50
fast_window=10
dollar_bars = generate_trading_signals_ma(dollar_bars, slow_window, fast_window)

**Risk Management: Focus on Key Timepoints**

We don't want to bet on *every* dollar bar, as that's too risky. Instead, it's better to place larger bets on fewer, more certain timepoints. This is where **events** come into play 🎯.

Various mathematical models can be used to create events. For this tutorial, we are using the **CUSUM** (Cumulative Sum) method, which is recommended by _Marcos López de Prado_. 

> 🌟 **CUSUM** helps us identify timepoints with big shifts in the average price, enabling us to focus our bets on these critical moments.

By the CUSUM method, we aim to create a more effective trading strategy.


In [35]:
threshold = 0.0555 # Determines sensitivity. A higher threshold leads to less events
trading_events = generate_cusum_events(dollar_bars, threshold)
trading_events = pd.merge(trading_events, dollar_bars[['datetime', 'close']], on='datetime', how='left')

In [36]:
from datetime import datetime, timedelta

# Calculate the timestamp for 2 months prior to the last available data point
last_timestamp = dollar_bars['datetime'].max()
two_months_ago = last_timestamp - timedelta(days=60)

# Filter dollar_bars and trading_events_close for data from the last 2 months
dollar_bars_last_2_months = dollar_bars[dollar_bars['datetime'] > two_months_ago]
trading_events_close_last_2_months = trading_events[trading_events['datetime'] > two_months_ago]

# Create a subplot with the 'close' column in dollar_bars_last_2_months
fig = go.Figure()
fig.add_trace(go.Scatter(x=dollar_bars_last_2_months['datetime'], y=dollar_bars_last_2_months['close'], name='Close Price'))

# Add the buy events as green squares on the plot
buy_events = trading_events_close_last_2_months[trading_events_close_last_2_months['side'] == 1]
fig.add_trace(go.Scatter(x=buy_events['datetime'], y=buy_events['close'], mode='markers', marker=dict(symbol='square', size=8, color='green'), name='Buy Events'))

# Add the sell events as violet squares on the plot
sell_events = trading_events_close_last_2_months[trading_events_close_last_2_months['side'] == -1]
fig.add_trace(go.Scatter(x=sell_events['datetime'], y=sell_events['close'], mode='markers', marker=dict(symbol='square', size=8, color='purple'), name='Sell Events'))

# Update layout
fig.update_layout(title='Close Price, Buy and Sell Events (Last 2 Months)', xaxis_title='Date', yaxis_title='Price (USD)', showlegend=True)

# Display the plot
fig.show()


The graph is helpful to validate our work so far.

It is not supposed to be perfect, but we are generally seeing the the price rise after buy events, and fall after sell events.

After an event is triggered and we initiate a trade, we must consider 3 possible outcomes.

**Introducing the Triple Barrier Method**

The Triple Barrier Method is an effective way to manage the exit strategy of our trading algorithm. 

It helps determine when to close a trade by considering three barriers: Profit Taking, Stop Loss, and Time. 

By using the Triple Barrier Method, we can effectively manage our trades, ensuring that we exit them at the most appropriate time, either to secure profits or to minimize losses.



In [53]:
pt = 0.04 #the gain at which to take profit
sl = 0.06 # the loss at which to cut losses
min_ret = 0.01 # the minimum return to be considered for triple barrier labeling
num_days = 1 # the maximum time for a trade to be live
triple_barrier_labels = triple_barrier_method(dollar_bars, trading_events, pt, sl, min_ret, num_days)
triple_barrier_labels.head()

Unnamed: 0,datetime,type,return,t1,side,initial_price,final_price
0,2020-05-20 09:03:00,sl,-0.068493,2020-05-20 15:47:00,-1.0,0.0365,0.039
1,2020-05-20 15:47:00,pt,0.046154,2020-05-20 21:43:00,-1.0,0.039,0.0372
2,2020-05-21 04:18:00,pt,0.063452,2020-05-21 08:13:00,-1.0,0.0394,0.0369
3,2020-05-21 08:13:00,vb,-0.02981,2020-05-22 10:05:00,-1.0,0.0369,0.038
4,2020-05-22 11:36:00,vb,0.046512,2020-05-23 14:07:00,-1.0,0.0387,0.0369


In [54]:

# Filter dollar_bars and triple_barrier_labels to only include the last two months
start_date = pd.Timestamp.now() - pd.DateOffset(weeks=3)
filtered_dollar_bars = dollar_bars[dollar_bars['datetime'] >= start_date]
filtered_triple_barrier_labels = triple_barrier_labels[triple_barrier_labels['datetime'] >= start_date]

buy_events = filtered_triple_barrier_labels[filtered_triple_barrier_labels['side'] == 1]
sell_events = filtered_triple_barrier_labels[filtered_triple_barrier_labels['side'] == -1]

fig = go.Figure()

# Plot the 'close' column in dollar_bars
fig.add_trace(go.Scatter(x=filtered_dollar_bars['datetime'], y=filtered_dollar_bars['close'], name='Price', line=dict(color='blue')))
# Plot the trading events
fig.add_trace(go.Scatter(x=buy_events['datetime'], y=buy_events['initial_price'], mode='markers', marker=dict(color='green', symbol='square'), name='Buy'))
fig.add_trace(go.Scatter(x=sell_events['datetime'], y=sell_events['initial_price'], mode='markers', marker=dict(color='violet', symbol='square'), name='Sell'))

# Plot the triple barriers
for _, row in filtered_triple_barrier_labels.iterrows():
    fig.add_shape(
        type="line",
        x0=row['datetime'],
        x1=row['t1'],
        y0=row['initial_price'] * (1 + (pt if row['side'] == 1 else -sl)),
        y1=row['initial_price'] * (1 + (pt if row['side'] == 1 else -sl)),
        yref="y",
        xref="x",
        line=dict(color="green"),
        name="Profit Taking"
    )

    fig.add_shape(
        type="line",
        x0=row['datetime'],
        x1=row['t1'],
        y0=row['initial_price'] * (1 - (sl if row['side'] == 1 else -pt)),
        y1=row['initial_price'] * (1 - (sl if row['side'] == 1 else -pt)),
        yref="y",
        xref="x",
        line=dict(color="purple"),
        name="Stop Loss"
    )

    fig.add_shape(
        type="line",
        x0=row['t1'],
        x1=row['t1'],
        y0=row['initial_price'] * (1 - sl),
        y1=row['initial_price'] * (1 + pt),
        yref="y",
        xref="x",
        line=dict(color="blue"),
        name="Time Barrier"
    )

# Update layout
fig.update_layout(title='Price, Event and Triple Barrier Method Visualization', xaxis_title='Date', yaxis_title='Price (USD)', showlegend=True)

# Display the plot
fig.show()

In this graph, we visualize the Triple Barrier Method applied to the price data. The blue line represents the price, green squares indicate buy events, and violet squares indicate sell events.

The three barrier labels:

1. **Profit Taking (Green Line):** This horizontal line represents the target price level for taking profits. If the price reaches this level, the position is closed with a profit.
2. **Stop Loss (Purple Line):** This horizontal line represents the maximum tolerable loss. If the price reaches this level, the position is closed to prevent further losses.
3. **Time Barrier (Blue Line):** This vertical line represents the maximum duration for a trade to be active. If the price hasn't reached the profit taking or stop loss levels within this time frame, the position is closed.

## Congratulations!

You have successfully created a primary model for finance trading. 

Along the way, you might have noticed several parameters that were set. 

Manually tuning these parameters can be time-consuming and challenging.

**Good thing** you can leverage a hyperparameter optimization library to find the best results and save time.

In the following cell, we search for the best parameters for by optimizing for the cumulative return of all trades with a touch of recency bias.


In [80]:
EVENT_TRESHOLD = 0.0555
BARS_PER_DAY = 50
MIN_RET = 0.005

n_jobs = 8  # Set the number of parallel jobs according to # of available CPU cores

In [66]:
from joblib import parallel_backend
import numpy as np


def backtest(triple_barrier_labels, initial_money=1000, bet_percentage=0.16):
    usd = initial_money * 0.5
    hbar = initial_money * 0.5
    total_money = initial_money
    active_bets = {}

    for index, row in triple_barrier_labels.iterrows():
        current_time = row['datetime']

        # Close bets that have reached their t1
        bets_to_close = [key for key, value in active_bets.items() if value['t1'] <= current_time]
        for key in bets_to_close:
            bet = active_bets.pop(key)
            bet_amount = bet['amount']
            pnl = bet_amount * bet['return']
            usd += pnl
            hbar -= pnl / row['final_price']
            total_money = usd + hbar * row['final_price']

        if len(active_bets) == 0:
            bet_amount = total_money * bet_percentage

            if row['side'] == 1:  # Long signal
                usd -= bet_amount
                hbar += bet_amount / row['initial_price']
            elif row['side'] == -1:  # Short signal
                usd += bet_amount
                hbar -= bet_amount / row['initial_price']

            active_bets[row['datetime']] = {'amount': bet_amount, 'return': row['return'], 't1': row['t1']}

    return total_money

def objective(trial):
    # Choose hyperparameters from trial object
    slow_window = trial.suggest_int("slow_window", 10, 200)
    fast_window = trial.suggest_int("fast_window", 5, 50)
    pt = trial.suggest_float("pt", 0.04, 0.1)
    sl = trial.suggest_float("sl", 0.03, 0.2)
    num_days = trial.suggest_float("num_days", 0.5, 2.5)


    ohlcv_data = pd.read_csv("../datasets/hbar_data.csv", parse_dates=['datetime'])
    ohlcv_data_cleaned = clean_and_filter_data(ohlcv_data, ['2021-01-01', '2022-06-01'])
    dollar_bars = tick_to_dollar_bar(ohlcv_data_cleaned, bars_per_day=BARS_PER_DAY)
    
    dollar_bars = generate_trading_signals_ma(dollar_bars, slow_window, fast_window)

    trading_events = generate_cusum_events(dollar_bars, threshold=EVENT_TRESHOLD)
    triple_barrier_labels = triple_barrier_method(dollar_bars, trading_events, pt, sl, MIN_RET, num_days)
    total_ret = triple_barrier_labels['return'].sum()
    last_weeks = triple_barrier_labels['datetime'].iloc[-1] - pd.DateOffset(weeks=9)
    dollar_bars_last_week = triple_barrier_labels[triple_barrier_labels['datetime'] > last_weeks]
    
    recency_bias = dollar_bars_last_week['return'].sum()

    #accuracy_multiplier = np.log(triple_barrier_labels['return'].gt(0).mean() + 0.5)
    score = (total_ret + recency_bias)

    return score

# Create an Optuna study and optimize the hyperparameters
study = optuna.create_study(
    direction="maximize",
    sampler=optuna.samplers.TPESampler(n_startup_trials=130)
)

with parallel_backend("threading", n_jobs=n_jobs):
    study.optimize(objective, n_trials=200, n_jobs=n_jobs)


[32m[I 2023-05-01 16:54:14,696][0m A new study created in memory with name: no-name-378441b8-9015-4785-b404-12585411c51d[0m
[32m[I 2023-05-01 16:54:24,582][0m Trial 0 finished with value: 1.2246176412212755 and parameters: {'slow_window': 179, 'fast_window': 33, 'pt': 0.056557167980914924, 'sl': 0.16775825480221043, 'num_days': 1.7508778801574483}. Best is trial 0 with value: 1.2246176412212755.[0m
[32m[I 2023-05-01 16:54:24,763][0m Trial 6 finished with value: -0.11407754967579041 and parameters: {'slow_window': 200, 'fast_window': 22, 'pt': 0.04954962514305387, 'sl': 0.03863222336432129, 'num_days': 1.5103448089048903}. Best is trial 0 with value: 1.2246176412212755.[0m
[32m[I 2023-05-01 16:54:25,250][0m Trial 2 finished with value: 0.37358630559936884 and parameters: {'slow_window': 63, 'fast_window': 48, 'pt': 0.057050742926684536, 'sl': 0.03972387454078999, 'num_days': 1.1525112665284343}. Best is trial 0 with value: 1.2246176412212755.[0m
[32m[I 2023-05-01 16:54:25,3


Best trial:
Score: 4.3920526787018925
Params: 
slow_window: 48
fast_window: 8
pt: 0.04507551639373595
sl: 0.1529883358809687
num_days: 2.071218279654183


The results are in:

In [79]:
# Print the best hyperparameters
print("\nBest trial:")
trial = study.best_trial
print("Score: {}".format(trial.value))
print("Params: ")
for key, value in trial.params.items():
    print("{}: {}".format(key, value))


Best trial:
Score: 4.3920526787018925
Params: 
slow_window: 48
fast_window: 8
pt: 0.04507551639373595
sl: 0.1529883358809687
num_days: 2.071218279654183


Now to recreate the primary model with the best parameters.

In [67]:
# Get the best parameters from the study
best_params = study.best_params
print(best_params)

# Extract the best parameters
slow_window = best_params["slow_window"]
fast_window = best_params["fast_window"]
pt = best_params["pt"]
sl = best_params["sl"]
num_days = best_params["num_days"]


ohlcv_data = pd.read_csv("../datasets/hbar_data.csv", parse_dates=['datetime'])
ohlcv_data_cleaned = clean_and_filter_data(ohlcv_data, ['2021-01-01', '2022-06-01'])
dollar_bars = tick_to_dollar_bar(ohlcv_data_cleaned, bars_per_day=BARS_PER_DAY)
dollar_bars = generate_trading_signals_ma(dollar_bars, slow_window, fast_window)

trading_events = generate_cusum_events(dollar_bars, threshold=EVENT_TRESHOLD)
triple_barrier_labels = triple_barrier_method(dollar_bars, trading_events, pt, sl, MIN_RET, num_days)
total_ret = triple_barrier_labels['return'].sum()

last_week = triple_barrier_labels['datetime'].iloc[-1] - pd.DateOffset(weeks=15)
dollar_bars_last_week = triple_barrier_labels[triple_barrier_labels['datetime'] > last_week]


# Calculate total return with the best parameters
total_ret = triple_barrier_labels['return'].sum()
print("Total return with the best parameters: ", total_ret)

{'slow_window': 48, 'fast_window': 8, 'pt': 0.04507551639373595, 'sl': 0.1529883358809687, 'num_days': 2.071218279654183}
Total return with the best parameters:  4.02724188259807


In [68]:
offset=5
# Filter the dollar bars for the last week
last_week = dollar_bars['datetime'].iloc[-1] - pd.DateOffset(weeks=offset)
dollar_bars_last_week = dollar_bars[dollar_bars['datetime'] > last_week]


tblw = triple_barrier_labels['datetime'].iloc[-1] - pd.DateOffset(weeks=offset)
tblw = triple_barrier_labels[triple_barrier_labels['datetime'] > tblw]


# Plot the dollar bars for the last week
dollar_bars_last_week_scatter = go.Scatter(x=dollar_bars_last_week['datetime'], y=dollar_bars_last_week['close'], name='Dollar Bars', mode='lines', line=dict(width=0.5))
triple_barrier_labels_last_week = triple_barrier_labels[triple_barrier_labels['datetime'] > last_week]


scatter_markers = []
for _, row in triple_barrier_labels_last_week.iterrows():
    start_time = row['datetime']
    end_time = row['t1']
    start_price = dollar_bars[dollar_bars['datetime'] == start_time]['close'].values[0]
    end_price = dollar_bars[dollar_bars['datetime'] == end_time]['close'].values[0]

    if row['return'] > 0:
        color = 'green'
    else:
        color = 'red'

    if row['side'] == 1:  # Long trade
        start_marker = 'triangle-up'
    else:  # Short trade
        start_marker = 'triangle-down'

    scatter_markers.append(go.Scatter(x=[start_time, end_time], y=[start_price, end_price], mode='markers', marker=dict(size=10, symbol=[start_marker, 'x'], color=color)))

# Create a figure for the dollar bars with triple barrier events (last week)
fig = go.Figure(data=[dollar_bars_last_week_scatter] + scatter_markers)
fig.update_layout(height=600, width=1000, title_text="Dollar Bars with Triple Barrier Events (Last Week)", xaxis_title="Date", yaxis_title="Close Price", showlegend=False)

# Show the plots
fig.show()

In [69]:
accuracy = triple_barrier_labels['return'].gt(0).mean()
average_trade_return = triple_barrier_labels['return'].mean()

In [77]:
print('Trade Accuracy: {:.3f}%'.format(accuracy*100))
print('Average Trade Return: {:.3f}%'.format(average_trade_return*100))

Trade Accuracy: 69.579%
Average Trade Return: 1.303%
