# Constructing a Trading Strategy Using Transfer Entropy Between Discretized Sentiment Scores and Stock Price Movements

In [24]:
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

import matplotlib.pyplot as plt
from scipy.optimize import minimize
#from pyinform.transferentropy import transfer_entropy 
from tqdm import tqdm
import utils.constants as constants
import utils.pipelines as pipelines
from utils.pipelines import *
from utils.helpers import *
from utils.calibration import *

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [10]:
#Switch (to run the sentiment part)
Switch = False
data_path = 'data/'

### 0. Cleaning our data

In [11]:
#Applying the function
#First choose what dataset we want to clean.
#I want us to first work with Apple news dataset but we could work with any dataset (just replace the path)
df_to_clean = read_file(data_path + "raw/AAPL_news.parquet")
df_to_clean.head()

Unnamed: 0,Date,Stock_symbol,Article_title
0,2023-12-16 22:00:00,AAPL,My 6 Largest Portfolio Holdings Heading Into 2...
1,2023-12-16 22:00:00,AAPL,Brokers Suggest Investing in Apple (AAPL): Rea...
2,2023-12-16 21:00:00,AAPL,"Company News for Dec 19, 2023"
3,2023-12-16 21:00:00,AAPL,NVIDIA (NVDA) Up 243% YTD: Will It Carry Momen...
4,2023-12-16 21:00:00,AAPL,"Pre-Market Most Active for Dec 19, 2023 : BMY,..."


In [12]:
#function applied
df_filtered = pipelines.cleaner_df(df_to_clean, ticker_lst = ["Apple","AAPL"])
df_filtered.head()

  df_filtered = df[df['Article_title'].str.contains(pattern, case=False, na=False)]


Unnamed: 0,Date,Stock_symbol,Article_title
1,2023-12-16 22:00:00,AAPL,Brokers Suggest Investing in Apple (AAPL): Rea...
4,2023-12-16 21:00:00,AAPL,"Pre-Market Most Active for Dec 19, 2023 : BMY,..."
6,2023-12-16 20:00:00,AAPL,AAPL Quantitative Stock Analysis
15,2023-12-16 04:00:00,AAPL,"After Hours Most Active for Dec 18, 2023 : PAC..."
16,2023-12-16 04:00:00,AAPL,"Technology Sector Update for 12/18/2023: PCT, ..."


In [13]:
#Finally, save the cleaned dataset
#If we work with another dataset, change the path
write_file(df_filtered,data_path+"clean/apple_news_c.csv")

### 1. News-data Preprocessing

**Here we start our data processing work to convert get the sentiment**

Assume that the dataset of the news is a constant flow of the news, then we try to construct a stochastic process on sentiment: use distrillroberta to generate sentiment(positive, neutral, negative) of the news headline and labelled the as (1, 0, -1) scores. Note that the news are always released in two ways: one is that the news are released at **midnight(or non-trading time) in batches**, the others are released during **trading time without batches(flow-released news)**. For batch-released news, average the sentiment score.

THe formula for the stochastic process is be written as 

$$
I_t^s = \frac{1}{|N^s_t|}\sum_{j} g(f(e_{jt}^{s})), 
g(x) = \begin{cases} 1, x = "Positive"\\
0, x = "Neutral"\\
-1, x = "Negative"\\
\end{cases},

f(x) = roBERTa(x),

$$

$f(x)$ is the pretrained sentiment classifier. $e_{jt}^{s}$ means the $j$-th embedding of the news headline at time $t$ for the stock $s$. $|N^s_t|$ means the number of the news released at time t related to the stock $s$.

**Note**: time $t$ is not the natural time for trading, but it's proxy time interval between the new release, and it differs from stock to stock. Think of it as some sort of jump process.

In [14]:
cleaned_df = pd.read_csv(data_path +"clean/apple_news_c.csv")
cleaned_df.head()

Unnamed: 0,Date,Stock_symbol,Article_title
0,2023-12-16 22:00:00,AAPL,Brokers Suggest Investing in Apple (AAPL): Rea...
1,2023-12-16 21:00:00,AAPL,"Pre-Market Most Active for Dec 19, 2023 : BMY,..."
2,2023-12-16 20:00:00,AAPL,AAPL Quantitative Stock Analysis
3,2023-12-16 04:00:00,AAPL,"After Hours Most Active for Dec 18, 2023 : PAC..."
4,2023-12-16 04:00:00,AAPL,"Technology Sector Update for 12/18/2023: PCT, ..."


This script performs sentiment analysis on financial news articles using the PRE-TRAINED model we selected from Hugging Face.
The sentiment analysis model used is 'distilroberta-finetuned-financial-news-sentiment-analysis', which is fine-tuned on the financial_phrasebank dataset.

**Functions**:
 - **distill_roberta_classify_sentiment(article: str) -> int**:
        Classifies the sentiment of a given article as positive, negative, or neutral.
        Returns 1 for positive sentiment, -1 for negative sentiment, and 0 for neutral sentiment.

The script reads a DataFrame **cleaned_df** containing financial news articles, applies sentiment analysis to the 'Article_title' column,
and saves the resulting DataFrame with sentiment scores to a CSV file.

**Usage**:
    Ensure that the required libraries are installed and the input DataFrame **cleaned_df** is loaded.
    Run the script to perform sentiment analysis and save the results to a CSV file.


In [16]:
from models.classifier import distill_roberta_classify_sentiment

# Now we need to merge the sentiment scores to the dataset (AND AFTER MERGE INDEX)
if Switch:
    cleaned_df['Sentiment'] = cleaned_df['Article_title'].apply(distill_roberta_classify_sentiment)
    #Writing the file
    write_file(cleaned_df,data_path+"processed/processed_news_data_with_sent.csv")

### 2. Price-data preprocessing

#### Now that we have our data set with the sentiments, we want to link it to our stock price

Discretize the tick-by-tick data to align with the news-data, by identifying the return of the price. When the batch news are released in non-trading time, assume that the traders in the market will react to the news as soon as the trading time starts, for flow-released news, assume that the market participants will react as soon as the information of the news is understood. There will be **one hyperparameter**:
  - The time-lag between the new release and the market reaction: $\gamma$

Set the return of the stock as positive, stable and negative under a threshold and label it as (1,0,-1). Then we construct a stochastic process for stock return.

The formula for the stochastic process of the discretized return can be written as 

$$
R_{t}^{s} = h(\log \frac{P_{t + 1}^{s}}{P_{t}^{s}}), 

h(x) = \begin{cases}1, x > \gamma \\ 0, \|x\| \leq \gamma \\ -1, x < -\gamma\end{cases},

\gamma, \delta > 0
$$

 - $P_{t}^{s}$ is the price of stock $s$ at time $t$
 - $(\log \frac{P_{t + 1}^{s}}{P_{t}^{s}})$ is the daily log-return


#### 2.1 Cleaning the price dataset

In [17]:
return_data = read_file(data_path + "quotes/apple_quotes.csv")
return_data.head()

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,02/01/1981,0.154018,0.155134,0.154018,0.154018,0.119849,21660800
1,05/01/1981,0.151228,0.151228,0.15067,0.15067,0.117244,35728000
2,06/01/1981,0.144531,0.144531,0.143973,0.143973,0.112032,45158400
3,07/01/1981,0.138393,0.138393,0.137835,0.137835,0.107256,55686400
4,08/01/1981,0.135603,0.135603,0.135045,0.135045,0.105085,39827200


In [20]:
#Calculating log returns and then inspecting the return dataframe 
returns_data = calculate_log_returns(return_data)
returns_data.head()

Unnamed: 0,Date,Close,Log_Return
1,05/01/1981,0.15067,-0.021977
2,06/01/1981,0.143973,-0.045466
3,07/01/1981,0.137835,-0.043568
4,08/01/1981,0.135045,-0.020449
5,09/01/1981,0.142299,0.052322


#### 2.2 Linking it to the news data
Now that we have a clean dataset to work on, let's merge our sentiment dataset with our returns dataset.

In [21]:
sentiment_data = read_file(data_path + 'processed/processed_news_data_with_sent.csv')
sentiment_data.head()

Unnamed: 0,Date,Stock_symbol,Article_title,Sentiment
0,2023-12-16 22:00:00,AAPL,Brokers Suggest Investing in Apple (AAPL): Rea...,0
1,2023-12-16 21:00:00,AAPL,"Pre-Market Most Active for Dec 19, 2023 : BMY,...",0
2,2023-12-16 20:00:00,AAPL,AAPL Quantitative Stock Analysis,0
3,2023-12-16 04:00:00,AAPL,"After Hours Most Active for Dec 18, 2023 : PAC...",0
4,2023-12-16 04:00:00,AAPL,"Technology Sector Update for 12/18/2023: PCT, ...",0


In [22]:
#We see that news are generally published in batches
sentiment_data.drop_duplicates(["Date"])

Unnamed: 0,Date,Stock_symbol,Article_title,Sentiment
0,2023-12-16 22:00:00,AAPL,Brokers Suggest Investing in Apple (AAPL): Rea...,0
1,2023-12-16 21:00:00,AAPL,"Pre-Market Most Active for Dec 19, 2023 : BMY,...",0
2,2023-12-16 20:00:00,AAPL,AAPL Quantitative Stock Analysis,0
3,2023-12-16 04:00:00,AAPL,"After Hours Most Active for Dec 18, 2023 : PAC...",0
6,2023-12-16 02:00:00,AAPL,"Technology Sector Update for 12/18/2023: ADBE,...",0
...,...,...,...,...
2813,2020-03-14 00:00:00,AAPL,All Apple Stores Outside Of China To Temporari...,0
2814,2020-03-13 00:00:00,AAPL,Canopy Growth's Storz & Bickel Bypasses Apple'...,0
2822,2020-03-12 00:00:00,AAPL,"Technical Pro: Apple A 'Great Company,' Not A ...",0
2823,2020-03-11 00:00:00,AAPL,Apple To Close All Italy Stores Until Further ...,0


In [23]:
intermediate_data = pipelines.process_and_merge_data(returns_data,sentiment_data)
intermediate_data.head()

  df = func(df, **kwargs)


Unnamed: 0,Date,Close,Log_Return,Trading_Date,Sentiment
0,2020-03-10,71.334999,0.069546,2020-03-10,0.0
1,2020-03-11,68.857498,-0.035348,2020-03-11,0.142857
2,2020-03-12,62.057499,-0.103978,2020-03-12,0.0
3,2020-03-13,69.4925,0.113157,2020-03-13,0.375
4,2020-03-16,60.552502,-0.137708,2020-03-16,-0.5


### 3.Get transfer entropy

After we construct on the two stochastic processes $I_{t}^{s}$ and $R_{t}^{s}$, calculate the transfer entropy of the two random processes

Before we move on, we **discretize** the sentiment scores to get into the value range (-1,0,1) as we took the average when converting them into daily scores. Therefore, we are going to have an additional parameter when classifying:

$$
I_t^s =
\begin{cases} 
1, & I_t^s > \beta \quad (\text{Positive sentiment}) \\\
0, & -\beta \leq I_t^s \leq \beta \quad (\text{Neutral sentiment}) \\\
-1, & I_t^s < -\beta \quad (\text{Negative sentiment})
\end{cases}
$$

The transfer entropy is revised to be a **lagged-$\Delta$ conditional/local transfer entropy**, which $\Delta$ is the lagged time for the market participants to react on the news release, and $\lambda$ is the fixed time window on the local/conditional entropy.

$$
TE(I_{t}^{s}, R_{t+\Delta}^{s} | t - \lambda : t)
$$

Mapping the states (-1,0,1) so that there are no negative-valued states don't change the result of the transfer entropy **(detail this part)**

### 4. Design trading strategy

Based on the sentiment process $I_{s}^{t}$ and the transfer entropy $TE(I_{t}^{s}, R_{t+\Delta}^{s} | t - \lambda : t)$ , design a statistic arbitrage trading strategy **(only one idea)** as follows:

````python
if (sentiment_score > 0) & (TE>threshold):
    buy stock at time t
    sell stock at time t + delta
elif (sentiment_score < 0) & (TE>threshold):
    short sell stock at time t
    buy back stock at time t + delta
else:
    continue
````

The ``threshold`` can be written as $\alpha$ for further demonstration.

The strategy can be also revised to be a long-short strategy.

In [None]:
# Initial values of the parameters
initial_gamma = 0.01
beta_initial = 0.3
delta_return_initial = 1
lambda_initial = 10
alpha_initial = 0.2

# Define transformations as a list of (function, arguments)
transformations = [
    (classify_returns, {
                        'column_name' : 'Log_Return', 
                        'gamma' : initial_gamma
                        }
                            ),

    #(process_and_merge_data, {'sentiment_data' : sentiment_data}),

    (discretize_sentiment_column, {'beta': beta_initial}),

    (calculate_and_add_transfer_entropy, {  
                                        'source_col': 'Sentiment_Discretized', 
                                        'target_col': 'Return_Label', 
                                        'window_size': lambda_initial,
                                        'delta': delta_return_initial
                                            }
                                                ),

    (apply_trading_strategy, {
                            'alpha':alpha_initial,
                            'delta':delta_return_initial
                            }
                                )
    
]

In [None]:
#Applying the pipeline
final_df = optimization_pipeline(intermediate_data, transformations)
final_df

  df.loc[t + delta, 'Strategy_Return'] = df.loc[t + delta, 'Log_Return']


Unnamed: 0,Date,Close,Log_Return,Trading_Date,Sentiment,Return_Label,Sentiment_Discretized,Rolling_TE,Strategy_Return,Cumulative_Return
0,2020-03-10,71.334999,0.069546,2020-03-10,0.000000,2,1,,0.0,1.000000
1,2020-03-11,68.857498,-0.035348,2020-03-11,0.142857,0,1,,0.0,1.000000
2,2020-03-12,62.057499,-0.103978,2020-03-12,0.000000,0,1,,0.0,1.000000
3,2020-03-13,69.492500,0.113157,2020-03-13,0.375000,2,2,,0.0,1.000000
4,2020-03-16,60.552502,-0.137708,2020-03-16,-0.500000,0,0,,0.0,1.000000
...,...,...,...,...,...,...,...,...,...,...
211,2023-01-23,141.110001,0.023229,2023-01-23,0.200000,2,1,0.12725,0.0,0.887675
212,2023-01-24,142.529999,0.010013,2023-01-24,0.000000,2,1,0.00000,0.0,0.887675
213,2023-01-25,141.860001,-0.004712,2023-01-25,0.166667,1,1,0.00000,0.0,0.887675
214,2023-01-26,143.960007,0.014695,2023-01-26,0.250000,2,1,0.00000,0.0,0.887675


### 5. Calibration and train-validation-test split

Based on the arbitrage strategy, we have several hyperparameters to be calibrated: $\delta, \gamma, \Delta, \lambda, \alpha$. The hyperparameters should be calibrated using the in-sample training set and validation set, and perform stimulation trading on the out-of-sample test set. The size of the whole train-validation-test set should be fixed and moved along the time with the moving interval to be the exact size of the test set, only using the most recent data and construct a trading performance over the whole dataframe.

Within one single train-validation-test set, the best hyperparameters for each stock $s$ are found based on the Sharpe ratio of the strategy in the **validation set**. The way of finding the best hyperparameters should be done by grid search.

**Important**: When calibrating the model use **memoisation** techniques to notice the time-difference

In [None]:
# Example usage:
transformations = []  # if you have extra steps to do outside the param search
initial_params = {
    'gamma': 0.01,
    'beta': 0.3,
    'delta': 1,
    'lambda': 10,
    'alpha': 0.2
}

summary_df = rolling_calibration_single_bar_summary(
    data=intermediate_data,
    transformations=transformations,
    initial_params=initial_params,
    window_size=100,
    step_size=1,
    train_ratio=0.6,
    val_ratio=0.2,
    test_ratio=0.2,
    optimize_params_func=optimize_params_func
)

In [None]:
summary_df

Unnamed: 0,start_idx,best_params,best_val_sharpe,test_sharpe,test_mean_return,test_cum_return
0,0,"{'gamma': 0.005, 'beta': 0.5, 'delta': 1, 'lam...",0.299292,-0.110744,-0.001304,-0.027040
1,1,"{'gamma': 0.005, 'beta': 0.5, 'delta': 1, 'lam...",0.299292,-0.073924,-0.000860,-0.018337
2,2,"{'gamma': 0.01, 'beta': 0.5, 'delta': 1, 'lamb...",0.288552,0.000000,0.000000,0.000000
3,3,"{'gamma': 0.01, 'beta': 0.5, 'delta': 1, 'lamb...",0.250011,0.000000,0.000000,0.000000
4,4,"{'gamma': 0.01, 'beta': 0.3, 'delta': 1, 'lamb...",0.051968,-0.188797,-0.000241,-0.004823
...,...,...,...,...,...,...
112,112,"{'gamma': 0.01, 'beta': 0.3, 'delta': 2, 'lamb...",0.456870,0.247479,0.000556,0.011122
113,113,"{'gamma': 0.01, 'beta': 0.5, 'delta': 1, 'lamb...",0.412017,0.000000,0.000000,0.000000
114,114,"{'gamma': 0.01, 'beta': 0.5, 'delta': 1, 'lamb...",0.412017,0.000000,0.000000,0.000000
115,115,"{'gamma': 0.005, 'beta': 0.3, 'delta': 1, 'lam...",0.350853,-0.223607,-0.000030,-0.000599


### 6. Final Result