# Stock Market Prediction using Textual Data - project write-up

## Motivation
In the field of the stock market, fundamental analysis is often neglected, especially by amateur traders, due to its difficulty, and people tend to rely more on technical analysis, wherein all you need is just the price chart of a stock. Conversely, fundamental analysis, which is a cornerstone of investment strategy, aiming to evaluate the intrinsic value of a stock based on a company's financial health, entails a comprehensive monitoring of the company's news, annual reports, corporate events, etc. Extracting meaningful insights from this type of information manually, which is entirely textual, is cumbersome. However, we can utilize NLP models to exploit these kinds of textual information and predict the trend of a stock. In an ideal case, such models can be trained on large-scale data (e.g. on the economic and geopolitical news on the web) and observe the trend of each stock both short-term and long-term. In this project, however, we aim to use a model that analyzes stock-specific news and predicts the temporary mispricing that may occur due to a corporate event or any influential news on a specific stock.

This project is motivated by the paper [Trade the Event: Corporate Events Detection for News-Based Event-Driven Trading](https://arxiv.org/abs/2105.12825) [1] whose code is also available in a [GitHub](https://github.com/Zhihan1996/TradeTheEvent) repository. The authors of the paper have provided three different datasets that can be used in the stock market area. These datasets will be covered in more details in the **Data** section.

## Data
The stock market data is pretty noisy and accordingly gathering a dataset together could be challenging. However, the authors of [1] have tried to prepare three datasets each of which can be adopted in different phases of training or testing an NLP model. These datasets are called **_EDT_** available [here](https://github.com/Zhihan1996/TradeTheEvent/tree/main/data). In what follows we briefly explaine each of these three datasets:

1. **Domain Adaptation**:
This portion contains a corpus of financial news articles and a financial terms encyclopedia. This dataset can be utilized for a simple language modeling task, and the NLP models can be trained via masked or next token prediction. It has a train.txt and dev.txt on which the perplexity of the model can be computed. Using this dataset, the model can be finetuned and adapted to the space of stock market and finance in order to be employed more effectively in the ultimate tasks.

2. **Corporate Event Detection**:
This dataset consists of news articles with sequence level and token level tags. The tags are associated with 11 types of corporate events plus a special category "O" that stands for no event. Those 11 events are: Acquisition(A), Clinical Trial(CT), Regular Dividend(RD), Dividend Cut(DC), Dividend Increase(DI), Guidance Increase(GI), New Contract(NC), Reverse Stock Split(RSS), Special Dividend(SD), Stock Repurchase(SR) and Stock Split(SS).
The corporate events of Reverse Stock Split(RSS) and Dividend Cut(DC) are claimed to have negative impact, Regular Dividend(RD) neutral impact, and the rest positive impact on the price. This dataset also comes in train.txt and dev.txt

3. **Trading Benchmark**:
The benchmark dataset contains `303893​` news articles range from `2020/03/01` to `2021/05/06`. Each article is saved as a dictionary with the following keys which include the corresponding price tags associated with the ticker about which the article is:
```
'title': Title and possible subtitle of the news article.
'text': Main text of the news article.
'pub_time': Adjusted minute-level timestamp of the article's publish time.
'labels': [
	'ticker': An automatically recognized ticker of the company that occurs most in the article. 
	          The following price labels comes from the price data of this ticker. 
	          If not ticker is recognized, the value is empty and there is no price labels.
	'start_time': The first available trading time of the ticker after the news is published.
	'start_price_open': The "Open" price of the ticker at 'start_time'.
	'start_price_close': The "Close" price of the ticker at 'start_time'.
	'end_price_1day': The "Close" price at the last minute of the following 1 trading day.
	                  The "following 1 trading day" refers to the same day as the "start_time"
	                  if 'start_time' if early than 4pm ET. Otherwise, it refers to the next 
	                  trading day. And so on for "following n trading day"
	'end_price_2day': The "Close" price at the last minute of the following 2 trading days.
	'end_price_3day': The "Close" price at the last minute of the following 3 trading days.
	'end_time_1day': The time corresponds to 'end_price_1day'.
	'end_time_2day': The time corresponds to 'end_price_2day'.
	'end_time_3day': The time corresponds to 'end_price_1day'.
	'highest_price_1day': The highest price in the following 1 trading day.
	'highest_price_2day': The highest price in the following 2 trading days.
	'highest_price_3day': The highest price in the following 3 trading days.
	'highest_time_1day': The time corresponds to 'highest_price_1day'.
	'highest_time_2day': The time corresponds to 'highest_price_2day'.
	'highest_time_3day': The time corresponds to 'highest_price_3day'.
	'lowest_price_1day': The lowest price in the following 1 trading day.
	'lowest_price_2day': The lowest price in the following 2 trading days.
	'lowest_price_2day': The lowest price in the following 3 trading days.
	'lowest_time_1day': The time corresponds to 'lowest_price_1day'.
	'lowest_time_2day': The time corresponds to 'lowest_price_2day'.
	'lowest_time_3day': The time corresponds to 'lowest_price_3day'.
] 
```

The authors of [1] only used the third set as the benchmark dataset to measure the accuracy of their method. However, we split this dataset into 85% of training data and 15% of validation (or test) data and used the training part to train our model. We will describe this procedure in more details in the following sections.

We also noticed that some of the data points miss the information of either `ticker`, or prices associated with that ticker. In the following block we noticed that only `197,274` data points are invalid to be used in our project, and only `106,619` data points were used. 

In [64]:
import json
json_file_path = 'data/Trading_benchmark/evaluate_news.json'
with open(json_file_path, 'r') as json_file:
    data = json.load(json_file)

invalid = 0
for i, datai in enumerate(data):
    if not 'ticker' in datai['labels']:
        invalid += 1
    else:
        if not datai['labels']['ticker']:
            invalid += 1
        else:
            if not 'start_price_open' in datai['labels']:
                invalid += 1
print(invalid)

197274


## Approach
In this part, we first explain the approach introduced in [1] and then explain our hypothesis and our proposed approach:

### Method of [1]
In [1], a BERT model is utilized to predict the corporate events. Then, the trend of the market is predicted based on the detected corporate event. The impact of such events was mentioned earlier. The authors of [1] first fine-tuned the BERT model on the *Domain Adaptation* dataset to put the model in the stock market area. Then, they trained a bi-level BERT-style model to classify the text as one of the corporate events using the *Corporate Event Detection* dataset. As seen in the image below, the all the tokens are classified to events as well as the CLS token and the combination of them provides the final corporate event class.

<center>

![Event Detection model](images/event_prediction.png)

</center>

The code of this model is provided below:

In [65]:
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss, BCEWithLogitsLoss
from transformers import BertModel, BertPreTrainedModel

class BertForBilevelClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels
        self.max_seq_length = config.max_seq_length
        self.final_dropout1 = nn.Dropout(config.hidden_dropout_prob)
        self.final_classifier1 = nn.Linear(config.hidden_size + self.max_seq_length * self.num_labels, 2048)
        self.final_dropout2 = nn.Dropout(config.hidden_dropout_prob)
        self.final_classifier2 = nn.Linear(2048, self.num_labels - 1)

        self.bert = BertModel(config)

        self.seq_dropout = nn.Dropout(config.hidden_dropout_prob)
        self.ner_dropout = nn.Dropout(config.hidden_dropout_prob)

        self.ner_classifier1 = nn.Linear(config.hidden_size, 2048)
        self.ner_classifier2 = nn.Linear(2048, self.num_labels)
        self.dropout1 = nn.Dropout(config.hidden_dropout_prob)
        self.dropout2 = nn.Dropout(config.hidden_dropout_prob)

        self.init_weights()

    def forward(self,
                input_ids=None,
                attention_mask=None,
                token_type_ids=None,
                position_ids=None,
                head_mask=None,
                inputs_embeds=None,
                seq_labels=None,
                ner_labels=None,
                output_attentions=None,
                output_hidden_states=None,
                return_dict=None):

        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict
                            )

        sequence_output = outputs[0]
        sequence_output = self.ner_dropout(sequence_output)
        ner_logits = self.ner_classifier1(sequence_output)
        ner_logits = self.dropout1(ner_logits)
        ner_logits = self.ner_classifier2(ner_logits)


        pad_pred = torch.zeros([self.num_labels], device=ner_logits.device, dtype=ner_logits.dtype)
        pad_pred[-1] = 1
        ner_logits[attention_mask==0]=pad_pred

        final_input = ner_logits.view([ner_logits.shape[0], -1])
        final_input = torch.cat((outputs[1], final_input), dim=1)

        seq_logits = self.final_dropout1(final_input)
        seq_logits = self.final_classifier1(seq_logits)
        seq_logits = self.final_dropout2(seq_logits)
        seq_logits = self.final_classifier2(seq_logits)

        outputs = (ner_logits,) + (seq_logits,) + outputs[2:]  # add hidden states and attention if they are here
        if ner_labels is not None:
            # calculate ner loss
            ner_loss_fct = CrossEntropyLoss()
            if attention_mask is not None:
                active_loss = attention_mask.view(-1) == 1
                active_logits = ner_logits.view(-1, self.num_labels)
                active_labels = torch.where(
                    active_loss, ner_labels.view(-1), torch.tensor(ner_loss_fct.ignore_index).type_as(ner_labels)
                )
                ner_loss = ner_loss_fct(active_logits, active_labels)
            else:
                ner_loss = ner_loss_fct(ner_logits.view(-1, self.num_labels), ner_labels.view(-1))

            # calculate sequence classification loss
            seq_loss_fct = BCEWithLogitsLoss()
            seq_loss = seq_loss_fct(seq_logits, seq_labels[:, :-1])
            loss = ner_loss + seq_loss

            outputs = (loss,) + outputs

        return outputs

### Our Approach
We believe that each news, regardless of being a corporate event, can potentially have an impact on the stock price. Therefore, we do not want to limit ourselves to corporate events to predict the stock trend. So, we decided to use the *Trading Benchmark* datasets containing the price tags, to predict the trend of the price directly. However, we first trained a simple BERT for sequence classification on the corporate events whose CLS token was only supposed to predict the class. Then, we used the weights of this model to initialize the following models utilized for stock trend prediction. We trained three models with three different architectures and purposes. In what follows, each of these models are illustrated in more details:

#### Trend Prediction model
This model has roughly the same architecture as model [1]. However, in this model the output of the CLS token is 11 numbers correponding to 11 corporate events, and the rest of the tokens' hidden representation are mapped to 3 features using the first FF layer. The second FF layer combines all of these features to predict the three trend classes of *upward*, *downward* and *neutral*. We thought that, initializing the model to the pre-trained corporate event detector weights could help the model to favor the impact of both corporate and non-corporate events. It is worth mentioning that generating the ground-truth labels are based on the difference between the `start_price_open` and `end_price_3day`. If the difference is more than a threshold of +3% or -3%, the associated class would be *upward* or *downward*, and it is *neutral* otherwise. Bellow, an image of the architecture and the code of this model are provided.

<center>

![Event Detection model](images/trend_prediction.png)

</center>

In [67]:
from utils.model import BertForSequenceClassification

class StockBertForSequenceClassification(BertForSequenceClassification):
    def __init__(self, config):
        super().__init__(config)

        self.stock_labels = config.stock_labels
        self.max_seq_length = config.max_seq_length
        self.token_classifier = nn.Linear(config.hidden_size, self.stock_labels)
        self.token_activation = nn.GELU()
        self.final_classifier1 = nn.Linear(config.num_labels + self.max_seq_length * self.stock_labels, 2048)
        self.final_dropout = nn.Dropout(config.hidden_dropout_prob)
        self.final_activation = nn.GELU()
        self.final_classifier2 = nn.Linear(2048, self.stock_labels)

        self.init_weights()

    def forward(self,
                input_ids=None,
                attention_mask=None,
                token_type_ids=None,
                position_ids=None,
                head_mask=None,
                inputs_embeds=None,
                labels=None,
                output_attentions=None,
                output_hidden_states=None,
                return_dict=None):

        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict
                            )

        pooled_output = outputs[1]
        pooled_output = self.seq_dropout(pooled_output)
        seq_logits = self.ner_classifier1(pooled_output)
        seq_logits = self.dropout2(seq_logits)
        seq_logits = self.ner_classifier2(seq_logits)

        token_logits = self.token_classifier(outputs[0])
        token_logits = self.token_activation(token_logits)

        final_input = torch.cat((seq_logits, token_logits.view(token_logits.shape[0], -1)), dim=1)

        final_logits = self.final_classifier1(final_input)
        final_logits = self.final_dropout(final_logits)
        final_logits = self.final_activation(final_logits)
        final_logits = self.final_classifier2(final_logits)
        final_logits = nn.functional.softmax(final_logits, dim=1)

        outputs = (final_logits,) + (seq_logits,) + outputs[2:]  # add hidden states and attention if they are here

        if labels is not None:
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(final_logits, labels)

            outputs = (loss,) + outputs

        return outputs

#### Simple Trend Prediction model
In this model, we removed the first level of FF and classify the text to the three trend classes using only the CLS token. This model is also initialized to the pre-trained corporate event detector weights:

<center>

![Event Detection model](images/simple_trend_prediction.png)

</center>

In [68]:
class BertForSimpleSequenceClassification(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)

        self.init_weights()

    def forward(self,
                input_ids=None,
                attention_mask=None,
                token_type_ids=None,
                position_ids=None,
                head_mask=None,
                inputs_embeds=None,
                labels=None,
                output_attentions=None,
                output_hidden_states=None,
                return_dict=None):
        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict
                            )
        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)

        outputs = (logits,) + outputs[2:]  # add hidden states and attention if they are here
        if labels is not None:
            # calculate sequence classification loss
            loss_fct = CrossEntropyLoss()
            loss = loss_fct(logits, labels)

            outputs = (loss,) + outputs

        return outputs

#### Price Prediction model
This model is similar to the previous model except for the fact that it is a regression model intended to predict the percent of price difference dirrectly. In other words, we do not generate the three trend classes of ground-truth labels prior to training. Conversely, we feed the model with the exact price difference percentage and consider the MSE loss between the predicted and the ground-truth. During the inference, if the predicted price difference is more (less) than +3% (-3%), the outcome belongs to *upward* (*downward*) class, and *neutral* otherwise. This way, the model could learn more from severe changes in prices based on the respective news:

<center>

![Event Detection model](images/price_prediction.png)

</center>

In [70]:
class BertForSequenceRegression(BertPreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
        self.init_weights()

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
    ):

        outputs = self.bert(input_ids,
                            attention_mask=attention_mask,
                            token_type_ids=token_type_ids,
                            position_ids=position_ids,
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict,
                            )

        pooled_output = outputs[1]

        pooled_output = self.dropout(pooled_output)
        logits = self.classifier(pooled_output)
        output = (logits,) + outputs[2:]

        loss = None
        if labels is not None:
            #  We are doing regression
            loss_fct = nn.MSELoss()
            loss = loss_fct(logits.view(-1), labels.view(-1))
            output = (loss,) + output

        return output

## Code
As mentioned earlier, our code is built upon the code of [1] provided in this [GitHub](https://github.com/Zhihan1996/TradeTheEvent) repository. `data.py` and `model.py` in the `utils` folder are the most important files containing the dataset-related and model-related codes. `run.domainadapt.py` is used for training the BERT model for the domain adaptation purpose using the first dataset. `run_event.py` is the code developed by [1] to train their models on the corporate event dataset. We used this file to train our base BERT model to initilize our proposed models. `run_full.py` is the script for training and evaluating our propopsed models. There is also a `check.py` file which is meant for cross checking the results. We stores the output predicted labels generated by our models as well as the ground-truths and use `check.py` to compute the Accuracy of our models on the 3-class trend classification task. The detailed explanations on reproducing the results are provided in `README.md` file.

## Experimental Setup
We first trained a \<MASKED\> token prediction BERT model on the domain adaptation dataset for 20 epochs. The learning rate was set to 3e-5 in the first phase. Then, we initilized the BERT model of a sequence classification model to the obtained weights and fine-tuned the model to predict the corporate event class using the CLS token. This fine-tuning in the second phase were done for 4 epochs with the initial learning rate of 5e-5. In the last phase, we trained our own proposed models, initilized to the weights obtained in phase 2, for 5 epochs on 85% of the *Trade Benchmark* dataset with the initial learning rate of 5e-5. In phase 2 and 3, we used a linear learning rate scheduler with warm-up steps. AdamW were also employed in all the phases as the optimizer. We compared the results of our models with the model presented in [1] using two main metrics illustrated in the followings: 

1. **Accuracy**:
The typical accuracy of the model in detecting the correct class. It is calculated as the number of correctly detected classes divided by the total number of samples.

2. **RPS (Ranked Probability Score)**:
This metric is useful in the cases where the classes have different level of importance. For instance, in predicting the result of a football match, predicting "win" is worse than predicting "draw" when the correct result is "loss". The same is true for stock market prediction. When the price trend is *upward*, predicting *downward* is worse than predicting *neutral*. And, when the price trend is *downward*, predicting *upward* is worse than predicting *neutral*. The RPS can be calculated using the following equation (The lower the `RPS` the better the results):

$$
RPS=\frac{1}{r-1} \sum_{i=1}^{r}\left(\sum_{j=1}^{i}(p_j - y_j) \right)^2
$$

## Results
First, the perplexity of the model trained in phase 1, became `3.1431` on the dev.txt. In phase 2, our simple classification model reached the best accuracy of `98.87%` which is the same as that we obtained by the bi-level token-level classification model proposed in [1]. As a result, it can be concluded that the token-level classification combined with the classification done on the CLS token is not helpful. This can be intuitively inferred as well. Because tokens do not seem to individually carry the corporate event information. As opposed, the whole context yeilds such information. Therefore, the representation of CLS token is enough to extract the existed corporate event. 
In the following, the results of our models and model provided in [1] on the test set of the *Trade Benchmark* dataset are provided. To be able to reproduce these results, you need to extract the contents of `output.zip` file into a folder named `models`. To obtain the RPS results, you need to run the `run_full.py` file with the argument of `--do_predict`. In this case, you would need the model weights.

In [1]:
from check import check_regression, check_3class

#### Method of [1]

In [2]:
check_3class('models/bilevel/results/labels.npy' , 'models/bilevel/results/model_preds.npy')
# Accuracy = 52.54
# RPS = 0.51

2023-12-12 02:44:36,540 - check - INFO - Accuracy: 52.54


#### Trend Prediction model

In [3]:
check_3class('models/stock_pred/results/labels.npy' , 'models/stock_pred/results/model_preds.npy')
# Accuracy = 53.01
# RPS = 0.49

2023-12-12 02:46:00,061 - check - INFO - Accuracy: 53.01


#### Simple Trend Prediction model

In [4]:
check_3class('models/simple_cls/results/labels.npy' , 'models/simple_cls/results/model_preds.npy')
# Accuracy = 53.61
# RPS = 0.41

2023-12-12 02:47:19,523 - check - INFO - Accuracy: 53.61


#### Price Prediction model

In [6]:
check_regression('models/regression/results/labels.npy' , 'models/regression/results/model_preds.npy', threshold=0.03)
# Accuracy = 53.89
# RPS = 0.50
# MSE = 0.018

2023-12-12 02:49:16,665 - check - INFO - Accuracy_3class: 53.89
2023-12-12 02:49:16,668 - check - INFO - Accuracy_2class: 54.34
2023-12-12 02:49:16,669 - check - INFO - MSE=0.018


## Analysis of the Results
According to the above-mentioned results, the worst performance belongs to the model proposed in [1] although the difference between the models' performance are not significatn. Anyways, we can conclude that using only corporate events to predict the trend might not be very effective, since in many cases, the news is classified as "no-event" with which the trend cannot be predicted. However, a more general model (like ours) can exploite the representation of the whole text in any ways it finds useful, which is not necessarily through corporate event detection, to predict the trend.

The second conclusion is that the CLS token's representation is enough for trend classification as the performance of "Simple Trend Prediction model" and "Price Prediction model" are slightly better than "Trend Prediction model".

In the end, the better accuracy of the "Price Prediction model" model compared to "Simple Trend Prediction model" suggests, using regression instead of classification during training, might be more effective. A regression model can be still used as a classifier in deployment, and it also has the flexibility of setting the suitable threshold during the inference.

## Future Works
We suggest the following items as future works:

1. Using other type of models like `gpt` as the base transformer model.

2. The size of the *Trading Benchmark* dataset was too small given extent of the news types and their impact on the stock market. Therefore, an improvement on the performance of the models can be achieved by extending the dataset.

3. Effective data filtering can be useful as well, given the stock market data is very noisy and some news might be totally irrelevant to the price, which can potentially mislead the model to a wrong direction.

4. Using a text summarization model to summarize the article first, and feed the predictor model with a more condensed text. This way, we can eliminate the effect of some tokens that might fool the model. The size of the input tokens can be reduced significantly and accordingly the time and required resources for training the model. Overally, the model can be trained more effectively.