## Financial news categorization/sentiment analysis using NLP techniques


Sentiment analysis is the statistical analysis of simple sentiment
cues. Essentially, it involves making statistical analyses on polarized
statements (i.e., statements with a positive, negative and neutral sen
timent), which are usually collected in the form of social media posts,
reviews, and news articles. Financial sentiment analysis is a challenging task due to the specialized language and lack of labeled data in that domain.


In our case, we will focus on two different tasks.


1. **Category tagger**: Create a NLP classifier capable of assigning a financial category to a text derived from the financial industry.

The Twitter Financial News dataset is an English-language dataset containing an annotated corpus of finance-related tweets. This dataset is used to classify finance-related tweets for their topic.

    The dataset holds 21,107 documents annotated with 20 labels:

topics = {
    "LABEL_0": "Analyst Update",
    "LABEL_1": "Fed | Central Banks",
    "LABEL_2": "Company | Product News",
    "LABEL_3": "Treasuries | Corporate Debt",
    "LABEL_4": "Dividend",
    "LABEL_5": "Earnings",
    "LABEL_6": "Energy | Oil",
    "LABEL_7": "Financials",
    "LABEL_8": "Currencies",
    "LABEL_9": "General News | Opinion",
    "LABEL_10": "Gold | Metals | Materials",
    "LABEL_11": "IPO",
    "LABEL_12": "Legal | Regulation",
    "LABEL_13": "M&A | Investments",
    "LABEL_14": "Macro",
    "LABEL_15": "Markets",
    "LABEL_16": "Politics",
    "LABEL_17": "Personnel Change",
    "LABEL_18": "Stock Commentary",
    "LABEL_19": "Stock Movement"
}

2. **Sentiment tagger**: Create a NLP classifier capable of assigning a sentiment score (positive,negative,neutral) to text derived from the financial industry. Additionally, we will use a powerful pre-trained model, finetuned on financial data, to assign scores to financial headlines, data from social media posts, etc ...


## Pre-requisites:


High level requirements of Python library.

    - Pytorch
    - HuggingFace Transformers library
    - Pandas
    - Numpy
    - Sklearn
    

## **Step 1: Pulling the data together**


Download and inspect the data from the various sources:

1. Financial Phrasebank https://huggingface.co/datasets/financial_phrasebank. Humanly annotated

2. Financial tweets topics dataset: https://huggingface.co/datasets/zeroshot/twitter-financial-news-topic/viewer/default/train?p=169. Humanly annotated

Think of any pre-processing functions (
    Converting the text to lowercase,
    removing punctuation,
    tokenizing the text,
    removing stop words and empty strings,
    lemmatizing tokens.
) that you might need to apply for downstream tasks. As always, pick a framework for data analysis and data exploration.

In [43]:
# Libraries import
import os
import pandas as pd
import nltk
import spacy
import string
nltk.download('punkt')
from sklearn.model_selection import train_test_split

#Tokenization
from nltk.tokenize import word_tokenize
from spacy.lang.en import English

#Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer

#Encoding
from sklearn.preprocessing import LabelEncoder

#Models
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report


[nltk_data] Downloading package punkt to /Users/daniel/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [44]:
# Datasets de categorizacion
cat_train = pd.read_csv('./categorizacion/topic_train.csv')
cat_test = pd.read_csv('./categorizacion/topic_valid.csv')

directory_sent='./sentimiento'
data_list = []

# Datasets de sentimiento
for filename in os.listdir(directory_sent):
    if filename.endswith(".txt"):
        file_path = os.path.join(directory_sent,filename)

        with open(file_path,'r',encoding='latin1') as file:
            lines = file.readlines()
            for line in lines:
                sentence, sentiment = line.rsplit('@',1)
                sentiment=sentiment.strip()
                data_list.append({'sentence':sentence,'sentiment': sentiment})
df = pd.DataFrame(data_list)
sent_train, sent_test = train_test_split(df, test_size=0.2, stratify=df['sentiment'], random_state=42)

# Output:
# cat_train,cat_test,sent_train,sent_test

In [45]:
def clean_stop_words(data):
    stop_words = spacy.lang.en.stop_words.STOP_WORDS
    lista = [palabra for palabra in data if palabra not in stop_words and palabra not in string.punctuation and palabra != '``']
    return lista

In [46]:
# Categorizacion - Tokenizacion
cat_train['tokenized_text']=cat_train['text'].apply(word_tokenize)
cat_test['tokenized_text']=cat_test['text'].apply(word_tokenize)

# Sentimiento - Tokenizacion
sent_train['tokenized_text']=sent_train['sentence'].apply(word_tokenize)
sent_test['tokenized_text']=sent_test['sentence'].apply(word_tokenize)

# Categorizacion - Stop Words
cat_train['cleaned_text']=cat_train['tokenized_text'].apply(clean_stop_words)
cat_test['cleaned_text']=cat_test['tokenized_text'].apply(clean_stop_words)

# Sentimiento - Stop Words
sent_train['cleaned_text']=sent_train['tokenized_text'].apply(clean_stop_words)
sent_test['cleaned_text']=sent_test['tokenized_text'].apply(clean_stop_words)

## **Step 2: Train and fine-tune various NLP classifiers on financial news datasets** 



#### **2.1 Let´s start with simple baseline (at your own choice)**. For example, build a logistic regression model based on pre-trained word embeddings or TF-IDF vectors of the financial news corpus **


Build a baseline model  with **Financial Phrasebank dataset**. What are the limitations of these baseline models?


#### **2.2 Compare the baseline with a pre-trained model that is specialized for the finance domain. Download and use the FinBERT model from Huggingfaces**

Model source: https://huggingface.co/ProsusAI/finbert

Once you have downloaded the model, run inference and compute performance metrics to get a sense of how the specialized pre-trained model fares against the baseline  model.  Use the HuggingFaces library to download the model and run inference on it. For large datasets or text sequences, CPU running time might be large.

For more information on the model: Araci, D. (2019). FinBERT: Financial Sentiment Analysis with Pre-trained Language Models.

#### **2.3 (Advanced) Fine-tune a pre-trained model such a base BERT model on a small labeled dataset**

General-purpose models are not effective enough because of the specialized language used in a financial context. We hypothesize that pre-trained language models can help with this problem because they require fewer labeled examples and they can be further trained on domain-specific corpora.

In recent years the NLP community has seen many breakthoughs in Natural Language Processing, especially the shift to transfer learning. Models like ELMo, fast.ai's ULMFiT, Transformer and OpenAI's GPT have allowed researchers to achieves state-of-the-art results on multiple benchmarks and provided the community with large pre-trained models with high performance. This shift in NLP is seen as NLP's ImageNet moment, a shift in computer vision a few year ago when lower layers of deep learning networks with million of parameters trained on a specific task can be reused and fine-tuned for other tasks, rather than training new networks from scratch.

One of the most significant milestones in the evolution of NLP recently is the release of Google's BERT, which is described as the beginning of a new era in NLP. In our case, we are going to explore a pre-trained model called FinBERT, already tuned with a financial corpus. I specifically recommend the HuggingFace library for easeness of implementation.

*What is HuggingFace?* Hugging Face’s Transformers is an open-source library that provides thousands of pre-trained models to perform various tasks on texts such as text classification, named entity recognition, translation, and more. The library has a unified, high-level API for these models and supports a wide range of languages and model architectures.


Here are various tutorials for finetuning BERT: https://drlee.io/fine-tuning-hugging-faces-bert-transformer-for-sentiment-analysis-69b976e6ac5d and https://skimai.com/fine-tuning-bert-for-sentiment-analysis/. I specially recommnend this one: http://mccormickml.com/2019/07/22/BERT-fine-tuning/

The dataset where to finetune a BERT related model can be found in the previous cell: **Financial tweets topics dataset** 

*ALERT*: Running or training a large language model like BERT or FinBERT might incur in large CPU processing times. Although BERT is very large, complicated, and have millions of parameters, we might only need to fine-tune it in only 2-4 epochs. You can also explore Google colab, for limited acces to free GPUs, which might best suited for this task., specially if training required.

Finally, compare the previous baseline with fine-tuned FinBERT

In [47]:
cat_train['cleaned_sentence'] = cat_train['cleaned_text'].apply(lambda x: ' '.join(x))
cat_test['cleaned_sentence'] = cat_test['cleaned_text'].apply(lambda x: ' '.join(x))

sent_train['cleaned_sentence'] = sent_train['cleaned_text'].apply(lambda x: ' '.join(x))
sent_test['cleaned_sentence'] = sent_test['cleaned_text'].apply(lambda x: ' '.join(x))

# Input of the model
cat_train_input=cat_train[['cleaned_sentence','label']]
cat_test_input=cat_test[['cleaned_sentence','label']]

sent_train_input=sent_train[['cleaned_sentence','sentiment']]
sent_test_input=sent_test[['cleaned_sentence','sentiment']]

#Vectorizing
vectorizer = TfidfVectorizer(max_features=1000)

In [48]:
sent_train_input['sentiment'].value_counts()

sentiment
neutral     7161
positive    3190
negative    1473
Name: count, dtype: int64

In [49]:
cat_train_tfid = vectorizer.fit_transform(cat_train_input['cleaned_sentence'])
y_cat_train=cat_train_input['label']

cat_test_tfid = vectorizer.fit_transform(cat_test_input['cleaned_sentence'])
y_cat_test=cat_test_input['label']

label_encoder = LabelEncoder()

sent_train_tfid = vectorizer.fit_transform(sent_train_input['cleaned_sentence'])
sent_train_input.loc[:,'sentiment_numeric']=label_encoder.fit_transform(sent_train_input['sentiment'])
y_sent_train = sent_train_input['sentiment_numeric']

sent_test_tfid = vectorizer.fit_transform(sent_test_input['cleaned_sentence'])
sent_test_input.loc[:,'sentiment_numeric']=label_encoder.fit_transform(sent_test_input['sentiment'])
y_sent_test= sent_test_input['sentiment_numeric']

print("Mapeo de etiquetas a valores numéricos:")
for label, numeric_value in zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_)):
    print(f"{label}: {numeric_value}")

Mapeo de etiquetas a valores numéricos:
negative: 0
neutral: 1
positive: 2


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sent_train_input.loc[:,'sentiment_numeric']=label_encoder.fit_transform(sent_train_input['sentiment'])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sent_test_input.loc[:,'sentiment_numeric']=label_encoder.fit_transform(sent_test_input['sentiment'])


In [55]:
model = LogisticRegression()
model.fit(cat_train_tfid, y_cat_train)

predictions = model.predict(cat_test_tfid)
accuracy = accuracy_score(y_cat_test, predictions)
print(f'Accuracy: {accuracy:.2f}')
print("Classification Report:")
print(classification_report(y_cat_test, predictions))

Accuracy: 0.16
Classification Report:
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        73
           1       0.06      0.04      0.05       214
           2       0.21      0.31      0.25       852
           3       0.00      0.00      0.00        77
           4       0.00      0.00      0.00        97
           5       0.02      0.00      0.01       242
           6       0.04      0.01      0.02       146
           7       0.00      0.00      0.00       160
           8       0.00      0.00      0.00        32
           9       0.09      0.16      0.12       336
          10       0.00      0.00      0.00        13
          11       0.00      0.00      0.00        14
          12       0.00      0.00      0.00       119
          13       0.25      0.09      0.13       116
          14       0.24      0.25      0.24       415
          15       0.00      0.00      0.00       125
          16       0.02      0.02      0.02

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## **Step 3: Deployment of the sentiment/category tagger on  financial news or social media posts**

Let´s now turn our attention to a live deployment of the financial news tagger. Things can get quite complicated, specially if we add streaming data, so it is best to keep the deploymnet lightweight. There are mainly three important pieces. Let´s explore them:


- Build a local dashboard/app (e.g. using Streamlit or another web applications framework of your choice). A bit UI to display the sentiment tagger in action and demonstrate the practical application of your model.


- Build a financial news/alerts scraper pipeline, filter some entities if you focus your search. In a real world setting,  you’d likely want to build a more robust infrastructure for processing and ingestion of new examples, handling any preprocessing, and outputting predictions. Here are some options where to scrape data (real-time data might be expensive or limited):

    - <span style="color:blue">*Social Media Posts*</span>: Pulling historical or live data from tweets or reddit. There are public APIs with extensive documentation for them.
    - <span style="color:blue">*OpenBB*</span>: Open research investment platform. It aggregates financial news across the world and has an API to access them.
    - <span style="color:blue">*Financial news outlet*</span>: Yahoo Finance
    
An pipeline example: The basic premise is to read in a stream of tweets, use a lighweight sentiment analysis engine (BERT might not be a good fit here) to assign a bullish/neutral/bearish score to each tweet, and then see how this cumulatively changes over time.
    
    
- Build an inference endpoint for the tagging model. Within your infrastructure, you can deploy and load the resuting model. One way is to build a REST API endpoint, only to be queried locally (in your laptop).



Extra: You could explore or quantify correlations with the market for a list of selected stock.