# Sentiment analysis for finance

In this notebook, I present three different approaches to sentiment analysis for finance:

1. Dictionary-based approach
2. FinBert and other BERT-based models
3. LLM models (GPT-3.5, Llama2, etc.)

In [1]:
import pandas as pd

First of all, let's load some financial news data to work with.

In [2]:
from langchain_community.document_loaders import NewsURLLoader

In [3]:
urls = [
    "https://ca.finance.yahoo.com/news/rising-oil-price-doesnt-shake-160556917.html",
    "https://ca.finance.yahoo.com/news/venezuela-detains-two-former-maduro-173140679.html",
    "https://ca.finance.yahoo.com/news/norfolk-southern-agrees-pay-600m-121211343.html",
    "https://ca.finance.yahoo.com/news/boeing-shares-fall-nyt-report-164509069.html",
    "https://ca.finance.yahoo.com/news/restaurants-along-eclipses-path-totality-153650201.html",
]

In [4]:
loader = NewsURLLoader(urls=urls)
data = loader.load()

In [5]:
data

[Document(page_content='TORONTO — The price of oil has been on a steady climb all year, but the talk at Canada\'s biggest oil and gas conference is still focused on spending discipline.\n\nIndustry leaders at the Canadian Association of Petroleum Producers conference, held in Toronto this year, have been emphasizing their predictability and focus on returning money to shareholders, rather than talk of growth.\n\nSuncor Energy Inc. chief executive Rich Kruger, who was named head of the oil and gas producer last year as it struggled with safety and operational issues, said his goal is to bring clarity and simplicity to the company.\n\n"I want to become consistently and boringly excellent," said Kruger. "I\'m not a big one for surprise parties."\n\nADVERTISEMENT\n\nKruger has been working to standardize operations and create a steadier production plan, in contrast to some of the more rushed decisions when growth was the answer to all of the industry\'s questions.\n\nThe early development 

In [6]:
df = pd.DataFrame(
    [{"title": d.metadata["title"], "text": d.page_content} for d in data]
)

df

Unnamed: 0,title,text
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...


## Dictionnary-based approach

A dictionary-based approach is a simple way to perform sentiment analysis. It consists of using a list of words with associated sentiment scores. The sentiment score of a sentence is the sum of the sentiment scores of the words it contains. You can also normalize the score by the number of words in the sentence.


For this type of approach, we will need to use a financial sentiment dictionary. I will use the Loughran-McDonald dictionary, which is widely used in finance. Note that this dictionary is designed for financial statements, so it may not be the best choice for news articles.

You can download the dictionary from [here](https://sraf.nd.edu/loughranmcdonald-master-dictionary/) (direct link: [Loughran-McDonald_MasterDictionary_1993-2023.csv](https://drive.google.com/file/d/1ptUgGVeeUGhCbaKL14Ri3Xi5xOKkPkUD/view?usp=sharing))


### Pre-processing

The preprocessing steps to use a dictionnary based approach depend on the dictionnary you are using and the final measure you want to obtain. In this case, we will use the Loughran-McDonald dictionary, which contains variation of similar words, including plural forms, verb forms, etc. Therefore, we do not need to perform stemming or lemmatization, a common step in text preprocessing.

The preprocessing steps we will perform are:
    - Lowercasing
    - Removing punctuation
    - Removing stopwords
    - Removing numbers

The removal of stopwords and numbers is optional, but it will affect the sentiment score of the text as measure as a ratio of the number of words in the text. Other common filtering includes removing URLs, emails, cities, company names, etc.





In [7]:
# Using the same ones as Loughran-McDonald

with open("stopwords.txt", "r") as f:
    stopwords = f.read().split("\n")[:-1]
stopwords[:10]

['about', 'and', 'from', 'now', 'where', 'you', 'am', 'until', 'them', 'in']

In [8]:
def preprocess_text(text):
    words = text.split()
    words = [w.lower() for w in words]
    words = [w for w in words if w not in stopwords]
    # Remove punctuation and numbers
    words = [w for w in words if w.isalpha()]
    return " ".join(words)


df["text_clean"] = df["text"].apply(preprocess_text)
df

Unnamed: 0,title,text,text_clean
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,toronto price oil a steady climb talk biggest ...
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,venezuela detained former oil finance minister...
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,norfolk southern agreed pay million a lawsuit ...
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,boeing faces a deepening crisis confidence eng...
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,restaurants path totality saw a jump sales mon...


### Dictionary

Next, will load the dictionary and make a list of positive words and a list of negative words.

In [9]:
lm_dict = pd.read_csv("Loughran-McDonald_MasterDictionary_1993-2023.csv")
lm_dict

Unnamed: 0,Word,Seq_num,Word Count,Word Proportion,Average Proportion,Std Dev,Doc Count,Negative,Positive,Uncertainty,Litigious,Strong_Modal,Weak_Modal,Constraining,Complexity,Syllables,Source
0,AARDVARK,1,664,2.690000e-08,1.860000e-08,4.050000e-06,131,0,0,0,0,0,0,0,0,2,12of12inf
1,AARDVARKS,2,3,1.210000e-10,8.230000e-12,9.020000e-09,1,0,0,0,0,0,0,0,0,2,12of12inf
2,ABACI,3,9,3.640000e-10,1.110000e-10,5.160000e-08,7,0,0,0,0,0,0,0,0,3,12of12inf
3,ABACK,4,29,1.170000e-09,6.330000e-10,1.560000e-07,28,0,0,0,0,0,0,0,0,2,12of12inf
4,ABACUS,5,9349,3.790000e-07,3.830000e-07,3.460000e-05,1239,0,0,0,0,0,0,0,0,3,12of12inf
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
86548,ZYGOTE,86529,69,2.790000e-09,1.310000e-09,2.940000e-07,45,0,0,0,0,0,0,0,0,2,12of12inf
86549,ZYGOTES,86530,1,4.050000e-11,1.720000e-11,1.880000e-08,1,0,0,0,0,0,0,0,0,2,12of12inf
86550,ZYGOTIC,86531,0,0.000000e+00,0.000000e+00,0.000000e+00,0,0,0,0,0,0,0,0,0,3,12of12inf
86551,ZYMURGIES,86532,0,0.000000e+00,0.000000e+00,0.000000e+00,0,0,0,0,0,0,0,0,0,3,12of12inf


In [10]:
pos_words = lm_dict[lm_dict["Positive"] != 0]["Word"].str.lower().to_list()
neg_words = lm_dict[lm_dict["Negative"] != 0]["Word"].str.lower().to_list()

pos_words[:10]

['able',
 'abundance',
 'abundant',
 'acclaimed',
 'accomplish',
 'accomplished',
 'accomplishes',
 'accomplishing',
 'accomplishment',
 'accomplishments']

In [11]:
neg_words[:10]

['abandon',
 'abandoned',
 'abandoning',
 'abandonment',
 'abandonments',
 'abandons',
 'abdicated',
 'abdicates',
 'abdicating',
 'abdication']

### Sentiment score

The sentiment score of a text is the sum of the sentiment scores of the words it contains. We can also normalize the score by the number of words in the text.

In [12]:
df["n"] = df["text_clean"].apply(lambda x: len(x.split()))
df["n_pos"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in pos_words])
)
df["n_neg"] = df["text_clean"].apply(
    lambda x: len([w for w in x.split() if w in neg_words])
)

df

Unnamed: 0,title,text,text_clean,n,n_pos,n_neg
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,toronto price oil a steady climb talk biggest ...,261,1,7
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,venezuela detained former oil finance minister...,179,1,12
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,norfolk southern agreed pay million a lawsuit ...,578,8,31
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,boeing faces a deepening crisis confidence eng...,402,3,38
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,restaurants path totality saw a jump sales mon...,255,4,1


In [13]:
df["lm_level"] = df["n_pos"] - df["n_neg"]

df["lm_score1"] = (df["n_pos"] - df["n_neg"]) / df["n"]
df["lm_score2"] = (df["n_pos"] - df["n_neg"]) / (df["n_pos"] + df["n_neg"])

CUTOFF = 0.3
df["lm_sentiment"] = df["lm_score2"].apply(
    lambda x: "positive" if x > CUTOFF else "negative" if x < -CUTOFF else "neutral"
)
df

Unnamed: 0,title,text,text_clean,n,n_pos,n_neg,lm_level,lm_score1,lm_score2,lm_sentiment
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,toronto price oil a steady climb talk biggest ...,261,1,7,-6,-0.022989,-0.75,negative
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,venezuela detained former oil finance minister...,179,1,12,-11,-0.061453,-0.846154,negative
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,norfolk southern agreed pay million a lawsuit ...,578,8,31,-23,-0.039792,-0.589744,negative
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,boeing faces a deepening crisis confidence eng...,402,3,38,-35,-0.087065,-0.853659,negative
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,restaurants path totality saw a jump sales mon...,255,4,1,3,0.011765,0.6,positive


## Bert-based models

BERT-based models are often called "state-of-the-art models" in recent papers in the finance even if the original [Bert paper](https://arxiv.org/abs/1810.04805) dates from 2018 and many more advanced models have come along since. They are pre-trained on a large corpus of text and fine-tuned on a specific task. In this case, we will use FinBert, a BERT model fine-tuned on financial data (see [FinBert paper](https://arxiv.org/abs/1908.10063)).

The way these models work is by taking a sequence of tokens as input and outputting a vector of size 768 (or 1024, depending on the model). This vector can be used as input to a classifier to predict the sentiment of the text. The FinBert model is trained to output softmax outputs (ie, probabilities) for three classes: positive, negative, and neutral.

### Pre-processing

You won't perform any pre-processing on the text before feeding it to the model. The model will take care of tokenizing the text and converting it to a sequence of tokens. Common pre-processing for Bert models include masking some words (date, company names, etc.).


### Usage

Many Bert models, including FinBert, are available in the Hugging Face Transformers library adn can be fecthed automatically.



In [14]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import scipy
import torch

In [15]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [16]:
def finbert_sentiment(text: str) -> tuple[float, float, float, str]:
    with torch.no_grad():
        inputs = tokenizer(
            text, return_tensors="pt", padding=True, truncation=True, max_length=512
        )
        outputs = model(**inputs)
        logits = outputs.logits
        scores = {
            k: v
            for k, v in zip(
                model.config.id2label.values(),
                scipy.special.softmax(logits.numpy().squeeze()),
            )
        }
        return (
            scores["positive"],
            scores["negative"],
            scores["neutral"],
            max(scores, key=scores.get),
        )

In [17]:
# Notice that this is the raw text, no preprocessing
df[["finbert_pos", "finbert_neg", "finbert_neu", "finbert_sentiment"]] = (
    df["text"].apply(finbert_sentiment).apply(pd.Series)
)
df["finbert_score"] = df["finbert_pos"] - df["finbert_neg"]

In [18]:
df[
    [
        "title",
        "text",
        "finbert_pos",
        "finbert_neg",
        "finbert_neu",
        "finbert_sentiment",
        "finbert_score",
    ]
]

Unnamed: 0,title,text,finbert_pos,finbert_neg,finbert_neu,finbert_sentiment,finbert_score
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,0.36347,0.054056,0.582474,neutral,0.309414
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,0.022471,0.666017,0.311512,negative,-0.643546
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,0.030382,0.732271,0.237347,negative,-0.701889
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,0.009635,0.949322,0.041043,negative,-0.939687
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,0.783017,0.038716,0.178267,positive,0.744301


## LLM models

LLM models are large language models that are trained on a large corpus of text. They are often used for text generation, but they can also be used for sentiment analysis. The approach is to design a prompt that will make the model output a sentiment score. Langchain is a library that makes it easy to use LLM models for different tasks, including sentiment analysis.


In [19]:
from langchain_core.pydantic_v1 import BaseModel, Field
from langchain_core.prompts import PromptTemplate
from langchain_core.output_parsers import PydanticOutputParser

In [20]:
from langchain_community.chat_models import ChatOllama

In [21]:
# For handling errors
from tenacity import retry, stop_after_attempt, RetryError

We will first define the desired output format as a Pydantic model, from which we will
create a PydanticOutputParser object. This object will be used to inject output
definition into the prompt and to parse the output of the model. 

In [22]:
class SentimentClassification(BaseModel):
    sentiment: str = Field(
        ...,
        description="The sentiment of the text",
        enum=["positive", "negative", "neutral"],
    )
    score: float = Field(..., description="The score of the sentiment", ge=-1, le=1)
    justification: str = Field(..., description="The justification of the sentiment")
    main_entity: str = Field(..., description="The main entity discussed in the text")

In [23]:
@retry(stop=stop_after_attempt(5))
def run_chain(text: str, chain) -> dict:
    return chain.invoke({"news": text}).dict()


def llm_sentiment(text: str, llm) -> tuple[str, float, str, str]:
    parser = PydanticOutputParser(pydantic_object=SentimentClassification)

    prompt = PromptTemplate(
        template="Describe the sentiment of a text of financial news.\n{format_instructions}\n{news}\n",
        input_variables=["news"],
        partial_variables={"format_instructions": parser.get_format_instructions()},
    )

    chain = prompt | llm | parser

    try:
        result = run_chain(text, chain)

        return (
            result["sentiment"],
            result["score"],
            result["justification"],
            result["main_entity"],
        )
    except RetryError as e:
        print(f"Error: {e}")
        return "error", 0, "", ""

In [24]:
# Replace with the correct model, or use ChatOpenAI if you want to use OpenAI
llama2 = ChatOllama(model="llama2", temperature=0.1)

df[
    ["llama2_sentiment", "llama2_score", "llama2_justification", "llama2_main_entity"]
] = (df["text"].apply(lambda x: llm_sentiment(x, llama2)).apply(pd.Series))

Error: RetryError[<Future at 0x329c35e50 state=finished raised OutputParserException>]
Error: RetryError[<Future at 0x329c629c0 state=finished raised OutputParserException>]
Error: RetryError[<Future at 0x3295c7680 state=finished raised OutputParserException>]


In [25]:
df[
    [
        "title",
        "text",
        "llama2_sentiment",
        "llama2_score",
        "llama2_justification",
        "llama2_main_entity",
    ]
]

Unnamed: 0,title,text,llama2_sentiment,llama2_score,llama2_justification,llama2_main_entity
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,positive,0.7,The article focuses on the oil and gas industr...,Suncor Energy Inc.
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,error,0.0,,
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,error,0.0,,
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,negative,0.7,The article reports on allegations of manufact...,Boeing Co.
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,error,0.0,,


In [26]:
mixtral = ChatOllama(model="dolphin-mixtral:latest", temperature=0.1)

df[
    [
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
] = (
    df["text"].apply(lambda x: llm_sentiment(x, mixtral)).apply(pd.Series)
)

Error: RetryError[<Future at 0x329d5b320 state=finished raised OutputParserException>]


In [27]:
df[
    [
        "title",
        "text",
        "mixtral_sentiment",
        "mixtral_score",
        "mixtral_justification",
        "mixtral_main_entity",
    ]
]

Unnamed: 0,title,text,mixtral_sentiment,mixtral_score,mixtral_justification,mixtral_main_entity
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,neutral,0.0,The text discusses the focus on spending disci...,Canadian Association of Petroleum Producers co...
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,negative,-0.8,The text discusses the detention of former oil...,Tareck El Aissami and Simón Zerpa
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,error,0.0,,
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,negative,-0.8,The text discusses a crisis of confidence for ...,Boeing
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,positive,0.85,The text discusses a significant increase in s...,eclipse


In [28]:
import textwrap

print(textwrap.fill(df.iloc[0]["text"][:500] + "...") + "\n")
print("Llama2: " + textwrap.fill(df.iloc[0]["llama2_justification"]) + "\n")
print("Mixtral: " + textwrap.fill(df.iloc[0]["mixtral_justification"]))

TORONTO — The price of oil has been on a steady climb all year, but
the talk at Canada's biggest oil and gas conference is still focused
on spending discipline.  Industry leaders at the Canadian Association
of Petroleum Producers conference, held in Toronto this year, have
been emphasizing their predictability and focus on returning money to
shareholders, rather than talk of growth.  Suncor Energy Inc. chief
executive Rich Kruger, who was named head of the oil and gas producer
last year as it st...

Llama2: The article focuses on the oil and gas industry in Canada and how
companies are shifting their priorities from growth to spending
discipline. The CEOs of Suncor Energy, Cenovus Energy, and Whitecap
Resources emphasize the importance of returning money to shareholders
rather than talking about growth. The article highlights the changing
narrative around oil demand and how it's still growing despite
investor concerns. The overall tone of the article is positive as
companies are taking

In [29]:
df[
    [
        "title",
        "text",
        "lm_sentiment",
        "finbert_sentiment",
        "llama2_sentiment",
        "mixtral_sentiment",
    ]
]

Unnamed: 0,title,text,lm_sentiment,finbert_sentiment,llama2_sentiment,mixtral_sentiment
0,Higher oil doesn't shake industry talk on spen...,TORONTO — The price of oil has been on a stead...,negative,neutral,positive,neutral
1,Venezuela Detains Former Maduro Confidantes in...,(Bloomberg) -- Venezuela detained former oil a...,negative,negative,error,negative
2,Norfolk Southern agrees to $600M settlement in...,Norfolk Southern has agreed to pay $600 millio...,negative,negative,error,error
3,Boeing Crisis of Confidence Deepens With 787 N...,(Bloomberg) -- Boeing Co. faces a deepening cr...,negative,negative,negative,negative
4,Restaurants along eclipse's path of totality s...,Restaurants on the eclipse's path of totality ...,positive,positive,error,positive
