# Financial News Sentiment Analysis
Classify the sentiment of financial news headlines as positive, neutral, or negative using classical machine learning and neural network methods.

## Setup

In [3]:
import pandas as pd
import re
import numpy as np
import random
import math
import string
from itertools import combinations
pd.set_option('display.max_colwidth',200)

## Load Dataset:
The dataset used is the **Financial Phrase Bank v1.0**, created by researchers at Aalto University. It contains 4846 financial and economic news sentences annotated for sentiment. This project specifically uses the **Sentences_50Agree.txt** file, which includes sentences where **at least 50% of annotators** agreed on the sentiment label.

In [5]:
def load_data(data_path: str, text_column: str, sentiment_col: str) -> pd.DataFrame:
    """Load data from a text file where each line contains text and label separated by '@'.
    Lines without the delimiter are skipped."""
    data = []
    skipped = 0
    print(f"Loading data from: {data_path} ...")
    with open(data_path, "r", encoding="latin1") as f:
        for line in f:
            line = line.strip()
            if "@" not in line:
                skipped += 1
                continue
            parts = line.split("@", 1)  # split only on first '@'
            data.append(parts)

    print(f"Loaded {len(data)} lines.")
    print(f"Skipped {skipped} lines without labels.")
    return pd.DataFrame(data, columns=[text_column, sentiment_col])


In [6]:
data_path = "../data/Sentences_50Agree.txt"
text_col = "news"
sentiment_col = "sentiment"
%time
corpora = load_data(data_path=data_path, text_column=text_col, sentiment_col=sentiment_col)

CPU times: user 1e+03 ns, sys: 0 ns, total: 1e+03 ns
Wall time: 2.15 μs
Loading data from: ../data/Sentences_50Agree.txt ...
Loaded 4846 lines.
Skipped 0 lines without labels.


corpora.head(10)

## Textual Similarity 

In this task, we measure the semantic similarity between 15 randomly selected positive financial sentences. We use
the average of word vectors as a distributional semantics approach at the sentence level.

### Preprocess
Before computing sentence similarities, we apply some preprocessing steps to the entire dataset of 4846 sentences.
These included: lowercasing the text, expanding contractions (e.g., “can’t” → “can not”, “I’m” → “I am”). This ensured cleaner, more consistent input for vector generation.

In [11]:
def contraction_expansion(text: str):
    text = re.sub(r"won't", "will not", text)
    text = re.sub(r"can't", "can not", text)
    text = re.sub(r"n't", " not", text)
    text = re.sub(r"'re", " are", text)
    text = re.sub(r"'s", " is", text)
    text = re.sub(r"'d", " would", text)
    text = re.sub(r"'ll", " will", text)
    text = re.sub(r"'ve", " have", text)
    text = re.sub(r"'m", " am", text)
    return text


def clean_text(text: str) -> str:
    if not isinstance(text, str):
        return ""
    text = text.lower()
    text = contraction_expansion(text)
    text = re.sub(r"\s+", " ", text).strip()
    return text

In [12]:
corpora[text_col] = corpora[text_col].apply(clean_text)
corpora.head(1)

Unnamed: 0,news,sentiment
0,"according to gran , the company has no plans to move all production to russia , although that is where the company is growing .",neutral


### Select 15 random positive sentences

In [14]:
positive_corpora = corpora[corpora[sentiment_col] == "positive"]

# Randomly select 15 positive sentences
random.seed(42)
selected_positive_sentences = positive_corpora.sample(15, random_state=42)

selected_sentences = selected_positive_sentences[text_col].tolist()

In [15]:
selected_positive_sentences

Unnamed: 0,news,sentiment
567,"the new agreement , which expands a long-established cooperation between the companies , involves the transfer of certain engineering and documentation functions from larox to etteplan .",positive
1752,"( adp news ) - finnish handling systems provider cargotec oyj ( hel : cgcbv ) announced on friday it won orders worth eur 10 million ( usd 13.2 m ) to deliver linkspans to jordan , morocco and ire...",positive
995,the world is biggest magazine paper maker said the program to improve efficiency will include closing several of its least competitive mills and would cover all the company is operations resulting...,positive
601,"a january 11 , 2010 ephc board of directors has approved an increase in the quarterly dividend from $ 0.03 to $ 0.05 per share .",positive
568,with this appointment kaupthing bank aims to further co-ordinate capital markets activities within the group and to improve the overall service to clients .,positive
3129,"st. petersburg , oct 14 ( prime-tass ) -- finnish tire producer nokian tyres plans to invest about 50 million euros in the expansion of its tire plant in the city of vsevolozhsk in russia is lenin...",positive
760,"during the past decade it has gradually divested noncore assets and bought several sports equipment makers , including california-based fitness products international and sparks , nevada-based ate...",positive
463,"- beijing xfn-asia - hong kong-listed standard chartered bank said it has signed a china mobile phone dealer financing agreement with nokia , making it the first foreign bank to offer financing to...",positive
818,"according to schmardin , nordea will most likely try to win customers over from other pension fund providers .",positive
1949,this is a much better process than using virgin paper as it requires less transportation of wood pulp from places like finland and canada .,positive


###  Load Word Vectors

For converting the words into vectors, we use two different methods:
- **PMI (Pointwise Mutual Information):** We build custom word vectors using co-occurrence statistics from the
entire dataset (4846 sentences). A sliding window (sizes 2–5) is used to count how often words appear near
each other. The co-occurrence matrix is then transformed into a PMI matrix and each word is represented
by its PMI vector.
- **GloVe (Global Vectors for Word Representation):** We load pre-trained 50-dimensional word embeddings
from glove.6B.50d.txt. These vectors were trained on a large external corpus and capture general semantic
relationships between words.

For both approaches, we compute the sentence vector by averaging the word vectors present in that sentence.
Cosine similarity was is to measure how semantically close the sentence pairs are.

In [18]:
def load_word_vectors(method, tokenized_sentences=None, vocab=None, window_size=3):
    word_vectors = {}

    if method == "glove":
        with open("glove.6B.50d.txt", 'r', encoding='utf-8') as f:
            for line in f:
                parts = line.split()
                word = parts[0]
                vec = list(map(float, parts[1:]))
                word_vectors[word] = vec

    elif method == "pmi":
        if tokenized_sentences is None or vocab is None:
            print("PMI mode requires tokenized_sentences and vocab.")
            return {}

        word_to_idx = {word: i for i, word in enumerate(vocab)}
        co_occurrence = np.zeros((len(vocab), len(vocab)))

        for sent in tokenized_sentences:
            for i, center in enumerate(sent):
                center_idx = word_to_idx[center]
                for j in range(max(0, i - window_size), min(len(sent), i + window_size + 1)):
                    if i != j:
                        context = sent[j]
                        context_idx = word_to_idx[context]
                        co_occurrence[center_idx][context_idx] += 1

        total = np.sum(co_occurrence)
        word_probs = np.sum(co_occurrence, axis=1) / total
        pmi_matrix = np.zeros_like(co_occurrence)

        for i in range(len(vocab)):
            for j in range(len(vocab)):
                joint = co_occurrence[i][j] / total
                if joint > 0:
                    pmi_matrix[i][j] = max(0, np.log2(joint / (word_probs[i] * word_probs[j])))

        word_vectors = {word: pmi_matrix[word_to_idx[word]] for word in vocab}

    else:
        print("Invalid method.")
    
    return word_vectors


### Convert Sentences to Vectors

In [28]:
def vectorize_sentences(sentences, method, word_vectors):
    sentence_vectors = []

    if method == "glove":
        vector_size = 50
    elif method == "pmi":
        vector_size = len(next(iter(word_vectors.values())))
    else:
        print("Invalid method.")
        return []

    for sentence in sentences:
        words = sentence.split()
        vectors = [word_vectors[word] for word in words if word in word_vectors]

        if vectors:
            avg_vector = [sum(x)/len(x) for x in zip(*vectors)]
        else:
            avg_vector = [0.0] * vector_size

        sentence_vectors.append(avg_vector)

    return sentence_vectors


### Compute Similarity and print results

In [31]:
def cosine_sim(vec1, vec2):
    dot = sum(a * b for a, b in zip(vec1, vec2))
    norm1 = math.sqrt(sum(a * a for a in vec1))
    norm2 = math.sqrt(sum(b * b for b in vec2))
    return dot / (norm1 * norm2) if norm1 and norm2 else 0

In [33]:
def compute_and_display_similarities(sentences, sentence_vectors, label, display=False):
    similarities = []
    for i in range(len(sentence_vectors)):
        for j in range(i + 1, len(sentence_vectors)):
            sim = cosine_sim(sentence_vectors[i], sentence_vectors[j])
            similarities.append(sim)

    avg_sim = sum(similarities) / len(similarities)
    print(f"\n Average Cosine Similarity using {label}: {avg_sim:.4f}\n")

    if display:
        print(f"\nPairwise Cosine Similarities ({label}):\n")
        for (i, s1), (j, s2) in combinations(enumerate(sentences), 2):
            sim = cosine_sim(sentence_vectors[i], sentence_vectors[j])
            print(f"[{i+1}] \"{s1[:60]}...\"")
            print(f"[{j+1}] \"{s2[:60]}...\"")
            print(f"   → Cosine Similarity: {sim:.4f}\n")
    
    return avg_sim

### Run

In [36]:
# GloVe
glove_vectors = load_word_vectors("glove")
glove_sentence_vectors = vectorize_sentences(selected_sentences, "glove", glove_vectors)
glove_avg_sim = compute_and_display_similarities(selected_sentences, glove_sentence_vectors, "GloVe", display = True)


 Average Cosine Similarity using GloVe: 0.9108


Pairwise Cosine Similarities (GloVe):

[1] "the new agreement , which expands a long-established coopera..."
[2] "( adp news ) - finnish handling systems provider cargotec oy..."
   → Cosine Similarity: 0.8701

[1] "the new agreement , which expands a long-established coopera..."
[3] "the world is biggest magazine paper maker said the program t..."
   → Cosine Similarity: 0.9505

[1] "the new agreement , which expands a long-established coopera..."
[4] "a january 11 , 2010 ephc board of directors has approved an ..."
   → Cosine Similarity: 0.8865

[1] "the new agreement , which expands a long-established coopera..."
[5] "with this appointment kaupthing bank aims to further co-ordi..."
   → Cosine Similarity: 0.9650

[1] "the new agreement , which expands a long-established coopera..."
[6] "st. petersburg , oct 14 ( prime-tass ) -- finnish tire produ..."
   → Cosine Similarity: 0.9201

[1] "the new agreement , which expands a long-estab

In [37]:
tokenized_all = [s.lower().split() for s in corpora[text_col].tolist()]
vocab = sorted(set(word for sent in tokenized_all for word in sent))
pmi_similarities = []
for window in [2, 3, 4, 5]:
    print(f"Window Size = {window}")
    pmi_vectors = load_word_vectors("pmi", tokenized_sentences=tokenized_all, vocab=vocab, window_size=window)
    pmi_sentence_vectors = vectorize_sentences(selected_sentences, "pmi", pmi_vectors)
    if window==5:
        display = True
    else: 
        display = False
    pmi_avg_sim = compute_and_display_similarities(selected_sentences, pmi_sentence_vectors, "PMI", display=display)
    pmi_similarities.append(pmi_avg_sim)

Window Size = 2

 Average Cosine Similarity using PMI: 0.6008

Window Size = 3

 Average Cosine Similarity using PMI: 0.6098

Window Size = 4

 Average Cosine Similarity using PMI: 0.6163

Window Size = 5

 Average Cosine Similarity using PMI: 0.6199


Pairwise Cosine Similarities (PMI):

[1] "the new agreement , which expands a long-established coopera..."
[2] "( adp news ) - finnish handling systems provider cargotec oy..."
   → Cosine Similarity: 0.5050

[1] "the new agreement , which expands a long-established coopera..."
[3] "the world is biggest magazine paper maker said the program t..."
   → Cosine Similarity: 0.7595

[1] "the new agreement , which expands a long-established coopera..."
[4] "a january 11 , 2010 ephc board of directors has approved an ..."
   → Cosine Similarity: 0.7415

[1] "the new agreement , which expands a long-established coopera..."
[5] "with this appointment kaupthing bank aims to further co-ordi..."
   → Cosine Similarity: 0.7383

[1] "the new agreement

### Summary
GloVe consistently provides higher similarity scores than PMI, likely due to being trained on large external corpora.
However, PMI still captures meaningful relationships, especially with larger window sizes (e.g., window = 5).

In [39]:
print("\nSummary:")
print(f"GloVe based average similarity: {glove_avg_sim:.4f}")
for i, pmi_avg_sim in enumerate(pmi_similarities):
    print(f"PMI based average similarity with window {i+2}:   {pmi_avg_sim:.4f}")


Summary:
GloVe based average similarity: 0.9108
PMI based average similarity with window 2:   0.6008
PMI based average similarity with window 3:   0.6098
PMI based average similarity with window 4:   0.6163
PMI based average similarity with window 5:   0.6199


In [40]:
positive_corpora[text_col][567]

'the new agreement , which expands a long-established cooperation between the companies , involves the transfer of certain engineering and documentation functions from larox to etteplan .'

In [41]:
positive_corpora[text_col][1752]

'( adp news ) - finnish handling systems provider cargotec oyj ( hel : cgcbv ) announced on friday it won orders worth eur 10 million ( usd 13.2 m ) to deliver linkspans to jordan , morocco and ireland .'

In [43]:
positive_corpora[text_col][995]

'the world is biggest magazine paper maker said the program to improve efficiency will include closing several of its least competitive mills and would cover all the company is operations resulting in annual savings of some euro200 million us$ 240 million .'