## üì∞ Financial News Sentiment Prediction

Given *financial news headlines*, let's try to predict the **sentiment** of a given headline.

We will use a TensorFlow RNN to make our predictions.

Data source: https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.model_selection import train_test_split

import tensorflow as tf

In [4]:
data = pd.read_csv('archive/all-data.csv', encoding='latin-1', names=['Label', 'Text'])
data

Unnamed: 0,Label,Text
0,neutral,"According to Gran , the company has no plans t..."
1,neutral,Technopolis plans to develop in stages an area...
2,negative,The international electronic industry company ...
3,positive,With the new production plant the company woul...
4,positive,According to the company 's updated strategy f...
...,...,...
4841,negative,LONDON MarketWatch -- Share prices ended lower...
4842,neutral,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,negative,Operating profit fell to EUR 35.4 mn from EUR ...
4844,negative,Net sales of the Paper segment decreased to EU...


In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4846 entries, 0 to 4845
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Label   4846 non-null   object
 1   Text    4846 non-null   object
dtypes: object(2)
memory usage: 75.8+ KB


### Preprocessing

In [6]:
data['Label'].unique()

array(['neutral', 'negative', 'positive'], dtype=object)

In [19]:
df = data.copy()

In [20]:
df

Unnamed: 0,Label,Text
0,1,"According to Gran , the company has no plans t..."
1,1,Technopolis plans to develop in stages an area...
2,0,The international electronic industry company ...
3,2,With the new production plant the company woul...
4,2,According to the company 's updated strategy f...
...,...,...
4841,0,LONDON MarketWatch -- Share prices ended lower...
4842,1,Rinkuskiai 's beer sales fell by 6.5 per cent ...
4843,0,Operating profit fell to EUR 35.4 mn from EUR ...
4844,0,Net sales of the Paper segment decreased to EU...


In [15]:
def get_sequences(texts):
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    print("Vocab Length:", len(tokenizer.word_index) + 1)
    max_seq_length = np.max(list(map(lambda x: len(x), sequences)))
    print("Maximum sequence length:", max_seq_length)
    sequences = pad_sequences(sequences, maxlen=max_seq_length, padding='post')
    return sequences

In [21]:
get_sequences(df['Text'])

Vocab Length: 10123
Maximum sequence length: 71


array([[  94,    5, 3498, ...,    0,    0,    0],
       [ 840,  336,    5, ...,    0,    0,    0],
       [   1,  293,  656, ...,    0,    0,    0],
       ...,
       [  42,   31,  242, ...,    0,    0,    0],
       [  30,   27,    2, ...,    0,    0,    0],
       [  27,    3,   35, ...,    0,    0,    0]],
      shape=(4846, 71), dtype=int32)

In [22]:
sequences = get_sequences(df['Text'])

label_mapping = {
    'negative': 0,
    'neutral': 1,
    'positive': 2
}

y = df['Label'].replace(label_mapping)

Vocab Length: 10123
Maximum sequence length: 71


In [23]:
train_sequences, test_sequences, y_train, y_test = train_test_split(sequences, y, train_size=0.7, shuffle=True, random_state=1)

In [24]:
train_sequences

array([[5442,  510,   16, ...,    0,    0,    0],
       [  22, 1628,    4, ...,    0,    0,    0],
       [1141,  936,  136, ...,    0,    0,    0],
       ...,
       [   1,  419,   16, ...,    0,    0,    0],
       [2586,  123, 3247, ...,    0,    0,    0],
       [  30,  615,  555, ...,    0,    0,    0]],
      shape=(3392, 71), dtype=int32)

In [25]:
y_train

545     2
2374    0
4217    1
1071    1
716     2
       ..
2895    1
2763    1
905     2
3980    1
235     2
Name: Label, Length: 3392, dtype: int64

### Training

In [35]:
inputs = tf.keras.Input(shape=(train_sequences.shape[1], ))
x = tf.keras.layers.Embedding(
    input_dim = 10123,
    output_dim = 128,
    input_length = train_sequences.shape[1]
)(inputs)
x = tf.keras.layers.GRU(256, return_sequences=True, activation='tanh')(x)
x = tf.keras.layers.Flatten()(x)
outputs = tf.keras.layers.Dense(3, activation='softmax')(x)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.compile(
    optimizer = 'adam',
    loss = 'sparse_categorical_crossentropy',
    metrics = ['accuracy']
)

history = model.fit(
    train_sequences,
    y_train,
    validation_split = 0.2,
    batch_size = 32,
    epochs = 100,
    callbacks = [
        tf.keras.callbacks.EarlyStopping(
            monitor = 'val_loss',
            patience = 3,
            restore_best_weights = True
        )
    ]
)

Epoch 1/100
[1m85/85[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m26s[0m 279ms/step - accuracy: 0.6347 - loss: 0.8368 - val_accuracy: 0.6392 - val_loss: 0.7983
Epoch 2/100
[1m85/85[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m36s[0m 215ms/step - accuracy: 0.8094 - loss: 0.4868 - val_accuracy: 0.6937 - val_loss: 0.7979
Epoch 3/100
[1m85/85[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m18s[0m 215ms/step - accuracy: 0.9337 - loss: 0.1851 - val_accuracy: 0.6863 - val_loss: 1.0193
Epoch 4/100
[1m85/85[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m19s[0m 223ms/step - accuracy: 0.9823 - loss: 0.0611 - val_accuracy: 0.6863 - val_loss: 1.0658
Epoch 5/100
[1m85/85[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m18s[0m 215ms/step - accuracy: 0.9963 - loss: 0.0198 - val_accuracy: 0.7054

### Results

In [36]:
model.evaluate(test_sequences, y_test)

[1m46/46[0m [32m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m[37m[0m [1m3s[0m 59ms/step - accuracy: 0.7483 - loss: 0.6856


[0.6856056451797485, 0.7482805848121643]

In [37]:
y_test.value_counts()

Label
1    850
2    420
0    184
Name: count, dtype: int64