## Introduction

We will be using the **ALBERT** transformer model to predict sentiment from financial news.

ALBERT, a lite version of BERT, is a self-supervised learning model of language representation. It was considered a major breakthrough due to its similar performance to BERT but with significant parameter reduction.

The dataset was extracted from the following research article: [Malo, P., Sinha, A., Takala, P., Korhonen, P., & Wallenius, J. (2013, July 23). Good debt or bad debt: Detecting semantic orientations in economic texts. arXiv.org. Retrieved February 28, 2023, from https://arxiv.org/abs/1307.5336](https://arxiv.org/pdf/1307.5336.pdf)

## Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import tensorflow

2023-02-28 20:30:09.448453: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load Data

In [3]:
import codecs

# Open the input file for reading with the original encoding
with codecs.open('Sentences_AllAgree.txt', 'r', encoding='ISO-8859-1') as f:
    # Read the file content
    file_content = f.read()

# Open the output file for writing with utf-8 encoding
with codecs.open('Sentences_AllAgree_new.txt', 'w', encoding='utf-8') as f:
    # Write the file content with utf-8 encoding
    f.write(file_content)
    
# Load the new UTF-8 encoded file into a Pandas Dataframe
df = pd.read_csv('Sentences_AllAgree_new.txt', delimiter='\t')
display(df.head(5))

Unnamed: 0,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral"
0,"For the last quarter of 2010 , Componenta 's n..."
1,"In the third quarter of 2010 , net sales incre..."
2,Operating profit rose to EUR 13.1 mn from EUR ...
3,"Operating profit totalled EUR 21.1 mn , up fro..."
4,Finnish Talentum reports its operating profit ...


## Preprocess Data

As seen above, the data needs to be preprocessed before proceeding with analysis. Currently, the first row is placed in the columns section of the dataframe; we will shift this input one cell below with the following manipulation. 

In addition, we will separate each string at the `"@"` character using the .strip() method. We can then separate the results into two columns named `news headline` and `sentiment`.

In [4]:
# Set column value to new row
new_row = list(df.columns)

# Insert the new row above the first row
df.loc[-1] = new_row
df.index = df.index + 1
df = df.sort_index()
display(df.head())

Unnamed: 0,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral"
0,"According to Gran , the company has no plans t..."
1,"For the last quarter of 2010 , Componenta 's n..."
2,"In the third quarter of 2010 , net sales incre..."
3,Operating profit rose to EUR 13.1 mn from EUR ...
4,"Operating profit totalled EUR 21.1 mn , up fro..."


In [10]:
df.columns = ['article']
news_analysis = []
sentiment = []

for index,row in df.iterrows():
    row = row['article'].strip('@')
    news_analysis.append(row[0])
    sentiment.append(row[1])

print(len(sentiment))

2264
