## Introduction

We will be using the **ALBERT** transformer model to predict sentiment from financial news.z

ALBERT, a lite version of BERT, is a self-supervised learning model of language representation. It was considered a major breakthrough due to its similar performance to BERT but with significant parameter reduction.

The dataset was extracted from the following research article: [Malo, P., Sinha, A., Takala, P., Korhonen, P., & Wallenius, J. (2013, July 23). Good debt or bad debt: Detecting semantic orientations in economic texts. arXiv.org. Retrieved February 28, 2023, from https://arxiv.org/abs/1307.5336](https://arxiv.org/pdf/1307.5336.pdf)

## Import Libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow

pd.set_option('display.max_colwidth', None)

2023-03-01 09:27:24.422296: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Load Data

In [2]:
import codecs

# Open the input file for reading with the original encoding
with codecs.open('Sentences_AllAgree.txt', 'r', encoding='ISO-8859-1') as f:
    # Read the file content
    file_content = f.read()

# Open the output file for writing with utf-8 encoding
with codecs.open('Sentences_AllAgree_new.txt', 'w', encoding='utf-8') as f:
    # Write the file content with utf-8 encoding
    f.write(file_content)
    
# Load the new UTF-8 encoded file into a Pandas Dataframe
df = pd.read_csv('Sentences_AllAgree_new.txt', delimiter='\t')
display(df.head(5))

FileNotFoundError: [Errno 2] No such file or directory: 'Sentences_AllAgree.txt'

## Clean Data

As seen above, the data needs to be preprocessed before proceeding with analysis. Currently, the first row is placed in the columns section of the dataframe; we will shift this input one cell below with the following manipulation. 

In addition, we will separate each string at the `"@"` character using the .split() method. We can then separate the results into two columns named `news headline` and `sentiment`.

In [None]:
# Set column value to new row
new_row = list(df.columns)

# Insert the new row above the first row
df.loc[-1] = new_row
df.index = df.index + 1
df = df.sort_index()
display(df.head())

Unnamed: 0,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral"
0,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .@neutral"
1,"For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .@positive"
2,"In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .@positive"
3,Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .@positive
4,"Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .@positive"


In [4]:
# Look through each row and split at "@" character
df.columns = ['article']
news_article= []
sentiment = []

for index,row in df.iterrows():
    row = row['article'].split('@')
    news_article.append(row[0])
    sentiment.append(row[1])
    
# Create new dataframe
df = pd.DataFrame({"News Article": news_article, "Sentiment": sentiment})
display(df.head())

Unnamed: 0,News Article,Sentiment
0,"According to Gran , the company has no plans to move all production to Russia , although that is where the company is growing .",neutral
1,"For the last quarter of 2010 , Componenta 's net sales doubled to EUR131m from EUR76m for the same period a year earlier , while it moved to a zero pre-tax profit from a pre-tax loss of EUR7m .",positive
2,"In the third quarter of 2010 , net sales increased by 5.2 % to EUR 205.5 mn , and operating profit by 34.9 % to EUR 23.5 mn .",positive
3,Operating profit rose to EUR 13.1 mn from EUR 8.7 mn in the corresponding period in 2007 representing 7.7 % of net sales .,positive
4,"Operating profit totalled EUR 21.1 mn , up from EUR 18.6 mn in 2007 , representing 9.7 % of net sales .",positive


## Data Preparation

The next step will be to transform our sentiment labels into a numerical data type so it is compatible with our transformer. To do this, the OneHotEncoder module from `sklearn` library will be used. As a result, there will be three columns designated for each sentiment label.

In [1]:
from sklearn.preprocessing import OneHotEncoder

y = df['Sentiment']

ohe = OneHotEncoder(categories=[['neutral', 'positive', 'negative']])
y_encoded = ohe.fit_transform(y.values.reshape(-1,1))

df_encoded = pd.concat([df.drop('Sentiment', axis=1), 
                        pd.DataFrame(y_encoded.toarray(),
                                     columns=ohe.get_feature_names_out())], axis=1)

df_encoded.columns = [['New Article', 'Neutral', 'Positive', 'Negative']]
display(df_encoded.head(5))

NameError: name 'df' is not defined