# Sentiment Analysis

Now, we have our cleaned dataset. We can start do our sentiment analysis on the comments about ```VOO``` from Subreddit ```r/ETFs```.

### Import Modules

In [1]:
from functions.count_parameters import count_parameters
from functions.evaluate import evaluate
from functions.get_accuracy import get_accuracy
from functions.get_collate_fn import get_collate_fn
from functions.get_data_loader import get_data_loader
from functions.predict_sentiment import predict_sentiment
from functions.split_data import split_data
from functions.train import train

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm
import transformers

In [2]:
seed = 8888

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### Import the datasets

In [3]:
cmts_voo = pd.read_csv('../datasets/cleaned_cmts_voo.csv')
display(cmts_voo)

Unnamed: 0,AUTHOR,ID,CREATED_UTC,PERMALINK,BODY,SCORE,SUBREDDIT
0,lotterytix,kwh3sji,2024-03-25 20:10:23,/r/ETFs/comments/1bmqbxg/new_to_investing_is_v...,Maybe consider VOO and a mid/small cap value f...,1,ETFs
1,AlgoTradingQuant,kwczgum,2024-03-25 00:51:21,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,I’m retired and hold a 100% equities portfolio...,8,ETFs
2,foldinthechhese,kwdbk25,2024-03-25 02:02:08,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,The more experienced investors recommend a ble...,5,ETFs
3,SirChetManly,kwd6nto,2024-03-25 01:33:43,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,It isn't *risky* by any stretch. You're exclud...,2,ETFs
4,ZAROV8862,kwei3zo,2024-03-25 06:17:54,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,Enough said :)),2,ETFs
...,...,...,...,...,...,...,...
927,Financial_Pickle_987,kvqcowi,2024-03-20 21:48:06,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,"Lots of downs, lots of ups, but average is aro...",2,ETFs
928,platskol,kvaljn3,2024-03-17 23:49:51,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,That is a Reddit thing. As soon as people say ...,8,ETFs
929,phillip_jay,kv9pe1j,2024-03-17 20:01:30,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,Did you read it?,4,ETFs
930,Rand-Seagull96734,kvhyeh5,2024-03-19 06:54:48,/r/ETFs/comments/1bgtk4e/voodoo_is_the_sorcery...,"Let's say you decided to invest some ""play mon...",1,ETFs


### Split the Cleaned Data into Train, Test and Validation Data

In [4]:
train_data, valid_data, test_data = split_data(cmts_voo)

print(train_data.info())
print()
print(valid_data.info())
print()
print(test_data.info())
print()

<class 'pandas.core.frame.DataFrame'>
Index: 349 entries, 104 to 100
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   AUTHOR       349 non-null    object
 1   ID           349 non-null    object
 2   CREATED_UTC  349 non-null    object
 3   PERMALINK    349 non-null    object
 4   BODY         349 non-null    object
 5   SCORE        349 non-null    int64 
 6   SUBREDDIT    349 non-null    object
dtypes: int64(1), object(6)
memory usage: 21.8+ KB
None

<class 'pandas.core.frame.DataFrame'>
Index: 117 entries, 127 to 920
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   AUTHOR       117 non-null    object
 1   ID           117 non-null    object
 2   CREATED_UTC  117 non-null    object
 3   PERMALINK    117 non-null    object
 4   BODY         117 non-null    object
 5   SCORE        117 non-null    int64 
 6   SUBREDDIT    117 non-null    object
dty

### Using Transformer Model in Sentiment Analysis

We will be using ```BERT-Base-Uncased``` model. 

```BERT```: BERT stands for Bidirectional Encoder Representations from Transformers. It's a groundbreaking model introduced by Google in 2018 that revolutionized the field of natural language processing (NLP). BERT is known for its deep understanding of language context, which it achieves through its transformer architecture.

```Base```: The "base" in "bert-base-uncased" indicates the size of the model. BERT typically comes in two sizes: base and large. The base model is smaller and faster, making it more practical for many applications, though the large model generally performs better on NLP tasks. The base model has 110 million parameters, while the large model has 340 million.

```Uncased```: This specifies that the model was trained on text that has been converted to lowercase, meaning the model does not differentiate between uppercase and lowercase letters. This is in contrast to a "cased" model, which is sensitive to letter casing. For instance, in a cased model, "Hello" and "hello" would be treated differently, whereas they would be treated the same in an uncased model.

The tokenizer we used here is ```AutoTokenizer``` from ```Hugging Face```. 
More detail can check at https://huggingface.co/docs/transformers/v4.39.2/en/autoclass_tutorial#autotokenizer

In [5]:
transformer_name = "bert-base-uncased"

tokenizer = transformers.AutoTokenizer.from_pretrained(transformer_name)

Here is an example of tokenization and numericalization. 

In [6]:
tokenizer.tokenize("We are doing sentiment analysis!")

['we', 'are', 'doing', 'sentiment', 'analysis', '!']

In [7]:
tokenizer.encode("We are doing sentiment analysis!")

[101, 2057, 2024, 2725, 15792, 4106, 999, 102]

In [8]:
tokenizer.convert_ids_to_tokens(tokenizer.encode("We are doing sentiment analysis!"))

['[CLS]', 'we', 'are', 'doing', 'sentiment', 'analysis', '!', '[SEP]']

In [9]:
tokenizer("We are doing sentiment analysis!")

{'input_ids': [101, 2057, 2024, 2725, 15792, 4106, 999, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}

In [10]:
pad_index = tokenizer.pad_token_id
batch_size = 8

train_data_loader = get_data_loader(train_data, batch_size, pad_index, shuffle=True)
valid_data_loader = get_data_loader(valid_data, batch_size, pad_index)
test_data_loader = get_data_loader(test_data, batch_size, pad_index)

### Transformer Model

Here, we are using the transformer model 