# Sentiment Analysis

Now, we have our cleaned dataset. We can start do our sentiment analysis on the comments about ```VOO``` from Subreddit ```r/ETFs```.

### Import Modules

In [72]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import collections

import torch
import torch.nn as nn
import torch.optim as optim
import torchtext
import tqdm
import transformers

In [73]:
seed = 8888

np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.backends.cudnn.deterministic = True

### Import the datasets

In [74]:
cmts_voo = pd.read_csv('../datasets/cleaned_cmts_voo.csv')
cmts_voo.head()

Unnamed: 0,author,id,created_utc,permalink,body,score,subreddit
0,lotterytix,kwh3sji,2024-03-25 20:10:23,/r/ETFs/comments/1bmqbxg/new_to_investing_is_v...,Maybe consider VOO and a mid/small cap value f...,1,ETFs
1,AlgoTradingQuant,kwczgum,2024-03-25 00:51:21,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,I’m retired and hold a 100% equities portfolio...,8,ETFs
2,foldinthechhese,kwdbk25,2024-03-25 02:02:08,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,The more experienced investors recommend a ble...,5,ETFs
3,SirChetManly,kwd6nto,2024-03-25 01:33:43,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,It isn't *risky* by any stretch. You're exclud...,2,ETFs
4,ZAROV8862,kwei3zo,2024-03-25 06:17:54,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,Enough said :)),2,ETFs


### Using Transformer Model in Sentiment Analysis

We will be using ```BERT-Base-Uncased``` model. 

```BERT```: BERT stands for Bidirectional Encoder Representations from Transformers. It's a groundbreaking model introduced by Google in 2018 that revolutionized the field of natural language processing (NLP). BERT is known for its deep understanding of language context, which it achieves through its transformer architecture.

```Base```: The "base" in "bert-base-uncased" indicates the size of the model. BERT typically comes in two sizes: base and large. The base model is smaller and faster, making it more practical for many applications, though the large model generally performs better on NLP tasks. The base model has 110 million parameters, while the large model has 340 million.

```Uncased```: This specifies that the model was trained on text that has been converted to lowercase, meaning the model does not differentiate between uppercase and lowercase letters. This is in contrast to a "cased" model, which is sensitive to letter casing. For instance, in a cased model, "Hello" and "hello" would be treated differently, whereas they would be treated the same in an uncased model.

The tokenizer we used here is ```AutoTokenizer``` from ```Hugging Face```. 
More detail can check at https://huggingface.co/docs/transformers/v4.39.2/en/autoclass_tutorial#autotokenizer

In [75]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


In [76]:
sentence_pos = "Stock price of VOO will go higher"
sentence_neg = "Never ever buy VOO"

In [77]:
classifier(sentence_pos), classifier(sentence_neg)

([{'label': 'POSITIVE', 'score': 0.5957298278808594}],
 [{'label': 'NEGATIVE', 'score': 0.9967347979545593}])

In [78]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = 'distilbert-base-uncased-finetuned-sst-2-english'
pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [79]:
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [80]:
inputs_pos, inputs_neg = tokenizer(sentence_pos), tokenizer(sentence_neg)

In [81]:
inputs_pos, inputs_neg

({'input_ids': [101, 4518, 3976, 1997, 29536, 2080, 2097, 2175, 3020, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]},
 {'input_ids': [101, 2196, 2412, 4965, 29536, 2080, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]})

In [82]:
tokenizer.convert_ids_to_tokens([29536, 2080])

['vo', '##o']

In [83]:
pt_batch = tokenizer(
    [sentence_pos, sentence_neg], 
    padding=True, 
    truncation=True,
    max_length=512,
    return_tensors="pt"
)

In [84]:
pt_batch

{'input_ids': tensor([[  101,  4518,  3976,  1997, 29536,  2080,  2097,  2175,  3020,   102],
        [  101,  2196,  2412,  4965, 29536,  2080,   102,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, 0, 0]])}

In [85]:
pt_outputs = pt_model(**pt_batch)

In [86]:
pt_outputs

SequenceClassifierOutput(loss=None, logits=tensor([[-0.1119,  0.2758],
        [ 3.1458, -2.5754]], grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

In [87]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_name = 'distilbert-base-uncased-finetuned-sst-2-english'

pt_model = AutoModelForSequenceClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

def get_label_and_score(output):
    # Apply softmax to output logits
    probabilities = torch.nn.functional.softmax(output.logits, dim=-1)

    # Get the predicted label (0 or 1) based on the highest probability
    label = torch.argmax(probabilities, dim=-1).item()

    # Get the probability score for the predicted label
    score = probabilities[0][label].item()

    return label, score

results = []

for comment in cmts_voo['body']:
    pt_batch = tokenizer(
        str(comment), 
        padding=True, 
        truncation=True,
        max_length=512,
        return_tensors="pt"
    )
    pt_outputs = pt_model(**pt_batch)
    label, score = get_label_and_score(pt_outputs)
    results.append((label, score))

labels, scores = zip(*results)
cmts_voo['sentiment_label'] = labels
cmts_voo['sentiment_score'] = scores


In [88]:
cmts_voo.head()

Unnamed: 0,author,id,created_utc,permalink,body,score,subreddit,sentiment_label,sentiment_score
0,lotterytix,kwh3sji,2024-03-25 20:10:23,/r/ETFs/comments/1bmqbxg/new_to_investing_is_v...,Maybe consider VOO and a mid/small cap value f...,1,ETFs,0,0.990078
1,AlgoTradingQuant,kwczgum,2024-03-25 00:51:21,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,I’m retired and hold a 100% equities portfolio...,8,ETFs,0,0.997071
2,foldinthechhese,kwdbk25,2024-03-25 02:02:08,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,The more experienced investors recommend a ble...,5,ETFs,0,0.860153
3,SirChetManly,kwd6nto,2024-03-25 01:33:43,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,It isn't *risky* by any stretch. You're exclud...,2,ETFs,1,0.993927
4,ZAROV8862,kwei3zo,2024-03-25 06:17:54,/r/ETFs/comments/1bmoom7/diversifying_my_ira_f...,Enough said :)),2,ETFs,1,0.847568


In [90]:
result = cmts_voo[['created_utc', 'body', 'sentiment_label', 'sentiment_score']]

result.head()

Unnamed: 0,created_utc,body,sentiment_label,sentiment_score
0,2024-03-25 20:10:23,Maybe consider VOO and a mid/small cap value f...,0,0.990078
1,2024-03-25 00:51:21,I’m retired and hold a 100% equities portfolio...,0,0.997071
2,2024-03-25 02:02:08,The more experienced investors recommend a ble...,0,0.860153
3,2024-03-25 01:33:43,It isn't *risky* by any stretch. You're exclud...,1,0.993927
4,2024-03-25 06:17:54,Enough said :)),1,0.847568
