# **Sentiment Analysis Using BERT & VEDAR**

---



In this notebook, we explore sentiment analysis with a special focus on the 'Bitcointalk' dataset. Our objective is to extract sentiment scores utilising two powerful techniques:
* BERT (Bidirectional Encoder Representations from Transformers), which represents a significant breakthrough in transformer-driven machine learning.
* VADER (Valence Aware Dictionary and sEntiment Reasoner), an instrument for lexicon-based sentiment analysis designed for social media expressions

# **Install Required Libraries**

---
Firstly, we install and import the required libraries.

In [1]:
# Install transformers library from Hugging Face
!pip install transformers

# Install the vaderSentiment library, for ectracting scores using VADER
!pip install vaderSentiment

# Install tqdm for displaying progress bars in loops and during data download/upload
!pip install tqdm


Collecting transformers
  Downloading transformers-4.33.1-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m16.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.1-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.8/294.8 kB[0m [31m29.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m47.9 MB/s[0m eta [36m0:00:0

# **Import Required Libraries**

---



In [2]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing the NumPy library
import numpy as np

# Importing the json library for JSON file operations
import json

# Importing the torch library, the main PyTorch module
import torch

# Importing the AutoTokenizer and AutoModelForSequenceClassification modules from transformers
# for tokenization and sequence classification tasks
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Importing the SentimentIntensityAnalyzer module from vaderSentiment for sentiment analysis tasks
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Importing the tqdm library to display progress bars in loops
from tqdm import tqdm


# **Mount to Google Drive**

---
Then, we  mount Google Drive to read our dataset.

In [3]:
# Mount (connect to) Google drive to be able to read from it
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# **Read the Data form Google Drive**

---



In [4]:
# Define the  file path
data_file_path = '/content/drive/My Drive/MyData/bitcointalk.json'

# Read the  dataset
with open(data_file_path , 'r') as file:
    dataset_bitcointalk = json.load(file)

# **EDA & Data Cleaning**

---



First of all, we get an overview of the first five variables of the dataset.

In [5]:
print(dataset_bitcointalk[:5])

[{'thread_id': 0, 'date': 1498867765000, 'text': 'if you wanna have a better security, i would recommend to have a multi signature wallet (the address with prefix of 3 instead of 1). in case of possibility of being hacked, they need certainly large amount of computing power which consumes all the possible energy in the Earth as there are more than googol (IIRC) of possible combinations and not all the private key is valid for bitcoin address.', 'post_id': 0}, {'thread_id': 0, 'date': 1498868180000, 'text': "Quote from: ayurvedicurea2growtaller on July 01, 2017, 12:04:49 AMcan they hack into our wallet. There are many professional hackers out there. What are the chances? Also what is the safest way to safeguard ourself?It is possible, 100%. That is if you don't have sufficient security measures. If you adopt good security measures, the chances are next to zero. Bitcoin don't have any vulnerability that allows anyone to guess your private key with a fair amount of computing power.If you 

Based on the basic overview of the data, the dataset columns are:

thread_id: Unique identifier for discussion threads.
date: Timestamp of the post, in Unix format.
text: Content of the post.
post_id: Unique identifier for each post.

In [6]:
print(type(dataset_bitcointalk))

<class 'list'>


Upon examination, we observe that the dataset is in a list format. It is crucial to transform this into a pandas dataframe.

## **Select Relevant Columns & Convert to Pandas Dataframe**

---


For the objectives of our analysis, only the 'date' and 'text' columns are essential. Therefore, we  retain just these two columns. Then, we convert the dataset into pandas dataframe.

In [7]:
# Extracting only the 'date' and 'text' columns
df_bitcointalk = pd.DataFrame(dataset_bitcointalk)[['date', 'text']]

In [8]:
# Display the first 5 rows of the data
print(df_bitcointalk[:5])

            date                                               text
0  1498867765000  if you wanna have a better security, i would r...
1  1498868180000  Quote from: ayurvedicurea2growtaller on July 0...
2  1498869531000  Quote from: lottery248 on July 01, 2017, 12:09...
3  1498869635000  Quote from: lottery248 on July 01, 2017, 12:09...
4  1498872766000  No, thats the good thing about Bitcoin, is tha...


# **Random Sampling**

---


Given our computational constraints, we  focus on a random subset of the data. To achieve this, we will employ random sampling techniques.

In [9]:
# Calculate the number of elements to sample (5% of the dataset)
sample_size = int(len(df_bitcointalk) * 0.05)

# Perform random sampling
df_text = df_bitcointalk.sample(n=sample_size)


In [10]:
# Display the first 5 rows of the df_text dataframe for a quick overview
df_text.head()


Unnamed: 0,date,text
824679,1522832541000,"Quote from: RedR00t on April 04, 2018, 08:24:4..."
157694,1516259913000,Bitcoin transactions are anonymous and there i...
806383,1528041768000,In my own research i can told that the good up...
86399,1491385438000,"Quote from: Wong Goblog on April 05, 2017, 09:..."
1156118,1522592032000,"not both, virtual currency has no central cont..."


# **Handle URLs**

---


In this step, we determine whether there are any URLs present. If URLs are found, we will proceed to remove them; otherwise, we will move on to the next steps.

## Check URLs Presence

In [11]:
# Check for URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
contains_urls = df_text['text'].str.contains(url_pattern, regex=True)

# Check if any row contains a URL and print the corresponding message
if contains_urls.any():
    print("There is a URL in the text.")
else:
    print("There are no URLs in the text.")


There is a URL in the text.


As evident from the dataset, there are URLs present. so we will remove them using "str.replace()" method.





## Remove URLs

---



In [12]:
# Remove URLs
df_text['text'] = df_text['text'].str.replace(url_pattern, '', regex=True)


## Verify After Remove URLs

In [13]:
# Check for URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
contains_urls = df_text['text'].str.contains(url_pattern, regex=True)

# Check if any row contains a URL and print the corresponding message
if contains_urls.any():
    print("There is a URL in the text.")
else:
    print("There are no URLs in the text.")

There are no URLs in the text.


After removeing the URLs we verify that there are no URLs in the text.

# **Handle Numbers**

---



## Check the Presence of the Numbers

In [14]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = df_text['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")


There are numbers in the text.


Because there are numbers in the dataset, we remove them.

## Remove Numbers

In [15]:
# Remove numbers from the 'text' column
df_text['text'] = df_text['text'].str.replace(r'\d+', '', regex=True)

## Verify After Remove Numbers

In [16]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = df_text['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")

There are no numbers in the text.


We have successfully removed all numbers from our dataset.

# **Handle Mentions**

---

Accordingly we implement a process to detect and remove the mentions.

## Check Mentions Presence


In [17]:
# Define pattern for mentions
mention_pattern = r'@\w+'
contains_mentions = df_text['text'].str.contains(mention_pattern, regex=True)

# Check if any row contains a mention and print the corresponding message
if contains_mentions.any():
    print("There are mentions in the text.")
else:
    print("There are no mentions in the text.")

There are mentions in the text.


## Remove Mentions

In [18]:
# Remove mentions from the 'text' column
df_text['text'] = df_text['text'].str.replace(mention_pattern, '', regex=True)

## Verify After Remove Mentions


In [19]:
contains_mentions_after = df_text['text'].str.contains(mention_pattern, regex=True)
if contains_mentions_after.any():
    print("There are mentions in the text after removal.")
else:
    print("There are no mentions in the text after removal.")

There are no mentions in the text after removal.


We verify that all mention are successfuly deleted.

# **Transform Epoch to Date Format**

---

In this step, we change the 'date' column from numerical timestamps to regular dates.


In [20]:
#Convert epoch timestamp to a date format
df_text['date'] = pd.to_datetime(df_text['date'], unit='ms')

print(df_text[:5])

                       date                                               text
824679  2018-04-04 09:02:21  Quote from: RedRt on April , , :: AMHello frie...
157694  2018-01-18 07:18:33  Bitcoin transactions are anonymous and there i...
806383  2018-06-03 16:02:48  In my own research i can told that the good up...
86399   2017-04-05 09:43:58  Quote from: Wong Goblog on April , , :: AMthe ...
1156118 2018-04-01 14:13:52  not both, virtual currency has no central cont...


Now the dataset is ready to extract sentiment scores.

# **Extract Sentiment Scores Using VADER**

---
In this phase, we use VADER technique to extract sentiments form the data.

In [21]:
# Initialize the VADER sentiment intensity analyzer
analyzer = SentimentIntensityAnalyzer()

# Apply the analyzer to each token in the processed_tokens column
df_text['VADER_scores'] = df_text['text'].apply(lambda x: analyzer.polarity_scores(x))
df_text['compound'] = df_text['VADER_scores'].apply(lambda d: d['compound'])

In [22]:
# Print the first few rows with the VADER scores
print(df_text.head())

                       date  \
824679  2018-04-04 09:02:21   
157694  2018-01-18 07:18:33   
806383  2018-06-03 16:02:48   
86399   2017-04-05 09:43:58   
1156118 2018-04-01 14:13:52   

                                                      text  \
824679   Quote from: RedRt on April , , :: AMHello frie...   
157694   Bitcoin transactions are anonymous and there i...   
806383   In my own research i can told that the good up...   
86399    Quote from: Wong Goblog on April , , :: AMthe ...   
1156118  not both, virtual currency has no central cont...   

                                              VADER_scores  compound  
824679   {'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp...    0.7184  
157694   {'neg': 0.17, 'neu': 0.79, 'pos': 0.04, 'compo...   -0.6619  
806383   {'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'comp...    0.4404  
86399    {'neg': 0.019, 'neu': 0.805, 'pos': 0.176, 'co...    0.9192  
1156118  {'neg': 0.202, 'neu': 0.728, 'pos': 0.07, 'com...   -0.5200  


# **Categorizing Sentiments (VADER)**

---


This section of the code we use a function to classify sentiment based on the compound score obtained from the VADER sentiment analysis tool.

In this approach:
* A sentiment is considered positive if its compound score is greater than 0.05.
* It's considered negative if the score is less than -0.05.
* All other scores fall into the neutral category.

In [23]:
def categorize_sentiment(compound_score):
    if compound_score > 0.05:
        return 1
    elif compound_score < -0.05:
        return -1
    else:
        return 0

df_text['VADER_Sentiment_Scores'] = df_text['compound'].apply(categorize_sentiment)

In [24]:
# Have an overview of the first few rows of the data
df_text.head(5)

Unnamed: 0,date,text,VADER_scores,compound,VADER_Sentiment_Scores
824679,2018-04-04 09:02:21,"Quote from: RedRt on April , , :: AMHello frie...","{'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp...",0.7184,1
157694,2018-01-18 07:18:33,Bitcoin transactions are anonymous and there i...,"{'neg': 0.17, 'neu': 0.79, 'pos': 0.04, 'compo...",-0.6619,-1
806383,2018-06-03 16:02:48,In my own research i can told that the good up...,"{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'comp...",0.4404,1
86399,2017-04-05 09:43:58,"Quote from: Wong Goblog on April , , :: AMthe ...","{'neg': 0.019, 'neu': 0.805, 'pos': 0.176, 'co...",0.9192,1
1156118,2018-04-01 14:13:52,"not both, virtual currency has no central cont...","{'neg': 0.202, 'neu': 0.728, 'pos': 0.07, 'com...",-0.52,-1


# **Sentiment Analysis Using FinBERT**

---

Afterwards, in this part we  use FinBERT to extract sentiment scores.

In [25]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

Downloading (…)okenizer_config.json:   0%|          | 0.00/252 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/758 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [26]:
from tqdm import tqdm
from torch.nn.functional import softmax

def sentim_analyzer(df, tokenizer, model):

    for i in tqdm(df.index):
        try:
            text_content = df.loc[i, 'text']
        except:
            return print(' \'text\' column might be missing from dataframe')

        # Pre-process input
        input = tokenizer(text_content, padding=True, truncation=True, return_tensors='pt')

        # Estimate output
        output = model(**input)

        # Pass model output logits through a softmax layer.
        predictions = softmax(output.logits, dim=-1)
        df.loc[i, 'Positive'] = predictions[0][0].tolist()
        df.loc[i, 'Negative'] = predictions[0][1].tolist()
        df.loc[i, 'Neutral']  = predictions[0][2].tolist()

    return df

# Use the modified function:
df_text = sentim_analyzer(df_text, tokenizer, model)


100%|██████████| 83242/83242 [1:16:10<00:00, 18.21it/s]


In [27]:
df_text.head(5)

Unnamed: 0,date,text,VADER_scores,compound,VADER_Sentiment_Scores,Positive,Negative,Neutral
824679,2018-04-04 09:02:21,"Quote from: RedRt on April , , :: AMHello frie...","{'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp...",0.7184,1,0.032704,0.021006,0.94629
157694,2018-01-18 07:18:33,Bitcoin transactions are anonymous and there i...,"{'neg': 0.17, 'neu': 0.79, 'pos': 0.04, 'compo...",-0.6619,-1,0.023187,0.05414,0.922673
806383,2018-06-03 16:02:48,In my own research i can told that the good up...,"{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'comp...",0.4404,1,0.216311,0.012912,0.770778
86399,2017-04-05 09:43:58,"Quote from: Wong Goblog on April , , :: AMthe ...","{'neg': 0.019, 'neu': 0.805, 'pos': 0.176, 'co...",0.9192,1,0.055743,0.075726,0.868531
1156118,2018-04-01 14:13:52,"not both, virtual currency has no central cont...","{'neg': 0.202, 'neu': 0.728, 'pos': 0.07, 'com...",-0.52,-1,0.019723,0.039657,0.94062


# **Calculate Compound BERT**

---



In [28]:
# Compute the intermediate compound score
df_text['BERT_Compound_intermediate'] = df_text['Positive'] - df_text['Negative']

# Normalize the score to be between -1 and 1 using tanh
df_text['BERT_Compound'] = np.tanh(df_text['BERT_Compound_intermediate'])

In [29]:
df_text.head(5)

Unnamed: 0,date,text,VADER_scores,compound,VADER_Sentiment_Scores,Positive,Negative,Neutral,BERT_Compound_intermediate,BERT_Compound
824679,2018-04-04 09:02:21,"Quote from: RedRt on April , , :: AMHello frie...","{'neg': 0.0, 'neu': 0.869, 'pos': 0.131, 'comp...",0.7184,1,0.032704,0.021006,0.94629,0.011698,0.011698
157694,2018-01-18 07:18:33,Bitcoin transactions are anonymous and there i...,"{'neg': 0.17, 'neu': 0.79, 'pos': 0.04, 'compo...",-0.6619,-1,0.023187,0.05414,0.922673,-0.030953,-0.030943
806383,2018-06-03 16:02:48,In my own research i can told that the good up...,"{'neg': 0.0, 'neu': 0.818, 'pos': 0.182, 'comp...",0.4404,1,0.216311,0.012912,0.770778,0.203399,0.200639
86399,2017-04-05 09:43:58,"Quote from: Wong Goblog on April , , :: AMthe ...","{'neg': 0.019, 'neu': 0.805, 'pos': 0.176, 'co...",0.9192,1,0.055743,0.075726,0.868531,-0.019983,-0.019981
1156118,2018-04-01 14:13:52,"not both, virtual currency has no central cont...","{'neg': 0.202, 'neu': 0.728, 'pos': 0.07, 'com...",-0.52,-1,0.019723,0.039657,0.94062,-0.019934,-0.019931


# **Categorizing Sentiments (BERT)**

---



In [30]:
def categorize_sentiment(compound_score):
    if compound_score > 0.05:
        return 1
    elif compound_score < -0.05:
        return -1
    else:
        return 0

df_text['BERT_Sentiment_Scores'] = df_text['BERT_Compound'].apply(categorize_sentiment)

In [31]:
df_text.tail(5)

Unnamed: 0,date,text,VADER_scores,compound,VADER_Sentiment_Scores,Positive,Negative,Neutral,BERT_Compound_intermediate,BERT_Compound,BERT_Sentiment_Scores
131743,2017-01-29 15:25:13,I think op is actually asking everyone to sup...,"{'neg': 0.011, 'neu': 0.861, 'pos': 0.127, 'co...",0.9393,1,0.107192,0.013667,0.879141,0.093525,0.093253,1
1398056,2018-02-17 10:58:16,"THE BEST TO EARN BITCOIN IS BY MINING,TRADING ...","{'neg': 0.0, 'neu': 0.704, 'pos': 0.296, 'comp...",0.6369,1,0.049312,0.020168,0.93052,0.029143,0.029135,0
1547596,2017-12-09 19:26:07,"Quote from: trauchot on December , , :: PMSo i...","{'neg': 0.0, 'neu': 0.84, 'pos': 0.16, 'compou...",0.9618,1,0.201316,0.020563,0.77812,0.180753,0.17881,1
1446707,2018-01-25 15:54:51,there is a chinese proverb which I love most s...,"{'neg': 0.0, 'neu': 0.656, 'pos': 0.344, 'comp...",0.9274,1,0.050532,0.034573,0.914894,0.015959,0.015958,0
1616913,2017-08-30 15:35:16,"Quote from: on August , , :: PMEveryone of us...","{'neg': 0.0, 'neu': 0.917, 'pos': 0.083, 'comp...",0.7227,1,0.152722,0.012174,0.835104,0.140548,0.13963,1


# **Drop Irrelevant Columns**

---

To drop irrelevant columns in a the 'df_text' DataFrame using pandas, we use the 'drop' method.

In [32]:
columns_to_drop = ["text", "VADER_scores", "Positive", "Negative", "Neutral", "BERT_Compound_intermediate"]
df_text = df_text.drop(columns=columns_to_drop)

In [33]:
df_text.head(5)

Unnamed: 0,date,compound,VADER_Sentiment_Scores,BERT_Compound,BERT_Sentiment_Scores
824679,2018-04-04 09:02:21,0.7184,1,0.011698,0
157694,2018-01-18 07:18:33,-0.6619,-1,-0.030943,0
806383,2018-06-03 16:02:48,0.4404,1,0.200639,1
86399,2017-04-05 09:43:58,0.9192,1,-0.019981,0
1156118,2018-04-01 14:13:52,-0.52,-1,-0.019931,0


# **Save the Dataset**

---


In this section, we'll save our dataset, which now includes sentiment scores derived from both VADER and FinBERT, back to Google Drive. This enriched dataset will later serve as a foundation for building deep learning models that leverage these sentiment scores.


In [34]:
path = "/content/drive/My Drive/MyData/Sentiment_Scores_Dataset.csv"
df_text.to_csv(path, index=False)
