# **Sentiment Analysis Using BERT & VEDAR**

---



In this notebook, we explore sentiment analysis with a special focus on the 'Bitcointalk' dataset. Our objective is to extract sentiment scores utilising two powerful techniques:
* BERT (Bidirectional Encoder Representations from Transformers), which represents a significant breakthrough in transformer-driven machine learning.
* VADER (Valence Aware Dictionary and sEntiment Reasoner), an instrument for lexicon-based sentiment analysis designed for social media expressions

# **Install Required Libraries**
---
Firstly, we install and import the required libraries.

In [None]:
# Install transformers library from Hugging Face
!pip install transformers

# Install the vaderSentiment library, for ectracting scores using VADER
!pip install vaderSentiment

# Install tqdm for displaying progress bars in loops and during data download/upload
!pip install tqdm


# **Import Required Libraries**

---



In [None]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing the NumPy library
import numpy as np

# Importing the json library for JSON file operations
import json

# Importing the torch library, the main PyTorch module
import torch

# Importing the AutoTokenizer and AutoModelForSequenceClassification modules from transformers
# for tokenization and sequence classification tasks
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# Importing the SentimentIntensityAnalyzer module from vaderSentiment for sentiment analysis tasks
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Importing the tqdm library to display progress bars in loops
from tqdm import tqdm


In [None]:
# Mount (connect to) Google drive to be able to read from it
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# Define the  file path
data_file_path = '/content/drive/My Drive/MyData/bitcointalk.json'

# Read the  dataset
with open(data_file_path , 'r') as file:
    dataset_bitcointalk = json.load(file)

# **EDA & Data Cleaning**

---



First of all, we get an overview of the first five variables of the dataset.

In [None]:
print(dataset_bitcointalk[:5])

Based on the basic overview of the data, the dataset columns are:

thread_id: Unique identifier for discussion threads.
date: Timestamp of the post, in Unix format.
text: Content of the post.
post_id: Unique identifier for each post.

In [None]:
print(type(dataset_bitcointalk))

Upon examination, we observe that the dataset is in a list format. It is crucial to transform this into a pandas dataframe.

## **Select Relevant Columns & Convert to Pandas Dataframe**

---


For the objectives of our analysis, only the 'date' and 'text' columns are essential. Therefore, we will retain just these two columns.

In [None]:
# Extracting only the 'date' and 'text' columns
df_bitcointalk = pd.DataFrame(dataset_bitcointalk)[['date', 'text']]

In [None]:
# Display the first 5 rows of the data
print(df_bitcointalk[:5])

# **Random Sampling**

---


Given our computational constraints, we will focus on a random subset of the data. To achieve this, we will employ random sampling techniques.

In [None]:
# Calculate the number of elements to sample (5% of the dataset)
sample_size = int(len(df_bitcointalk) * 0.05)

# Perform random sampling
df_text = df_bitcointalk.sample(n=sample_size)


In [None]:
# Display the first 5 rows of the df_text dataframe for a quick overview
df_text.head()

# **Handle URLs**

---


In this step, we will determine whether there are any URLs present. If URLs are found, we will proceed to remove them; otherwise, we will move on to the next steps.

## Check URLs Presence

In [None]:
# Check for URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
contains_urls = df_text['text'].str.contains(url_pattern, regex=True)

# Check if any row contains a URL and print the corresponding message
if contains_urls.any():
    print("There is a URL in the text.")
else:
    print("There are no URLs in the text.")

As evident from the dataset, there are URLs present. so we will remove them using "str.replace()" method.





## Remove URLs

---



In [None]:
# Remove URLs
df_text['text'] = df_text['text'].str.replace(url_pattern, '', regex=True)

## Verify After Remove URLs

In [None]:
# Check for URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
contains_urls = df_text['text'].str.contains(url_pattern, regex=True)

# Check if any row contains a URL and print the corresponding message
if contains_urls.any():
    print("There is a URL in the text.")
else:
    print("There are no URLs in the text.")

# **Handle Numbers**

---



## Check the Presence of the Numbers

In [None]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = df_text['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")

## Remove Numbers

In [None]:
# Remove numbers from the 'text' column
df_text['text'] = df_text['text'].str.replace(r'\d+', '', regex=True)

## Verify After Remove Numbers

In [None]:
# Check for Numbers
number_pattern = r'\d+'  # This pattern will match one or more digits
contains_numbers = df_text['text'].str.contains(number_pattern, regex=True)

# Check if any row contains a number and print the corresponding message
if contains_numbers.any():
    print("There are numbers in the text.")
else:
    print("There are no numbers in the text.")

We have successfully removed all numbers from our dataset.

# **Handle Mentions**

---



## Check Mentions Presence


In [None]:
# Define pattern for mentions
mention_pattern = r'@\w+'
contains_mentions = df_text['text'].str.contains(mention_pattern, regex=True)

# Check if any row contains a mention and print the corresponding message
if contains_mentions.any():
    print("There are mentions in the text.")
else:
    print("There are no mentions in the text.")

## Remove Mentions

In [None]:
# Remove mentions from the 'text' column
df_text['text'] = df_text['text'].str.replace(mention_pattern, '', regex=True)

## Verify After Remove Mentions


In [None]:
contains_mentions_after = df_text['text'].str.contains(mention_pattern, regex=True)
if contains_mentions_after.any():
    print("There are mentions in the text after removal.")
else:
    print("There are no mentions in the text after removal.")

# **Transform Epoch to Date Format**

---



In [None]:
#Convert epoch timestamp to a date format
df_text['date'] = pd.to_datetime(df_text['date'], unit='ms')

print(df_text[:5])

Now the dataset is ready to extract sentiment scores.

# **Extract Sentiment Scores Using VADER**

---




In [None]:
# Initialize the VADER sentiment intensity analyzer
analyzer = SentimentIntensityAnalyzer()

# Apply the analyzer to each token in the processed_tokens column
df_text['VADER_scores'] = df_text['text'].apply(lambda x: analyzer.polarity_scores(x))
df_text['compound'] = df_text['VADER_scores'].apply(lambda d: d['compound'])

In [None]:
# Print the first few rows with the VADER scores
print(df_text.head())

# **Categorizing Sentiments (VADER)**

---


This section of the code we use a function to classify sentiment based on the compound score obtained from the VADER sentiment analysis tool.

In this approach:
* A sentiment is considered positive if its compound score is greater than 0.05.
* It's considered negative if the score is less than -0.05.
* All other scores fall into the neutral category.

In [None]:
def categorize_sentiment(compound_score):
    if compound_score > 0.05:
        return 1
    elif compound_score < -0.05:
        return -1
    else:
        return 0

df_text['VADER_Sentiment_Scores'] = df_text['compound'].apply(categorize_sentiment)

In [None]:
# Have an overview of the first few rows of the data
df_text.head(5)

# **Sentiment Analysis Using BERT**

---



In [None]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [None]:
from tqdm import tqdm
from torch.nn.functional import softmax

def sentim_analyzer(df, tokenizer, model):

    for i in tqdm(df.index):
        try:
            text_content = df.loc[i, 'text']
        except:
            return print(' \'text\' column might be missing from dataframe')

        # Pre-process input
        input = tokenizer(text_content, padding=True, truncation=True, return_tensors='pt')

        # Estimate output
        output = model(**input)

        # Pass model output logits through a softmax layer.
        predictions = softmax(output.logits, dim=-1)
        df.loc[i, 'Positive'] = predictions[0][0].tolist()
        df.loc[i, 'Negative'] = predictions[0][1].tolist()
        df.loc[i, 'Neutral']  = predictions[0][2].tolist()

    return df

# Use the modified function:
df_text = sentim_analyzer(df_text, tokenizer, model)


In [None]:
df_text.head(5)

# **Calculate Compound BERT**

---



In [None]:
# Compute the intermediate compound score
df_text['BERT_Compound_intermediate'] = df_text['Positive'] - df_text['Negative']

# Normalize the score to be between -1 and 1 using tanh
df_text['BERT_Compound'] = np.tanh(df_text['BERT_Compound_intermediate'])

In [None]:
df_text.head(5)

# **Categorizing Sentiments (BERT)**

---



In [None]:
def categorize_sentiment(compound_score):
    if compound_score > 0.05:
        return 1
    elif compound_score < -0.05:
        return -1
    else:
        return 0

df_text['BERT_Sentiment_Scores'] = df_text['BERT_Compound'].apply(categorize_sentiment)

In [None]:
df_text.tail(5)

# **Drop Irrelevant Columns**

---

To drop irrelevant columns in a the 'df_text' DataFrame using pandas, we use the 'drop' method.

In [None]:
columns_to_drop = ["text", "VADER_scores", "Positive", "Negative", "Neutral", "BERT_Compound_intermediate"]
df_text = df_text.drop(columns=columns_to_drop)

In [None]:
df_text.head(5)

# **Save the Dataset**

---


In this section, we'll save our dataset, which now includes sentiment scores derived from both VADER and BERT, back to Google Drive. This enriched dataset will later serve as a foundation for building deep learning models that leverage these sentiment scores.


In [None]:
path = "/content/drive/My Drive/MyData/Sentiment_Scores_Dataset.csv"
df_text.to_csv(path, index=False)
