# Introduction to Sentiment Analysis for Cryptocurrency News with FinBERT

In this Jupyter notebook, titled `ToReproduceSentiment.ipynb`, we embark on the journey of retraining the FinBERT model, a specialized transformer model for financial sentiment analysis, using the `CryptoLin_IE_v2.csv` dataset. The objective is to apply the retrained model for sentiment analysis on a dataset of cryptocurrency-related news articles. This process aims to enrich the dataset with sentiment scores, categorizing each news piece into positive, negative, or neutral sentiment. The enriched dataset, named `GA_data_wsentiment.csv`, will serve as an invaluable asset for deeper insights into the cryptocurrency market's reaction to news events.

## Objective

The primary goal is to leverage the FinBERT model, fine-tune it with cryptocurrency-specific news sentiment, and apply it to a pre-filtered dataset (`GA_data_wrelevance.csv`) of cryptocurrency news. This retraining and application process ensures that the sentiment analysis is attuned to the nuances of cryptocurrency discourse, thereby providing a more accurate and relevant assessment of news sentiment.

## Methodology

The approach encompasses several critical steps:

1. **Data Preparation and Cleaning**: Utilize the `CryptoLin_IE_v2.csv` dataset for training, which involves pre-processing steps such as text cleaning, stopwords removal, and lemmatization to prepare the data for the model.

2. **Model Retraining**: Fine-tune the FinBERT model with the prepared dataset to adapt its capabilities to the cryptocurrency news domain. This step involves setting up a training pipeline, defining metrics for evaluation, and executing the training process.

3. **Sentiment Analysis Deployment**: Apply the retrained model to the `GA_data_wrelevance.csv` dataset to perform sentiment analysis. Each news article is scored and classified into one of three sentiment categories: positive, negative, or neutral.

4. **Dataset Enrichment**: Append the sentiment scores and classifications to the dataset, resulting in `GA_data_wsentiment.csv`. This enriched dataset includes additional columns for sentiment probabilities and the final sentiment class for each news article.

5. **Output Generation**: The final dataset with appended sentiment analysis results is saved and ready for subsequent analysis phases, aiming to uncover insights into how news sentiment influences cryptocurrency markets.

## Usage

This notebook is a critical component of a larger analytical framework aimed at understanding the impact of news sentiment on the cryptocurrency market. By integrating sentiment analysis, stakeholders can gain nuanced insights into market sentiment, aiding in decision-making processes and market analysis.

## Requirements

- Python 3.x
- Pandas for data manipulation
- NumPy for numerical operations
- Transformers and PyTorch for model retraining and sentiment analysis
- NLTK for natural language processing tasks
- Scikit-learn for additional machine learning utilities

## Conclusion

The sentiment analysis performed in this notebook enhances our understanding of the cryptocurrency market's reaction to news events. By retraining the FinBERT model on cryptocurrency-specific news and deploying it on a relevant news dataset, we have created a valuable resource (`GA_data_wsentiment.csv`) for analyzing the market sentiment and its potential impacts on cryptocurrency prices and trends.


In [2]:
## Imports

import warnings
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import EarlyStoppingCallback
import numpy as np
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
import string
from nltk.corpus import stopwords
from transformers import Trainer, TrainingArguments, AutoModelForSequenceClassification, AutoTokenizer, DataCollatorWithPadding
import re
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from math import exp

warnings.filterwarnings('ignore')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/arturoolivera/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

Functions

In [3]:
def apply_finbert(x):
    device = torch.device('mps')  # Define the MPS device.
    model.to(device)  # Move your model to MPS.
    
    # Ensure the input is also moved to the MPS device.
    inputs = tokenizer([x], padding=True, truncation=True, return_tensors='pt').to(device)
    
    # Execute model prediction.
    outputs = model(**inputs)
    
    # Apply softmax and move the outputs back to CPU for further processing if necessary.
    predictions = torch.nn.functional.softmax(outputs.logits, dim=1).cpu() 
    
    return predictions[:, 0].tolist()[0], predictions[:, 1].tolist()[0], predictions[:, 2].tolist()[0]

def calculate_result(df):
    largo = df.shape[0]
    result = []
    for i in range(0,largo):
        if df.iloc[i]['Positive']>0.5:
            result.append(1)
        elif df.iloc[i]['Negative']>0.5:
            result.append(-1)
        elif df.iloc[i]['Neutral']>0.5:
            result.append(0)
        elif df.iloc[i]['Neutral']>df.iloc[i]['Positive'] and df.iloc[i]['Neutral']>df.iloc[i]['Negative']:
            result.append(0)
        elif df.iloc[i]['Positive']>df.iloc[i]['Neutral'] and df.iloc[i]['Positive']>df.iloc[i]['Negative']:
            result.append(1)
        else:
            result.append(-1)

    df['result']=result
    return df

# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

# Define Trainer parameters
def compute_metrics(p):
    pred, labels = p
    pred = np.argmax(pred, axis=1)

    accuracy = accuracy_score(y_true=labels, y_pred=pred)
    #recall = recall_score(y_true=labels, y_pred=pred,'weighted')
    #precision = precision_score(y_true=labels, y_pred=pred)
    #f1 = f1_score(y_true=labels, y_pred=pred)
    
    return {"accuracy": accuracy}#, "precision": precision, "recall": recall, "f1": f1}

#Function to apply for each word the proper lemmatization.
def lemmetize_titles(words):
    a = []
    tokens = word_tokenize(words)
    for token in tokens:
        lemmetized_word = lemmatizer.lemmatize(token)
        a.append(lemmetized_word)
    lemmatized_title = ' '.join(a)
    return lemmatized_title

In [4]:
# Custom transformer for text cleaning
class TextCleaner(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        cleaned_data = X.astype(str).map(lambda x: x.lower())
        cleaned_data = cleaned_data.map(lambda x: re.sub('[^A-Za-z0-9]+', ' ', x))
        return cleaned_data

# Custom transformer for stop words removal
class StopWordsRemover(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.stop_words = set(stopwords.words('english'))

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        removed_stop_words = X.apply(lambda x: ' '.join([word for word in x.split() if word not in self.stop_words]))
        return removed_stop_words

# Custom transformer for lemmatization
class Lemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.lemmatizer = WordNetLemmatizer()

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        lemmatized_data = X.apply(lambda x: ' '.join([self.lemmatizer.lemmatize(word) for word in word_tokenize(x)]))
        return lemmatized_data

### Importing the data of CryptoLin for finBERT model Training 

In [5]:
df = pd.read_csv("CryptoLin_IE_v2.txt")

In [6]:
#We ae going to use the reduce manual_labeled df for training our finBERT model:
input_df = df[['date','news','final_manual_labelling','text_span']]

In [7]:
input_df.head()

Unnamed: 0,date,news,final_manual_labelling,text_span
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...


#### Original finBERT

In [8]:
tokenizer = AutoTokenizer.from_pretrained("ProsusAI/finbert")
model = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [9]:
input_df[['Positive','Negative','Neutral']] = (input_df['news'].apply(apply_finbert)).apply(pd.Series)

In [10]:
input_df.head()

Unnamed: 0,date,news,final_manual_labelling,text_span,Positive,Negative,Neutral
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...,0.098288,0.020569,0.881142
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...,0.047823,0.162971,0.789207
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...,0.156997,0.008097,0.834906
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...,0.055608,0.015489,0.928903
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...,0.039964,0.472788,0.487248


In [11]:
non_trained_result = calculate_result(input_df)

In [12]:
non_trained_result.head()

Unnamed: 0,date,news,final_manual_labelling,text_span,Positive,Negative,Neutral,result
0,2022-01-25,"Ripple announces stock buyback, nabs $15 billi...",1,{annotator1_id:22;annotator1_label:1; annotato...,0.098288,0.020569,0.881142,0
1,2022-01-25,IMF directors urge El Salvador to remove Bitco...,-1,{annotator1_id:16;annotator1_label:-1; annotat...,0.047823,0.162971,0.789207,0
2,2022-01-25,Dragonfly Capital is raising $500 million for ...,1,{annotator1_id:45;annotator1_label:1; annotato...,0.156997,0.008097,0.834906,0
3,2022-01-25,Rick and Morty co-creator collaborates with Pa...,0,{annotator1_id:32;annotator1_label:0; annotato...,0.055608,0.015489,0.928903,0
4,2022-01-25,How fintech SPACs lost their shine,0,{annotator1_id:48;annotator1_label:0; annotato...,0.039964,0.472788,0.487248,0


In [13]:
#The accuracy with the original FinBERT:
accuracy_score(y_true=non_trained_result['final_manual_labelling'], y_pred=non_trained_result['result'])

0.5143496086470369

#### Retraining Finbert

In [18]:
df = input_df[['news','final_manual_labelling']]
df.columns = ['news','labels']

In [19]:
df.head()

Unnamed: 0,news,labels
0,"Ripple announces stock buyback, nabs $15 billi...",1
1,IMF directors urge El Salvador to remove Bitco...,-1
2,Dragonfly Capital is raising $500 million for ...,1
3,Rick and Morty co-creator collaborates with Pa...,0
4,How fintech SPACs lost their shine,0


In [20]:
#Giving a +1 offset to the target variable:
df['labels'].replace({1:2},inplace=True)
df['labels'].replace({0:1},inplace=True)
df['labels'].replace({-1:0},inplace=True)

In [15]:
# Preprocess data
X = list(df["news"])
y = list(df["labels"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_val_2, X_test, y_val_2, y_test = train_test_split(X_val, y_val, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val_2, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

In [16]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)

In [17]:
# Define Trainer
args = TrainingArguments(
    output_dir="first_test",
    evaluation_strategy="steps",
    eval_steps=500,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=4,
    seed=0,
    load_best_model_at_end=True,
)

In [18]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=3)],
)

In [19]:
# Train pre-trained model
trainer.train()

Step,Training Loss,Validation Loss,Accuracy
500,0.6626,2.031116,0.435897
1000,0.2616,3.120428,0.435897


Checkpoint destination directory first_test/checkpoint-500 already exists and is non-empty. Saving will proceed but saved results may be invalid.
Checkpoint destination directory first_test/checkpoint-1000 already exists and is non-empty. Saving will proceed but saved results may be invalid.


TrainOutput(global_step=1076, training_loss=0.43966309643146273, metrics={'train_runtime': 367.6347, 'train_samples_per_second': 23.349, 'train_steps_per_second': 2.927, 'total_flos': 149982870399264.0, 'train_loss': 0.43966309643146273, 'epoch': 4.0})

In [20]:
# ----- 3. Predicting -----#
# Create torch dataset
test_dataset = Dataset(X_test_tokenized, y_test)

In [21]:
# Load trained model
model_path = "first_test/checkpoint-500"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=3)

In [22]:
# Define test trainer
test_trainer = Trainer(model)

In [23]:
 #Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)

In [24]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [25]:
#The accuracy with the re-trained FinBERT:
accuracy_score(y_true=y_test, y_pred=y_pred)

0.6666666666666666

In [26]:
#The accuracy of the original finbert:
X_test_df = pd.DataFrame()
X_test_df['news']=X_test
X_test_df['labels']=y_test

In [27]:
X_test_df.head()

Unnamed: 0,news,labels
0,Calamari Network Rolls Out Community Governanc...,2
1,Gemini acquires crypto custody startup Shard X,1
2,Sen. Sherrod Brown tells the Fed to move forwa...,1
3,What Happens to Bitcoin if the Dollar Crashes?,1
4,Higher and Safer APR gets 200% bigger: Meet Ph...,1


In [28]:
X_test_df.shape

(108, 2)

In [29]:
X_test_df['labels'].replace({0:-1},inplace=True)
X_test_df['labels'].replace({1:0},inplace=True)
X_test_df['labels'].replace({2:1},inplace=True)

In [30]:
X_test_df.head()

Unnamed: 0,news,labels
0,Calamari Network Rolls Out Community Governanc...,1
1,Gemini acquires crypto custody startup Shard X,0
2,Sen. Sherrod Brown tells the Fed to move forwa...,0
3,What Happens to Bitcoin if the Dollar Crashes?,0
4,Higher and Safer APR gets 200% bigger: Meet Ph...,0


In [31]:
X_test_df[['Positive','Negative','Neutral']] = (X_test_df['news'].apply(apply_finbert)).apply(pd.Series)

In [32]:
result_df = calculate_result(X_test_df)

In [33]:
result_df.head()

Unnamed: 0,news,labels,Positive,Negative,Neutral,result
0,Calamari Network Rolls Out Community Governanc...,1,0.013107,0.304759,0.682134,0
1,Gemini acquires crypto custody startup Shard X,0,0.001792,0.040396,0.957812,0
2,Sen. Sherrod Brown tells the Fed to move forwa...,0,0.067053,0.887254,0.045694,-1
3,What Happens to Bitcoin if the Dollar Crashes?,0,0.024612,0.950764,0.024625,-1
4,Higher and Safer APR gets 200% bigger: Meet Ph...,0,0.010688,0.820045,0.169266,-1


In [34]:
result_df.shape

(108, 6)

In [35]:
#The accuracy with the original FinBERT:
accuracy_score(y_true=result_df['labels'], y_pred=result_df['result'])

0.18518518518518517

#### Testing feature enginnering to improve the model accuracy

In [21]:
stop_words = stopwords.words('english')
df['news'] = df['news'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))
df["news"] = df["news"].str.lower()
df["news"] = df['news'].str.replace('[^\w\s]','')

In [22]:
lemmatizer = WordNetLemmatizer()
df['lemmetized_titles'] = df['news'].apply(lemmetize_titles)

In [23]:
df.head()

Unnamed: 0,news,labels,lemmetized_titles
0,ripple announces stock buyback nabs 15 billion...,2,ripple announces stock buyback nabs 15 billion...
1,imf directors urge el salvador remove bitcoin ...,0,imf director urge el salvador remove bitcoin l...
2,dragonfly capital raising 500 million new fund,2,dragonfly capital raising 500 million new fund
3,rick morty cocreator collaborates paradigm nft...,1,rick morty cocreator collaborates paradigm nft...
4,how fintech spacs lost shine,1,how fintech spacs lost shine


In [24]:
# Preprocess data
X = list(df["lemmetized_titles"])
y = list(df["labels"])
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2)
X_val_2, X_test, y_val_2, y_test = train_test_split(X_val, y_val, test_size=0.2)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val_2, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

In [25]:
train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)
test_dataset = Dataset(X_test_tokenized, y_test)

In [26]:
data_collator = DataCollatorWithPadding(tokenizer=tokenizer, return_tensors="pt")
training_args = TrainingArguments(
    output_dir='final_model/',          # output directory for checkpoints
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,   # batch size per device during training
    per_device_eval_batch_size=8,    # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='logs/',            # directory for storing logs
    logging_steps=10,
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    evaluation_strategy="epoch",        # evaluate at the end of each epoch
    save_strategy="epoch",  
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=data_collator,  # Use the dynamic padding data collator
    compute_metrics=compute_metrics,
)

In [52]:
# Train pre-trained model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy
1,0.2716,3.467583,0.386946
2,0.2566,3.429484,0.398601
3,0.1523,4.036821,0.403263


TrainOutput(global_step=807, training_loss=0.16719053310827992, metrics={'train_runtime': 273.5665, 'train_samples_per_second': 23.534, 'train_steps_per_second': 2.95, 'total_flos': 102561815787732.0, 'train_loss': 0.16719053310827992, 'epoch': 3.0})

In [27]:
# Load trained model
model_path = "final_model/checkpoint-807"
model = BertForSequenceClassification.from_pretrained(model_path, num_labels=3)

In [28]:
# Define test trainer
test_trainer = Trainer(model)

In [29]:
 #Make prediction
raw_pred, _, _ = test_trainer.predict(test_dataset)

In [30]:
# Preprocess raw predictions
y_pred = np.argmax(raw_pred, axis=1)

In [31]:
#The accuracy with the re-trained FinBERT:
accuracy_score(y_true=y_test, y_pred=y_pred)

0.9166666666666666

In [32]:
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score, confusion_matrix

In [33]:
confusion_matrix(y_true=y_test, y_pred=y_pred)

array([[18,  0,  0],
       [ 1, 28,  3],
       [ 1,  4, 53]])

In [34]:
raw_pred, _, _ = test_trainer.predict(train_dataset)

In [35]:
y_pred = np.argmax(raw_pred, axis=1)

In [36]:
accuracy_score(y_true=y_train, y_pred=y_pred)

0.9394221808014911

In [37]:
confusion_matrix(y_true=y_train, y_pred=y_pred)

array([[ 303,   16,    8],
       [  13,  674,   42],
       [   6,   45, 1039]])

#### Applying Labels to Study Case 

In [38]:
# Define the pipeline
text_processing_pipeline = Pipeline([
    ('text_cleaning', TextCleaner()),
    ('stop_words_removal', StopWordsRemover()),
    ('lemmatization', Lemmatizer())
])

In [39]:
df_study = pd.read_csv('GA_data_wrelevance.csv')

In [40]:
processed_data = text_processing_pipeline.fit_transform(df_study['title'])

In [41]:
df_study['lemmetized_titles'] = processed_data

In [42]:
X = df_study['lemmetized_titles'].tolist()

In [43]:
y = np.ones(len(X))

In [44]:
y = [int(x) for x in y]

In [45]:
XT = tokenizer(X, padding=True, truncation=True, max_length=512)

In [46]:
train_dataset = Dataset(XT, y)

In [76]:
raw_pred, raw_pred2, raw_pred3 = test_trainer.predict(train_dataset)

In [77]:
y_pred = np.argmax(raw_pred, axis=1)

In [78]:
exp(raw_pred[0][0])/(exp(raw_pred[0][0])+exp(raw_pred[0][1])+exp(raw_pred[0][2]))

0.9995974296994414

In [79]:
exp(raw_pred[0][1])/(exp(raw_pred[0][0])+exp(raw_pred[0][1])+exp(raw_pred[0][2]))

0.00024243224362529114

In [80]:
exp(raw_pred[0][2])/(exp(raw_pred[0][0])+exp(raw_pred[0][1])+exp(raw_pred[0][2]))

0.00016013805693329355

In [81]:
neg_class = []
neut_class = []
pos_class = []

for item in raw_pred:
    neg = exp(item[0])/(exp(item[0])+exp(item[1])+exp(item[2]))
    neut = exp(item[1])/(exp(item[0])+exp(item[1])+exp(item[2]))
    pos = exp(item[2])/(exp(item[0])+exp(item[1])+exp(item[2]))
    neg_class.append(neg)
    neut_class.append(neut)
    pos_class.append(pos)

In [82]:
df_study['sentiment_negative_probability']=neg_class
df_study['sentiment_neutral_probability']=neut_class
df_study['sentiment_positive_probability']=pos_class
df_study['sentiment_class']=y_pred-1

In [83]:
df_study

Unnamed: 0.1,Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,relevance_probability,relevance_class,lemmetized_titles,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,sentiment_class
0,0,https://news.yahoo.com/ai-scams-missouri-warns...,,AI Scams : Missouri warns voices of loved ones...,20240219T204500Z,https://media.zenfs.com/en/ktvi_articles_498/2...,news.yahoo.com,English,United States,0.210185,0.0,ai scam missouri warns voice loved one used fraud,0.999597,0.000242,0.000160,-1
1,1,https://www.americanbanker.com/opinion/regulat...,,Regulators should reexamine their assumptions ...,20240219T194500Z,https://source-media-brightspot.s3.us-east-1.a...,americanbanker.com,English,United States,0.565864,1.0,regulator reexamine assumption brokered deposit,0.989373,0.010579,0.000048,-1
2,2,https://biztoc.com/x/97e1450bfef84362,,South Korean Political Party Eyes Crypto Revol...,20240219T130000Z,https://c.biztoc.com/p/97e1450bfef84362/s.webp,biztoc.com,English,,0.697517,1.0,south korean political party eye crypto revolu...,0.998693,0.000833,0.000474,-1
3,3,https://biztoc.com/x/5c2110519540e5cf,,Unraveling the Mystery Behind XRP Price Underp...,20240219T103000Z,https://c.biztoc.com/p/5c2110519540e5cf/s.webp,biztoc.com,English,,0.369046,0.0,unraveling mystery behind xrp price underperfo...,0.001967,0.996818,0.001215,0
4,4,https://biztoc.com/x/2f038851769a9841,,Cryptocurrency Rankings : Solana Claims the Co...,20240219T181500Z,https://c.biztoc.com/p/2f038851769a9841/s.webp,biztoc.com,English,,0.512854,1.0,cryptocurrency ranking solana claim coveted fo...,0.000429,0.987183,0.012388,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
13999,13999,https://www.kilkennypeople.ie/news/national-ne...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.kilkennypeople.ie/resizer/1200/700...,kilkennypeople.ie,English,Ireland,0.178277,0.0,oliver callan first day new rt show feeling st...,0.000348,0.997917,0.001735,0
14000,14000,https://www.leitrimobserver.ie/news/national-n...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.leitrimobserver.ie/resizer/1200/70...,leitrimobserver.ie,English,Ireland,0.178277,0.0,oliver callan first day new rt show feeling st...,0.000348,0.997917,0.001735,0
14001,14001,https://www.donegallive.ie/news/national-news/...,,Oliver Callan first day on new RTÉ show - feel...,20240129T151500Z,https://www.donegallive.ie/resizer/1200/700/tr...,donegallive.ie,English,,0.178277,0.0,oliver callan first day new rt show feeling st...,0.000348,0.997917,0.001735,0
14002,14002,https://www.banklesstimes.com/news/2024/01/29/...,,RFK Jr . Joins Trump in Anti - CBDC Stance,20240129T141500Z,"https://cdn.banklesstimes.com/tr:f-jpg,w-1200,...",banklesstimes.com,English,United States,0.113114,0.0,rfk jr join trump anti cbdc stance,0.116411,0.882984,0.000605,0


In [86]:
df_study.to_csv('GA_data_wsentiment.csv')

Applying sentiment analysis to strength training data

In [47]:
df_strength = pd.read_csv('news_short.csv')

In [49]:
processed_data = text_processing_pipeline.fit_transform(df_strength['title'])

In [50]:
df_strength['lemmetized_titles'] = processed_data

In [51]:
X = df_strength['lemmetized_titles'].tolist()

In [52]:
y = np.ones(len(X))

In [53]:
y = [int(x) for x in y]

In [54]:
XT = tokenizer(X, padding=True, truncation=True, max_length=512)

In [55]:
train_dataset = Dataset(XT, y)

In [56]:
raw_pred, raw_pred2, raw_pred3 = test_trainer.predict(train_dataset)

In [57]:
y_pred = np.argmax(raw_pred, axis=1)

In [58]:
neg_class = []
neut_class = []
pos_class = []

for item in raw_pred:
    neg = exp(item[0])/(exp(item[0])+exp(item[1])+exp(item[2]))
    neut = exp(item[1])/(exp(item[0])+exp(item[1])+exp(item[2]))
    pos = exp(item[2])/(exp(item[0])+exp(item[1])+exp(item[2]))
    neg_class.append(neg)
    neut_class.append(neut)
    pos_class.append(pos)

In [59]:
df_strength['sentiment_negative_probability']=neg_class
df_strength['sentiment_neutral_probability']=neut_class
df_strength['sentiment_positive_probability']=pos_class
df_strength['sentiment_class']=y_pred-1

In [60]:
df_strength.head()

Unnamed: 0,url,url_mobile,title,seendate,socialimage,domain,language,sourcecountry,lemmetized_titles,sentiment_negative_probability,sentiment_neutral_probability,sentiment_positive_probability,sentiment_class
0,https://www.digitaljournal.com/pr/longhash-ven...,,LongHash Ventures and Terraform Labs Join Forc...,20220406T163000Z,https://www.newsfilecorp.com/newsinfo/119481/356,digitaljournal.com,English,United States,longhash venture terraform lab join force adva...,0.000179,0.010903,0.988918,1
1,https://www.prnewswire.com/news-releases/terra...,,TERRA . DO TO COMPETE IN FINAL 20 GROUP FOR ED...,20220406T001500Z,,prnewswire.com,English,United States,terra compete final 20 group edtech competitio...,0.000484,0.99854,0.000976,0
2,https://techcrunch.com/2022/04/06/terras-found...,,Terra founder plans to back its stablecoin wit...,20220406T213000Z,https://techcrunch.com/wp-content/uploads/2022...,techcrunch.com,English,United States,terra founder plan back stablecoin basket cryp...,8.1e-05,0.001193,0.998726,1
3,https://www.business-standard.com/article/comp...,https://wap.business-standard.com/article-amp/...,Crypto platform Leap raises $3 . 2 mn in fundi...,20220406T081500Z,https://bsmedia.business-standard.com/_media/b...,business-standard.com,English,India,crypto platform leap raise 3 2 mn funding coin...,6.8e-05,0.00475,0.995181,1
4,https://www.fool.com/investing/2022/04/06/can-...,,Can THORchain Keep Surging ? | The Motley Fool,20220406T120000Z,https://g.foolcdn.com/editorial/images/673167/...,fool.com,English,United States,thorchain keep surging motley fool,0.000619,0.995138,0.004243,0


In [61]:
df_strength.to_csv('news_short_wsentiment.csv')