# FinBert Finetune Notebook
## Functionallity
* 1. Import the dataset with the news labelled in class
* 2. Split the data into train and test 
* 3. Import FinBert model
* 4. Use the train data (after processing) to fine-tune FinBert
* 5. Measure the performance of the model in train and test
* 6. Compare the model with others used in the previous notebook (Sentiment Analysys)
* 7. Use the model (FinBert FineTune) to classify new news created during the previous steps of the project (Relevance and Strength Score)

In [2]:
!pip install transformers
!pip install datasets
#!pip install src

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.20.1-py3-none-any.whl (4.4 MB)
[K     |████████████████████████████████| 4.4 MB 14.8 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.8.1-py3-none-any.whl (101 kB)
[K     |████████████████████████████████| 101 kB 14.9 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 64.5 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 47.8 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninsta

In [4]:
import sys

import numpy as np
import pandas as pd

import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import Dataset, DataLoader

from transformers import BertTokenizer, BertModel

from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import f1_score, classification_report

from tqdm import tqdm

sys.path.insert(0, '..')
#from data_collection import get_data

pd.set_option("display.max_colwidth", None)

### 1.0. Data Preprocessing

In [5]:
#from google.colab import drive
#drive.mount('/content/drive')
#Load the data 
#import pandas as pd
#dataset = pd.read_csv('/content/drive/MyDrive/CryptoLin_IE_v2.csv', index_col='id')
#dataset = dataset[['news','final_manual_labelling']]

In [None]:
taset = pd.read_csv("CryptoLin2.csv", index_col = 'id')
dataset = dataset[['news','final_manual_labelling']]

In [6]:
print(len(dataset))
print(dataset['news'])
print(dataset['final_manual_labelling'])

2683
id
0                                                        Ripple announces stock buyback, nabs $15 billion valuation
1                                                  IMF directors urge El Salvador to remove Bitcoin as legal tender
2                                                            Dragonfly Capital is raising $500 million for new fund
3                                      Rick and Morty co-creator collaborates with Paradigm on NFT research project
4                                                                                How fintech SPACs lost their shine
                                                           ...                                                     
2678    Gambling for a good cause  CryptoSlots donates all proceeds from new slot to the fight against coronavirus
2679                                                                   Litecoin, The Chinese Alternative to Bitcoin
2680                                                            

In [8]:
#Description of our data set (how many tweets we have with hate speech)
dataset["final_manual_labelling"].value_counts()

 1    1366
 0     921
-1     396
Name: final_manual_labelling, dtype: int64

In [9]:
device = torch.device("cuda:0") if torch.cuda.is_available() else "cpu"
device

device(type='cuda', index=0)

In [11]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
  
TOKENIZER = AutoTokenizer.from_pretrained("ProsusAI/finbert")
MODEL = AutoModelForSequenceClassification.from_pretrained("ProsusAI/finbert")

In [12]:
#Define hyperparameters. Download the pretrained model
MODEL_NAME = "FinBERT"  
BATCH_SIZE = 16
MAX_LEN = 128
EPOCHS = 10
LEARNING_RATE = 1e-05
#TOKENIZER = BertTokenizer.from_pretrained(MODEL_NAME, truncation=True, do_lower_case=True)

In [13]:
# Some data processing to be able to use the Hugging face data set
class Dataset_Preprocess(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.news
        self.targets = OneHotEncoder(sparse=False).fit_transform(np.array(self.data["final_manual_labelling"]).reshape(-1, 1))
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        token_type_ids = inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
            "targets": torch.tensor(self.targets[index], dtype=torch.float)
        }

In [14]:
# Dataloader creation for dataset, split into train, validation and test

train_size = 0.8
val_size = 0.1

train_data = dataset.sample(frac = train_size)
test_data = dataset.drop(train_data.index).reset_index(drop=True)
train_data = train_data.reset_index(drop=True)
val_data = test_data.sample(frac=val_size / (1 - train_size), random_state=220).reset_index()
test_data = test_data.drop(val_data.index).reset_index()

print(f"Full Dataset Size: {dataset.shape}")
print(f"Train Dataset Size: {train_data.shape}")
print(f"Validation Dataset Size: {val_data.shape}")
print(f"Test Dataset Size: {test_data.shape}")

val_test_concat = pd.concat([val_data,test_data])
val_test_concat = val_test_concat.reset_index()
val_test_concat = val_test_concat.drop(columns='level_0')

#val_test_concat['index'] = val_test_concat.index

val_test_set =  Dataset_Preprocess(val_test_concat, TOKENIZER, MAX_LEN)
training_set = Dataset_Preprocess(train_data, TOKENIZER, MAX_LEN)
validation_set = Dataset_Preprocess(val_data, TOKENIZER, MAX_LEN)
testing_set = Dataset_Preprocess(test_data, TOKENIZER, MAX_LEN)


Full Dataset Size: (2683, 2)
Train Dataset Size: (2146, 2)
Validation Dataset Size: (269, 3)
Test Dataset Size: (268, 3)


In [15]:
val_test_concat.head(20)

Unnamed: 0,index,news,final_manual_labelling
0,377,Chinas bitcoin crackdown comment sparks USDT sell-off on OTC desks,-1
1,252,A lack of precedent leads Swedish court to return 33 BTC after law enforcement seizure,-1
2,182,"Ethereum user pays $430,000 in transaction fees for a failed payment",-1
3,360,Bitcoin financial services firm Unchained Capital raises $25 million in Series A funding,1
4,380,Bitcoin hash rate drops as Sichuan miners face short-term power cap,-1
5,466,Goldman is planning to reignite its bitcoin trading desk,0
6,330,El Salvador is handing out up to $117 million in Bitcoin to its citizens,1
7,376,Chinese bitcoin miners brace for impact amid regulatory uncertainty,-1
8,463,"Demand for bitcoin exists across Goldman Sachs wealth management clientele, says crypto exec",1
9,56,Quadency Launches Major Upgrade to Crypto Platform,1


In [107]:
#OneHotEncoder(sparse=False).fit_transform(np.array(val_test_concat["final_manual_labelling"]).reshape(-1, 1))

In [17]:
train_data.dtypes
train_data.head()

Unnamed: 0,news,final_manual_labelling
0,"After pocketing £1 billion off bitcoin, Ruffer describes it as a risky, speculative asset",-1
1,DEX aggregator 1inch blocks out US trades in preparation for separate American platform,0
2,Can You Earn Your Living by Working Online?,0
3,Mastercard highlights applications beyond payments for central bank digital currencies,1
4,"MicroStrategy completes $500 million offering, plans to buy more Bitcoin",1


In [18]:
train_params = {
    "batch_size": BATCH_SIZE,
    "shuffle": True,
    "num_workers": 0
}

val_params = {
    "batch_size": 1,
    "shuffle": False,
    "num_workers": 0
}

test_params = {
    "batch_size": 1,
    "shuffle": False,
    "num_workers": 0
}

training_loader = DataLoader(training_set, **train_params)
validation_loader = DataLoader(validation_set, **val_params)
testing_loader = DataLoader(testing_set, **test_params)
val_test_loader =  DataLoader(val_test_set, **test_params)

### 2.0. FinBERT Base Model

In [19]:
import gc
import numpy as np
import torch
import torch.nn as nn
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score
from transformers import AutoModel
import pandas as pd


In [20]:
class FinBERT(nn.Module):
    def __init__(self, n_classes):
        super(FinBERT, self).__init__()
        #self.l1 = BertModel.from_pretrained('bert-base-uncased')
        self.l1 = BertModel.from_pretrained("ProsusAI/finbert")
        self.pre_classifier = nn.Linear(768, 768)
        self.dropout = nn.Dropout(0.1)
        self.classifier = nn.Linear(768, n_classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        output_1 = self.l1(input_ids=input_ids, attention_mask=attention_mask)
        hidden_state = output_1[0]
        pooler = hidden_state[:, 0]
        pooler = self.pre_classifier(pooler)
        pooler = nn.Tanh()(pooler)
        pooler = self.dropout(pooler)
        output = self.classifier(pooler)
        return output

In [21]:
num_classes = dataset["final_manual_labelling"].nunique()
model = FinBERT(n_classes = num_classes)
model.to(device)
model

Some weights of the model checkpoint at ProsusAI/finbert were not used when initializing BertModel: ['classifier.bias', 'classifier.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


FinBERT(
  (l1): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      

### 3.0. Model Training using Train set

In [22]:
def loss_fn(outputs, targets):
    return torch.nn.BCEWithLogitsLoss()(outputs, targets)

In [23]:
optimizer = AdamW(params=model.parameters(), lr=LEARNING_RATE)

In [24]:
def train(epoch):
    model.train()
    for _, data in tqdm(enumerate(training_loader, 0)):
        ids = data["ids"].to(device, dtype=torch.long)
        mask = data["mask"].to(device, dtype=torch.long)
        token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
        targets = data["targets"].to(device, dtype=torch.float)
        # print('ids', type(ids))
        # print('mask', type(mask))
        # print('token type ids', type(token_type_ids))
        outputs = model(ids, mask, token_type_ids)
        optimizer.zero_grad()
        loss = loss_fn(outputs, targets)
        if _ % 1000 == 0:
            print(f"Epoch: {epoch}, Loss: {loss.item()}")
        loss.backward()
        optimizer.step()

    return outputs,targets

In [25]:
for epoch in range(EPOCHS):
    outputs, targets = train(epoch)
    #outputs, targets

0it [00:00, ?it/s]Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.


Epoch: 0, Loss: 0.699700117111206


135it [00:46,  2.93it/s]
0it [00:00, ?it/s]

Epoch: 1, Loss: 0.38409623503685


135it [00:48,  2.80it/s]
0it [00:00, ?it/s]

Epoch: 2, Loss: 0.2219899594783783


135it [00:48,  2.81it/s]
0it [00:00, ?it/s]

Epoch: 3, Loss: 0.18588143587112427


135it [00:48,  2.81it/s]
0it [00:00, ?it/s]

Epoch: 4, Loss: 0.16490384936332703


135it [00:48,  2.80it/s]
0it [00:00, ?it/s]

Epoch: 5, Loss: 0.04431833326816559


135it [00:47,  2.81it/s]
0it [00:00, ?it/s]

Epoch: 6, Loss: 0.03268811106681824


135it [00:47,  2.82it/s]
0it [00:00, ?it/s]

Epoch: 7, Loss: 0.03282664343714714


135it [00:47,  2.82it/s]
0it [00:00, ?it/s]

Epoch: 8, Loss: 0.01967923529446125


135it [00:47,  2.82it/s]
0it [00:00, ?it/s]

Epoch: 9, Loss: 0.009695596992969513


135it [00:47,  2.82it/s]


### 4.0. Model Evaluation

In [26]:
def validation(model, loader):
    model.eval()
    fin_targets = []
    fin_outputs = []
    with torch.no_grad():
        for _, data in tqdm(enumerate(loader, 0)):
            ids = data["ids"].to(device, dtype=torch.long)
            mask = data["mask"].to(device, dtype=torch.long)
            token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
            targets = data["targets"].to(device, dtype=torch.float)
            outputs = model(ids, mask, token_type_ids)
            fin_targets.extend(targets.cpu().detach().numpy().tolist())
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs, fin_targets

In [27]:
#val_test_concat

In [63]:
#Performance on train
outputs_train, targets_train = validation(model, training_loader)

final_outputs_train = np.argmax(outputs_train, axis=1)
targets_train = np.argmax(targets_train, axis=1)

135it [00:19,  6.99it/s]


In [29]:
from sklearn.metrics import recall_score, precision_score, roc_auc_score, accuracy_score, auc , roc_curve

In [89]:
#roc_auc_score(final_outputs_train.reshape(len(final_outputs_train),1), targets_train.reshape(len(final_outputs_train),1), multi_class='ovr' ,average='weighted')

In [96]:
f1 = f1_score(targets_train,final_outputs_train,average="weighted")
precision = precision_score(targets_train,final_outputs_train,average="weighted")
recall = recall_score(targets_train,final_outputs_train,average="weighted")
accuracy = accuracy_score(targets_train,final_outputs_train)
print(f"Epoch: {epoch}, Accuracy: {accuracy.item()*100} %")
print(f"Epoch: {epoch}, Precision: {precision.item()*100} %")
print(f"Epoch: {epoch}, Recall: {recall.item()*100} %")
print(f"Epoch: {epoch}, F1: {f1.item()*100} %")
#print(f"Epoch: {epoch}, Auc: {auc_precision_recall}")

Epoch: 9, Accuracy: 99.4874184529357 %
Epoch: 9, Precision: 99.48704390806225 %
Epoch: 9, Recall: 99.4874184529357 %
Epoch: 9, F1: 99.48713590131322 %


In [31]:
#Performance on test
outputs, targets = validation(model, val_test_loader)

final_outputs = np.argmax(outputs, axis=1)
targets = np.argmax(targets, axis=1)

537it [00:08, 60.04it/s]


In [32]:
len(val_test_loader)

537

In [33]:
micro_f1 = f1_score(targets, final_outputs, average="micro")
macro_f1 = f1_score(targets, final_outputs, average="macro")
weighted_f1 = f1_score(targets, final_outputs, average="weighted")

print(f"Micro F1 score:\t\t{round(micro_f1, 3)}")
print(f"Macro F1 score:\t\t{round(macro_f1, 3)}")
print(f"Weighted F1 score:\t{round(weighted_f1, 3)}")

Micro F1 score:		0.708
Macro F1 score:		0.701
Weighted F1 score:	0.708


In [97]:
f1 = f1_score(targets,final_outputs,average="weighted")
precision = precision_score(targets,final_outputs,average="weighted")
recall = recall_score(targets,final_outputs,average="weighted")
accuracy = accuracy_score(targets,final_outputs,)
#auc = roc_auc_score(final_outputs_train, targets_train,multi_class='ovr')
print(f"Accuracy Test: {accuracy.item()*100} %")
print(f"Precision Test: {precision.item()*100} %")
print(f"Recall Test: {recall.item()*100} %")
print(f"F1 Test: {f1.item()*100} %")
#print(f"Epoch: {epoch}, Auc: {auc}")

Accuracy Test: 70.7635009310987 %
Precision Test: 70.83607976475923 %
Recall Test: 70.7635009310987 %
F1 Test: 70.76198489212472 %


In [35]:
print(classification_report(targets, final_outputs))

              precision    recall  f1-score   support

           0       0.76      0.68      0.72        95
           1       0.60      0.61      0.61       186
           2       0.77      0.79      0.78       256

    accuracy                           0.71       537
   macro avg       0.71      0.69      0.70       537
weighted avg       0.71      0.71      0.71       537



In [36]:
print(f"Got {sum(final_outputs == targets)} / {len(final_outputs)} correct")

Got 380 / 537 correct


In [37]:
dataset_final = pd.DataFrame()
positive = []
neutral = []
negative = []

for output in outputs:
  negative.append(output[0])
  neutral.append(output[1])
  positive.append(output[2])

dataset_final['positive_retrain'] = positive
dataset_final['neutral_retrain'] = neutral
dataset_final['negative_retrain'] = negative
dataset_final['final_tragets'] = val_test_concat['final_manual_labelling']

list_out = []
for out in final_outputs:
  if out == 0:
    list_out.append(-1)
  elif out == 1: 
    list_out.append(0)
  else:
    list_out.append(1)
dataset_final['final_outputs'] = list_out

In [38]:
combined_finbert = []
for output in outputs:
  if output[0] > output[1] and output[0] > output[2]: #negative 
    OldMax = max(negative)
    OldMin = min(negative)
    NewMax = -1
    NewMin = -0.05
    OldRange = (OldMax - OldMin)  
    NewRange = (NewMax - NewMin)  
    OldValue = output[0]
    NewValue = (((OldValue - OldMin) * NewRange) / OldRange) + NewMin
    combined_finbert.append(NewValue)
  elif output[2] > output[0] and output[2] > output[1]: #positive
    OldMax = max(positive)
    OldMin = min(positive)
    NewMax = 1
    NewMin = 0.05
    OldRange = (OldMax - OldMin)  
    NewRange = (NewMax - NewMin)  
    OldValue = output[2]
    NewValue = (((OldValue - OldMin) * NewRange) / OldRange) + NewMin
    combined_finbert.append(NewValue)
  else: #neutral
    OldMax = max(neutral)
    OldMin = min(neutral)
    NewMax = 0.05
    NewMin = -0.05
    OldRange = (OldMax - OldMin)  
    NewRange = (NewMax - NewMin)  
    OldValue = output[1]
    NewValue = (((OldValue - OldMin) * NewRange) / OldRange) + NewMin
    combined_finbert.append(NewValue)

dataset_final['combined_retrain'] = combined_finbert


### 5.0. Threshold evaluation (this part is not used in the final model)

In [42]:
from sklearn.metrics import roc_curve, confusion_matrix

In [43]:
from sklearn.metrics import roc_curve
best_negative_threshold, best_positive_threshold = 0, 0
#y = dataset_final['final_tragets']
y_positive_or_else = dataset_final['final_tragets'].apply(lambda x: 1 if x > 0 else 0)
y_else_or_negative = dataset_final['final_tragets'].apply(lambda x: 0 if x < 0 else 1)
def apply_cutoff(x):
    
    if x < best_negative_threshold:
        return -1
    elif x > best_positive_threshold:
        return 1
    else:
        return 0

from numpy import sqrt, argmax
from numpy import sqrt, argmax
from sklearn.metrics import accuracy_score, classification_report

print("Sentiment Algorithm\tbest negative threshold\t\tbest positive threshold\t\tAccuracy")
for prediction_name in ['positive_retrain','neutral_retrain','negative_retrain','combined_retrain']:
    fpr, tpr, thresholds = roc_curve(y_positive_or_else, dataset_final[prediction_name])
    gmeans = sqrt(tpr * (1-fpr))
    ix = argmax(gmeans)
    best_positive_threshold = thresholds[ix]
    best_positive_threshold

    fpr, tpr, thresholds = roc_curve(y_else_or_negative, dataset_final[prediction_name])
    
    gmeans = sqrt(tpr * (1-fpr))
    ix = argmax(gmeans)
    best_negative_threshold = thresholds[ix]
    
    
    dataset_final[prediction_name+"_class"] = dataset_final[prediction_name].apply(apply_cutoff)
    
    accuracy = accuracy_score(dataset_final['final_tragets'], dataset_final[prediction_name+"_class"])
    if ( best_negative_threshold >= best_positive_threshold):
        print("{}\t{} (no neutral found)\t{}\t\t\t\t{}%".format(prediction_name.ljust(22), round(best_positive_threshold,5), round(best_positive_threshold,5), round(100*accuracy,2)))
    elif best_negative_threshold  >0 :
        print("{}\t{}\t\t\t\t{}\t\t\t\t{}%".format(prediction_name.ljust(22), round(best_negative_threshold,5), round(best_positive_threshold,5), round(100*accuracy,2)))
    else: 
        print("{}\t{}\t\t\t{}\t\t\t{}%".format(prediction_name.ljust(22), round(best_negative_threshold,5), round(best_positive_threshold,5), round(100*accuracy,2)))
    

Sentiment Algorithm	best negative threshold		best positive threshold		Accuracy
positive_retrain      	0.03965				0.24758				58.85%
neutral_retrain       	0.1172 (no neutral found)	0.1172				22.72%
negative_retrain      	0.00413				0.00419				26.44%
combined_retrain      	0.0494				0.58051				65.36%


### 6.0. Extraxt sentiment from crypto news

In [200]:
import pandas as pd
df1 = pd.read_csv('/content/drive/MyDrive/strength_output.csv')

In [201]:
df_relevance = df1[['title']]
df_relevance.rename(columns = {'title':'news'},inplace=True)
df_relevance

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,news
0,bitcoin : is bitcoin mining is legal ?
1,missed out on ethereum ? here what to buy now
2,missed out on ethereum ? here what to buy now
3,bitcoin cash price reaches $327 . 50 on exchanges ( bch )
4,crypto cash out ! here who won bitcoin bonus money at ufc 273
...,...
193,bitcoin use as currency may just be getting started ( cryptocurrency : btc - usd )
194,itwire - review â stellar data recovery premium
195,newly - discovered stellar explosion the micronova could explain more on dead stars
196,"cryptocurrencies price prediction : ethereum , ripple & bitcoin â american wrap 11 april"


In [202]:
# Some data processing to be able to use the Hugging face data set
class Dataset_Preprocess_appplication(Dataset):

    def __init__(self, dataframe, tokenizer, max_len):
        self.tokenizer = tokenizer
        self.data = dataframe
        self.text = dataframe.news
        self.max_len = max_len

    def __len__(self):
        return len(self.text)

    def __getitem__(self, index):
        text = str(self.text[index])
        text = " ".join(text.split())

        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            pad_to_max_length=True,
            return_token_type_ids=True
        )

        ids = inputs["input_ids"]
        mask = inputs["attention_mask"]
        token_type_ids = inputs["token_type_ids"]

        return {
            "ids": torch.tensor(ids, dtype=torch.long),
            "mask": torch.tensor(mask, dtype=torch.long),
            "token_type_ids": torch.tensor(token_type_ids, dtype=torch.long),
        }

In [203]:
relevance_set =  Dataset_Preprocess_appplication(df_relevance, TOKENIZER, MAX_LEN)
relevance_loader =  DataLoader(relevance_set, **test_params)

In [204]:
def prediction(model, loader):
    model.eval()
    fin_targets = []
    fin_outputs = []
    with torch.no_grad():
        for _, data in tqdm(enumerate(loader, 0)):
            ids = data["ids"].to(device, dtype=torch.long)
            mask = data["mask"].to(device, dtype=torch.long)
            token_type_ids = data["token_type_ids"].to(device, dtype=torch.long)
            outputs = model(ids, mask, token_type_ids)
            fin_outputs.extend(torch.sigmoid(outputs).cpu().detach().numpy().tolist())
    return fin_outputs

In [205]:
outputs_train = prediction(model, relevance_loader)

final_outputs_train = np.argmax(outputs_train, axis=1)

198it [00:02, 73.93it/s]


In [206]:
final_outputs = []
for out in final_outputs_train:
  if out == 0:
    final_outputs.append(-1)
  elif out == 1:
    final_outputs.append(0)
  elif out == 2:
    final_outputs.append(1)

df1['sentiment'] = final_outputs

In [212]:
df_news = df1[['title','url','date_x','coin','Relevance Score','Strength','sentiment']]
df_news.rename(columns={'Relevance Score':'relevance','date_x':'date','Strength':'strength'},inplace=True)
df_news[df_news['coin']=='cardano']

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,title,url,date,coin,relevance,strength,sentiment
7,cardano trading 12 % lower over last week ( ada ),https://www.etfdailynews.com/2022/04/10/cardano-trading-12-lower-over-last-week-ada/,2022-04-10,cardano,0.429458,1,0
8,here why cardano price faces an uphill battle to $1 . 60,https://www.fxstreet.com/cryptocurrencies/news/heres-why-cardano-price-faces-an-uphill-battle-to-160-202204100457,2022-04-10,cardano,0.41611,1,0
22,can cardano price rally to $1 . 6 after major strategic partnership,https://www.fxstreet.com/cryptocurrencies/news/can-cardano-price-rally-to-16-after-major-strategic-partnership-202204080913,2022-04-08,cardano,0.419992,1,0
42,cardano price could rally beyond $1 on one condition,https://www.fxstreet.com/cryptocurrencies/news/cardano-price-could-rally-beyond-1-on-one-condition-202204211905,2022-04-22,cardano,0.379797,0,1
62,what cardano price needs to do to break out to $1 . 60,https://www.fxstreet.com/cryptocurrencies/news/what-cardano-price-needs-to-do-to-break-out-to-160-202204110841,2022-04-11,cardano,0.545061,0,0
63,cardano ( ada ) trading down 20 . 5 % over last week,https://www.etfdailynews.com/2022/04/11/cardano-ada-trading-down-20-5-over-last-week/,2022-04-11,cardano,0.468043,0,-1
64,"why polkadot , cardano , and solana all dropped today",https://www.fool.com/investing/2022/04/11/why-polkadot-cardano-and-solana-all-dropped-today/?source=iedfolrf0000001,2022-04-11,cardano,0.444945,0,-1
80,how to buy cardano in 2022 â our top 3 sites,https://www.heraldscotland.com/news/20050027.buy-cardano-2022---top-3-sites/,2022-04-07,cardano,0.536368,1,0
81,cardano ( ada ) price down 10 . 8 % over last week,https://www.etfdailynews.com/2022/04/07/cardano-ada-price-down-10-8-over-last-week/,2022-04-07,cardano,0.48602,1,-1
104,"cardano price loading up for a 50 % rally , targets $1 . 40",https://www.fxstreet.com/cryptocurrencies/news/cardano-price-loading-up-for-a-50-rally-targets-140-202204202001,2022-04-20,cardano,0.429218,1,0


In [213]:
df_news.to_csv('final_table_groupc.csv')