#Introduction

The labels we are going to use are 0 for the positive and 1 for the negative.

We have tried two methods, one classic and straightforward and one with a k-fold.

In the following cells you can find the code we used to create and train the models.

In [None]:
!pip install transformers
!pip install evaluate

In [11]:
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, BertTokenizerFast
import torch
from pathlib import Path
import pandas as pd
import numpy as np

dp = pd.read_csv("/content/drive/MyDrive/CriptoBert/Cripto_sentiment.csv", sep="\t")

train_texts, train_labels = list(dp["Text"]), list(dp["label"])

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(train_texts, truncation=True, padding=True)

class CryptoDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = 0 if self.labels[idx] == "positive" else 1
        return item

    def __len__(self):
        return len(self.labels)

train_dataset = CryptoDataset(train_encodings, train_labels)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=64,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=10,
)

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")

trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
)

trainer.train()

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Step,Training Loss
10,0.7056
20,0.7076
30,0.7096
40,0.6692
50,0.676
60,0.6542
70,0.6388
80,0.5921
90,0.5172
100,0.4603




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=360, training_loss=0.23928162790576202, metrics={'train_runtime': 516.1154, 'train_samples_per_second': 10.889, 'train_steps_per_second': 0.698, 'total_flos': 1478684131123200.0, 'train_loss': 0.23928162790576202, 'epoch': 10.0})

In [17]:
trainer.save_model()

Saving model checkpoint to ./results
Configuration saved in ./results/config.json
Model weights saved in ./results/pytorch_model.bin


In [17]:
from transformers import BertModel

model = BertForSequenceClassification.from_pretrained("/content/results")

loading configuration file /content/results/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file /content/results/pytorch_model.bin
Some weights of the model checkpoint at /content/results were not used when initializing BertModel: ['classifier.weight', 'classifier.bi

In [39]:
predictions = trainer.predict(test_dataset)
print(predictions.predictions.shape, predictions.label_ids.shape)

***** Running Prediction *****
  Num examples = 112
  Batch size = 64


(112, 2) (112,)


In [42]:
import evaluate

preds = np.argmax(predictions.predictions, axis=-1)
metric = evaluate.load("glue", "mrpc")
labels = [0 if label == "positive" else 1 for label in test_dataset.labels]
metric.compute(predictions=preds, references=labels)

{'accuracy': 0.875, 'f1': 0.8870967741935484}

In [76]:
sample_txt = "Crypto Prices Bullish Once More"
tokenizer(sample_txt, truncation=True, padding=True)
encoding = tokenizer(sample_txt, return_tensors='pt', truncation=True, padding=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

out = model(input_ids, attention_mask=attention_mask, output_attentions=False)

In [88]:
model.push_to_hub("Robertuus/Crypto_Sentiment_Analysis_Bert", use_auth_token="")

Configuration saved in /tmp/tmp90pw82fg/config.json
Model weights saved in /tmp/tmp90pw82fg/pytorch_model.bin
Uploading the following files to Robertuus/Crypto_Sentiment_Analysis_Bert: pytorch_model.bin,config.json


CommitInfo(commit_url='https://huggingface.co/Robertuus/Crypto_Sentiment_Analysis_Bert/commit/f98c9aeaa71dbc235ce4aa30059817aae651f54a', commit_message='Upload BertForSequenceClassification', commit_description='', oid='f98c9aeaa71dbc235ce4aa30059817aae651f54a', pr_url=None, pr_revision=None, pr_num=None)

In [2]:
from sklearn.model_selection import KFold
from transformers import BertForSequenceClassification, Trainer, TrainingArguments, BertTokenizerFast
import pandas as pd
import numpy as np
import torch

class CryptoDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = 0 if self.labels[idx] == "positive" else 1
        return item

    def __len__(self):
        return len(self.labels)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    logits = torch.nn.functional.softmax(torch.tensor(logits))
    predictions = np.argmax(logits, axis=-1)
    
    labels = torch.tensor(labels)
    acc = torch.sum(predictions == labels) / predictions.shape[0]
    return {"accuracy" : acc}



In [50]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dp = pd.read_csv("/content/drive/MyDrive/CriptoBert/Cripto_sentiment.csv", sep="\t")

kf = KFold(n_splits=5, shuffle=True, random_state=0)

tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

model = BertForSequenceClassification.from_pretrained("bert-base-uncased")
model.to(device)

training_args = TrainingArguments(
    output_dir='./results',          # output directory
    overwrite_output_dir = True,
    num_train_epochs=2,              # total number of training epochs
    per_device_train_batch_size=16,  # batch size per device during training
    per_device_eval_batch_size=16,   # batch size for evaluation
    learning_rate=2e-5,
    warmup_steps=1000,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    logging_steps=220,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True
)


counter = 0
results_lst = []


train_text = dp["Text"]
train_label = dp["label"].apply(lambda label: 0 if label == "positive" else 1)

for train_idx, val_idx in kf.split(train_text):
    print("Starting fold", counter)

    # split data
    train_texts_base = train_text.iloc[train_idx].tolist()
    train_labels_base = train_label.iloc[train_idx].tolist()

    val_texts = train_text.iloc[val_idx].tolist()
    val_labels = train_label.iloc[val_idx].tolist()

    # do tokenization
    train_encodings = tokenizer(train_texts_base, truncation=True, padding=True, max_length=512, return_tensors="pt")
    val_encodings = tokenizer(val_texts, truncation=True, padding=True, max_length=512, return_tensors="pt")
    
    # make datasets
    train_data = CryptoDataset(train_encodings, train_labels_base)
    val_data = CryptoDataset(val_encodings, val_labels)
    
    # train
    trainer = Trainer(
        model=model,                         # the instantiated 🤗 Transformers model to be trained
        args=training_args,                  # training arguments, defined above
        train_dataset=train_data,         # training dataset
        eval_dataset=val_data,             # evaluation dataset
        compute_metrics=compute_metrics
    )
    trainer.train()
    
    # eval
    predicts = trainer.predict(val_data)
    result_df = pd.DataFrame({
        "text" : val_texts,
        "score" : torch.softmax(torch.tensor(predicts.predictions), axis=1).tolist()
    })
    results_lst.append(result_df)
    
    counter+=1
    
trainer.save_model("/results")

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/5546055f03398095e385d7dc625e636cc8910bf2/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_ac

Starting fold 0


***** Running training *****
  Num examples = 449
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 58
  Number of trainable parameters = 109483778
  del sys.path[0]


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.504197,1.0
2,No log,0.392931,1.0


***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
  del sys.path[0]
***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-58 (score: 0.3929307460784912).
***** Running Prediction *****
  Num examples = 113
  Batch size = 16
  del sys.path[0]


***** Running training *****
  Num examples = 449
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 58
  Number of trainable parameters = 109483778


Starting fold 1


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.35951,1.0
2,No log,0.285328,1.0


***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
  del sys.path[0]
***** Running Evaluation *****
  Num examples = 113
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-58 (score: 0.28532758355140686).
***** Running Prediction *****
  Num examples = 113
  Batch size = 16
  del sys.path[0]


***** Running training *****
  Num examples = 450
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 58
  Number of trainable parameters = 109483778


Starting fold 2


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.275044,1.0
2,No log,0.217441,1.0


***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
  del sys.path[0]
***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-58 (score: 0.21744079887866974).
***** Running Prediction *****
  Num examples = 112
  Batch size = 16
  del sys.path[0]


***** Running training *****
  Num examples = 450
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 58
  Number of trainable parameters = 109483778


Starting fold 3


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.197558,1.0
2,No log,0.151641,1.0


***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
  del sys.path[0]
***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-58 (score: 0.1516409069299698).
***** Running Prediction *****
  Num examples = 112
  Batch size = 16
  del sys.path[0]


***** Running training *****
  Num examples = 450
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 58
  Number of trainable parameters = 109483778


Starting fold 4


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.145197,1.0
2,No log,0.112443,1.0


***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-29
Configuration saved in ./results/checkpoint-29/config.json
Model weights saved in ./results/checkpoint-29/pytorch_model.bin
  del sys.path[0]
***** Running Evaluation *****
  Num examples = 112
  Batch size = 16
Saving model checkpoint to ./results/checkpoint-58
Configuration saved in ./results/checkpoint-58/config.json
Model weights saved in ./results/checkpoint-58/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-58 (score: 0.11244286596775055).
***** Running Prediction *****
  Num examples = 112
  Batch size = 16
  del sys.path[0]


Saving model checkpoint to /results
Configuration saved in /results/config.json
Model weights saved in /results/pytorch_model.bin


In [51]:
model = BertForSequenceClassification.from_pretrained("/content/results/checkpoint-64")
tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

loading configuration file /content/results/checkpoint-64/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.24.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file /content/results/checkpoint-64/pytorch_model.bin
All model checkpoint weights were used when initializing BertForSequenceClassification.

All the weights of 

In [53]:
sample_txt = "good"

tokenizer(sample_txt, truncation=True, padding=True)
encoding = tokenizer(sample_txt, return_tensors='pt', truncation=True, padding=True)
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']

out = model(input_ids, attention_mask=attention_mask, output_attentions=False)
print(torch.softmax(out.logits, axis=1))

tensor([[0.0129, 0.9871]], grad_fn=<SoftmaxBackward0>)
