# FineTuning DistilBERT with StockTwits-Crypto Dataset (100K samples)

## Idea

- As discussed earlier while exploraing classical ML approach, using external dataset from similar domain to fine tune to large language model on binary classification.
- This perticular dataset contains all cryptocurrency-related posts from the StockTwits website, from 1st of November 2021 to the 15th of June 2022. 
- There are total __1.3 MN__ tweets collected over above mentioned period. 
- Though, the labeling process has been unkown, I thought of giving it a shot to fine tune the model using it.

Reference - https://huggingface.co/datasets/ElKulako/stocktwits-crypto

## Data Description

- Stats, 
    - The dataset holds __1.3 MN tweets__ annotated with __3__ labels:
    - Sentiments: __Bearish, Bullish, Neutral__
- I've hypothesize that Bearish correlates to Negative Sentiment & Bullish correlates to Positive Sentiment. 
- So after dropping tweets with Neutral label from the dataset and mapping Bearish to Negative as well as Bullish to Positive class,
    - No. of __Negative__ Samples - __124,451__
    - No. of __Positive__ Samples - __676,701__
- Using __100K samples__ from above __800K pairs__ of _Postive_ & _Negative_ Samples to finetune the __DistilBERT-base-uncased__ model.
- Now, I aim to fine-tune __DistilBERT__ Model __(67MN) param__ with this twitter dataset with minimal change with respect to original training configurations.

## Points to be noted,
- In whole training, we are never showing the actual Reddit Crypto Comments Sentiment Dataset to model.
- Plan is to only use it for testing purpose to see the performance metrics.
- My hunch behind not using for finetuning at all is that it won't make any difference as size of Reddit Dataset is very small.
- It might overfit the model and the resulting numbers will be misleading so just relying on external dataset to train the actual classification head. 

## Configurations

Let's use the pretrained model to finetuen on StockTwits-Crypto Sentiment Dataset.
    
- Using __"distilbert-base-uncased-finetuned-sst-2-english"__
- Its finetuned on __SST2 Dataset__ (Link - https://huggingface.co/datasets/sst2)
- Total Number of Params - __67 MN__
- Class Labels - 0: Negative 1: Positive
- Using __StockTwits-Crypto (100k) samples__ to further finetune for 1 epoch
- Training Arguments & Trainer Configurations are kept same with _minimal changes_ in __BatchSize__, __Eval Strategy__, __Num of Epochs__ 

In [1]:
# !rm -rf /content/trainer/reddit-crypto-sent-trainer/

In [2]:
!nvidia-smi

Wed Feb 22 00:38:29 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   70C    P0    31W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Installing & Importing required Packages

In [3]:
!pip install contractions

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
!pip install transformers
!pip install torchsummary
!pip install datasets
!pip install evaluate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [3]:
import pandas as pd
import numpy as np
from collections import Collection
import re,string,unicodedata
import contractions
import html

  from collections import Collection


In [4]:
from sklearn import metrics
import seaborn as sns
import matplotlib.pyplot as plt

## Utility Functions

In [5]:
def plot_confusion_matrix(test_y, predict_y):
    C = metrics.confusion_matrix(test_y, predict_y)
    missclassified_pts = (len(test_y)-np.trace(C))/len(test_y)*100
    print("Number of misclassified points -",np.round(missclassified_pts, 2),"%")
    # C = 2,2 matrix, each cell (i,j) represents number of points of class i are predicted class j
    # The predictions are along the columns of the confusion matrix whereas the actual values are along the rows of confusion matrix.
    
    A =(((C.T)/(C.sum(axis=1))).T)
        
    B =(C/C.sum(axis=0))
        
    labels = [0,1]
    cmap=sns.light_palette("green")
    # representing A in heatmap format
    #print("-"*50, "Confusion matrix", "-"*50)
    plt.figure(figsize=(20,4))
    plt.subplot(1, 3, 1)
    sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Confusion matrix")
    #plt.show()

    #print("-"*50, "Precision matrix", "-"*50)
    #plt.figure(figsize=(10,5))
    plt.subplot(1, 3, 2)
    sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Precision matrix")
    #plt.show()
    #print("Sum of columns in precision matrix",B.sum(axis=0))
    
    # representing B in heatmap format
    #print("-"*50, "Recall matrix"    , "-"*50)
    #plt.figure(figsize=(10,5))
    plt.subplot(1, 3, 3)
    sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.title("Recall matrix")
    #plt.show()
    #print("Sum of rows in Recall matrix",A.sum(axis=1))

## Minimal Data PreProcessing

In [6]:
data_df = pd.read_csv("/content/st-data-mini.csv")
data_df

Unnamed: 0,text,label,text_len
0,7 ways to short bitcoins bear,0,30
1,yahoo shows bitty 30k,0,22
2,can anyone instruct me how to short shib?,0,42
3,bulls need to learn what bearish flags look l...,0,93
4,all the solid alt coins at breaching ath's. st...,0,100
...,...,...,...
99995,another bare flag daaam pick me up at 65k boy...,2,55
99996,the bears have hopes,2,21
99997,who has cool nft shiba imagesüòÅ,2,32
99998,nice slow recovery i‚Äôll take it,2,32


In [7]:
data_df["label"] = data_df["label"].apply(lambda x: 0 if x == 0 else 1)

In [8]:
data_df["sentiment"] = data_df["label"].apply(lambda x: "POSITIVE" if x == 1 else "NEGATIVE")
data_df

Unnamed: 0,text,label,text_len,sentiment
0,7 ways to short bitcoins bear,0,30,NEGATIVE
1,yahoo shows bitty 30k,0,22,NEGATIVE
2,can anyone instruct me how to short shib?,0,42,NEGATIVE
3,bulls need to learn what bearish flags look l...,0,93,NEGATIVE
4,all the solid alt coins at breaching ath's. st...,0,100,NEGATIVE
...,...,...,...,...
99995,another bare flag daaam pick me up at 65k boy...,1,55,POSITIVE
99996,the bears have hopes,1,21,POSITIVE
99997,who has cool nft shiba imagesüòÅ,1,32,POSITIVE
99998,nice slow recovery i‚Äôll take it,1,32,POSITIVE


In [9]:
data_df["processed_text"] = data_df["text"]
data_df["processed_text"] = data_df["processed_text"].astype(str).apply(lambda x: x.strip())

In [10]:
data_df["processed_text"][100]

'destined to 20k minimum'

In [11]:
data_df['sentiment'].value_counts()

NEGATIVE    50000
POSITIVE    50000
Name: sentiment, dtype: int64

## Importing HF Packages

In [12]:
import torch
import torch.nn.functional as F
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from torchsummary import summary

In [13]:
import datasets
from datasets import Dataset, DatasetDict

In [14]:
from torch.utils.data import DataLoader
import evaluate

## Defining Model Checkpoint Name From HF

In [15]:
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

## Preparing HF Dataset

### Loading

In [16]:
stocktwits_ds = Dataset.from_pandas(data_df)
stocktwits_ds

Dataset({
    features: ['text', 'label', 'text_len', 'sentiment', 'processed_text'],
    num_rows: 100000
})

In [17]:
stocktwits_ds = stocktwits_ds.rename_column("label", "labels")
stocktwits_ds

Dataset({
    features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text'],
    num_rows: 100000
})

In [18]:
stocktwits_ds

Dataset({
    features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text'],
    num_rows: 100000
})

In [19]:
stocktwits_ds = stocktwits_ds.train_test_split(test_size=0.15, shuffle=True, seed=42)
stocktwits_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text'],
        num_rows: 85000
    })
    test: Dataset({
        features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text'],
        num_rows: 15000
    })
})

### Tokenizing

In [20]:
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

In [21]:
def tokenize_function(examples):
    return tokenizer(examples["processed_text"], padding="max_length", truncation=True,  return_tensors="pt")

In [22]:
tokenized_stocktwits_ds = stocktwits_ds.map(tokenize_function, batched=True)

  0%|          | 0/85 [00:00<?, ?ba/s]

  0%|          | 0/15 [00:00<?, ?ba/s]

In [23]:
tokenized_stocktwits_ds

DatasetDict({
    train: Dataset({
        features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text', 'input_ids', 'attention_mask'],
        num_rows: 85000
    })
    test: Dataset({
        features: ['text', 'labels', 'text_len', 'sentiment', 'processed_text', 'input_ids', 'attention_mask'],
        num_rows: 15000
    })
})

In [24]:
tokenized_mini_stocktwits_ds = tokenized_stocktwits_ds.remove_columns(["sentiment", "text", "processed_text", "text_len"])
tokenized_mini_stocktwits_ds

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 85000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 15000
    })
})

In [25]:
tokenized_mini_stocktwits_ds.set_format("torch")
tokenized_mini_stocktwits_ds

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 85000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'attention_mask'],
        num_rows: 15000
    })
})

## Training Hyperparams

In [27]:
training_args = TrainingArguments(
    output_dir="/content/trainer/reddit-crypto-sent-trainer", 
    evaluation_strategy="steps", 
    save_strategy = "steps",
    save_steps=400,
    eval_steps=400,
    save_total_limit=20, 
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    learning_rate=2e-05,
    seed=42,
    lr_scheduler_type="linear",
    num_train_epochs=1,
    load_best_model_at_end=True,
)

## Defining Evaluation Metrics

In [28]:
metric = evaluate.combine(["accuracy", "precision", "recall", "f1"])

In [29]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

## Loading HF Model onto availabel device (GPU/CPU)

In [30]:
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
device

device(type='cuda')

In [31]:
model = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2)

In [32]:
model.to(device)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

## Training Loop

In [33]:
trainer = Trainer(
    model,
    training_args,
    train_dataset=tokenized_mini_stocktwits_ds["train"],
    eval_dataset=tokenized_mini_stocktwits_ds["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [34]:
trainer.train()

***** Running training *****
  Num examples = 85000
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 2657
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
400,No log,0.547067,0.724467,0.736239,0.714735,0.725327
800,0.606000,0.513825,0.740467,0.759285,0.717616,0.737863
1200,0.517100,0.493518,0.756667,0.76627,0.751015,0.758566
1600,0.493400,0.479204,0.767667,0.764399,0.785724,0.774914
2000,0.487400,0.473258,0.7714,0.776638,0.773281,0.774956
2400,0.487400,0.468518,0.7726,0.784253,0.763196,0.773581


***** Running Evaluation *****
  Num examples = 15000
  Batch size = 32
Saving model checkpoint to /content/trainer/reddit-crypto-sent-trainer/checkpoint-400
Configuration saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-400/config.json
Model weights saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-400/pytorch_model.bin
tokenizer config file saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-400/tokenizer_config.json
Special tokens file saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-400/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 15000
  Batch size = 32
Saving model checkpoint to /content/trainer/reddit-crypto-sent-trainer/checkpoint-800
Configuration saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-800/config.json
Model weights saved in /content/trainer/reddit-crypto-sent-trainer/checkpoint-800/pytorch_model.bin
tokenizer config file saved in /content/trainer/reddit-crypto-sent-train

TrainOutput(global_step=2657, training_loss=0.5114543140944028, metrics={'train_runtime': 5475.121, 'train_samples_per_second': 15.525, 'train_steps_per_second': 0.485, 'total_flos': 1.125972888576e+16, 'train_loss': 0.5114543140944028, 'epoch': 1.0})

## Moving Saved Checkpoints to Local Google Drive

In [50]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [53]:
!zip -r /content/trainer/reddit-crypto-sent-trainer/checkpoint-2000.zip /content/trainer/reddit-crypto-sent-trainer/checkpoint-2000

  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/ (stored 0%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/vocab.txt (deflated 53%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/special_tokens_map.json (deflated 42%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/rng_state.pth (deflated 28%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/pytorch_model.bin (deflated 8%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/scheduler.pt (deflated 49%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/tokenizer_config.json (deflated 45%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/tokenizer.json (deflated 71%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/optimizer.pt (deflated 26%)
  adding: content/trainer/reddit-crypto-sent-trainer/checkpoint-2000/config.json (deflated 47%)
  adding: content/trai

In [54]:
!ls /content/drive/MyDrive/ml_models

checkpoint-1600.zip  checkpoint-2400.zip


In [55]:
!cp -p -r /content/trainer/reddit-crypto-sent-trainer/checkpoint-2000.zip /content/drive/MyDrive/ml_models/