# **Insert Title Here**
**DATA103 S11 Group 4**
- GOZON, Jean Pauline D.
- JAMIAS, Gillian Nicole A.
- MARCELO Andrea Jean C. 
- REYES, Anton Gabriel G.
- VICENTE, Francheska Josefa

## Requirements and Imports

### Imports

**Basic Libraries**

* `numpy` contains a large collection of mathematical functions
* `pandas` contains functions that are designed for data manipulation and data analysis



In [1]:
import numpy as np
import pandas as pd
import datasets

**Machine Learning Libraries**

* `torch` this is an open source ML library for deep neural network creation
* `transformers` contains pre-trained models

In [2]:
from sklearn.model_selection import train_test_split

In [3]:
import torch
from torch.utils.data import Dataset, DataLoader, RandomSampler, SequentialSampler
from pytorch_lightning.callbacks import ProgressBarBase, RichProgressBar

In [4]:
from transformers import AutoTokenizer, BertTokenizerFast, AutoModelForSequenceClassification, TrainerCallback, TrainingArguments, Trainer

In [5]:
from sklearn.metrics import f1_score, roc_auc_score, hamming_loss, accuracy_score
from transformers import EvalPrediction
import evaluate

In [6]:
import optuna

In [7]:
import pickle

In [8]:
df = pd.read_csv ('cleaned_data.csv')
df

Unnamed: 0,class,text
0,0,"['Its not a viable option, and youll be leavin..."
1,1,['It can be hard to appreciate the notion that...
2,1,"['Hi, so last night i was sitting on the ledge..."
3,1,['I tried to kill my self once and failed badl...
4,1,['Hi NEM3030. What sorts of things do you enjo...
...,...,...
242155,0,If you don't like rock then your not going to ...
242156,0,You how you can tell i have so many friends an...
242157,0,pee probably tastes like salty tea😏💦‼️ can som...
242158,1,The usual stuff you find hereI'm not posting t...


In [9]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device

device(type='cuda')

## Preparing data for Feature Engineering

### Splitting the Dataset into Train, Val, and Test Split

In [10]:
X = df ['text']
X

0         ['Its not a viable option, and youll be leavin...
1         ['It can be hard to appreciate the notion that...
2         ['Hi, so last night i was sitting on the ledge...
3         ['I tried to kill my self once and failed badl...
4         ['Hi NEM3030. What sorts of things do you enjo...
                                ...                        
242155    If you don't like rock then your not going to ...
242156    You how you can tell i have so many friends an...
242157    pee probably tastes like salty tea😏💦‼️ can som...
242158    The usual stuff you find hereI'm not posting t...
242159    I still haven't beaten the first boss in Hollo...
Name: text, Length: 242160, dtype: object

In [11]:
y = df ['class']
y

0         0
1         1
2         1
3         1
4         1
         ..
242155    0
242156    0
242157    0
242158    1
242159    0
Name: class, Length: 242160, dtype: int64

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2,
                                                    stratify = y,
                                                    random_state = 42, 
                                                    shuffle = True)

In [13]:
X_train, X_val, y_train, y_val = train_test_split(X_train, 
                                                  y_train, 
                                                  test_size = 0.1,
                                                  stratify = y_train,
                                                  random_state = 42, 
                                                  shuffle = True)

In [14]:
print('Train input  shape: ', X_train.shape)
print('Train output shape: ', y_train.shape)

Train input  shape:  (174355,)
Train output shape:  (174355,)


In [15]:
print('Val input  shape: ', X_val.shape)
print('Val output shape: ', y_val.shape)

Val input  shape:  (19373,)
Val output shape:  (19373,)


In [16]:
print('Test input  shape: ', X_test.shape)
print('Test output shape: ', y_test.shape)

Test input  shape:  (48432,)
Test output shape:  (48432,)


In [17]:
train_df = pd.concat([X_train, y_train], axis = 1).reset_index(drop = True)
train_df

Unnamed: 0,text,class
0,How do you explain to your family that you wer...,0
1,I DONT UNDERSTAND THE US DEBT WHO DO THEY OWE ...,0
2,FireIt’s been a bit but I still think of her a...,1
3,AITA for telling my wife (34F) that reddit agr...,0
4,Join among us SGGFIF Jesjeuejjejejeeieieijdjdj...,0
...,...,...
174350,"Fellow teenagers, I have been influenced by th...",0
174351,I felt like talkingSo I was just outside at 01...,1
174352,i am trying to but i just cant i have everythi...,1
174353,I just want my suffering to endAll I have hear...,1


In [18]:
val_df = pd.concat([X_val, y_val], axis = 1).reset_index(drop = True)
val_df

Unnamed: 0,text,class
0,Really down........just need some words of enc...,1
1,I’m not gonna buy a carThe day gets closer. I’...,1
2,Help me kill myself. Please. Please. Please.I’...,1
3,The only thing keeping me alive is the fact th...,1
4,"I'm not.I'm not the sweet, determined girl eve...",1
...,...,...
19368,when she says Hi! This post seems to be relate...,0
19369,I gotta go to school tmmr for orientation at 9...,0
19370,Hey lads! Can I get some help from y'all? So.....,0
19371,My birthday is this coming month and it will b...,1


In [19]:
test_df = pd.concat([X_test, y_test], axis = 1).reset_index(drop = True)
test_df

Unnamed: 0,text,class
0,I just felt myself snapI have to pretend to be...,1
1,Are you envious of something about the opposit...,0
2,"We get it. Men have problems, too. We never sa...",0
3,Happy Birthday to everyone having Birthday on ...,0
4,i cant deal with life any longer but ive tried...,1
...,...,...
48427,I just need to go for everyone's sakeI can't e...,1
48428,Hope is now goneI'm 17m and I'm considering ta...,1
48429,18f needs someone to talk toI understand if th...,1
48430,"Help mePlease someone help me, just pm me.\nI'...",1


### Creation of Dataset

In [20]:
train_dataset = datasets.Dataset.from_pandas(train_df)
train_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 174355
})

In [21]:
val_dataset = datasets.Dataset.from_pandas(val_df)
val_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 19373
})

In [22]:
test_dataset = datasets.Dataset.from_pandas(test_df)
test_dataset

Dataset({
    features: ['text', 'class'],
    num_rows: 48432
})

In [23]:
dataset = datasets.DatasetDict({
    "train" : train_dataset, 
    "val" : val_dataset, 
    "test" : test_dataset
})

dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'class'],
        num_rows: 174355
    })
    val: Dataset({
        features: ['text', 'class'],
        num_rows: 19373
    })
    test: Dataset({
        features: ['text', 'class'],
        num_rows: 48432
    })
})

## Feature Engineering

### Defining of Functions

In [24]:
MAX_LENGTH = 512

In [25]:
def preprocess_function(examples, tokenizer):
    encoding = tokenizer(examples["text"], padding = "max_length", truncation = True, max_length = MAX_LENGTH)
    encoding["labels"] = torch.tensor(examples ['class'])
    return encoding

In [26]:
def create_encoded_dataset (tokenizer):
    encoded_dataset = dataset.map(preprocess_function, 
                                  batched=True, 
                                  remove_columns=dataset['train'].column_names, 
                                  fn_kwargs = {"tokenizer": tokenizer})
    
    encoded_dataset.set_format("torch")
    
    return encoded_dataset

### Tokenizing with BERT

In [27]:
bert_tokenizer = BertTokenizerFast.from_pretrained('bert-base-uncased')

In [28]:
bert_encoded_dataset = create_encoded_dataset (bert_tokenizer)

  0%|          | 0/175 [00:00<?, ?ba/s]

  0%|          | 0/20 [00:00<?, ?ba/s]

  0%|          | 0/49 [00:00<?, ?ba/s]

### Tokenizing with RoBERTa

In [None]:
model_checkpoint_roberta = 'roberta-base'

In [None]:
roberta_tokenizer = AutoTokenizer.from_pretrained(model_checkpoint_roberta)

In [None]:
roberta_encoded_dataset = create_encoded_dataset (roberta_tokenizer)

## Modeling and Evaluation

### Defining of Functions

In [29]:
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    f1_macro_average = f1_score(y_true=y_true, y_pred=y_pred, average='macro')
    # hamming_loss_score = hamming_loss(y_true = y_true, y_pred = y_pred)
    accuracy = accuracy_score(y_true, y_pred)
    
    # return as dictionary
    metrics = {
        'f1_micro_average': f1_micro_average,
        # 'hamming_loss_score' : hamming_loss_score,
        'f1_macro_average' : f1_macro_average,
        'accuracy': accuracy
    }
    return metrics

In [30]:
def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

### Defining of Hyperparameter Space

In [31]:
def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_categorical("learning_rate", [0.1, 0.01, 0.001]),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "num_train_epochs": trial.suggest_categorical("num_train_epochs", [2, 3, 4])
    }

### BERT Model

#### Model Training 

In [32]:
model_checkpoint = 'bert-base-uncased'

In [33]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint,
    return_dict = False
).to(device)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [34]:
training_args = TrainingArguments(output_dir = "bert_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

In [35]:
trainer = Trainer(
    model = bert_model,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

Using cuda_amp half precision backend


In [36]:
trainer.train()

***** Running training *****
  Num examples = 174355
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 65385
  Number of trainable parameters = 109483778
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mfrancheska_vicente[0m ([33mtonely[0m). Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss


Saving model checkpoint to bert_trainer\checkpoint-20000
Configuration saved in bert_trainer\checkpoint-20000\config.json
Model weights saved in bert_trainer\checkpoint-20000\pytorch_model.bin
***** Running Evaluation *****
  Num examples = 19373
  Batch size = 8


NameError: name 'multi_label_metrics' is not defined

#### Saving BERT base model

In [None]:
path_for_models ='./saved_models/BERTv1'
trainer.save_model(path_for_models)

#### Hyperparameter Tuning

In [None]:
def model_init():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint)

In [None]:
trainer_tuning = Trainer(
    model_init = model_init,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    tokenizer = bert_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

In [None]:
best_trial = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3,
    compute_objective=compute_objective
)

In [None]:
best_trial

##### Saving BERT tuned model

In [None]:
path_for_models ='./saved_models/BERTv1_tuned'
trainer.save_model(path_for_models)

#### Evaluation

#### Feature Importance

### RoBERTa Model

#### Model Training 

In [None]:
model_checkpoint_roberta = 'roberta-base'

In [None]:
bert_model = AutoModelForSequenceClassification.from_pretrained(
    model_checkpoint_roberta,
    return_dict = False
).to(device)

In [None]:
training_args = TrainingArguments(output_dir = "roberta_trainer", 
                                  save_steps = 20000,
                                  save_strategy = 'steps',
                                  fp16 = True,
                                  evaluation_strategy = "epoch", 
                                  resume_from_checkpoint = True)

In [None]:
trainer = Trainer(
    model = bert_model,
    args = training_args,
    train_dataset = bert_encoded_dataset ['train'],
    eval_dataset = bert_encoded_dataset ['val'],
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

In [None]:
trainer.train()

#### Saving RoBERTa base model

In [None]:
path_for_models ='./saved_models/RoBERTav1'
trainer.save_model(path_for_models)

#### Hyperparameter Tuning

In [None]:
def model_init_roberta ():
    return AutoModelForSequenceClassification.from_pretrained(model_checkpoint_roberta)

In [None]:
trainer_tuning = Trainer(
    model_init = model_init_roberta,
    args = training_args,
    train_dataset = roberta_encoded_dataset ['train'],
    eval_dataset = roberta_encoded_dataset ['val'],
    tokenizer = roberta_tokenizer,
    compute_metrics = compute_metrics,
    callbacks = [TrainerCallback()]
)

In [None]:
best_trial_roberta = trainer_tuning.hyperparameter_search(
    direction = "maximize",
    backend = "optuna",
    hp_space = optuna_hp_space,
    n_trials = 3,
    compute_objective = compute_objective
)

In [None]:
best_trial_roberta

##### Saving RoBERTa tuned model

In [None]:
path_for_models ='./saved_models/RoBERTav1_tuned'
trainer.save_model(path_for_models)

#### Evaluation

#### Feature Importance