# Initial Reproduction and Three Extensions of the Cyberbullying Detections Using Transformers Paper
### Initial Objective is to recreate
  * Attempt to repeat the origin numbers using same hyperparamters
  * This will serve as the baseline

### Experiment #1 Recreate as an ensemble of binary modules per label
  * Have last layer be binary (2) output
  * Apply SoftMax layer for probabilities
  * Create ensemble with each of the outputs per label
  * Compare outputs

### Experiment #2 Vertical data augmentation using synthetic data generation
  * Leverage GPT-3 for custom data generation per label
  * Use as additional data for training
  * Compare Outputs

### Experiment #3 Horizontal data augmentation using additional label and context content
  * Add in additional labels of data serving as additional binary modules for the ensemble
  * Add in personal context information to serve as "normal" baseline for that individual 

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
%cd /content/drive/MyDrive/Github/

/content/drive/MyDrive/Github


In [3]:
username = 'bgoldfe2'
repository = 'Cyberbullying-Detection-with-Transformers'
git_token = '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'



In [4]:
!git clone https://{git_token}@github.com/{username}/{repository}

fatal: destination path 'Cyberbullying-Detection-with-Transformers' already exists and is not an empty directory.


In [None]:
%cd {repository}

/content/drive/MyDrive/Github/Cyberbullying-Detection-with-Transformers


In [6]:
!git status

On branch main
Your branch is up to date with 'origin/main'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

	[31mmodified:   CyberTransformer.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


In [14]:
!git config --global user.email "bgoldfe2@gmu.edu"
!git config --global user.name "Bruce Goldfeder"
!git add .
!git commit -m "starting back up nov 30"

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes not staged for commit:
	[31mmodified:   ../CyberTransformer.ipynb[m
	[31mmodified:   ../Scripts/model.py[m
	[31mmodified:   ../Scripts/train.py[m

no changes added to commit


In [16]:
!pwd


/content/drive/MyDrive/Github/Cyberbullying-Detection-with-Transformers


In [10]:
!pip install -r requirements.txt

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 6.7 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.97-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 45.5 MB/s 
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 43.5 MB/s 
Collecting huggingface-hub<1.0,>=0.10.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 58.6 MB/s 
Installing collected packages: tokenizers, huggingface-hub, transformers, sentencepiece
Successfully installed huggingface-hub-0.11.1 sentencepiece-0.1.97 tokenizers-0.13.2 transformers-4.24.0


In [None]:
%cd Scripts/

/content/drive/MyDrive/Github/Cyberbullying-Detection-with-Transformers/Scripts


In [12]:
%cd Models
!ls -alh

/content/drive/MyDrive/Github/Cyberbullying-Detection-with-Transformers/Models
total 418M
-rw------- 1 root root 418M Oct 27 18:22 bert-base-uncased_Best_Val_Acc.bin
-rw------- 1 root root    0 Oct 27 16:14 .gitkeep


In [None]:
# Test run for regression testing
!python3 train.py

2022-10-27 18:18:08.476793: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
47705
{'Gender', 'Notcb', 'Others', 'Religion', 'Ethnicity', 'Age'}
train len - 28623, valid len - 9541, test len - 9541
Downloading: 100% 232k/232k [00:00<00:00, 3.03MB/s]
Downloading: 100% 28.0/28.0 [00:00<00:00, 26.7kB/s]
Downloading: 100% 570/570 [00:00<00:00, 572kB/s]
Downloading: 100% 440M/440M [00:05<00:00, 74.5MB/s]
Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.d

In [None]:
!cat model.py

import torch
import torch.nn as nn
import numpy as np
from transformers import BertModel, RobertaModel, XLNetModel, DistilBertModel

from common import get_parser

parser = get_parser()
args = parser.parse_args()
np.random.seed(args.seed)
torch.manual_seed(args.seed)
torch.cuda.manual_seed(args.seed)

class BertFGBC(nn.Module):
    def __init__(self, pretrained_model = args.pretrained_model):
        super().__init__()
        self.Bert = BertModel.from_pretrained(pretrained_model)
        self.drop1 = nn.Dropout(args.dropout)
        self.linear = nn.Linear(args.bert_hidden, 64)
        self.batch_norm = nn.LayerNorm(64)
        self.drop2 = nn.Dropout(args.dropout)
        self.out = nn.Linear(64, args.classes)

    def forward(self, input_ids, attention_mask, token_type_ids):
        _,last_hidden_state = self.Bert(
            input_ids=input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            return_dict=False
        )
        #p