### Installation

In [None]:
%pip install torch
%pip install numpy
%pip install transformers
%pip install pandas
%pip install ipywidgets
%pip install tqdm

### Imports

In [None]:
import requests
import transformers
from transformers import RobertaTokenizer, RobertaModel
from ipywidgets import IntProgress
import torch
from torch.utils.data import Dataset, DataLoader
import pandas as pd
from tqdm import tqdm

### Download data

We'll be using the _Offensive Language Identification Dataset (OLID)_ dataset to specialize BERT in hate speech detection. You can find the dataset [here](https://scholar.harvard.edu/malmasi/olid). The following code will download it into the current dir.

In [None]:
!git clone https://github.com/idontflow/OLID.git

### Examine our data

OLID (2019) is comprised of a number of datasets. Let's break them down now. olid-training-v1.tsv has ~13,000 annotated tweets and 3 subtask labels. Then we have our test sets for each subtask: testset-levela.tsv, testset-levelb.tsv, testset-levelc.tsv.

subtask_a is a categorization on whether the tweet is offensive. Here are the categories:

- (NOT) Not Offensive - This post does not contain offense or profanity.
- (OFF) Offensive - This post contains offensive language or a targeted (veiled or direct) offense

If a tweet was labeled as offensive, then it can have a value for subtask B. Here are the categories for subtask_b:

- (TIN) Targeted Insult and Threats - A post containing an insult or threat to an individual, a group, or others (see categories in sub-task C).
- (UNT) Untargeted - A post containing non-targeted profanity and swearing.

If a tweet was marked as offensive and targeted, then it can have the following categories for subtask_c:

- (IND) Individual - The target of the offensive post is an individual: a famous person, a named individual or an unnamed person interacting in the conversation.
- (GRP) Group - The target of the offensive post is a group of people considered as a unity due to the same ethnicity, gender or sexual orientation, political affiliation, religious belief, or something else.
- (OTH) Other – The target of the offensive post does not belong to any of the previous two categories (e.g., an organization, a situation, an event, or an issue)

Of course, if a tweet isn't offensive, it won't have values for subtask b and c. Similarly, if the tweet is untargeted, it won't have a value for subtask c. 

Now, let's take a look at the dataset. 

In [None]:
data = pd.read_csv('./OLID/olid-training-v1.0.tsv', sep='\t')
data.head()

We want to make sure we can run this code on CUDA, if available.

In [None]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
device

Let's start by making a classifier for generally offensive tweets (subtask_a). We'll make a pytorch model that extends roBERTa.

### subtask_a

We'll be using roBERTa's tokenizer to encode our text. Behind the scene, it's doing byte-level byte pair encoding (BPE). Let's also define some constants here.

In [None]:
tokenizer = RobertaTokenizer.from_pretrained("roberta-base", truncation=True)
batch_size = 32
train_size = 0.8 
max_len = 140
epochs = 100

Let's clean up our dataframe by converting "OFF" to 1 and "NOT" to 0. 

In [None]:
data.loc[data["subtask_a"] == "OFF", "subtask_a"] = 1
data.loc[data["subtask_a"] == "NOT", "subtask_a"] = 0
data.head()

Let's define a Data class for this task. We'll extend pytorch's Dataset class here so we can initialize DataLoaders with it down the road. 

In [None]:
class DataA(Dataset):
    def __init__(self, dataframe, tokenizer, max_len):
        # Our data
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_len = max_len
        # X and Y
        self.text = self.data.tweet
        self.targets = self.data.subtask_a
    
    # Some class methods we have to override:
    def __len__(self):
        return len(self.text)
    
    def __getitem__(self, index):
        # Extract the text at index
        text = str(self.text[index])
        text = "".join(text.split())
        
        # Create a dictionary that contains the encoded sequence
        inputs = self.tokenizer.encode_plus(
            text,
            None,
            add_special_tokens=True,
            max_length=self.max_len,
            padding="max_length",
            return_token_type_ids=True
        )
        
        # Extract the values we care about from inputs
        ids = inputs['input_ids']
        mask = inputs['attention_mask']
        token_type_ids = inputs["token_type_ids"]
        
        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'token_type_ids': torch.tensor(token_type_ids, dtype=torch.long),
            'targets': torch.tensor(self.targets[index], dtype=torch.float)
        }   

Let's make our train/test sets!!

In [None]:
# Sample from our dataset 
train_data = data.sample(frac=train_size, random_state=0)
# Remove the train data we sampled in the last step to get our training data
test_data = data.drop(train_data.index).reset_index(drop=True)
# Clean up train data by resetting the indexes
train_data = train_data.reset_index(drop=True)

# Our train and test data shape
print("Train shape:", train_data.shape)
print("Test shape:", test_data.shape)

# Our data
train_set = DataA(train_data, tokenizer, max_len)
test_set = DataA(test_data, tokenizer, max_len)

# Our dataloaders
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=0)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=True, num_workers=0)

Our model:

In [None]:
class ModelA(torch.nn.Module):
    def __init__(self):
        super(ModelA, self).__init__()

        # Layer 1 is Roberta
        self.l1 = RobertaModel.from_pretrained("roberta-base")
        # First hidden layer for our classifier
        self.l2 = torch.nn.Linear(768, 768)
        # ReLU
        self.relu = torch.nn.ReLU() 
        # Dropout
        self.dropout = torch.nn.Dropout(0.5)
        # Output layer
        self.l3 = torch.nn.Linear(768, 2)
        # Softmax
        self.softmax = torch.nn.Softmax()
    
    def forward(self, input_ids, attention_mask, token_type_ids):
        # Roberta inputs
        x = self.l1(input_ids=input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids)
        x = self.l2(x[0][:, 0])
        x = self.relu(x)
        x = self.dropout(x)
        x = self.l3(x)
        x = self.softmax(x)
        return x

Let's train this jawn

In [None]:
modelA = ModelA()
modelA.to(device)

loss_fn = torch.nn.CrossEntropyLoss()
optimizer_fn = torch.optim.Adam(params =  modelA.parameters())

for epoch in range(epochs):
    curr_loss = 0
    modelA.train()
    # Loop over batch data
    # btw tqdm is a nice loop progress visualizer
    for step, example in tqdm(enumerate(train_loader, 0)):
        # Convert example from batch into input for Roberta
        ids = example['ids'].to(device, dtype = torch.long)
        mask = example['mask'].to(device, dtype = torch.long)
        token_type_ids = example['token_type_ids'].to(device, dtype = torch.long)
        targets = example['targets'].to(device, dtype = torch.long)
        
        # Get model outputs
        outputs = modelA(ids, mask, token_type_ids)
        loss = loss_fn(outputs, targets)
        # Add loss to current epoch loss tracker
        curr_loss += loss.item()
        
        # Calculate backward pass
        optimizer_fn.zero_grad()
        loss.backward()
        # Update parameters
        optimizer_fn.step()
        
    print("Total loss for epoch {epoch} was {loss}".format(epoch, curr_loss))
    
        