* Use slang embeddings (SMASH) - fasttext based, combine with other pre-trained embeddings (if possible)
  * If this is possible, use these embeddings with a neural net architecture (BERT, LSTM, Bi-LSTM) to classify words as slang or not.
  * If not possible, investigate using two classifiers, one with slang embeddings, one with regular embeddings. Then checking which model is more confident for a given word?
* Use these embeddings to develop a model for identifying the slang word in a sentence.
* Evaluate these models with the pos/neg examples in order to tune/evaluate them without us having to manually evaluate.
  * Our slang test sentences will come from Urban dictionary. They have an API for looking up words, that returns JSON, and one of the fields is the word, and another is an example sentence. We can use these two fields to create our positive example dataset.
  * Our slang negative test sentences will come from the NYT dataset we have used in prior assignments in this class.
* Once a slang word is identified, we need to match it to a definition, compared to a database of slang. (https://www.kaggle.com/therohk/urban-dictionary-words-dataset is a slang database file from urban dictionary in 2019)
* Disambiguating the slang definition may require POS tagging. Once we have slang detection working, we will tackle this task.
* Once we have the model trained, we will export/freeze it, for use in translating slang from a source sentence into the definition of that word.
* Once the slang is identified and defined, it should be shown to the user for their input sentence.

In [None]:
from google.colab import drive
drive.mount('/content/drive')
drivePath = 'drive/MyDrive/Colab Notebooks/finalProj/'
output_dir = drivePath + 'model_save/'

Loading the data and preprocessing it into pytorch tensors

In [None]:
import pickle
import os
#loading all of the positive and negative pickled samples
target = drivePath + "fullSamples.pkl"
fullSamples = [] # scores is an empty dict already

if os.path.getsize(target) > 0:      
    with open(target, "rb") as f:
        unpickler = pickle.Unpickler(f)
        # if file is not empty scores will be equal
        # to the value unpickled
        fullSamples = unpickler.load()



In [None]:
! pip install requests
! pip install kaggle
! pip install transformers
! pip install tensorflow
import tensorflow as tf

# Get the GPU device name.
device_name = tf.test.gpu_device_name()

# The device name should look like the following:
if device_name == '/device:GPU:0':
    print('Found GPU at: {}'.format(device_name))
else:
    raise SystemError('GPU device not found')
import torch

# If there's a GPU available...
if torch.cuda.is_available():    

    # Tell PyTorch to use the GPU.    
    device = torch.device("cuda")

    print('There are %d GPU(s) available.' % torch.cuda.device_count())

    print('We will use the GPU:', torch.cuda.get_device_name(0))

# If not...
else:
    print('No GPU available, using the CPU instead.')
    device = torch.device("cpu")

In [None]:
# BERT STUFF - Structure and how to use/train BERT was taken from Seth's BERT Tutorial python notebook.
# we are using different data structures to store our initial data, and results. as well as using a different model for a different task (token classification for sequence labeling).
# The broad strokes of code layout are from Seth's tutorial, with modification to make it work for our case.
from transformers import BertTokenizer
# TODO, update dataset processing to use the BERT tokenizer to tokenize everything already, that way our lables will also correspond to the BERT tokens.
# Load the BERT tokenizer.
print('Loading BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
max_len = 0
skipLabels = []
# For every sentence...
for idx, sample in enumerate(fullSamples):
    sentence = sample['sent']
    sentString = " ".join(sentence)
    # we have some data that is longer than BERT can handle, we need to remove those samples.
    if len(sentString) > 512:
      skipLabels.append(idx)
      continue
    # Tokenize the text and add `[CLS]` and `[SEP]` tokens.
    input_ids = tokenizer.encode(sentString, add_special_tokens=True)

    # Update the maximum sentence length.
    max_len = max(max_len, len(input_ids))

print('Max sentence length: ', max_len)

In [None]:
#deleting the samples that exceed bert's character limit
i = 0
for ele in sorted(skipLabels, reverse = True):
  # remove those values from the dataset.
  del fullSamples[ele]

In [None]:
from torch.utils.data import random_split
# unspool the fullSamples to a list of samples and a list of corresponding labels
# also convert the BIO notation to 1, 2, 0 values, so they can be used in a tensor / predicted by BERT.
# this is where we do our train / test split.
# the training data will be further split into development and validation splits elsewhere.
labels = []
sents = []
train_size = int(0.8 * len(fullSamples))
test_size = len(fullSamples) - train_size
trainSamples, testSamples = random_split(fullSamples, [train_size, test_size])


for sample in trainSamples:
  label = []
  for labelVal in sample['label']:
    if(labelVal == 'O'):
        label.append(0)
    elif(labelVal == 'B-SLANG'):
        label.append(1)
    elif(labelVal == 'I-SLANG'):
        label.append(2)
    else:
      raise SystemError('invalid label %s', labelVal)
  labels.append(label)
  sents.append(" ".join(sample['sent']))
print(len(sents))
print(len(labels))

testSents = []
testLabels = []
for sample in testSamples:
  label = []
  for labelVal in sample['label']:
    if(labelVal == 'O'):
        label.append(0)
    elif(labelVal == 'B-SLANG'):
        label.append(1)
    elif(labelVal == 'I-SLANG'):
        label.append(2)
    else:
      raise SystemError('invalid label %s', labelVal)
  testLabels.append(label)
  testSents.append(" ".join(sample['sent']))
print(len(testSents))
print(len(testLabels))



In [None]:
# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sample in sents:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sample,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = max_len,           # Pad & truncate all sentences.
                        padding = 'max_length',
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])


In [None]:
import torch
# Convert the lists into tensors.
input_ids_tensor = torch.cat(input_ids, dim=0)
attention_masks_tensor = torch.cat(attention_masks, dim=0)
# we need to pad out the labels to match the length of the padded sentences.
# TODO: this is not be best way to do this, as it does NOT account for bert's tokenization into word pieces.
padlabels = []
for label in labels:
  padlabels.append(label + [0] * (max_len - len(label)))
labels_tensor = torch.tensor(padlabels)

# Print sentence 0, now as a list of IDs.
print('Original: ', sents[0])
print('Token IDs:', input_ids_tensor[0])
print('Labeling:', labels_tensor[0])

Training BERT

In [None]:
from torch.utils.data import TensorDataset, random_split

# Combine the training inputs into a TensorDataset.
dataset = TensorDataset(input_ids_tensor, attention_masks_tensor, labels_tensor)

# Create a 90-10 train-validation split.

# Calculate the number of samples to include in each set.
train_size = int(0.9 * len(dataset))
val_size = len(dataset) - train_size

# Divide the dataset by randomly selecting samples.
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print('{:>5,} training samples'.format(train_size))
print('{:>5,} validation samples'.format(val_size))

In [None]:
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# The DataLoader needs to know our batch size for training, so we specify it 
# here. For fine-tuning BERT on a specific task, the authors recommend a batch 
# size of 16 or 32.
batch_size = 32

# Create the DataLoaders for our training and validation sets.
# We'll take training samples in random order. 
train_dataloader = DataLoader(
            train_dataset,  # The training samples.
            sampler = RandomSampler(train_dataset), # Select batches randomly
            batch_size = batch_size # Trains with this batch size.
        )

# For validation the order doesn't matter, so we'll just read them sequentially.
validation_dataloader = DataLoader(
            val_dataset, # The validation samples.
            sampler = SequentialSampler(val_dataset), # Pull out batches sequentially.
            batch_size = batch_size # Evaluate with this batch size.
        )

In [None]:
from transformers import BertForTokenClassification, AdamW, BertConfig
from transformers import BertTokenizer
# Load BertForTokenClassification, the pretrained BERT model 
model = BertForTokenClassification.from_pretrained(
    "bert-base-uncased", # Use the 12-layer BERT model, with an uncased vocab.
    num_labels = 3, # The number of output labels   
    output_attentions = False, # Whether the model returns attentions weights.
    output_hidden_states = False, # Whether the model returns all hidden-states.
)

# Tell pytorch to run this model on the GPU.
#model.cuda()

# Note: AdamW is a class from the huggingface library (as opposed to pytorch) 
# I believe the 'W' stands for 'Weight Decay fix"
optimizer = AdamW(model.parameters(),
                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
                )

In [None]:
import numpy as np

# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)

import time
import datetime

def format_time(elapsed):
    '''
    Takes a time in seconds and returns a string hh:mm:ss
    '''
    # Round to the nearest second.
    elapsed_rounded = int(round((elapsed)))
    
    # Format as hh:mm:ss
    return str(datetime.timedelta(seconds=elapsed_rounded))

In [None]:
from transformers import get_linear_schedule_with_warmup

# Number of training epochs. The BERT authors recommend between 2 and 4. 
# We chose to run for 4, but we'll see later that this may be over-fitting the
# training data.
epochs = 4

# Total number of training steps is [number of batches] x [number of epochs]. 
# (Note that this is not the same as the number of training samples).
total_steps = len(train_dataloader) * epochs

# Create the learning rate scheduler.
scheduler = get_linear_schedule_with_warmup(optimizer, 
                                            num_warmup_steps = 0, # Default value in run_glue.py
                                            num_training_steps = total_steps)

In [None]:
import random
import numpy as np

# This training code is based on the `run_glue.py` script here:
# https://github.com/huggingface/transformers/blob/5bfcd0485ece086ebcbed2d008813037968a9e58/examples/run_glue.py#L128

# Set the seed value all over the place to make this reproducible.
seed_val = 42

random.seed(seed_val)
np.random.seed(seed_val)
torch.manual_seed(seed_val)
torch.cuda.manual_seed_all(seed_val)

# We'll store a number of quantities such as training and validation loss, 
# validation accuracy, and timings.
training_stats = []

# Measure the total training time for the whole run.
total_t0 = time.time()

# For each epoch...
for epoch_i in range(0, epochs):
    
    # ========================================
    #               Training
    # ========================================
    
    # Perform one full pass over the training set.

    print("")
    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
    print('Training...')

    # Measure how long the training epoch takes.
    t0 = time.time()

    # Reset the total loss for this epoch.
    total_train_loss = 0

    # Put the model into training mode. Don't be misled--the call to 
    # `train` just changes the *mode*, it doesn't *perform* the training.
    # `dropout` and `batchnorm` layers behave differently during training
    # vs. test (source: https://stackoverflow.com/questions/51433378/what-does-model-train-do-in-pytorch)
    model.train()

    # For each batch of training data...
    for step, batch in enumerate(train_dataloader):

        # Progress update every 40 batches.
        if step % 40 == 0 and not step == 0:
            # Calculate elapsed time in minutes.
            elapsed = format_time(time.time() - t0)
            
            # Report progress.
            print('  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))

        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using the 
        # `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)

        # Always clear any previously calculated gradients before performing a
        # backward pass. PyTorch doesn't do this automatically because 
        # accumulating the gradients is "convenient while training RNNs". 
        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
        model.zero_grad()        

        # Perform a forward pass (evaluate the model on this training batch).
        # The documentation for this `model` function is here: 
        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
        # It returns different numbers of parameters depending on what arguments
        # arge given and what flags are set. For our useage here, it returns
        # the loss (because we provided labels) and the "logits"--the model
        # outputs prior to activation.

        # Seth Briney: loss and logits both come out to strings, resolve with next line:
        # loss, logits = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels)

        seths_model = model(b_input_ids, # Seth Briney modified
              token_type_ids=None, 
              attention_mask=b_input_mask, 
              labels=b_labels)
        # loss, logits = seths_model[0], seths_model[1] # Seth Briney modified
        loss, logits = seths_model['loss'], seths_model['logits'] # Seth Briney modified

        # Accumulate the training loss over all of the batches so that we can
        # calculate the average loss at the end. `loss` is a Tensor containing a
        # single value; the `.item()` function just returns the Python value 
        # from the tensor.
        total_train_loss += loss.item() # Seth Briney: loss somehow becomes a string, and gives an error message.

        # Perform a backward pass to calculate the gradients.
        loss.backward()

        # Clip the norm of the gradients to 1.0.
        # This is to help prevent the "exploding gradients" problem.
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update parameters and take a step using the computed gradient.
        # The optimizer dictates the "update rule"--how the parameters are
        # modified based on their gradients, the learning rate, etc.
        optimizer.step()

        # Update the learning rate.
        scheduler.step()

    # Calculate the average loss over all of the batches.
    avg_train_loss = total_train_loss / len(train_dataloader)            
    
    # Measure how long this epoch took.
    training_time = format_time(time.time() - t0)

    print("")
    print("  Average training loss: {0:.2f}".format(avg_train_loss))
    print("  Training epcoh took: {:}".format(training_time))
        
    # ========================================
    #               Validation
    # ========================================
    # After the completion of each training epoch, measure our performance on
    # our validation set.

    print("")
    print("Running Validation...")

    t0 = time.time()

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    model.eval()

    # Tracking variables 
    total_eval_accuracy = 0
    total_eval_loss = 0
    nb_eval_steps = 0

    # Evaluate data for one epoch
    for batch in validation_dataloader:
        
        # Unpack this training batch from our dataloader. 
        #
        # As we unpack the batch, we'll also copy each tensor to the GPU using 
        # the `to` method.
        #
        # `batch` contains three pytorch tensors:
        #   [0]: input ids 
        #   [1]: attention masks
        #   [2]: labels 
        b_input_ids = batch[0].to(device)
        b_input_mask = batch[1].to(device)
        b_labels = batch[2].to(device)
        
        # Tell pytorch not to bother with constructing the compute graph during
        # the forward pass, since this is only needed for backprop (training).
        with torch.no_grad():        

            # Forward pass, calculate logit predictions.
            # token_type_ids is the same as the "segment ids", which 
            # differentiates sentence 1 and 2 in 2-sentence tasks.
            # The documentation for this `model` function is here: 
            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
            # Get the "logits" output by the model. The "logits" are the output
            # values prior to applying an activation function like the softmax.
            
            # (loss, logits) = model(b_input_ids, token_type_ids=None, attention_mask=b_input_mask, labels=b_labels) # Seth Briney commented

            seths_model = model(b_input_ids, # Seth Briney modified
                  token_type_ids=None, 
                  attention_mask=b_input_mask, 
                  labels=b_labels)
            # loss, logits = seths_model[0], seths_model[1] # Seth Briney modified
            loss, logits = seths_model['loss'], seths_model['logits'] # Seth Briney modified

        # Accumulate the validation loss.
        total_eval_loss += loss.item()

        # Move logits and labels to CPU
        logits = logits.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

        # Calculate the accuracy for this batch of test sentences, and
        # accumulate it over all batches.
        total_eval_accuracy += flat_accuracy(logits, label_ids)
        

    # Report the final accuracy for this validation run.
    avg_val_accuracy = total_eval_accuracy / len(validation_dataloader)
    print("  Accuracy: {0:.2f}".format(avg_val_accuracy))

    # Calculate the average loss over all of the batches.
    avg_val_loss = total_eval_loss / len(validation_dataloader)
    
    # Measure how long the validation run took.
    validation_time = format_time(time.time() - t0)
    
    print("  Validation Loss: {0:.2f}".format(avg_val_loss))
    print("  Validation took: {:}".format(validation_time))

    # Record all statistics from this epoch.
    training_stats.append(
        {
            'epoch': epoch_i + 1,
            'Training Loss': avg_train_loss,
            'Valid. Loss': avg_val_loss,
            'Valid. Accur.': avg_val_accuracy,
            'Training Time': training_time,
            'Validation Time': validation_time
        }
    )

print("")
print("Training complete!")

print("Total training took {:} (h:mm:ss)".format(format_time(time.time()-total_t0)))

In [None]:
import pandas as pd
def displayLoss():
    # Display floats with two decimal places.
    pd.set_option('precision', 2)

    # Create a DataFrame from our training statistics.
    df_stats = pd.DataFrame(data=training_stats)

    # Use the 'epoch' as the row index.
    df_stats = df_stats.set_index('epoch')

    # A hack to force the column headers to wrap.
    #df = df.style.set_table_styles([dict(selector="th",props=[('max-width', '70px')])])

    # Display the table.
    df_stats

    import matplotlib.pyplot as plt
    % matplotlib inline

    import seaborn as sns

    # Use plot styling from seaborn.
    sns.set(style='darkgrid')

    # Increase the plot size and font size.
    sns.set(font_scale=1.5)
    plt.rcParams["figure.figsize"] = (12,6)

    # Plot the learning curve.
    plt.plot(df_stats['Training Loss'], 'b-o', label="Training")
    plt.plot(df_stats['Valid. Loss'], 'g-o', label="Validation")

    # Label the plot.
    plt.title("Training & Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.legend()
    plt.xticks([1, 2, 3, 4])

    plt.show()

In [None]:
import os

# Saving best-practices: if you use defaults names for the model, you can reload it using from_pretrained()

output_dir = drivePath + 'model_save/'

# Create output directory if needed
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

print("Saving model to %s" % output_dir)

# Save a trained model, configuration and tokenizer using `save_pretrained()`.
# They can then be reloaded using `from_pretrained()`
model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
model_to_save.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)

Load the saved model and use bert to predict labels on the test set. 

In [None]:
from transformers import BertForTokenClassification, AdamW, BertConfig
from transformers import BertTokenizer
# Good practice: save your training arguments together with the trained model
#torch.save(args, os.path.join(output_dir, 'training_args.bin'))
def load_model(output_dir):
  # Load a trained model and vocabulary that you have fine-tuned
  model = BertForTokenClassification.from_pretrained(output_dir)
  tokenizer = BertTokenizer.from_pretrained(output_dir)
  model.cuda()
  # Copy the model to the GPU.
  model.to(device)
  return model

In [None]:
model = load_model(output_dir)
displayLoss()

In [None]:
import pandas as pd
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler

# Create sentence and label lists
sentences = testSents
labels = testLabels


# Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids = []
attention_masks = []

# For every sentence...
for sent in sentences:
    # `encode_plus` will:
    #   (1) Tokenize the sentence.
    #   (2) Prepend the `[CLS]` token to the start.
    #   (3) Append the `[SEP]` token to the end.
    #   (4) Map tokens to their IDs.
    #   (5) Pad or truncate the sentence to `max_length`
    #   (6) Create attention masks for [PAD] tokens.
    encoded_dict = tokenizer.encode_plus(
                        sent,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length =max_len,           # Pad & truncate all sentences.
                        pad_to_max_length = True,
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
    
    # Add the encoded sentence to the list.    
    input_ids.append(encoded_dict['input_ids'])
    
    # And its attention mask (simply differentiates padding from non-padding).
    attention_masks.append(encoded_dict['attention_mask'])

# Convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
padlabels = []
for label in labels:
  padlabels.append(label + [0] * (max_len - len(label)))
labels_tensor = torch.tensor(padlabels)

# Set the batch size.  
batch_size = 32  

# Create the DataLoader.
prediction_data = TensorDataset(input_ids, attention_masks, labels_tensor)
prediction_sampler = SequentialSampler(prediction_data)
prediction_dataloader = DataLoader(prediction_data, sampler=prediction_sampler, batch_size=batch_size)

In [None]:
# Prediction on test set, takes about 7 mins to run

print('Predicting labels for {:,} test sentences...'.format(len(input_ids)))

#load the model
model = load_model(output_dir)

# Put model in evaluation mode
model.eval()

# Tracking variables 
predictions , true_labels = [], []

# Predict 
for batch in prediction_dataloader:
  # Add batch to GPU
  batch = tuple(t.to(device) for t in batch)
  
  # Unpack the inputs from our dataloader
  b_input_ids, b_input_mask, b_labels = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)

  logits = outputs[0]

  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.to('cpu').numpy()
  
  # Store predictions and true labels
  predictions.append(logits)
  true_labels.append(label_ids)

print('    DONE.')

Calculating the Metrics

In [None]:
import numpy as np
from sklearn.metrics import matthews_corrcoef
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay
from sklearn.metrics import balanced_accuracy_score
from sklearn.metrics import precision_recall_fscore_support
import seaborn as sns
import matplotlib.pyplot as plt
matthews_set = []

# Evaluate each test batch using Matthew's correlation coefficient
print('Calculating Matthews Corr. Coef. for each batch...')

# For each input batch...
for i in range(len(true_labels)):
  
  # The predictions for this batch are a 3-column ndarray (one column for "0" 
  # and one column for "1", one column for "2"). Pick the label with the highest value and turn this
  # in to a list of 0s and 1s.
  #print(predictions[i][0])
  pred_labels_i = np.argmax(predictions[i], axis=2).flatten()
  #print(pred_labels_i)
  #The solution I found for this bug was that, because the prediction labels are being flattened, so do the true labels
  true_labels_i = true_labels[i].flatten()

  # Calculate and store the coef for this batch.  
  matthews = matthews_corrcoef(true_labels_i, pred_labels_i)                
  matthews_set.append(matthews)

  # Create a barplot showing the MCC score for each batch of test samples.
ax = sns.barplot(x=list(range(len(matthews_set))), y=matthews_set, ci=None)

plt.title('MCC Score per Batch')
plt.ylabel('MCC Score (-1 to +1)')
plt.xlabel('Batch #')

plt.show()

# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0 or 1) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=2).flatten()

# Combine the correct labels for each batch into a single list.
flat_true_labels = np.concatenate(true_labels, axis=0).flatten()

# Calculate the MCC
mcc = matthews_corrcoef(flat_true_labels, flat_predictions)

print('\nTotal MCC: %.3f' % mcc)
print('\nTotal Confusion Matrix')
ConfusionMatrixDisplay(confusion_matrix(flat_true_labels, flat_predictions, normalize='all'), display_labels=["O", "B-SLANG", "I-SLANG"]).plot()
plt.show()

# show just B and I slang.
stripped_true = []
stripped_pred = []
correctO = 0
for idx, value in enumerate(flat_true_labels):
  if flat_true_labels[idx] == 0 and flat_predictions[idx] == 0:
    correctO += 1
  else:
    stripped_true.append(flat_true_labels[idx])
    stripped_pred.append(flat_predictions[idx])

print(correctO)
print(len(flat_true_labels))
print('\nTotal Confusion Matrix with correct O predictions removed')
ConfusionMatrixDisplay(confusion_matrix(stripped_true, stripped_pred, normalize='all'), display_labels=["O", "B-SLANG", "I-SLANG"]).plot()
plt.show()   

print("\nBalanced Accuracy")
balanced_accuracy_score(flat_true_labels, flat_predictions)

print("\nweighted precision, recall, fscore, support")
precision_recall_fscore_support(flat_true_labels, flat_predictions, labels=[0,1,2], average='weighted')

In [None]:
#kaggle API key, to get UD dataset.
! echo "{\"username\":\"ericclark\",\"key\":\"d84149862033d3369bad2f997df47d26\"}" > kaggle.json
! pip install kaggle

#getting the urban dictionary words datat set from Kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
! kaggle datasets download therohk/urban-dictionary-words-dataset
! unzip urban-dictionary-words-dataset
# gives urbandict-word-defs.csv - for use in defining slang terms

#reading the file to get the udWords vocabulary
!head urbandict-word-defs.csv
import csv
file = open("urbandict-word-defs.csv")
csvreader = csv.reader(file)
header = next(csvreader)
print(header)
udWords = []
for row in csvreader:
    udWords.append(row)
print(len(udWords))
file.close()

#moving all of the words and their definitions into a dictionary
uDict = {}
for row in udWords:
  uDict[row[1].lower()] = row[5]

In [None]:

# Combine the results across all batches. 
flat_predictions = np.concatenate(predictions, axis=0)

# For each sample, pick the label (0, 1, 2) with the higher score.
flat_predictions = np.argmax(flat_predictions, axis=2)
#The sequence below prints the first m sentences that are predicted to have a slang term in it.
y=0
m = 20
for i in range(len(flat_predictions)):
  slangPhrase = []
  startSlang = False
  phrases = []
  for slangIdx, x in enumerate(flat_predictions[i]):
    words = testSents[i].split()
    if x == 1 and (len(words) > slangIdx):
      startSlang = True
      slangPhrase.append(words[slangIdx])
    if x == 2 and startSlang and (len(words) > slangIdx):
          slangPhrase.append(words[slangIdx])
    if x == 0 and startSlang:
        startSlang = False
        phrases.append(" ".join(slangPhrase))
        slangPhrase = []

  if slangPhrase != []:
      phrases.append("".join(slangPhrase))
  if len(phrases) > 0:
      print(testSents[i])
      y += 1
      print("{:d} slang words found".format(len(phrases)))
      for phrase in phrases:
          if phrase in uDict:
              print("The slang word is: \"{:s}\"\nThe definition of {:s} is: {:s}\n".format(phrase, phrase, uDict[phrase]))
          else:
              y -= 1
              print("We do not have a definition for \"{:s}\". Sorry!\n".format(phrase))
  if y >= m:
      break


In [None]:
from transformers import BertForTokenClassification, AdamW, BertConfig
from transformers import BertTokenizer
import torch
import numpy as np
# Good practice: save your training arguments together with the trained model
#torch.save(args, os.path.join(output_dir, 'training_args.bin'))
def load_model(output_dir):
  # Load a trained model and vocabulary that you have fine-tuned
  model = BertForTokenClassification.from_pretrained(output_dir)
  tokenizer = BertTokenizer.from_pretrained(output_dir)

  # Copy the model to the GPU.
  # we are going to use the model on the CPU for actual processing.
  #model.to(device)
  return model

def tokenizeInput(sentence):
  encoded_dict = tokenizer.encode_plus(
                        sentence,                      # Sentence to encode.
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        return_attention_mask = True,   # Construct attn. masks.
                        return_tensors = 'pt',     # Return pytorch tensors.
                   )
  # return the encoded sentence.
  # And its attention mask (simply differentiates padding from non-padding).
  inputs = encoded_dict['input_ids']
  attention = encoded_dict['attention_mask']
  return (inputs, attention)

# CODE TO FUNCTIONALIZE USING THE MODEL (ENCODE, PREDICT, DECODE?) - AND THEN REPORT THE RESULT TO THE USER
def findSlang(sentence):
  # this is essentially a hack to get a c-style static var in a python function.
  # essentially, if the model isnt already initialized, load it.
  if "slangModel" not in findSlang.__dict__: findSlang.slangModel = load_model(output_dir)
  # Put model in evaluation mode
  model.eval()
  prediction = []
  tokenData = tokenizeInput(sentence)
  # Add data to GPU
  batch = (tokenData[0].to(device), tokenData[1].to(device))
  
  # Unpack the inputs from our GPU batch
  b_input_ids, b_input_mask = batch
  
  # Telling the model not to compute or store gradients, saving memory and 
  # speeding up prediction
  with torch.no_grad():
      # Forward pass, calculate logit predictions
      outputs = model(b_input_ids, token_type_ids=None, 
                      attention_mask=b_input_mask)
  logits = outputs[0]
  # Move logits to CPU
  logits = logits.detach().cpu().numpy()
    
  # Store predictions and true labels
  prediction = np.argmax(logits, axis=2)[0]
  print("Predicted labels: ", prediction)
  phrases = []
  slangPhrase = []
  startSlang = False
  # this can be better. we arent collapsing the bert encoding down to the sentence (and collapsing prediction of slang down)
  for slangIdx, x in enumerate(prediction):
    words = sentence.split()
    if x == 1 and (len(words) > slangIdx):
      startSlang = True
      slangPhrase.append(words[slangIdx])
    if x == 2 and startSlang and (len(words) > slangIdx):
          slangPhrase.append(words[slangIdx])
    if x == 0 and startSlang:
        startSlang = False
        phrases.append(" ".join(slangPhrase))
        slangPhrase = []

  if slangPhrase != []:
      phrases.append("".join(slangPhrase))

  
  if (1 in prediction) and (phrases == []):
      print("There was a slang term in your sentence, but there was an error defining it. Sorry!")

  if len(phrases) > 0:
      print(len(phrases))
      for phrase in phrases:
          if phrase in uDict:
              print("The slang word is: \"{:s}\"\nThe definition of {:s} is: {:s}\n".format(phrase, phrase, uDict[phrase]))
          else:
              print("We do not have a definition for \"{:s}\". Sorry!".format(phrase))


In [None]:
findSlang("she is so janky and she is gawj")

In [None]:
cont = True
while(cont):
  userInput = input("Enter your sentence or 'end': ")
  if userInput == 'end':
      cont = False
      break
  findSlang(userInput)