# Fine-tuning BERT (and friends) for multi-label text classification

In this notebook, we are going to fine-tune BERT to predict one or more labels for a given piece of text. Note that this notebook illustrates how to fine-tune a bert-base-uncased model, but you can also fine-tune a RoBERTa, DeBERTa, DistilBERT, CANINE, ... checkpoint in the same way. 

All of those work in the same way: they add a linear layer on top of the base model, which is used to produce a tensor of shape (batch_size, num_labels), indicating the unnormalized scores for a number of labels for every example in the batch.



## Set-up environment

First, we install the libraries which we'll use: HuggingFace Transformers and Datasets.

In [1]:
import s3fs
import boto3
import pandas as pd
import nltk

import re
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk.tokenize import word_tokenize
import tensorflow as tf
import transformers
import numpy as np


2022-07-29 04:54:25.006731: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-07-29 04:54:25.011137: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-07-29 04:54:25.011155: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


In [2]:
import torch

## Load dataset

Next, let's download a multi-label text classification dataset from the [hub](https://huggingface.co/).

At the time of writing, I picked a random one as follows:   

* first, go to the "datasets" tab on huggingface.co
* next, select the "multi-label-classification" tag on the left as well as the the "1k<10k" tag (fo find a relatively small dataset).

Note that you can also easily load your local data (i.e. csv files, txt files, Parquet files, JSON, ...) as explained [here](https://huggingface.co/docs/datasets/loading.html#local-and-remote-files).



In [3]:
dataset = dataset = pd.read_csv('BERTopic_Labeled.csv')

In [4]:
dataset.rename(columns = {'Unnamed: 0':'ID'}, inplace = True)

In [5]:
dataset = dataset[['ID','description', 'university', 'relationships', 'break ups', 'divorce', 'weddings', 'death', 'family', 'friendship']]

In [6]:
dataset['description']

0      From the Wall Street Journal and #1 Amazon bes...
1      Helping set the stage for BioWare's hotly anti...
2      Sebastian Locke, the fifty-six-year-old patria...
3      "Take me back to Oxmoon, the lost paradise of ...
4      When the Mayflower set sail in 1620, it carrie...
                             ...                        
995    Lee wants to be a Tarantula – a member of the ...
996    The Merry Adventures of Robin Hood of Great Re...
997    Moving from present-day Oslo to Brooklyn in th...
998    The captivating sequel to INKHEART, the critic...
999    Santa Claus, my dear old friend, you are a thi...
Name: description, Length: 1000, dtype: object

In [7]:
dataset['description'] = dataset['description'].apply(lambda x: re.sub(r'[^a-zA-Z ]+', ' ', x))

In [8]:
dataset['description'] = dataset['description'].apply(lambda x: x.lower())

In [9]:
dataset = dataset.replace(np.nan, False)

In [10]:
dataset

Unnamed: 0,ID,description,university,relationships,break ups,divorce,weddings,death,family,friendship
0,39822,from the wall street journal and amazon best...,False,True,False,False,False,False,False,False
1,34235,helping set the stage for bioware s hotly anti...,False,False,False,False,False,False,False,False
2,27904,sebastian locke the fifty six year old patria...,False,True,False,True,False,True,True,False
3,10515,take me back to oxmoon the lost paradise of ...,False,True,False,False,False,False,True,False
4,935,when the mayflower set sail in it carried on...,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,17361,lee wants to be a tarantula a member of the ...,False,False,False,False,False,False,False,True
996,9029,the merry adventures of robin hood of great re...,False,False,False,False,False,False,False,False
997,32216,moving from present day oslo to brooklyn in th...,False,False,False,False,False,False,True,False
998,1036,the captivating sequel to inkheart the critic...,False,False,False,False,False,False,False,False


In [11]:
dataset[0:899].to_csv('FINAL_bert_train.csv')
dataset[900:999].to_csv('FINAL_bert_validation.csv')

As we can see, the dataset contains 3 splits: one for training, one for validation and one for testing.

In [12]:
from datasets import load_dataset
dataset = load_dataset('csv', data_files={'train': 'FINAL_bert_train.csv', 'validation':'FINAL_bert_validation.csv'})

Using custom data configuration default-7779b0cd14456992


Downloading and preparing dataset csv/default to /home/ec2-user/.cache/huggingface/datasets/csv/default-7779b0cd14456992/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

0 tables [00:00, ? tables/s]

0 tables [00:00, ? tables/s]

Dataset csv downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/csv/default-7779b0cd14456992/0.0.0/51cce309a08df9c4d82ffd9363bbe090bf173197fc01a71b034e8594995a1a58. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

Let's check the first example of the training split:

In [13]:
example = dataset['train'][0]
example

{'Unnamed: 0': 0,
 'ID': '39,822',
 'description': 'from the wall street journal and   amazon bestselling author comes a new installment to the blue moon small town romance series   a womanizing bad boy with a motorcycle and sexy as sin smile is not part of emma s life plan   she moved cross country to be close to family and finally settle down in hippie  trippy  nosey blue moon bend  but when famed fashion photographer niko shows up with his leather jacket  underwear melting voice  and a problem  she sees nothing but trouble   he s all wrong for emma  but that doesn t stop the attraction from boiling over   niko doesn t let being friend zoned get in his way  once he gets his hands and his mouth on her  will their friendship survive  or will he lose everything he s worked for back in new york to the redhead who dominates his every thought  ',
 'university': False,
 'relationships': True,
 'break ups': False,
 'divorce': False,
 'weddings': False,
 'death': False,
 'family': False,
 'fr

The dataset consists of tweets, labeled with one or more emotions. 

Let's create a list that contains the labels, as well as 2 dictionaries that map labels to integers and back.

In [14]:
labels = [label for label in dataset['train'][0].keys() if label not in ['ID', 'description', 'Unnamed: 0']]
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}
labels

['university',
 'relationships',
 'break ups',
 'divorce',
 'weddings',
 'death',
 'family',
 'friendship']

In [15]:
dataset

DatasetDict({
    train: Dataset({
        features: ['Unnamed: 0', 'ID', 'description', 'university', 'relationships', 'break ups', 'divorce', 'weddings', 'death', 'family', 'friendship'],
        num_rows: 899
    })
    validation: Dataset({
        features: ['Unnamed: 0', 'ID', 'description', 'university', 'relationships', 'break ups', 'divorce', 'weddings', 'death', 'family', 'friendship'],
        num_rows: 99
    })
})

## Preprocess data

As models like BERT don't expect text as direct input, but rather `input_ids`, etc., we tokenize the text using the tokenizer. Here I'm using the `AutoTokenizer` API, which will automatically load the appropriate tokenizer based on the checkpoint on the hub.

What's a bit tricky is that we also need to provide labels to the model. For multi-label text classification, this is a matrix of shape (batch_size, num_labels). Also important: this should be a tensor of floats rather than integers, otherwise PyTorch' `BCEWithLogitsLoss` (which the model will use) will complain, as explained [here](https://discuss.pytorch.org/t/multi-label-binary-classification-result-type-float-cant-be-cast-to-the-desired-output-type-long/117915/3).

In [16]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  text = examples['description']
  # encode them
  encoding = tokenizer(text, padding="max_length", truncation=True, max_length=128)
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  return encoding

In [17]:
encoded_dataset = dataset.map(preprocess_data, batched=True, remove_columns=dataset['train'].column_names)



  0%|          | 0/1 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [18]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [19]:
tokenizer.decode(example['input_ids'])

'[CLS] from the wall street journal and amazon bestselling author comes a new installment to the blue moon small town romance series a womanizing bad boy with a motorcycle and sexy as sin smile is not part of emma s life plan she moved cross country to be close to family and finally settle down in hippie trippy nosey blue moon bend but when famed fashion photographer niko shows up with his leather jacket underwear melting voice and a problem she sees nothing but trouble he s all wrong for emma but that doesn t stop the attraction from boiling over niko doesn t let being friend zoned get in his way once he gets his hands and [SEP]'

In [20]:
example['labels']

[0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

In [21]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['relationships']

Finally, we set the format of our data to PyTorch tensors. This will turn the training, validation and test sets into standard PyTorch [datasets](https://pytorch.org/docs/stable/data.html). 

In [22]:
encoded_dataset.set_format("torch")

## Define model

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [23]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

## Train the model!

We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things: 

* `TrainingArguments`, which specify training hyperparameters. All options can be found in the [docs](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments). Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
* a `Trainer` object (docs can be found [here](https://huggingface.co/transformers/main_classes/trainer.html#id1)).

In [24]:
batch_size = 8
metric_name = "f1"

In [25]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [26]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [27]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [28]:
encoded_dataset['train']['input_ids'][0]

tensor([  101,  2013,  1996,  2813,  2395,  3485,  1998,  9733,  2190, 23836,
         2075,  3166,  3310,  1037,  2047, 18932,  2000,  1996,  2630,  4231,
         2235,  2237,  7472,  2186,  1037,  2450,  6026,  2919,  2879,  2007,
         1037,  9055,  1998,  7916,  2004,  8254,  2868,  2003,  2025,  2112,
         1997,  5616,  1055,  2166,  2933,  2016,  2333,  2892,  2406,  2000,
         2022,  2485,  2000,  2155,  1998,  2633,  7392,  2091,  1999,  5099,
        14756,  4440,  7685,  4451,  2100,  2630,  4231,  8815,  2021,  2043,
        15607,  4827,  8088, 23205,  2080,  3065,  2039,  2007,  2010,  5898,
         6598, 14236, 13721,  2376,  1998,  1037,  3291,  2016,  5927,  2498,
         2021,  4390,  2002,  1055,  2035,  3308,  2005,  5616,  2021,  2008,
         2987,  1056,  2644,  1996,  8432,  2013, 16018,  2058, 23205,  2080,
         2987,  1056,  2292,  2108,  2767,  4224,  2094,  2131,  1999,  2010,
         2126,  2320,  2002,  4152,  2010,  2398,  1998,   102])

In [29]:
#forward pass
outputs = model(input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.5620, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[-0.2132,  0.5485, -0.4189, -0.5215,  0.0801, -0.0681, -0.0181, -0.7138]],
       grad_fn=<AddmmBackward0>), hidden_states=None, attentions=None)

Let's start training!

In [30]:
trainer = Trainer(
    model,
    args,
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["validation"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [31]:
trainer.train()

***** Running training *****
  Num examples = 899
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 565


Epoch,Training Loss,Validation Loss,F1,Roc Auc,Accuracy
1,No log,0.239332,0.134831,0.536145,0.40404
2,No log,0.213655,0.446429,0.647782,0.535354
3,No log,0.201752,0.504202,0.676492,0.545455
4,No log,0.185468,0.552846,0.700588,0.575758
5,0.224800,0.172963,0.681159,0.777491,0.646465


***** Running Evaluation *****
  Num examples = 99
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-113
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-113/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-113/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-113/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-113/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 99
  Batch size = 8
Saving model checkpoint to bert-finetuned-sem_eval-english/checkpoint-226
Configuration saved in bert-finetuned-sem_eval-english/checkpoint-226/config.json
Model weights saved in bert-finetuned-sem_eval-english/checkpoint-226/pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english/checkpoint-226/tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english/checkpoint-226/

TrainOutput(global_step=565, training_loss=0.21556996202046894, metrics={'train_runtime': 1848.6156, 'train_samples_per_second': 2.432, 'train_steps_per_second': 0.306, 'total_flos': 295686976727040.0, 'train_loss': 0.21556996202046894, 'epoch': 5.0})

## Evaluate

After training, we evaluate our model on the validation set.

In [32]:
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 99
  Batch size = 8


{'eval_loss': 0.17296305298805237,
 'eval_f1': 0.6811594202898551,
 'eval_roc_auc': 0.7774907811783099,
 'eval_accuracy': 0.6464646464646465,
 'eval_runtime': 11.0421,
 'eval_samples_per_second': 8.966,
 'eval_steps_per_second': 1.177,
 'epoch': 5.0}

## Inference

Let's test the model on a new sentence:

In [34]:
final_preds = pd.read_csv('s3://ec2-jupyter-notebook-us-west-2-8c94c42abbd5478ca9a1a477613965a7/books_filtered.csv')
final_preds['description'] = final_preds['description'].apply(lambda x: re.sub(r'[^a-zA-Z ]+', ' ', x))
final_preds['description'] = final_preds['description'].apply(lambda x: x.lower())

In [35]:
final_preds.columns

Index(['Unnamed: 0', 'bookId', 'title', 'series', 'author', 'rating',
       'description', 'language', 'isbn', 'genres', 'characters', 'bookFormat',
       'edition', 'pages', 'publisher', 'publishDate', 'firstPublishDate',
       'awards', 'numRatings', 'ratingsByStars', 'likedPercent', 'setting',
       'coverImg', 'bbeScore', 'bbeVotes', 'price', 'Fiction', 'Nonfiction',
       'Young Adult', 'Childrens', 'New Adult', 'Fantasy', 'Erotica',
       'History', 'Dystopia', 'Poetry', 'Biography', 'Manga', 'Thriller',
       'Graphic Novels', 'Romance'],
      dtype='object')

In [36]:
len(final_preds)

29652

In [48]:
def predict_bert(text):
    #text = final_preds['description'][i]
    encoding = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}
    outputs = trainer.model(**encoding)
    logits = outputs.logits
    #print(i, logits.shape)
    # apply sigmoid + threshold
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(logits.squeeze().cpu())
    return probs.tolist()

In [47]:
%%time
predict_bert(final_preds['description'][2])

CPU times: user 295 ms, sys: 46 µs, total: 295 ms
Wall time: 148 ms


[0.050868600606918335,
 0.8663569688796997,
 0.02580675669014454,
 0.035985786467790604,
 0.038903478533029556,
 0.04208512231707573,
 0.2104557901620865,
 0.3869123160839081]

In [49]:
final_preds['scores'] = final_preds['description'].apply(lambda x: predict_bert(x))

In [50]:
final_preds

Unnamed: 0.1,Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,...,Erotica,History,Dystopia,Poetry,Biography,Manga,Thriller,Graphic Novels,Romance,scores
0,0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,winning means fame and fortune losing means ce...,English,9.78044E+12,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",...,,,Dystopia,,,,,,Romance,"[0.018422817811369896, 0.26195046305656433, 0...."
1,1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.50,there is a door at the end of a silent corrido...,English,9.78044E+12,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",...,,,,,,,,,,"[0.021728631108999252, 0.04066431522369385, 0...."
2,2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,the unforgettable novel of a childhood in a sl...,English,1E+13,"['Classics', 'Fiction', 'Historical Fiction', ...",...,,,,,,,,,,"[0.024807997047901154, 0.029674483463168144, 0..."
3,3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,alternate cover edition of isbn since its imm...,English,1E+13,"['Classics', 'Fiction', 'Romance', 'Historical...",...,,,,,,,,,Romance,"[0.020264117047190666, 0.08955474197864532, 0...."
4,4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.60,about three things i was absolutely positive f...,English,9.78032E+12,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",...,,,,,,,,,Romance,"[0.046299006789922714, 0.8076027035713196, 0.0..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29647,52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.00,the fateful trilogy continues with fractured ...,English,2.94001E+12,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",...,,,,,,,,,Romance,"[0.03791283443570137, 0.8515658974647522, 0.01..."
29648,52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,anasazi sequel to the thirteenth chime by ...,English,1E+13,"['Mystery', 'Young Adult']",...,,,,,,,,,,"[0.037705618888139725, 0.830470085144043, 0.01..."
29649,52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.70,readers favorite awards winner sixteen year ...,English,9.78146E+12,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",...,,,,,,,,,Romance,"[0.04115355759859085, 0.1833418607711792, 0.03..."
29650,52476,11330278-wayward-son,Wayward Son,,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,a powerful tremor unearths an ancient secretbu...,English,9.78145E+12,"['Fiction', 'Mystery', 'Historical Fiction', '...",...,,,,,,,,,,"[0.0220066849142313, 0.03703244775533676, 0.02..."


In [38]:
test_probabilities = []
for i in range(0,len(final_preds)):
    text = final_preds['description'][i]
    encoding = tokenizer(text, return_tensors="pt", padding="max_length", truncation=True, max_length=128)
    encoding = {k: v.to(trainer.model.device) for k,v in encoding.items()}
    outputs = trainer.model(**encoding)
    logits = outputs.logits
    #print(i, logits.shape)
    # apply sigmoid + threshold
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(logits.squeeze().cpu())
    test_probabilities.append(probs.tolist())
    #predictions = np.zeros(probs.shape)
    #predictions[np.where(probs >= 0.5)] = 1
# turn predicted id's into actual label names
    #predicted_labels = [id2label[idx] for idx, label in enumerate(predictions) if label == 1.0]

KeyboardInterrupt: 

In [51]:
bert_list = ['university', 'relationships', 'break ups', 'divorce', 'weddings', 'death', 'family', 'friendship']
#final_preds['scores'] = test_probabilities
final_preds['labels'] = str(bert_list)

In [54]:
#from ast import literal_eval
#final_preds['labels'] = final_preds['labels'].apply(lambda row: literal_eval(row))
#final_preds['scores'] = final_preds['scores'].apply(lambda row: literal_eval(row))
final_preds['dictionary'] = final_preds.apply(lambda row: dict(zip(row['labels'], row['scores'])), axis=1)
LE_columns = final_preds['dictionary'].apply(pd.Series)
LE_columns = LE_columns > 0.5
final_preds_LE = pd.concat([final_preds, LE_columns], axis=1)

In [56]:
final_preds_LE

Unnamed: 0.1,Unnamed: 0,bookId,title,series,author,rating,description,language,isbn,genres,...,labels,dictionary,university,relationships,break ups,divorce,weddings,death,family,friendship
0,0,2767052-the-hunger-games,The Hunger Games,The Hunger Games #1,Suzanne Collins,4.33,winning means fame and fortune losing means ce...,English,9.78044E+12,"['Young Adult', 'Fiction', 'Dystopia', 'Fantas...",...,"[university, relationships, break ups, divorce...","{'university': 0.018422817811369896, 'relation...",False,False,False,False,False,False,False,False
1,1,2.Harry_Potter_and_the_Order_of_the_Phoenix,Harry Potter and the Order of the Phoenix,Harry Potter #5,"J.K. Rowling, Mary GrandPré (Illustrator)",4.50,there is a door at the end of a silent corrido...,English,9.78044E+12,"['Fantasy', 'Young Adult', 'Fiction', 'Magic',...",...,"[university, relationships, break ups, divorce...","{'university': 0.021728631108999252, 'relation...",False,False,False,False,False,False,False,False
2,2,2657.To_Kill_a_Mockingbird,To Kill a Mockingbird,To Kill a Mockingbird,Harper Lee,4.28,the unforgettable novel of a childhood in a sl...,English,1E+13,"['Classics', 'Fiction', 'Historical Fiction', ...",...,"[university, relationships, break ups, divorce...","{'university': 0.024807997047901154, 'relation...",False,False,False,False,False,False,False,False
3,3,1885.Pride_and_Prejudice,Pride and Prejudice,,"Jane Austen, Anna Quindlen (Introduction)",4.26,alternate cover edition of isbn since its imm...,English,1E+13,"['Classics', 'Fiction', 'Romance', 'Historical...",...,"[university, relationships, break ups, divorce...","{'university': 0.020264117047190666, 'relation...",False,False,False,False,False,False,False,False
4,4,41865.Twilight,Twilight,The Twilight Saga #1,Stephenie Meyer,3.60,about three things i was absolutely positive f...,English,9.78032E+12,"['Young Adult', 'Fantasy', 'Romance', 'Vampire...",...,"[university, relationships, break ups, divorce...","{'university': 0.046299006789922714, 'relation...",False,True,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29647,52473,11492014-fractured,Fractured,Fateful #2,Cheri Schmidt (Goodreads Author),4.00,the fateful trilogy continues with fractured ...,English,2.94001E+12,"['Vampires', 'Paranormal', 'Young Adult', 'Rom...",...,"[university, relationships, break ups, divorce...","{'university': 0.03791283443570137, 'relations...",False,True,False,False,False,False,False,False
29648,52474,11836711-anasazi,Anasazi,Sense of Truth #2,Emma Michaels,4.19,anasazi sequel to the thirteenth chime by ...,English,1E+13,"['Mystery', 'Young Adult']",...,"[university, relationships, break ups, divorce...","{'university': 0.037705618888139725, 'relation...",False,True,False,False,False,False,False,False
29649,52475,10815662-marked,Marked,Soul Guardians #1,Kim Richardson (Goodreads Author),3.70,readers favorite awards winner sixteen year ...,English,9.78146E+12,"['Fantasy', 'Young Adult', 'Paranormal', 'Ange...",...,"[university, relationships, break ups, divorce...","{'university': 0.04115355759859085, 'relations...",False,False,False,False,False,False,False,True
29650,52476,11330278-wayward-son,Wayward Son,,"Tom Pollack (Goodreads Author), John Loftus (G...",3.85,a powerful tremor unearths an ancient secretbu...,English,9.78145E+12,"['Fiction', 'Mystery', 'Historical Fiction', '...",...,"[university, relationships, break ups, divorce...","{'university': 0.0220066849142313, 'relationsh...",False,False,False,False,False,False,False,False


In [57]:
final_preds_LE.to_csv('final_results_for_alicia.csv')

In [68]:
final_preds_LE['death'].value_counts()

False    29652
Name: death, dtype: int64

In [79]:
LE_columns_v2 = final_preds['dictionary'].apply(pd.Series)
LE_columns_v2 = LE_columns > 0.1
#final_preds_LE_v2 = pd.concat([final_preds, LE_columns], axis=1)

In [83]:
LE_columns_v2['divorce'].value_counts()

False    29652
Name: divorce, dtype: int64

In [84]:
train = pd.read_csv('LDA_test.csv')

In [85]:
train.columns

Index(['Unnamed: 0.2', 'Unnamed: 0.1', 'index', 'Unnamed: 0', 'bookId',
       'title', 'series', 'author', 'rating', 'description', 'language',
       'isbn', 'genres', 'characters', 'bookFormat', 'edition', 'pages',
       'publisher', 'publishDate', 'firstPublishDate', 'awards', 'numRatings',
       'ratingsByStars', 'likedPercent', 'setting', 'coverImg', 'bbeScore',
       'bbeVotes', 'price', 'Fiction', 'Nonfiction', 'Young Adult',
       'Childrens', 'New Adult', 'Fantasy', 'Erotica', 'History', 'Dystopia',
       'Poetry', 'Biography', 'Manga', 'Thriller', 'Graphic Novels', 'Romance',
       'university', 'relationships', 'break ups', 'divorce', 'weddings',
       'death', 'family', 'friendship', 'labeled? ', 'Contains True?'],
      dtype='object')

In [86]:
len(train)

1000

In [88]:
train['description'][999]

"Santa Claus, my dear old friend, you are a thief, a traitor, a slanderer, a murderer, a liar, but worst of all you are a mockery of everything for which I stood. You have sung your last ho, ho, ho, for I am coming for your head. . . . I am coming to take back what is mine, to take back Yuletide . . .—from KrampusThe author and artist of The Child Thief returns with a modern fabulist tale of Krampus, the Lord of Yule and the dark enemy of Santa ClausOne Christmas Eve in a small hollow in Boone County, West Virginia, struggling songwriter Jesse Walker witnesses a strange spectacle: seven devilish figures chasing a man in a red suit toward a sleigh and eight reindeer. When the reindeer leap skyward, taking the sleigh, devil men, and Santa into the clouds, screams follow. Moments later, a large sack plummets back to earth, a magical sack that thrusts the down-on-his-luck singer into the clutches of the terrifying Yule Lord, Krampus. But the lines between good and evil become blurred as Je