# Bert Extractor quickstart
This notebook is meant as a quickstart of this package and how to set it, extract data and fine tune a BERT model.

In [1]:
import numpy as np
import torch
from torch.utils.data import DataLoader, RandomSampler, SequentialSampler, TensorDataset
from transformers import (
    BertForSequenceClassification,
    BertTokenizerFast,
    Trainer,
    TrainingArguments,
)

from bert_extractor.extractors.reviews import ReviewsExtractor

Create a configuration dictionary this can be set with in the configs files.

In [2]:
params = {
    "pretrained_model_name_or_path": "bert-base-uncased",
    "sentence_col": "text",
    "labels_col": "label",
}

### Create an object of the ReviewExtractor

In [3]:
reviews = ReviewsExtractor(**params)

### Extract and preprocess data

In [4]:
url = "http://deepyeti.ucsd.edu/jianmo/amazon/categoryFilesSmall/All_Beauty_5.json.gz"
tensor_extracted = reviews.extract_preprocess(url)

Token indices sequence length is longer than the specified maximum sequence length for this model (577 > 512). Running this sequence through the model will result in indexing errors


In [5]:
train_encode = tensor_extracted.train_inputs
train_labels = tensor_extracted.train_labels
valid_encode = tensor_extracted.validation_inputs
valid_labels = tensor_extracted.validation_labels

## Load Dataset into Torch.Dataset
Get numbers of unique labels

In [6]:
num_labels = len(np.unique(train_labels))

In [7]:
class ReviewsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)

Create a Dataset object to pass it to the model.

In [8]:
train_dataset = ReviewsDataset(train_encode, train_labels)
valid_dataset = ReviewsDataset(valid_encode, valid_labels)

Instantiate a BERT text classification model

In [9]:
model = BertForSequenceClassification.from_pretrained(reviews.pretrained_model_name_or_path, num_labels=num_labels)

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.bias', 'cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

Create training arguments

In [10]:
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=1,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=64,
    warmup_steps=10,
    weight_decay=0.01,
    logging_dir="./logs",
    load_best_model_at_end=True,
    logging_steps=200,
)

  return torch._C._cuda_getDeviceCount() > 0


Create a Trainer object and train the model

In [11]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
)

Note: the model has run for a few steps because the lack of GPU

In [None]:
trainer.train()

***** Running training *****
  Num examples = 4742
  Num Epochs = 1
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 149


Step,Training Loss
