Modelo obtenido de :
STAT 453: Deep Learning (Spring 2021)  
Instructor: Sebastian Raschka (sraschka@wisc.edu)  
GitHub repository: https://github.com/rasbt/stat453-deep-learning-ss21

---

# Fine-tuning BERT for Movie Review Classfification

In [3]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [4]:
import gzip
import shutil
import time

import pandas as pd
import requests
import torch
import torch.nn.functional as F
import torchtext

import transformers
from transformers import DistilBertTokenizerFast
from transformers import DistilBertForSequenceClassification

## General Settings

In [5]:
torch.backends.cudnn.deterministic = True
RANDOM_SEED = 123
torch.manual_seed(RANDOM_SEED)
DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

NUM_EPOCHS = 3

## Download Dataset

The following cells will download the IMDB movie review dataset (http://ai.stanford.edu/~amaas/data/sentiment/) for positive-negative sentiment classification in as CSV-formatted file:

In [6]:
url = "https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch08/movie_data.csv.gz"
filename = url.split("/")[-1]

with open(filename, "wb") as f:
    r = requests.get(url)
    f.write(r.content)

with gzip.open('movie_data.csv.gz', 'rb') as f_in:
    with open('movie_data.csv', 'wb') as f_out:
        shutil.copyfileobj(f_in, f_out)

Check that the dataset looks okay:

In [7]:
df = pd.read_csv('movie_data.csv')
df.head()

Unnamed: 0,review,sentiment
0,"In 1974, the teenager Martha Moxley (Maggie Gr...",1
1,OK... so... I really like Kris Kristofferson a...,0
2,"***SPOILER*** Do not read this, if you think a...",0
3,hi for all the people who have seen this wonde...,1
4,"I recently bought the DVD, forgetting just how...",0


In [8]:
df.shape

(50000, 2)

## Split Dataset into Train/Validation/Test

In [9]:
train_texts = df.iloc[:3500]['review'].values
train_labels = df.iloc[:3500]['sentiment'].values

valid_texts = df.iloc[3500:4000]['review'].values
valid_labels = df.iloc[3500:4000]['sentiment'].values

test_texts = df.iloc[4000:5500]['review'].values
test_labels = df.iloc[4000:5500]['sentiment'].values

## Tokenization

In [10]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

In [11]:
train_encodings = tokenizer(list(train_texts), truncation=True, padding=True)
valid_encodings = tokenizer(list(valid_texts), truncation=True, padding=True)
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True)

In [12]:
train_encodings[0]

Encoding(num_tokens=512, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])

## Dataset Class and Loaders

In [13]:
class IMDbDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.labels)


train_dataset = IMDbDataset(train_encodings, train_labels)
valid_dataset = IMDbDataset(valid_encodings, valid_labels)
test_dataset = IMDbDataset(test_encodings, test_labels)

In [14]:
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=16, shuffle=True)
valid_loader = torch.utils.data.DataLoader(valid_dataset, batch_size=16, shuffle=False)
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size=16, shuffle=False)

## Load Model

In [15]:
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased')
model.to(DEVICE)
model.train()

optim = torch.optim.Adam(model.parameters(), lr=5e-5)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

## Train Model

In [16]:
def compute_accuracy(model, data_loader, device):

    with torch.no_grad():

        correct_pred, num_examples = 0, 0

        for batch_idx, batch in enumerate(data_loader):

            ### Prepare data
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss, logits = outputs['loss'], outputs['logits']

            _, predicted_labels = torch.max(logits, 1)

            num_examples += labels.size(0)

            correct_pred += (predicted_labels == labels).sum()
    return correct_pred.float()/num_examples * 100

In [17]:
start_time = time.time()

for epoch in range(NUM_EPOCHS):
    
    model.train()
    
    for batch_idx, batch in enumerate(train_loader):
        
        ### Prepare data
        input_ids = batch['input_ids'].to(DEVICE)
        attention_mask = batch['attention_mask'].to(DEVICE)
        labels = batch['labels'].to(DEVICE)

        ### Forward
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss, logits = outputs['loss'], outputs['logits']
        
        ### Backward
        optim.zero_grad()
        loss.backward()
        optim.step()
        
        ### Logging
        if not batch_idx % 250:
            print (f'Epoch: {epoch+1:04d}/{NUM_EPOCHS:04d} | '
                   f'Batch {batch_idx:04d}/{len(train_loader):04d} | '
                   f'Loss: {loss:.4f}')
            
    model.eval()

    with torch.set_grad_enabled(False):
        print(f'training accuracy: '
              f'{compute_accuracy(model, train_loader, DEVICE):.2f}%'
              f'\nvalid accuracy: '
              f'{compute_accuracy(model, valid_loader, DEVICE):.2f}%')
        
    print(f'Time elapsed: {(time.time() - start_time)/60:.2f} min')
    
print(f'Total Training Time: {(time.time() - start_time)/60:.2f} min')
print(f'Test accuracy: {compute_accuracy(model, test_loader, DEVICE):.2f}%')

Epoch: 0001/0003 | Batch 0000/0219 | Loss: 0.6978
training accuracy: 96.29%
valid accuracy: 90.40%
Time elapsed: 3.54 min
Epoch: 0002/0003 | Batch 0000/0219 | Loss: 0.1998
training accuracy: 97.37%
valid accuracy: 87.40%
Time elapsed: 7.20 min
Epoch: 0003/0003 | Batch 0000/0219 | Loss: 0.2048
training accuracy: 99.54%
valid accuracy: 91.20%
Time elapsed: 10.88 min
Total Training Time: 10.88 min
Test accuracy: 89.60%


In [23]:
def inferir(review):
  model.eval()
  tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')
  encoded_review1 = tokenizer.encode_plus(review, padding='max_length', max_length=128, truncation=True, return_tensors='pt')

  input_ids1 = encoded_review1['input_ids'].to(DEVICE)
  attention_mask1 = encoded_review1['attention_mask'].to(DEVICE)
  outputs1 = model(input_ids1, attention_mask=attention_mask1)

  logits1 = outputs1['logits']
  predicted_label1 = torch.argmax(logits1).item()

  # Etiquetas predichas
  labels = {0: 'Negativo', 1: 'Positivo'}

  print(f"Review: {review}")
  print(f"Etiqueta predicha: {labels[predicted_label1]}")

In [24]:
review = "Guardian of the Galaxy Vol. 3, oh where do I begin? This movie managed to encapsulate everything that can go wrong in a film, creating an experience that left me utterly disappointed and disheartened. It pains me to write this review for a franchise that I once held in high regard, but the third installment falls far short of its predecessors in every possible way. From the very beginning, it becomes painfully apparent that this film is a complete misstep. The storyline is convoluted and lacking any meaningful depth. It feels as if the writers simply threw random ideas together without any coherent structure or purpose. The plot twists and turns, leaving the audience confused and disengaged."
inferir(review)
review = "The first of the Guardians films remains my favorite of the trilogy, with this entry, volume three, being second. This, the third entry, seems to lack a lot of what made the first one so very popular, that including the wide array of classic, Motown and old-time rock music that made it so memorable. Seems here they stuck mostly with light folk music, and had hardly much of any Motown that really made scenes stand out or funky. Not as much action, and when it does happen it's short-lived and not as detailed or impressive as previous entries or other Marvel films. This film focuses more on backstory, and is told in conjunction with present events that help give us more depth into one of the most loved members of the Guardians. The comedy is mostly funny, unless you've seen some of the best bits already in the trailers, but the story could have been better as it seems to be one bumbling task after another."
inferir(review)

Review: Guardian of the Galaxy Vol. 3, oh where do I begin? This movie managed to encapsulate everything that can go wrong in a film, creating an experience that left me utterly disappointed and disheartened. It pains me to write this review for a franchise that I once held in high regard, but the third installment falls far short of its predecessors in every possible way. From the very beginning, it becomes painfully apparent that this film is a complete misstep. The storyline is convoluted and lacking any meaningful depth. It feels as if the writers simply threw random ideas together without any coherent structure or purpose. The plot twists and turns, leaving the audience confused and disengaged.
Etiqueta predicha: Negativo
Review: The first of the Guardians films remains my favorite of the trilogy, with this entry, volume three, being second. This, the third entry, seems to lack a lot of what made the first one so very popular, that including the wide array of classic, Motown and o

## Beneficios de usar distiLBERT

Beneficios:

Tamaño de modelo más pequeño

Mayor velocidad de inferencia

Menores requisitos de memoria

Precisión competitiva

Transferencia de aprendizaje

Más amigable con el medio ambiente