## Transfer Learning

The idea is to train a neural network on a large dataset, with large computational resources, and once trained, use the knowledge that this model already has as a starting point for our particular case in the process known as "fine tuning".

It allows to train neural networks faster, with lower computational requirements and allowing the training of networks with better performance with small datasets.

## Dataset

In [None]:
!pip install torchtext==0.10.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import torch
import torchtext.legacy

In [None]:
TEXT = torchtext.legacy.data.Field(tokenize = 'spacy')
LABEL = torchtext.legacy.data.LabelField(dtype = torch.long)

train_data, test_data = torchtext.legacy.datasets.IMDB.splits(TEXT, LABEL)



downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:10<00:00, 7.75MB/s]


In [None]:
len(train_data), len(test_data)

(25000, 25000)

In [None]:
print(vars(train_data.examples[0]))

{'text': ['"', 'Girlfight', '"', 'is', 'much', 'more', 'of', 'a', 'coming', '-', 'of', '-', 'age', '-', 'story', 'than', 'it', 'is', 'a', 'fight', 'flick', '.', 'And', 'what', 'a', 'relief', 'to', 'have', 'one', 'in', 'an', 'urban', 'school', ',', 'with', 'naturalistic', ',', 'realistic', 'Latinos', 'and', 'believable', 'use', 'of', 'Brooklyn', 'project', 'settings', '.', '<', 'br', '/><br', '/>It', 'made', 'me', 'realize', 'that', 'virtually', 'all', 'Hollywood', 'high', 'school', 'movies', 'are', 'set', 'in', 'luxurious', 'suburbia', 'or', 'small', 'towns', '.', '(', 'Even', 'the', 'somewhat', 'comparable', '"', 'Love', 'and', 'Basketball', '"', 'which', 'focused', 'on', 'teen', 'African', '-', 'Americans', 'was', 'set', 'in', 'suburbia', '.', ')', 'While', 'these', 'kids', 'share', 'some', 'of', 'the', 'same', 'peer', 'problems', ',', 'those', 'issues', 'shrink', 'compared', 'to', 'the', 'other', 'struggles', 'of', 'these', 'kids', ',', 'where', 'high', 'school', 'graduation', 'coul

## Pre-trained Embeddings

Embedding is the vector representation of each word in the vocabulary that we will use to feed the recurrent network.



In [None]:
MAX_VOCAB_SIZE = 10000

TEXT.build_vocab(train_data, 
                 max_size = MAX_VOCAB_SIZE, 
                 vectors = "glove.6B.100d", # pre-trained embeddings
                 unk_init = torch.Tensor.normal_)

LABEL.build_vocab(train_data)

len(TEXT.vocab), len(LABEL.vocab)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.34MB/s]                           
100%|█████████▉| 399999/400000 [00:13<00:00, 29470.07it/s]


(10002, 2)

We define the dataloaders with the torchtext.data.BucketIterator class.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"

dataloader = {
    'train': torchtext.legacy.data.BucketIterator(train_data, batch_size=64, shuffle=True, sort_within_batch=True, device=device),
    'test': torchtext.legacy.data.BucketIterator(test_data, batch_size=64, device=device)
}

## Model

This model is mainly made up of the embedding layer, which in this case we will replace with the previously downloaded vectors, and the recurrent and linear layers, which we will train from scratch.

In [None]:
class RNN(torch.nn.Module):
    def __init__(self, input_dim, embedding_dim=128, hidden_dim=128, output_dim=2, num_layers=2, dropout=0.2, bidirectional=False):
        super().__init__()
        self.embedding = torch.nn.Embedding(input_dim, embedding_dim)
        self.rnn = torch.nn.GRU(
            input_size=embedding_dim, 
            hidden_size=hidden_dim, 
            num_layers=num_layers, 
            dropout=dropout if num_layers > 1 else 0,
            bidirectional=bidirectional
        )
        self.fc = torch.nn.Linear(2*hidden_dim if bidirectional else hidden_dim, output_dim)
        
    def forward(self, text):
        # no entrenamos los embeddings
        with torch.no_grad():
            #text = [sent len, batch size]        
            embedded = self.embedding(text)        
        #embedded = [sent len, batch size, emb dim]        
        output, hidden = self.rnn(embedded)        
        #output = [sent len, batch size, hid dim]
        y = self.fc(output[-1,:,:].squeeze(0))
        """ Now the batch dimension is NOT the first, this is the default behavior of recursive 
            layers in Pytorch. You can modify this by adding the option batch_first=True in the 
            recursive layer (and make sure your dataloader uses the first dimension for the batch 
            as well. """       
        return y

We replace the tensors in the "embedding" layer with the pre-trained vectors downloaded earlier.

In [None]:
model = RNN(input_dim=len(TEXT.vocab), bidirectional=True, embedding_dim=100)

pretrained_embeddings = TEXT.vocab.vectors
model.embedding.weight.data.copy_(pretrained_embeddings)
# we zero out the weights corresponding to the tokens and <pad>
model.embedding.weight.data[TEXT.vocab.stoi[TEXT.unk_token]] = torch.zeros(100)
model.embedding.weight.data[TEXT.vocab.stoi[TEXT.pad_token]] = torch.zeros(100)

outputs = model(torch.randint(0, len(TEXT.vocab), (100, 64)))
outputs.shape

torch.Size([64, 2])

## Training

In [None]:
from tqdm import tqdm
import numpy as np

def fit(model, dataloader, epochs=5):
    model.to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
    criterion = torch.nn.CrossEntropyLoss()
    for epoch in range(1, epochs+1):
        model.train()
        train_loss, train_acc = [], []
        bar = tqdm(dataloader['train'])
        for batch in bar:
            X, y = batch
            X, y = X.to(device), y.to(device)
            optimizer.zero_grad()
            y_hat = model(X)
            loss = criterion(y_hat, y)
            loss.backward()
            optimizer.step()
            train_loss.append(loss.item())
            acc = (y == torch.argmax(y_hat, axis=1)).sum().item() / len(y)
            train_acc.append(acc)
            bar.set_description(f"loss {np.mean(train_loss):.5f} acc {np.mean(train_acc):.5f}")
        bar = tqdm(dataloader['test'])
        val_loss, val_acc = [], []
        model.eval()
        with torch.no_grad():
            for batch in bar:
                X, y = batch
                X, y = X.to(device), y.to(device)
                y_hat = model(X)
                loss = criterion(y_hat, y)
                val_loss.append(loss.item())
                acc = (y == torch.argmax(y_hat, axis=1)).sum().item() / len(y)
                val_acc.append(acc)
                bar.set_description(f"val_loss {np.mean(val_loss):.5f} val_acc {np.mean(val_acc):.5f}")
        print(f"Epoch {epoch}/{epochs} loss {np.mean(train_loss):.5f} val_loss {np.mean(val_loss):.5f} acc {np.mean(train_acc):.5f} val_acc {np.mean(val_acc):.5f}")

In [None]:
fit(model, dataloader)

loss 0.61457 acc 0.65098: 100%|██████████| 391/391 [00:23<00:00, 16.84it/s]
val_loss 0.88599 val_acc 0.50388: 100%|██████████| 391/391 [00:27<00:00, 14.10it/s]


Epoch 1/5 loss 0.61457 val_loss 0.88599 acc 0.65098 val_acc 0.50388


loss 0.41138 acc 0.81278: 100%|██████████| 391/391 [00:23<00:00, 16.56it/s]
val_loss 0.53038 val_acc 0.77739: 100%|██████████| 391/391 [00:27<00:00, 14.39it/s]


Epoch 2/5 loss 0.41138 val_loss 0.53038 acc 0.81278 val_acc 0.77739


loss 0.33329 acc 0.85904: 100%|██████████| 391/391 [00:23<00:00, 16.93it/s]
val_loss 0.36333 val_acc 0.83838: 100%|██████████| 391/391 [00:26<00:00, 14.59it/s]


Epoch 3/5 loss 0.33329 val_loss 0.36333 acc 0.85904 val_acc 0.83838


loss 0.28268 acc 0.88349: 100%|██████████| 391/391 [00:23<00:00, 16.88it/s]
val_loss 0.30692 val_acc 0.86973: 100%|██████████| 391/391 [00:27<00:00, 14.35it/s]


Epoch 4/5 loss 0.28268 val_loss 0.30692 acc 0.88349 val_acc 0.86973


loss 0.24724 acc 0.89946: 100%|██████████| 391/391 [00:23<00:00, 16.82it/s]
val_loss 0.31385 val_acc 0.86767: 100%|██████████| 391/391 [00:27<00:00, 14.01it/s]

Epoch 5/5 loss 0.24724 val_loss 0.31385 acc 0.89946 val_acc 0.86767





## Generating predictions

In [None]:
import spacy
nlp = spacy.load("en_core_web_sm")

def predict(model, X):
    model.eval() 
    with torch.no_grad():
        X = torch.tensor(X).to(device)
        pred = model(X)
        return pred

In [None]:
sentences = ["this film is terrible", "this film is great", "this film is good", "a waste of time"]
tokenized = [[tok.text for tok in nlp.tokenizer(sentence)] for sentence in sentences]
indexed = [[TEXT.vocab.stoi[_t] for _t in t] for t in tokenized]
tensor = torch.tensor(indexed).permute(1,0)
predictions = torch.argmax(predict(model, tensor), axis=1)
predictions

  import sys


tensor([0, 1, 1, 0], device='cuda:0')