# PyTorch Finetune Example

TextWiser is designed with extensibility and optimizability in mind. As such, it tries to allow fine-tuning for embeddings that are compatible. The detailed list is available in the README, and we will be using the FastText word embeddings for this example.

In [1]:
import os
os.chdir('..')

As with the pipeline example, we use the news group dataset from Scikit-learn. This dataset contains 20 news groups with the aim of classifying a text document into one of these news groups. Here, we only use a subset of all the news group for demonstration purposes.

In [2]:
from sklearn.datasets import fetch_20newsgroups

categories = ['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
print("Train data size: {}".format(len(newsgroups_train.data)))
print("Test data size: {}".format(len(newsgroups_test.data)))

Train data size: 2034
Test data size: 1353


## Feedforward without fine-tuning

We use a simple feedforward network which uses the word embeddings as a feature extractor, and builds a linear layer on top for classification. The only non-linearity is the `max` operation done on the word pooling stage. This architecture is analogous to using multi-class Logistic Regression in Scikit-learn. For this model, we do not allow fine-tuning of the weights.

In [9]:
import torch
import torch.nn as nn

from textwiser import TextWiser, Embedding, PoolOptions, Transformation, WordOptions

featurizer = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained='en'), Transformation.Pool(pool_option=PoolOptions.max), dtype=torch.float32)
featurizer.fit()  # Initialize the model
clf = nn.Sequential(featurizer, nn.Linear(300, 4))
clf

Sequential(
  (0): TextWiser(
    (model): _Sequential(
      (0): _WordEmbeddings(
        (model): Embedding(1000001, 300, sparse=True)
      )
      (1): _PoolTransformation()
    )
  )
  (1): Linear(in_features=300, out_features=4, bias=True)
)

To make the dataset more in-line with PyTorch best practices, we wrap it up in a PyTorch data loader, which takes care of batching and shuffling the data.

In [10]:
import numpy as np
from sklearn.preprocessing import LabelBinarizer
from torch.utils.data import Dataset, DataLoader

class SklearnData(Dataset):
    def __init__(self, data):
        self.X = data.data
        self.y = torch.from_numpy(data.target)

    def __getitem__(self, key):
        return self.X[key], self.y[key]
    
    def __len__(self):
        return len(self.X)

batch_size = 32
train_loader = DataLoader(dataset=SklearnData(newsgroups_train),
                          batch_size=batch_size,
                          shuffle=True)

test_loader = DataLoader(dataset=SklearnData(newsgroups_test),
                         batch_size=batch_size,
                         shuffle=False)

We then define the testing procedure. It calculates both the cross entropy loss, and the F1-score (as used to evaluate the logistic regression module).

In [11]:
from sklearn import metrics

def run_test(model):
    criterion = nn.CrossEntropyLoss()
    model.eval()
    with torch.no_grad():
        running_loss = 0
        y_preds = []
        for i, (X, y_act) in enumerate(test_loader):
            y_pred = model(X)
            y_preds.append(y_pred)
            loss = criterion(y_pred, y_act)
            running_loss += loss.item()
        print("Test loss: {}".format(running_loss / i))
        print("F1: {}".format(metrics.f1_score(newsgroups_test.target, torch.cat(y_preds).argmax(dim=1), average='macro')))

run_test(clf)

Test loss: 1.4290459070886885
F1: 0.1308254526998125


  'precision', 'predicted', average, warn_for)


Training procedure is defined in a similar way to the testing procedure, we just make sure that the gradients are properly backpropogated. We train the model for 20 epochs.

In [12]:
import torch.optim as optim

def run_train(model, epochs=20):
    optimizer = optim.SGD(model.parameters(), lr=1e-1)
    criterion = nn.CrossEntropyLoss()
    model.train()
    for epoch in range(epochs):
        running_loss = 0
        for i, (X, y_act) in enumerate(train_loader):
            optimizer.zero_grad()
            y_pred = model(X)
            loss = criterion(y_pred, y_act)
            running_loss += loss.item()
            loss.backward()
            optimizer.step()
        print("Train loss: {}".format(running_loss / i))
        run_test(model)

run_train(clf)

Train loss: 1.3728034590917921
Test loss: 1.3427254075095767
F1: 0.15902572180794858
Train loss: 1.3013285720159138
Test loss: 1.312653694834028
F1: 0.46820457930384674
Train loss: 1.2439212912604922
Test loss: 1.2670719538416182
F1: 0.2387517378408087
Train loss: 1.180820141519819
Test loss: 1.183376139118558
F1: 0.6304858479190042
Train loss: 1.1273101492533608
Test loss: 1.1516293116978236
F1: 0.4237316569477765
Train loss: 1.0902921765569658
Test loss: 1.1317261826424372
F1: 0.3940073084957095
Train loss: 1.0442262490590413
Test loss: 1.095052597068605
F1: 0.48166709724363393
Train loss: 1.0046795852600583
Test loss: 1.0513577262560527
F1: 0.7062689281993478
Train loss: 0.9733775863571773
Test loss: 1.0385798868678866
F1: 0.5791997399619029
Train loss: 0.9416219478561765
Test loss: 1.0035120660350436
F1: 0.5490690780697897
Train loss: 0.9144637215705145
Test loss: 0.9927101759683519
F1: 0.5174257996209894
Train loss: 0.8900331466917007
Test loss: 0.9795907948698316
F1: 0.5478367245

## Feedforward with fine-tuning

Now that we have some results for a model without any fine-tuning on the word embeddings, we can try to get some idea on how fine-tuning the word vectors can have an impact on the results. For that, we set the `is_finetuneable` parameter to `True`, allowing the word embeddings to get updated. We also set the `sparse` parameter to `True` to make the computation faster, as we are using regular SGD and regular SGD supports sparse embeddings.

In [13]:
featurizer = TextWiser(Embedding.Word(word_option=WordOptions.word2vec, pretrained='en', sparse=True), Transformation.Pool(pool_option=PoolOptions.max),
                       dtype=torch.float32, is_finetuneable=True)
featurizer.fit()  # Initialize the model
clf = nn.Sequential(featurizer, nn.Linear(300, 4))

run_train(clf)

Train loss: 1.3728999618499997
Test loss: 1.4624727283205305
F1: 0.11276473955352032


  'precision', 'predicted', average, warn_for)


Train loss: 1.3004483419751365
Test loss: 1.2668068494115556
F1: 0.3533541672575323
Train loss: 1.2066675784095886
Test loss: 1.2630064572606767
F1: 0.3464395516838561
Train loss: 1.1065468021801539
Test loss: 1.1482909563041868
F1: 0.4207929749905942
Train loss: 1.001804190022605
Test loss: 1.0175516733101435
F1: 0.625650046596664
Train loss: 0.8983037112251161
Test loss: 0.928099608137494
F1: 0.6468610852745489
Train loss: 0.7932572014748104
Test loss: 0.8646905847958156
F1: 0.6327623693340367
Train loss: 0.7156903781588115
Test loss: 0.7822205083710807
F1: 0.7439485897126473
Train loss: 0.6325573684677245
Test loss: 0.7522109895944595
F1: 0.7074469397543486
Train loss: 0.5773770634144072
Test loss: 0.7108256540128163
F1: 0.684786043075462
Train loss: 0.5200203449007065
Test loss: 0.6587015042702357
F1: 0.7621398303572542
Train loss: 0.4740950380052839
Test loss: 0.637218890445573
F1: 0.7742645720492037
Train loss: 0.43589949584196486
Test loss: 0.6126030186812083
F1: 0.7325757799254

As seen above, the top F1 score with fine-tuning is better than the top F1 score without fine-tuning. Obviously, this makes the word embeddings lose some of their generality, which is necessary to get better scores in downstream tasks.