# LAB 9: Transfer learning

Objectives:
- apply NLI + zero-shot learning 
- build a classifier using contextual word vectors
- fine tune distilbert for wine classification

**Important:** This notebook needs a GPU to work properly! If you're on google colab, choose Runtime > Change runtime type, and make sure Hardware Accelerator is set to GPU.

Set up transformers library

In [None]:
!pip -q install transformers datasets

In [2]:
import warnings

import numpy as np

import torch
import torchtext
import cpuinfo
from tqdm.auto import tqdm

from transformers import AutoTokenizer, AutoModel, pipeline
from datasets import DatasetDict, Dataset

from sklearn.metrics import f1_score, classification_report, ConfusionMatrixDisplay
from sklearn.linear_model import SGDClassifier

If you get an error here then double check that you've got the right kind of runtime (see comment above about GPUs)

In [3]:
DEVICE = 'cuda:0'
dev = torch.cuda.get_device_properties(0)
print(f"Using device: {dev.name} ({dev.total_memory/1024/1024/1024:.1f}gb)")

Using device: Tesla T4 (14.7gb)


----

## Load data

In [4]:
torchtext.utils.download_from_url('http://malouf.sdsu.edu/files/wine-data.tar.gz', root='./')
torchtext.utils.extract_archive('/content/wine-data.tar.gz', './')


['./wine-train.parquet', './wine-test.parquet']

In [5]:
ds = DatasetDict({'train':Dataset.from_parquet('wine-train.parquet'),
                  'test':Dataset.from_parquet('wine-test.parquet')})



---

## SGDClassifier on vectors

This method uses a pre-trained transform for feature extraction. This is similar to our use of fasttext vectors, but with an important difference: word vectors are static, but vectors produced by a transformer are contextual. The vectors we get depend on which words are in the text **and** the order they appear in.

The model we'll start with is called [distilbert-base-uncased](https://huggingface.co/distilbert-base-uncased). It has nearly the same performance as BERT but is much smaller and faster (useful give our limited hardware).

Things to experiment with:
- Optimize the SGDClassifier.
- Try a different model. Here is [a list of HuggingFace models that support feature extraction](https://huggingface.co/models?pipeline_tag=feature-extraction&sort=downloads).

In [6]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

In [7]:
model = AutoModel.from_pretrained('distilbert-base-uncased').to(DEVICE)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [8]:
def vectorize(batch):
  toks = tokenizer(batch['review_text'], padding="max_length", truncation=True, return_tensors='pt')
  inputs = {k:toks[k].to(DEVICE) for k in tokenizer.model_input_names}
  with torch.no_grad():
    with torch.autocast(device_type='cuda'):
      vec = model(**inputs).last_hidden_state
  return {'vector': vec[:,0].cpu().numpy()}

In [9]:
v = vectorize(ds['train'][:10])

In [10]:
small = DatasetDict({'train': ds['train'].select(range(1000)),
                     'test': ds['test']})

In [11]:
small = small.map(vectorize, batched=True, batch_size=256)



In [12]:
small.save_to_disk('wine-vecs')
!zip -r wine-vecs.zip wine-vecs/

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/32625 [00:00<?, ? examples/s]

updating: wine-vecs/ (stored 0%)
updating: wine-vecs/train/ (stored 0%)
updating: wine-vecs/train/data-00000-of-00001.arrow (deflated 12%)
updating: wine-vecs/train/state.json (deflated 38%)
updating: wine-vecs/train/dataset_info.json (deflated 60%)
updating: wine-vecs/test/ (stored 0%)
updating: wine-vecs/test/data-00000-of-00001.arrow (deflated 12%)
updating: wine-vecs/test/state.json (deflated 38%)
updating: wine-vecs/test/dataset_info.json (deflated 60%)
updating: wine-vecs/dataset_dict.json (stored 0%)


In [13]:
sgd = SGDClassifier(alpha=0.01)
sgd.fit(small['train']['vector'], small['train']['wine_variant'])

In [14]:
predict = sgd.predict(small['test']['vector'])
print(classification_report(small['test']['wine_variant'], predict))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.40      0.57      0.47      7558
        Chardonnay       0.63      0.19      0.29      4861
            Merlot       0.15      0.00      0.00      1381
        Pinot Noir       0.44      0.57      0.50      9618
          Riesling       0.40      0.36      0.38      2421
   Sauvignon Blanc       0.59      0.02      0.04      1278
             Syrah       0.20      0.31      0.24      3426
         Zinfandel       0.50      0.01      0.02      2082

          accuracy                           0.39     32625
         macro avg       0.41      0.26      0.24     32625
      weighted avg       0.43      0.39      0.35     32625



Describe what you did and what you learned from these experiments:

In [15]:
from scipy.stats.distributions import loguniform
from sklearn.model_selection import RandomizedSearchCV

In [19]:
sgd_search = RandomizedSearchCV(
    sgd,
    {
        "alpha": loguniform(1e-8, 1.0),
    },
    n_iter=20,
    n_jobs=-1,
    scoring="f1_macro",
)

sgd_search.fit(small['train']['vector'], small['train']['wine_variant'])

In [20]:
sgd_search.best_params_, sgd_search.best_score_

({'alpha': 0.0025213650423000743}, 0.24632166173848424)

In [21]:
sgd.set_params(**sgd_search.best_params_)
sgd.fit(small['train']['vector'], small['train']['wine_variant'])
predict = sgd.predict(small['test']['vector'])
print(classification_report(small['test']['wine_variant'], predict))

                    precision    recall  f1-score   support

Cabernet Sauvignon       0.75      0.16      0.27      7558
        Chardonnay       0.23      0.82      0.36      4861
            Merlot       0.08      0.00      0.00      1381
        Pinot Noir       0.51      0.32      0.40      9618
          Riesling       0.34      0.31      0.33      2421
   Sauvignon Blanc       0.50      0.01      0.02      1278
             Syrah       0.23      0.27      0.25      3426
         Zinfandel       0.29      0.17      0.22      2082

          accuracy                           0.32     32625
         macro avg       0.37      0.26      0.23     32625
      weighted avg       0.45      0.32      0.30     32625



**NOTES:**

1. Tried facebook/large-bart and facebook/bart-base and ran into out of memory issue.

2. Went ahead with tokenizer from 'distilbert-base-uncased' model.

3. Tried optimizing the SGD Classifier, f1_macro on the training data was at 0.24 for the alpha 0.0025 which was same as the previous test results, but test results dipped when alpha was set to 0.0025. Since, we did random search and not the grid search, it is not the best value but best in what it randomly chose.