# Text Feature Extraction From Hidden States

> This notebook shows how to extract and utilize the hidden states of a Roberta language model for a variety of tasks

- skip_showdoc: true
- skip_exec: true

In [None]:
%reload_ext autoreload
%autoreload 2

In [None]:
#| hide
from nbdev.showdoc import *

In [None]:
import os

In [None]:
#This will specify a (or a list) of GPUs for training
os.environ['CUDA_VISIBLE_DEVICES'] = "0"

In [None]:
from that_nlp_library.text_transformation import *
from that_nlp_library.text_augmentation import *
from that_nlp_library.text_main_lm import *
from that_nlp_library.utils import seed_everything
from that_nlp_library.model_lm_main import *

comet_ml is installed but `COMET_API_KEY` is not set.


In [None]:
from underthesea import text_normalize
from functools import partial
from pathlib import Path
from transformers import AutoTokenizer, AutoConfig, AutoModelForMaskedLM, AutoModelForCausalLM
from datasets import load_dataset
import pandas as pd
import numpy as np
from transformers import DataCollatorForLanguageModeling

# Finetune a Roberta Language Model (with line-by-line tokenization)

## Create a TextDataLMController object

We will reuse the data and the preprocessings in [this tutorial](https://anhquan0412.github.io/that-nlp-library/text_main_lm.html) 

In order to extract a feature vector from a review sentence in the dataset, we can directly use pretrained models such as Roberta, GPT2, ... But if our dataset is vastly different from the datasets these pretrained models are trained on, we can finetune these pretrained models on our dataset before extracting the feature vector. And that's exactly what we are going to do now.

In [None]:
dset = load_dataset('sample_data',data_files=['Womens_Clothing_Reviews.csv'],split='train')
tdc = TextDataLMController(dset,
                         main_text='Review Text',
                         filter_dict={'Review Text': lambda x: x is not None},
                         metadatas='Title',
                         content_transformations=[text_normalize,str.lower],
                         seed=42,
                         verbose=False
                        )

Define our tokenizer for Roberta

In [None]:
_tokenizer = AutoTokenizer.from_pretrained('roberta-base')

Process and tokenize our dataset (using line-by-line tokenization)

In [None]:
block_size=112
tdc.process_and_tokenize(_tokenizer,line_by_line=True,max_length=block_size) 
# set max_length=-1 if you want the data collator to pad

And set the data collator

In [None]:
tdc.set_data_collator(is_mlm=True,mlm_prob=0.15)

## Initialize and train Roberta Language Model

In [None]:
_config = AutoConfig.from_pretrained('roberta-base',vocab_size=len(_tokenizer))

In [None]:
_model = language_model_init(AutoModelForMaskedLM,
                             config=_config,
                             cpoint_path='roberta-base',
                             seed=42
                            )

Total parameters: 124697433
Total trainable parameters: 124697433


Create a model controller

In [None]:
controller = ModelLMController(_model,data_store=tdc,seed=42)

And we can start training our model

In [None]:
lr = 1e-4
bs=32
wd=0.01
epochs= 6
warmup_ratio=0.25
controller.fit(epochs,lr,
               batch_size=bs,
               weight_decay=wd,
               warmup_ratio=warmup_ratio,
               save_checkpoint=False,
              )




Epoch,Training Loss,Validation Loss,Accuracy
1,No log,1.531281,0.65348
2,1.702300,1.501062,0.658028
3,1.702300,1.421605,0.672822
4,1.468500,1.352898,0.684757
5,1.468500,1.311555,0.692361
6,1.304600,1.29634,0.696147


Perplexity on validation set: 3.616


Finetuning from a pretrained model results in a massive improvement in terms of metrics

In [None]:
controller.trainer.model.save_pretrained('./sample_weights/roberta_lm_model')

## Extract hidden states from model

### From raw texts

We can extract a feature vector from a raw text

In [None]:
# including the `Title` entry, because we have it as our metadata in the data controller
inp1 = {'Title':'Flattering',
        'Review Text': "Love this skirt. The detail is amazing. Runs small I ordered a 12 I'm usually a 10, but still a little snug"
       }

There's a crucial step we have to do: set `output_hidden_states` to be `True`, so that the model can return them back for us to extract

In [None]:
_config = AutoConfig.from_pretrained('./sample_weights/roberta_lm_model',output_hidden_states=True)

In [None]:
trained_model = language_model_init(AutoModelForCausalLM,
                                    cpoint_path='./sample_weights/roberta_lm_model',
                                    config=_config
                                   )

controller2 = ModelLMController(trained_model,data_store=tdc,seed=42)

If you want to use `RobertaLMHeadModel` as a standalone, add `is_decoder=True.`


Total parameters: 124697433
Total trainable parameters: 124697433


When `output_hidden_states` is set to `True`, the model will return a variable called `hidden_states`, which construct of hidden states of each layer in RoBERTa-base model. We only want the last layer's hidden states (index -1), and we want the hidden vector of the first token of this layer (the `[CLS]` token)

In [None]:
hidden_from_ip1 = controller2.get_hidden_states_from_raw_text(inp1,
                                                              state_name='hidden_states',
                                                              state_idx=[-1,0]
                                                             )

In [None]:
hidden_from_ip1

Dataset({
    features: ['Title', 'Review Text', 'input_ids', 'attention_mask', 'special_tokens_mask', 'hidden_states'],
    num_rows: 1
})

The lenght of the hidden vector (our feature vector) from the first token of the last layer of RoBERTa is 768

In [None]:
hidden_from_ip1['hidden_states'].shape

(1, 768)

### From train (or validation) set

Similarly, we can extract feature vectors for all sentences in our training set

In [None]:
hidden_from_train = controller2.get_hidden_states(ds_type='train',
                                                 state_name='hidden_states',
                                                 state_idx=[-1,0]
                                                )

In [None]:
hidden_from_train

Dataset({
    features: ['input_ids', 'attention_mask', 'special_tokens_mask', 'hidden_states'],
    num_rows: 18112
})

In [None]:
hidden_from_train['hidden_states'].shape

(18112, 768)

# What can we do with feature vectors?