# Week 9: Transformers

Additional references: 
- [Character Level Language Model (GPU required)](https://github.com/m2dsupsdlclass/lectures-labs/blob/master/labs/06_deep_nlp/Character_Level_Language_Model_rendered.ipynb)
- [Transformers (BERT fine-tuning): Joint Intent Classification and Slot Filling](https://github.com/m2dsupsdlclass/lectures-labs/blob/master/labs/06_deep_nlp/Transformers_Joint_Intent_Classification_Slot_Filling_rendered.ipynb)
- [Generating Language with huggingface](https://huggingface.co/blog/how-to-generate)
- [huggingface examples](https://huggingface.co/transformers/quickstart.html)

In [1]:
#setup
import warnings; warnings.simplefilter('ignore')
import pandas as pd
import numpy as np

df = pd.read_pickle('sc_cases_cleaned.pkl', compression='gzip')
df = df.assign(author_id=(df['authorship']).astype('category').cat.codes)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 768 entries, 0 to 819
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype         
---  ------          --------------  -----         
 0   case_name       768 non-null    object        
 1   opinion_type    768 non-null    object        
 2   date_standard   768 non-null    datetime64[ns]
 3   authorship      768 non-null    object        
 4   x_republican    768 non-null    float64       
 5   maj_judges      768 non-null    object        
 6   dissent_judges  768 non-null    object        
 7   topic_id        768 non-null    float64       
 8   cite_count      768 non-null    float64       
 9   opinion_text    768 non-null    object        
 10  year            768 non-null    int64         
 11  log_cite_count  768 non-null    float64       
 12  author_id       768 non-null    int8          
dtypes: datetime64[ns](1), float64(4), int64(1), int8(1), object(6)
memory usage: 78.8+ KB


## Huggingface Transformer

In [2]:
import torch
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification

# gpu or cpu?
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print (device)

cpu


Load the model from a pretrained checkpoint. 

In [3]:
model_name = 'distilbert-base-uncased' # huggingface model_ID or path to folder 
model = DistilBertForSequenceClassification.from_pretrained(model_name)
print (model)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
       

In [4]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)
inputs = tokenizer(df.iloc[0]['opinion_text'], return_tensors="pt")
print(inputs)

Token indices sequence length is longer than the specified maximum sequence length for this model (4669 > 512). Running this sequence through the model will result in indexing errors


{'input_ids': tensor([[  101,  3425, 18353,  ...,  3641,  1012,   102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1]])}


In [5]:
inputs = tokenizer(df['opinion_text'].tolist(), return_tensors="pt", padding=True, truncation=True)
labels = torch.tensor(df['x_republican'].tolist()).long() 
print(inputs, labels)

{'input_ids': tensor([[  101,  3425, 18353,  ...,  6525,  3089,   102],
        [  101,  3425,  8799,  ...,  4781,  2580,   102],
        [  101,  3425,  1051,  ..., 13931,  9964,   102],
        ...,
        [  101,  3425,  8040,  ...,  2005,  1996,   102],
        [  101,  3425,  2726,  ...,  2015,  2006,   102],
        [  101,  3425,  1051,  ..., 25394, 11461,   102]]), 'attention_mask': tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        ...,
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 1, 1, 1]])} tensor([0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1,
        1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
        1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0,
        0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1

More infos about huggingface tokenizers can be found [here](https://huggingface.co/transformers/main_classes/tokenizer.html).

Now we have a set of text inputs and authors indicators as labels and we can train a transformers model using a cross-entropy loss function

In [6]:
unique_labels, counts = np.unique(df["x_republican"], return_counts=True)
print (unique_labels, counts)
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=len(unique_labels))

optimizer = torch.optim.Adam([
    {'params': model.distilbert.parameters(), 'lr': 1e-5},  
    {'params': model.classifier.parameters(), 'lr': 1e-3}
])


[0. 1.] [174 594]


Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'pre_classifier.bias', 'classi

In [7]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['opinion_text'].tolist(), df['x_republican'].tolist(), test_size=.2)

# generate batches
X_train, X_test, y_train, y_test = np.array(X_train[:608]), np.array(X_test[:152]), np.array(y_train[:608]), np.array(y_test[:152])
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

X_train, X_test, y_train, y_test = X_train.reshape(-1, 8), X_test.reshape(-1, 8), y_train.reshape(-1, 8), y_test.reshape(-1, 8)
print (X_train.shape, X_test.shape, y_train.shape, y_test.shape)

X_train, X_test = X_train.tolist(), X_test.tolist()

(608,) (152,) (608,) (152,)
(76, 8) (19, 8) (76, 8) (19, 8)


In [8]:
# train
from tqdm import tqdm

num_epochs = 1
for epoch in range(num_epochs):
    model.train()
    for text, labels in tqdm(zip(X_train, y_train), total=len(X_train)):
        # prepare model input through our tokenizer
        model_inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=256)
        # place everything on the right device
        model_inputs = {k:v.to(device) for k,v in model_inputs.items()}
        # labels have to be torch long tensors
        labels = torch.tensor(labels).long()
        # now, we can perform the forward pass
        output = model(**model_inputs, labels=labels)
        loss, logits = output[:2]
        # and the backward pass
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()


100%|██████████| 76/76 [07:42<00:00,  6.09s/it]


In [9]:
predictions, targets = [], []
model.eval()


with torch.no_grad():
    for text, labels in tqdm(zip(X_test, y_test), total=len(X_test)):
        model_inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
        model_inputs = {k:v.to(device) for k,v in model_inputs.items()}

        output = model(**model_inputs)
        logits = output[0]
        # prediction is the argmax of the logits
        predictions.extend(logits.argmax(dim=1).tolist())
        targets.extend(labels)
        
from sklearn import metrics
accuracy = metrics.accuracy_score(targets, predictions)
print ("accuracy", accuracy)
classification_report = metrics.classification_report(targets, predictions)
print (classification_report)

100%|██████████| 19/19 [01:09<00:00,  3.68s/it]

accuracy 0.9868421052631579
              precision    recall  f1-score   support

         0.0       0.97      0.97      0.97        33
         1.0       0.99      0.99      0.99       119

    accuracy                           0.99       152
   macro avg       0.98      0.98      0.98       152
weighted avg       0.99      0.99      0.99       152






So far, we considered the pytorch version for transformers. It also works with keras, a more in-depth tutorial can be found [here](https://towardsdatascience.com/working-with-hugging-face-transformers-and-tf-2-0-89bf35e3555a).

In [None]:
!pip install tensorflow

In [10]:
from transformers import TFDistilBertForSequenceClassification, DistilBertConfig
import tensorflow as tf

# note that we use TFDistilBert... instead of DistilBert...

transformer_model = TFDistilBertForSequenceClassification.from_pretrained(model_name)

# define model input layer

input_ids = tf.keras.layers.Input(shape=(256,), name='input_token', dtype='int32')
input_masks_ids = tf.keras.layers.Input(shape=(256,), name='masked_token', dtype='int32')
X = transformer_model(input_ids, input_masks_ids)
model = tf.keras.Model(inputs=[input_ids, input_masks_ids], outputs = X)

Some layers from the model checkpoint at distilbert-base-uncased were not used when initializing TFDistilBertForSequenceClassification: ['vocab_layer_norm', 'activation_13', 'vocab_projector', 'vocab_transform']
- This IS expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFDistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some layers of TFDistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier', 'pre_classifier', 'dropout_19']
You should probably TRAIN this model on a down-stream task to be able to use i

Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7efdbc9279a8> is not a module, class, method, function, traceback, frame, or code object
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: <cyfunction Socket.send at 0x7efdbc9279a8> is not a module, class, method, function, traceback, frame, or code object



In [11]:
model.summary()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_token (InputLayer)        [(None, 256)]        0                                            
__________________________________________________________________________________________________
masked_token (InputLayer)       [(None, 256)]        0                                            
__________________________________________________________________________________________________
tf_distil_bert_for_sequence_cla TFSequenceClassifier 66955010    input_token[0][0]                
                                                                 masked_token[0][0]               
Total params: 66,955,010
Trainable params: 66,955,010
Non-trainable params: 0
__________________________________________________________________________________________________


In [12]:
model.compile(loss='sparse_categorical_crossentropy', # cost function
              optimizer='adam', # use adam as the optimizer
              metrics=['accuracy']) # compute accuracy, for scoring


In [13]:
# tokenize X_train
X_train, X_test, y_train, y_test = train_test_split(df['opinion_text'].tolist(), df['x_republican'].tolist(), test_size=.2)
X_train_tf = [tokenizer(x, return_tensors="tf", padding=True, truncation=True, max_length=256) for x in X_train]

In [14]:
input_ids, input_masks = [x["input_ids"][0].numpy() for x in X_train_tf], [x["attention_mask"][0].numpy() for x in X_train_tf]
dataset = tf.data.Dataset.from_tensor_slices(({'input_token': input_ids, 'masked_token': input_masks}, y_train)).batch(8)

In [15]:
model_info = model.fit(dataset,epochs=1)



## GPT-2 and Language Generation

In [16]:
# load GPT2

from transformers import GPT2LMHeadModel, GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2", pad_token_id=tokenizer.eos_token_id)

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [17]:
input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

# generate text until the output length (which includes the context length) reaches 50
greedy_output = model.generate(input_ids, max_length=50)

print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))


I enjoy generating and sharing my own ideas and ideas for the future. I'm always looking for new ways to make my life better. I'm always looking for ways to make my life better.

I'm always looking for ways to make my


In [18]:
# activate beam search and early_stopping

beam_output = model.generate(
    input_ids, 
    max_length=50, 
    num_beams=5, 
    no_repeat_ngram_size=2, # to avoid repetitions of the same word sequences
    early_stopping=True
)
print(tokenizer.decode(beam_output[0], skip_special_tokens=True))

I enjoy generating my own content, so I'm always looking for ways to improve it.

If you have any questions or comments, feel free to leave them in the comments below.


In [19]:
# activate sampling and deactivate top_k by setting top_k sampling to 0

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_k=0
)

print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy generating breathable air supply," says Callahan. "I spend half a delivery day or so doing research locally and I think that is exactly what I want to be doing. Then once in the morning we go and get into shape three fingers


In [20]:
# sample only from 92% most likely words

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))

# arguably the best generation technique

I enjoy generating the image series, so I decided to generate 2 versions of each system out of each series of particles... for their respective state states. To do this, I changed my array, and I recoiled and spun things around in the


**GPTNeo**

In [None]:
#GPTNeo

!pip install transformers==4.5.1

In [21]:
from transformers import GPT2Tokenizer, GPTNeoForCausalLM

model_name = "EleutherAI/gpt-neo-125M"

tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPTNeoForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

input_ids = tokenizer.encode('I enjoy generating', return_tensors='pt')

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


I enjoy generating freedom by deducing these complex paths into “stems.” Especially with whom, what it comes from? There is no authority to make anything come from a “stem” in this way, at least not directly


**Conditional Text Generation**

In [22]:
input_ids = tokenizer.encode('Donald Trump:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Donald Trump: The Real Crisis or False Promise

Enlarge this image toggle caption Laurie McCafferty/AP Laurie McCafferty/AP

Meanwhile, amid signs that Trump is likely to be more open to negotiation, many are now highlighting the


In [23]:
input_ids = tokenizer.encode('Joe Biden:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))



Joe Biden: 'The joke about Jewish kids being in a binary socialist-hymnical (homework) is about the Jewish kids. Is it true that Jews are in a binary Marxist-hymnical (homework) or do


In [24]:
input_ids = tokenizer.encode('Justice Ruth Bader Ginsburg:', return_tensors="pt")

sample_output = model.generate(
    input_ids, 
    do_sample=True, 
    max_length=50, 
    top_p=0.92, 
    top_k=0
)
print(tokenizer.decode(sample_output[0], skip_special_tokens=True))


Justice Ruth Bader Ginsburg: A Tech Guy Author Opinion 10 October 2016

We published the first two books by Chedakis on 12 November 2016. The fourth is a more conservative study and will likely be published by the same year. We
