install refers to install.py in current directory
install_requirements is one of various functions in install.py

In [None]:
# Uncomment and run this cell if you're on Colab or Kaggle
# !git clone https://github.com/nlp-with-transformers/notebooks.git
# %cd notebooks
# from install import *
# install_requirements(is_chapter2=True)
!git clone https://github.com/credamit/nbhfbk.git
%cd nbhfbk
from install import *
install_requirements(is_chapter2=True)

Here utils refers to utils.py in current directory
setup_chapter is one of the various functions in util.py

In [None]:
from utils import *
setup_chapter()

datasets ia a library from hugging face that is downloaded via setup_chapter() from utils.py

In [None]:
from datasets import load_dataset
emotions = load_dataset("dair-ai/emotion")

emotions object is of type DataSetDict class from huggingface library

In [None]:
emotions

In [None]:
train_ds = emotions["train"]

Retreiver tokenizer associated with transformer model for "distilbert-base-uncased"

In [None]:
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

Checking contents of tokenizer retreived from transformer model for "distilbert-base-uncased"

In [None]:
tokenizer.vocab_size
tokenizer.model_max_length
tokenizer.model_input_names


A custom function that uses tokenizer retreived from transformer model for "distilbert-base-uncased"
Input is a Dataset object that has a column named "text"
This function would provide tokenized value of entire dataset of emotions["train"] for values in its column named "text"

In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

emotions_encoded["train"] is tokenized value of emotions["train] using tokernizer associated with transformer model "distilbert-base-uncased"

First argument of map() function here is a processing function that returns additional column names and values that would be appended to emotions dataset. The enhanced emotions dataset is named as emotions_encoded.

In [None]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)
print(emotions_encoded["train"].column_names)

Retreiving the model object associated with transformer model named "distilbert-base-uncased"

In [None]:
from transformers import AutoModel
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

A custom function that would be used last_hidden_state using model and tokenizer objects associated with transformer model named "distilbert-base-uncased"

In [None]:
def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

In [None]:
emotions_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])

Using generic function extract_hidden_state created above to derrive last_hidden_state associated with emotions dataset.
The input provided is tokenized value of enhanced emotions dataset using the same tokenizer that is also used by the generic function invoked.
The output is same input dataset returned back after adding one additional column named hidden_state
Value of each hidden_state object contains a numpy object residing in cpu.

In [None]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

In [None]:
emotions_hidden["train"].column_names

Creating numpy arrays from modified emotions dataset that has additional hidden_state column associated now.
We create arrays for both the training data as well as validation data

In [None]:
import numpy as np

X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

Step-1 of 2 to visualize modified emotions dataset having additional hidden_state column associated with it.

In [None]:
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

Step-2 of 2 to visualize modified emotions dataset having additional hidden_state column associated with it.

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = emotions["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

plt.tight_layout()
plt.show()

Creating a Classifier Model using LogisticRegression 
The Classifier model is trained using data from modified emotions dataset that has additional column hidden_state in it.

In [None]:
# We increase `max_iter` to guarantee convergence 
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)

Comparing output of Classifier model created by comparing its predictions on prediction set with actual values for prediction set.

In [None]:
lr_clf.score(X_valid, y_valid)

Visualising comparison of above output info by using confusion matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)