install refers to install.py in current directory
install_requirements is single function in install.py
https://colab.research.google.com/github/nlp-with-transformers/notebooks/blob/main/01_introduction.ipynb
https://colab.research.google.com/github/credamit/nbhfbk/blob/main/01_introduction.ipynb
https://kaggle.com/kernels/welcome?src=https://github.com/credamit/nbhfbk/blob/main/src.ipynb

# install.py, install_requirements()
## requirements.txt [TensorFlow, PyTorch, Scikit-learn, matplotlib etc], Git LFS, transformers, datasets
Single function install_requirements from install.py does the following tasks:
1. Uses requirements.txt for all chapters except chapter #7, using below command:
    python -m pip install -r requirements.txt
2. Installs Git LFS, an open source Git extension used to manage large files and binary files in a separate ”LFS store.” It uses below command:
    apt install git-lfs
3. For chapter 2, it additionally installs Hugging face libraries transformers and datasets using below command:
    python -m pip install transformers==4.13.0 datasets==2.8.0

Contents of requirements.txt common and chapter 2 specific parts are as below:
    transformers[tf,torch,sentencepiece,vision,optuna,sklearn,onnxruntime]==4.16.2
    datasets[audio]==1.16.1
    matplotlib
    ipywidgets
    umap-learn==0.5.1

In [None]:
# Uncomment and run this cell if you're on Colab or Kaggle
# !git clone https://github.com/nlp-with-transformers/notebooks.git
# %cd notebooks
# from install import *
# install_requirements(is_chapter2=True)
!git clone https://github.com/credamit/nbhfbk.git
%cd nbhfbk
from install import *
install_requirements(is_chapter2=True)

# utils.py, setup_chapter()
Here utils refers to utils.py in current directory
setup_chapter() is the main functions in util.py, It does the following:
1. Checks if GPU is available in current environment or by using below API call for PyTorch
    torch.cuda.is_available()
2. Displays the versions being used for huggingface libraries transformers and datasets being used in current program.
3. Sets Loggging levels to use for both libraries from huggigface
4. Sets plot styles to be used when using matplotlib library in our program


In [None]:
from utils import *
setup_chapter()

# emotions (DatasetDict object): Creation
1. Library used: datasets
2. Function used load_dataset()

In [None]:
from datasets import load_dataset
emotions = load_dataset("dair-ai/emotion")

# emotions (DatasetDict object): Review
emotions object is of type DataSetDict class from huggingface library. Lets see its content:

In [None]:
emotions

# tokenizer (AutoTokenizer object)
Retreive tokenizer associated with transformer model for "distilbert-base-uncased" By using:
1. Class AutoTokenizer in huggingface transformers library
2. Method from_pretrained() of Class AutoTokenizer listed above


In [None]:
from transformers import AutoTokenizer
model_ckpt = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_ckpt)

In next three cels, let's review various charactertics of tokenizer retreived from transformer model for "distilbert-base-uncased".
Here we make use of following attributes from AutoTokenizer class
    1. vocab_size
    2. model_max_length
    3. model_input_names

In [None]:
tokenizer.vocab_size

In [None]:
tokenizer.model_max_length

In [None]:
tokenizer.model_input_names

# tokenize() function
A custom function that uses tokenizer retreived from transformer model for "distilbert-base-uncased"
    1. This function will later be used as processing function when invoking DatasetDic function map() on emotions DatasetDict object
    2. Input is a Dataset object that has a column named "text"
    2. This function expects a column named "text" in input DatasetDict object 


In [None]:
def tokenize(batch):
    return tokenizer(batch["text"], padding=True, truncation=True)

# emotions_encoded (DatasetDict object)
1. First argument of map() function here is a processing function that returns additional column names and values that would be appended to emotions dataset. 
2. We name enhanced emotions dataset as emotions_encoded.
3. Two new columns are added "attention_mask" and "input_ids"
4. emotions DatasetDict had two columns, "text" and "label". 
5. emotions_encoded has four columns

In [None]:
emotions_encoded = emotions.map(tokenize, batched=True, batch_size=None)


Let's verify columns in all three Datasets present in DatasetDict object emotions_encoded

print(emotions_encoded["train"].column_names)

print(emotions_encoded["validation"].column_names)

print(emotions_encoded["test"].column_names)

# model (AutoModel object)
Retreiving the model object associated with transformer model named "distilbert-base-uncased" by using:
    1. Class AutoModel from transformers library from huggingface
    2. Method from_pretrained() in class AutoModel mentioned above

In [None]:
from transformers import AutoModel
model_ckpt = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_ckpt).to(device)

A custom function that would be used last_hidden_state using model and tokenizer objects associated with transformer model named "distilbert-base-uncased"
1. It will later be used as processing function when we invoke map() on DatasetDict object emotions_encoded. 

In [None]:
def extract_hidden_states(batch):
    # Place model inputs on the GPU
    inputs = {k:v.to(device) for k,v in batch.items() 
              if k in tokenizer.model_input_names}
    # Extract last hidden states
    with torch.no_grad():
        last_hidden_state = model(**inputs).last_hidden_state
    # Return vector for [CLS] token
    return {"hidden_state": last_hidden_state[:,0].cpu().numpy()}

In [None]:
emotions_encoded.set_format("torch", 
                            columns=["input_ids", "attention_mask", "label"])

# emotions_hidden (DatasetDict object)
Using generic function extract_hidden_state created above to derrive last_hidden_state associated with emotions dataset.
    1. The input provided is tokenized value of enhanced emotions dataset using the same tokenizer that is also used by the generic function invoked.
    2. The output is same input dataset returned back after adding one additional column named hidden_state
    3. Value of each hidden_state object contains a numpy object residing in cpu.

In [None]:
emotions_hidden = emotions_encoded.map(extract_hidden_states, batched=True)

In [None]:
emotions_hidden["train"].column_names

Creating numpy arrays from modified emotions DatasetDict that has additional hidden_state column associated now.
We create arrays for both the training data as well as validation data

In [None]:
import numpy as np

X_train = np.array(emotions_hidden["train"]["hidden_state"])
X_valid = np.array(emotions_hidden["validation"]["hidden_state"])
y_train = np.array(emotions_hidden["train"]["label"])
y_valid = np.array(emotions_hidden["validation"]["label"])
X_train.shape, X_valid.shape

Step-1 of 2 to visualize modified training dataset having additional hidden_state column associated with it. [Against same six categories of emotions].

In [None]:
from umap import UMAP
from sklearn.preprocessing import MinMaxScaler
import pandas as pd

# Scale features to [0,1] range
X_scaled = MinMaxScaler().fit_transform(X_train)
# Initialize and fit UMAP
mapper = UMAP(n_components=2, metric="cosine").fit(X_scaled)
# Create a DataFrame of 2D embeddings
df_emb = pd.DataFrame(mapper.embedding_, columns=["X", "Y"])
df_emb["label"] = y_train
df_emb.head()

Step-2 of 2 to visualize modified training dataset having additional hidden_state column associated with it. [Against same six categories of emotions]

In [None]:
fig, axes = plt.subplots(2, 3, figsize=(7,5))
axes = axes.flatten()
cmaps = ["Greys", "Blues", "Oranges", "Reds", "Purples", "Greens"]
labels = emotions["train"].features["label"].names

for i, (label, cmap) in enumerate(zip(labels, cmaps)):
    df_emb_sub = df_emb.query(f"label == {i}")
    axes[i].hexbin(df_emb_sub["X"], df_emb_sub["Y"], cmap=cmap,
                   gridsize=20, linewidths=(0,))
    axes[i].set_title(label)
    axes[i].set_xticks([]), axes[i].set_yticks([])

plt.tight_layout()
plt.show()

# LogisticRegression, fit() (a Classifier linear_model from sklearn library)
Creating a Classifier Model using LogisticRegression 
    1. The Classifier model is trained using data from modified emotions dataset that has additional column hidden_state in it.
    2. Training is perfomed by supplying modified training dataset and calling fit() method.

In [None]:
# We increase `max_iter` to guarantee convergence 
from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(max_iter=3000)
lr_clf.fit(X_train, y_train)

# LogisticRegression, score() (a Classifier linear_model from sklearn library)
Comparing output of Classifier model created by comparing its predictions on prediction set with actual values for validations set.

In [None]:
lr_clf.score(X_valid, y_valid)

Visualising comparison of above output info by using confusion matrix

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

def plot_confusion_matrix(y_preds, y_true, labels):
    cm = confusion_matrix(y_true, y_preds, normalize="true")
    fig, ax = plt.subplots(figsize=(6, 6))
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
    disp.plot(cmap="Blues", values_format=".2f", ax=ax, colorbar=False)
    plt.title("Normalized confusion matrix")
    plt.show()
    
y_preds = lr_clf.predict(X_valid)
plot_confusion_matrix(y_preds, y_valid, labels)