# Embedding Model with Instructions

## IMDB Dataset

The IMDB dataset is a large dataset of movie reviews used for sentiment analysis. It contains 50,000 reviews labeled as positive or negative, which can be used for binary sentiment classification tasks. 

- **Train**: Contains 25,000 labeled reviews for training
- **Test**: Contains 25,000 labeled reviews for evaluating

### Save the dataset locally (Don't need to run)

In [14]:
from datasets import load_dataset

def save_dataset(random_sample_size=5000):  # Reduce sample size for testing
    # Load the dataset
    dataset = load_dataset('stanfordnlp/imdb')
    
    # Access the train, test splits
    train_dataset = dataset['train']
    test_dataset = dataset['test']

    # Random sample the dataset, only use random_sample_size
    train_dataset = train_dataset.shuffle(seed=42).select(range(random_sample_size))
    test_dataset = test_dataset.shuffle(seed=42).select(range(random_sample_size))

    train_dataset.save_to_disk('sampled_train_dataset')
    test_dataset.save_to_disk('sampled_test_dataset')
save_dataset()

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

## Load dataset from file

In [6]:
from datasets import load_from_disk
try:
    train_dataset = load_from_disk('sampled_train_dataset')
    test_dataset = load_from_disk('sampled_test_dataset')
    
    print(">>Train Dataset loaded<<")
    # Print a sample from the loaded datasets to verify
    print("Text:", train_dataset[0]["text"], "\nLabel:", train_dataset[0]["label"])
    
    print("\n>>Test Dataset loaded<<")
    # Print a sample from the loaded test datasets to verify
    print("Text:", test_dataset[0]["text"], "\nLabel:", test_dataset[0]["label"])
except Exception as e:
    print(f"An error occurred while loading the datasets: {e}")

>>Train Dataset loaded<<
Text: There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier's plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it's the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all... 
Label: 1

>>Test Dataset loaded<<
Text: <br /><br />When I unsuspectedly rented A Thousand Acres, I thought I was in for an entertaining King Lear story and of course Michelle Pfeiffer was in it, so what could

## Text Embedding Models

- **BERT** (Bidirectional Encoder Representations from Transformers)
    - Retrieved from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased)

- **Sentence Transformer** (all-MiniLM-L12-v2)
    - Retrieved from [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

- **T5** (Text-To-Text Transfer Transformer)
    - Retrieved from [Hugging Face](https://huggingface.co/google-t5/t5-base)

- **Instructor**
    - Retrieved from [Hugging Face](https://huggingface.co/hkunlp/instructor-large)

In [15]:
import torch
from transformers import pipeline
device = 0 if torch.cuda.is_available() else -1

model_BERT = pipeline("feature-extraction", model="google-bert/bert-base-uncased", device=device)
#model_ST = pipeline("feature-extraction", model="sentence-transformers/all-MiniLM-L12-v2", device=device)
#model_T5 = pipeline("feature-extraction", model="google-t5/t5-base", device=device)
#model_INS = pipeline("feature-extraction", model="hkunlp/instructor-large", device=device)

- Helper function using the provided embedding model to encode text
- returns Tensor

In [28]:
from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset

def embed_dataset(model, dataset, key, truncation=False, padding=False, max_length=0):
    data = KeyDataset(dataset, key)
    pipe = model(data, return_tensors=True, truncation=truncation, padding=padding, max_length=max_length)
    for out in tqdm(pipe, desc="Encoding text:"): pass

In [11]:
tensor[0].shape

torch.Size([1, 179, 768])

Embed `train_dataset` and `test_dataset` then save them to disk
- CPU 10 it/s
- GPU 100 it/s

In [5]:
from tqdm import tqdm
import numpy as np

# Helper function to process and embed dataset
def process(dataset, feature_extractor, use_mean_pooling):
    # init list of tensors
    embeddings = []
    labels = dataset["label"]

    # Embed the texts
    embeddings = embed_dataset(dataset=dataset, model=feature_extractor, key="text", truncation=True, padding=True, max_length=512)

    # Use mean pooling to 1 dim for classification
    if(use_mean_pooling):
        tensor = tensor.mean(dim=1)
        tensor = tensor.flatten()

    # Return as numpy array for saving
    return np.array(embeddings), np.array(labels)

Augment dataset with instructions

In [34]:
import textwrap

def modify_dataset(example):
    example['text'] = "Prefix: " + example['text'] + " :Suffix"
    return example

# Apply the function to the dataset
train_dataset_aug = train_dataset.map(modify_dataset)

# Check the first review
example = train_dataset_aug['text'][0]
print(textwrap.fill(example, width=60))

Prefix: There is no relation at all between Fortier and
Profiler but the fact that both are police series about
violent crimes. Profiler looks crispy, Fortier looks
classic. Profiler plots are quite simple. Fortier's plot are
far more complicated... Fortier looks more like Prime
Suspect, if we have to spot similarities... The main
character is weak and weirdo, but have "clairvoyance".
People like to compare, to judge, to evaluate. How about
just enjoying? Funny thing too, people writing Fortier looks
American but, on the other hand, arguing they prefer
American series (!!!). Maybe it's the language, or the
spirit, but I think this series is more English than
American. By the way, the actors are really good and funny.
The acting is not superficial at all... :Suffix


This is not required to run if loading from disk

In [None]:
from datetime import datetime
import os, json
def save_embeddings(embeddings, model_name, save_path="data"):
    timestamp = datetime.now().strftime("%m%d_%H%M")
    # Calculate the average shape of tensors
    tensor_shapes = [tensor.shape for tensor in embeddings]
    avg_shape = np.mean(tensor_shapes, axis=0).tolist()
    
    embedding_info = {
        'model_name': model_name,
        'num_embeddings': len(embeddings),
        'avg_embedding_shape': avg_shape,
        'created_at': timestamp
    }
    
    os.makedirs(save_path, exist_ok=True)
    embedding_file = os.path.join(save_path, f"{model_name}_embeddings.npy")
    metadata_file = os.path.join(save_path, f"{model_name}_metadata.json")
    
    np.save(embedding_file, embeddings)

    with open(metadata_file, 'w') as f:
        json.dump(embedding_info, f)
    
    print(f"Embeddings and metadata saved for {model_name} at {timestamp}")

In [None]:
def load_embeddings(embedding_file, metadata_file):
    embeddings = np.load(embedding_file)
    with open(metadata_file, 'r') as f:
        metadata = json.load(f)
    print(f"Loaded embeddings from model: {metadata['model_name']}")
    print(f"Embedding shape: {metadata['embedding_shape']}")
    print(f"Created at: {metadata['created_at']}")
    return embeddings

In [6]:
import numpy as np
import os

def encode(train_dataset, test_dataset, feature_extractor):
    # Process and embed the training and test data
    train_embeddings, train_labels = process(train_dataset, feature_extractor, use_mean_pooling=True)
    test_embeddings, test_labels = process(test_dataset, feature_extractor, use_mean_pooling=True)

    # Save the embeddings and labels to .npy files
    np.save(os.path.join('data','train_embeddings.npy'), train_embeddings)
    np.save(os.path.join('data','train_labels.npy'), train_labels)
    np.save(os.path.join('data','test_embeddings.npy'), test_embeddings)
    np.save(os.path.join('data','test_labels.npy'), test_labels)

#encode(train_dataset, test_dataset, model_BERT)

  attn_output = torch.nn.functional.scaled_dot_product_attention(
Encoding train data:   0%|          | 1/25000 [00:00<3:12:18,  2.17it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Encoding train data: 100%|██████████| 25000/25000 [04:25<00:00, 94.19it/s] 
Encoding test data: 100%|██████████| 25000/25000 [04:45<00:00, 87.57it/s] 


Load embeddings and labels from disk

In [1]:
import numpy as np
import os

# Load the embeddings and labels from .npy files
try:
    train_embeddings = np.load(os.path.join('data','train_embeddings.npy'))
    train_labels = np.load(os.path.join('data', 'train_labels.npy'))
    print("Train embeddings and labels loaded!")
    
    test_embeddings = np.load(os.path.join('data','train_embeddings.npy'))
    test_labels = np.load(os.path.join('data','test_labels.npy'))
    print("Test embeddings and labels loaded!")
except Exception as e:
    print("Load embedding failed:", e)

Train embeddings and labels loaded!
Test embeddings and labels loaded!


### Train classifiers to evaluate Embedding performance
- Linear
    - SVM
- Non-Lienar
    - MLP

Train SVM model using `train_embeddings` and `train_labels`

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Train an SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(train_embeddings, train_labels)

Evaluate SVM results using `test_embeddings` and `test_labels`

In [12]:
# Predict and evaluate
predicted_labels = svm_model.predict(test_embeddings)
print(classification_report(y_true=test_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.90      0.91      0.90     12500
           1       0.91      0.90      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000



MLP

In [4]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

# Define the MLP model
mlp_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, alpha=1e-4,
                          solver='sgd', verbose=1, random_state=1,
                          learning_rate_init=.1)

# Train the MLP model
mlp_model.fit(train_embeddings, train_labels)

# Predict and evaluate
predicted_labels_mlp = mlp_model.predict(test_embeddings)
print(classification_report(y_true=test_labels, y_pred=predicted_labels_mlp))


Iteration 1, loss = 0.45508312
Iteration 2, loss = 0.32360067
Iteration 3, loss = 0.30201054
Iteration 4, loss = 0.29376630
Iteration 5, loss = 0.29093661
Iteration 6, loss = 0.28485199
Iteration 7, loss = 0.28306261
Iteration 8, loss = 0.27774261
Iteration 9, loss = 0.28060523
Iteration 10, loss = 0.27239479
Iteration 11, loss = 0.26746099
Iteration 12, loss = 0.26639452
Iteration 13, loss = 0.26759380
Iteration 14, loss = 0.27147793
Iteration 15, loss = 0.26649923
Iteration 16, loss = 0.26131166
Iteration 17, loss = 0.26058729
Iteration 18, loss = 0.25630766
Iteration 19, loss = 0.25358914
Iteration 20, loss = 0.25397461
Iteration 21, loss = 0.25147743
Iteration 22, loss = 0.24943971
Iteration 23, loss = 0.25084300
Iteration 24, loss = 0.24741396
Iteration 25, loss = 0.24528136
Iteration 26, loss = 0.24756351
Iteration 27, loss = 0.24111311
Iteration 28, loss = 0.24670886
Iteration 29, loss = 0.23864409
Iteration 30, loss = 0.23626259
Iteration 31, loss = 0.23756441
Iteration 32, los