# Embedding Model with Instructions

## Datasets
- IMDB Review
    - stanfordnlp/imdb
    - The IMDB dataset is a large dataset of movie reviews used for sentiment analysis. Labeled as positive or negative
        - **Train**: Contains 25,000 labeled reviews for training
        - **Test**: Contains 25,000 labeled reviews for evaluating
- Yelp Review
    - yelp_review_full
    - The Yelp reviews dataset consists of reviews from Yelp. It is extracted from the Yelp Dataset Challenge 2015 data. Labeled from 1 star to 5 star
        - **Train**: Contains 650,000 labeled reviews for training
        - **Test**: Contains 50,000 labeled reviews for evaluating

In [63]:
def display_dataset_info(dataset):
    info = dataset.info
    dataset_name = info.dataset_name
    splits_info = info.splits
    features = info.features
    print(f"Dataset Name: {dataset_name}")
    print("Splits Info:")
    for split_name, split_info in splits_info.items():
        num_examples = split_info.num_examples
        print(f" - Split: {split_name}, Num Examples: {num_examples}")
    print("Features:")
    for feature_name, feature_info in features.items():
        print(f" - {feature_name}: {feature_info}")

### Save Data to disk
- Retrieve Dataset
- Save (sampled) dataset to disk

In [100]:
from datasets import load_dataset
import os

def save_dataset(random_sample_size=1000, save_path='sampled_datasets'):  
    # Load the dataset
    dataset_name = 'stanfordnlp/imdb'
    dataset = load_dataset(dataset_name)
    display_dataset_info(dataset['test'])
    # Access the train, test splits
    train_dataset = dataset['train']
    test_dataset = dataset['test']

    # Random sample the dataset, only use random_sample_size
    if(random_sample_size != 0):
        train_dataset = train_dataset.shuffle(seed=42).select(range(random_sample_size))
        test_dataset = test_dataset.shuffle(seed=42).select(range(random_sample_size))

    train_save_path = os.path.join(save_path, f"{dataset_name}_train_{random_sample_size}")
    test_save_path = os.path.join(save_path, f"{dataset_name}_test_{random_sample_size}")

    train_dataset.save_to_disk(train_save_path)
    test_dataset.save_to_disk(test_save_path)

save_dataset(random_sample_size=0)

Dataset Name: imdb
Splits Info:
 - Split: train, Num Examples: 25000
 - Split: test, Num Examples: 25000
 - Split: unsupervised, Num Examples: 50000
Features:
 - text: Value(dtype='string', id=None)
 - label: ClassLabel(names=['neg', 'pos'], id=None)


Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/25000 [00:00<?, ? examples/s]

### Load Local Dataset
- Load dataset from disk

In [101]:
from datasets import load_from_disk
import textwrap

def load_dataset_from_disk(save_path='sampled_datasets/stanfordnlp', train_path='imdb_train_0', test_path='imdb_test_0'):
    try:
        train_save_path = os.path.join(save_path, train_path)
        test_save_path = os.path.join(save_path, test_path)

        train_dataset = load_from_disk(train_save_path)
        test_dataset = load_from_disk(test_save_path)
        
        print(">>Train Dataset loaded<<")
        # Print a sample from the loaded datasets to verify
        print("Text:", textwrap.fill(train_dataset[0]["text"], width=60), "\nLabel:", train_dataset[0]["label"])
        
        print("\n>>Test Dataset loaded<<")
        # Print a sample from the loaded test datasets to verify
        print("Text:", textwrap.fill(test_dataset[0]["text"], width=60), "\nLabel:", test_dataset[0]["label"])
        return train_dataset, test_dataset
    except Exception as e:
        print(f"Error: {e}")
    return train_dataset, test_dataset

train_dataset, test_dataset = load_dataset_from_disk()

>>Train Dataset loaded<<
Text: I rented I AM CURIOUS-YELLOW from my video store because of
all the controversy that surrounded it when it was first
released in 1967. I also heard that at first it was seized
by U.S. customs if it ever tried to enter this country,
therefore being a fan of films considered "controversial" I
really had to see this for myself.<br /><br />The plot is
centered around a young Swedish drama student named Lena who
wants to learn everything she can about life. In particular
she wants to focus her attentions to making some sort of
documentary on what the average Swede thought about certain
political issues such as the Vietnam War and race issues in
the United States. In between asking politicians and
ordinary denizens of Stockholm about their opinions on
politics, she has sex with her drama teacher, classmates,
and married men.<br /><br />What kills me about I AM
CURIOUS-YELLOW is that 40 years ago, this was considered
pornographic. Really, the sex and nudity scen

## Text Embedding Models

- **BERT** (Bidirectional Encoder Representations from Transformers)
    - Retrieved from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased)

- **Sentence Transformer** (all-MiniLM-L12-v2)
    - Retrieved from [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

- **T5** (Text-To-Text Transfer Transformer)
    - Retrieved from [Hugging Face](https://huggingface.co/google-t5/t5-base)

- **Instructor**
    - Retrieved from [Hugging Face](https://huggingface.co/hkunlp/instructor-large)

In [102]:
import torch
from transformers import pipeline
device = 0 if torch.cuda.is_available() else -1

model_BERT = pipeline("feature-extraction", model="google-bert/bert-base-uncased", device=device)
#model_ST = pipeline("feature-extraction", model="sentence-transformers/all-MiniLM-L12-v2", device=device)
#model_T5 = pipeline("feature-extraction", model="google-t5/t5-base", device=device)
#model_INS = pipeline("feature-extraction", model="hkunlp/instructor-large", device=device)

- Helper function using the provided embedding model to encode text
- returns Array of Tensors and labels

In [78]:
from tqdm.auto import tqdm
from transformers.pipelines.pt_utils import KeyDataset
import numpy as np

def process_dataset(model, dataset, key, truncation=False, padding=False, max_length=0, use_mean_pooling=False):
    data = KeyDataset(dataset, key)
    pipe = model(data, return_tensors=True, truncation=truncation, padding=padding, max_length=max_length)
    embeddings=[]
    for tensor in tqdm(pipe, desc="Encoding text"): 
        # Use mean pooling to 1 dim for classification
        if(use_mean_pooling):
            tensor = tensor.mean(dim=1)
            tensor = tensor.flatten()
        embeddings.append(tensor)
    return np.array(embeddings), np.array(dataset["label"])

Augment dataset with instructions

In [103]:
import textwrap

def modify_dataset(example):
    example['text'] = "DIUhifhnfIUSWFDIdn*#&^G(1)" + example['text'] + "cm8097)C*#MRY0W*&Edhj9efhdddwqferw"
    return example

# Apply the function to the dataset
train_dataset_aug = train_dataset.map(modify_dataset)

# Check the first review
example = train_dataset_aug['text'][0]
print(textwrap.fill(example, width=60))

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

DIUhifhnfIUSWFDIdn*#&^G(1)I rented I AM CURIOUS-YELLOW from
my video store because of all the controversy that
surrounded it when it was first released in 1967. I also
heard that at first it was seized by U.S. customs if it ever
tried to enter this country, therefore being a fan of films
considered "controversial" I really had to see this for
myself.<br /><br />The plot is centered around a young
Swedish drama student named Lena who wants to learn
everything she can about life. In particular she wants to
focus her attentions to making some sort of documentary on
what the average Swede thought about certain political
issues such as the Vietnam War and race issues in the United
States. In between asking politicians and ordinary denizens
of Stockholm about their opinions on politics, she has sex
with her drama teacher, classmates, and married men.<br
/><br />What kills me about I AM CURIOUS-YELLOW is that 40
years ago, this was considered pornographic. Really, the sex
and nudity scenes ar

Embed `train_dataset` and `test_dataset` then save them to disk
- CPU 10 it/s
- GPU 100 it/s

In [104]:
train_embeddings, train_labels = process_dataset(dataset=train_dataset, model=model_BERT, key="text", 
                                                 truncation=True, padding=True, max_length=512,use_mean_pooling=True)
test_embeddings, test_labels = process_dataset(dataset=train_dataset, model=model_BERT, key="text", 
                                               truncation=True, padding=True, max_length=512,use_mean_pooling=True)

Encoding text:   0%|          | 0/25000 [00:00<?, ?it/s]

Encoding text:   0%|          | 0/25000 [00:00<?, ?it/s]

This is not required to run if loading from disk

In [80]:
from datetime import datetime
import os, json
def save_embeddings(embeddings, model_name, save_path="data"):
    timestamp = datetime.now().strftime("%m-%d_%H:%M")
    # Calculate the average shape of tensors
    tensor_shapes = [tensor.shape for tensor in embeddings]
    avg_shape = np.mean(tensor_shapes, axis=0).tolist()
    
    embedding_info = {
        'model_name': model_name,
        'num_embeddings': len(embeddings),
        'avg_embedding_shape': avg_shape,
        'created_at': timestamp
    }
    
    os.makedirs(save_path, exist_ok=True)
    embedding_file = os.path.join(save_path, f"{model_name}_embeddings.npy")
    metadata_file = os.path.join(save_path, f"{model_name}_metadata.json")
    
    np.save(embedding_file, embeddings)

    with open(metadata_file, 'w') as f:
        json.dump(embedding_info, f)
    
    print(f"Embeddings and metadata saved for {model_name} at {timestamp}")

In [81]:
def load_embeddings(embedding_file, metadata_file):
    embeddings = np.load(embedding_file)
    with open(metadata_file, 'r') as f:
        metadata = json.load(f)
    
    print(f"Loaded embeddings from model: {metadata['model_name']}")
    print(f"Number of embeddings: {metadata['num_embeddings']}")
    print(f"Average embedding shape: {metadata['avg_embedding_shape']}")
    print(f"Created at: {metadata['created_at']}")
    
    return embeddings

In [82]:
save_path="data"
model_name="BERT"
save_embeddings(train_embeddings, model_name, save_path)

Embeddings and metadata saved for BERT at 05-27_12:23


Load embedding from disk

In [83]:
save_path="data"
model_name="BERT"

embedding_file = os.path.join(save_path, f"{model_name}_embeddings.npy")
metadata_file = os.path.join(save_path, f"{model_name}_metadata.json")
load_embeddings(embedding_file,metadata_file)

Loaded embeddings from model: BERT
Number of embeddings: 10000
Average embedding shape: [768.0]
Created at: 05-27_12:23


array([[ 0.33487466,  0.18951571,  0.16997708, ..., -0.14543127,
         0.18289806, -0.05866966],
       [ 0.12900904,  0.15694778,  0.7177404 , ..., -0.11730668,
         0.20632458,  0.12624525],
       [ 0.29783016,  0.17169824,  0.4370483 , ..., -0.16575865,
         0.21610147, -0.13894227],
       ...,
       [ 0.33479112,  0.19027473,  0.19242047, ..., -0.05238853,
         0.05621795,  0.01344509],
       [ 0.19905636,  0.1317051 ,  0.54959613, ..., -0.09682898,
         0.20079939,  0.08876501],
       [ 0.02522485,  0.07287347,  0.52920115, ..., -0.04332894,
         0.10391848,  0.15221924]], dtype=float32)

### Train classifiers to evaluate Embedding performance
- Linear
    - SVM
- Non-Lienar
    - MLP

Train SVM model using `train_embeddings` and `train_labels`

In [105]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Train an SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(train_embeddings, train_labels)

Evaluate SVM results using `test_embeddings` and `test_labels`

In [90]:
# Predict and evaluate
predicted_labels = svm_model.predict(test_embeddings)
print(classification_report(y_true=test_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.79      0.85      0.82      2035
           1       0.65      0.67      0.66      1977
           2       0.64      0.62      0.63      1943
           3       0.66      0.64      0.65      1991
           4       0.80      0.78      0.79      2054

    accuracy                           0.71     10000
   macro avg       0.71      0.71      0.71     10000
weighted avg       0.71      0.71      0.71     10000



MLP

In [91]:
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import classification_report

# Define the MLP model
mlp_model = MLPClassifier(hidden_layer_sizes=(100,), max_iter=300, alpha=1e-4,
                          solver='sgd', verbose=1, random_state=1,
                          learning_rate_init=.1)

# Train the MLP model
mlp_model.fit(train_embeddings, train_labels)

# Predict and evaluate
predicted_labels_mlp = mlp_model.predict(test_embeddings)
print(classification_report(y_true=test_labels, y_pred=predicted_labels_mlp))


Iteration 1, loss = 1.34905888
Iteration 2, loss = 1.04675439
Iteration 3, loss = 1.01076066
Iteration 4, loss = 0.99786187
Iteration 5, loss = 0.98156938
Iteration 6, loss = 0.98844070
Iteration 7, loss = 0.95807240
Iteration 8, loss = 0.95131509
Iteration 9, loss = 0.95523799
Iteration 10, loss = 0.94671827
Iteration 11, loss = 0.92625707
Iteration 12, loss = 0.92137301
Iteration 13, loss = 0.92025075
Iteration 14, loss = 0.89425304
Iteration 15, loss = 0.90082937
Iteration 16, loss = 0.88649484
Iteration 17, loss = 0.88761526
Iteration 18, loss = 0.89916118
Iteration 19, loss = 0.87750865
Iteration 20, loss = 0.86367560
Iteration 21, loss = 0.86018417
Iteration 22, loss = 0.84177840
Iteration 23, loss = 0.83433371
Iteration 24, loss = 0.83123333
Iteration 25, loss = 0.83825253
Iteration 26, loss = 0.83235303
Iteration 27, loss = 0.81067179
Iteration 28, loss = 0.79958709
Iteration 29, loss = 0.79540450
Iteration 30, loss = 0.78297575
Iteration 31, loss = 0.78561131
Iteration 32, los

In [92]:
print(classification_report(y_true=test_labels, y_pred=predicted_labels_mlp))

              precision    recall  f1-score   support

           0       0.87      0.93      0.89      2035
           1       0.82      0.75      0.79      1977
           2       0.79      0.71      0.74      1943
           3       0.75      0.84      0.79      1991
           4       0.87      0.86      0.86      2054

    accuracy                           0.82     10000
   macro avg       0.82      0.82      0.82     10000
weighted avg       0.82      0.82      0.82     10000



## The Text Embedding Pipeline

In [1]:
from TextEmbeddingPipeline import EmbedFlow

flow = EmbedFlow(model_name="BERT", dataset_name="IMDB", prefix="Prefix: ", suffix=" :Suffix")

flow.start_flow(sample_size=1000, use_mean_pooling=True, method='svm')


Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/1000 [00:00<?, ? examples/s]

Data loaded and sampled.
Local data loaded.


Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Data augmented with prefixes and suffixes.


Embedding text:   0%|          | 0/1000 [00:00<?, ?it/s]

  attn_output = torch.nn.functional.scaled_dot_product_attention(


Embeddings and labels saved at 05-27_09:10.
Data loaded for model: BERT
SVM Evaluation Report:
              precision    recall  f1-score   support

           0       0.88      0.89      0.89       104
           1       0.88      0.86      0.87        96

    accuracy                           0.88       200
   macro avg       0.88      0.88      0.88       200
weighted avg       0.88      0.88      0.88       200

