# Embedding Model with Instructions

## IMDB Dataset

The IMDB dataset is a large dataset of movie reviews used for sentiment analysis. It contains 50,000 reviews labeled as positive or negative, which can be used for binary sentiment classification tasks. 

- **Train**: Contains 25,000 labeled reviews for training
- **Test**: Contains 25,000 labeled reviews for evaluating

### Save the dataset locally (Don't need to run)

In [14]:
from datasets import load_dataset

def save_dataset(random_sample_size=5000):  # Reduce sample size for testing
    # Load the dataset
    dataset = load_dataset('stanfordnlp/imdb')
    
    # Access the train, test splits
    train_dataset = dataset['train']
    test_dataset = dataset['test']

    # Random sample the dataset, only use random_sample_size
    train_dataset = train_dataset.shuffle(seed=42).select(range(random_sample_size))
    test_dataset = test_dataset.shuffle(seed=42).select(range(random_sample_size))

    train_dataset.save_to_disk('sampled_train_dataset')
    test_dataset.save_to_disk('sampled_test_dataset')
save_dataset()

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

Saving the dataset (0/1 shards):   0%|          | 0/5000 [00:00<?, ? examples/s]

## Load dataset from file

In [2]:
from datasets import load_from_disk
try:
    train_dataset = load_from_disk('sampled_train_dataset')
    test_dataset = load_from_disk('sampled_test_dataset')
    
    print(">>Train Dataset loaded<<")
    # Print a sample from the loaded datasets to verify
    print("Text:", train_dataset[0]["text"], "\nLabel:", train_dataset[0]["label"])
    
    print("\n>>Test Dataset loaded<<")
    # Print a sample from the loaded test datasets to verify
    print("Text:", test_dataset[0]["text"], "\nLabel:", test_dataset[0]["label"])
except Exception as e:
    print(f"An error occurred while loading the datasets: {e}")

>>Train Dataset loaded<<
Text: I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scen

## Text Embedding Models

- **BERT** (Bidirectional Encoder Representations from Transformers)
    - Retrieved from [Hugging Face](https://huggingface.co/google-bert/bert-base-uncased)

- **Sentence Transformer** (all-MiniLM-L12-v2)
    - Retrieved from [Hugging Face](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)

- **T5** (Text-To-Text Transfer Transformer)
    - Retrieved from [Hugging Face](https://huggingface.co/google-t5/t5-base)

- **Instructor**
    - Retrieved from [Hugging Face](https://huggingface.co/hkunlp/instructor-large)

In [3]:
import torch
from transformers import pipeline
device = 0 if torch.cuda.is_available() else -1

model_BERT = pipeline("feature-extraction", model="google-bert/bert-base-uncased", device=device)
#model_ST = pipeline("feature-extraction", model="sentence-transformers/all-MiniLM-L12-v2", device=device)
#model_T5 = pipeline("feature-extraction", model="google-t5/t5-base", device=device)
#model_INS = pipeline("feature-extraction", model="hkunlp/instructor-large", device=device)

- Helper function using the provided embedding model to encode text
- returns Tensor

In [4]:
def embed(text, model, use_mean_pooling=False):
    # BERT max seq length is 512, some reviews need to be truncated
    tensor = model(text, return_tensors=True, truncation=True, padding=True, max_length=512)
    
    # Use mean pooling to 1 dim for SVM classification
    if(use_mean_pooling):
        tensor = tensor.mean(dim=1)
        tensor = tensor.flatten()
    return tensor

Embed `train_dataset` and `test_dataset` then save them to disk
- CPU 10 it/s
- GPU 100 it/s

In [5]:
from tqdm import tqdm
import numpy as np

# Helper function to process and embed dataset
def process_and_embed(dataset, feature_extractor, desc):
    embeddings = []
    labels = []
    for text, label in tqdm(zip(dataset['text'], dataset['label']), total=len(dataset['text']), desc=desc):
        embedding = embed(text, feature_extractor, use_mean_pooling=True)
        embeddings.append(embedding)
        labels.append(label)
    return np.array(embeddings), np.array(labels)

This is not required to run if loading from disk

In [6]:
import numpy as np
import os

def encode(train_dataset, test_dataset, feature_extractor):
    # Process and embed the training and test data
    train_embeddings, train_labels = process_and_embed(train_dataset, feature_extractor, desc="Encoding train data")
    test_embeddings, test_labels = process_and_embed(test_dataset, feature_extractor, desc="Encoding test data")

    # Save the embeddings and labels to .npy files
    np.save(os.path.join('data','train_embeddings.npy'), train_embeddings)
    np.save(os.path.join('data','train_labels.npy'), train_labels)
    np.save(os.path.join('data','test_embeddings.npy'), test_embeddings)
    np.save(os.path.join('data','test_labels.npy'), test_labels)

#encode(train_dataset, test_dataset, model_BERT)

  attn_output = torch.nn.functional.scaled_dot_product_attention(
Encoding train data:   0%|          | 1/25000 [00:00<3:12:18,  2.17it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Encoding train data: 100%|██████████| 25000/25000 [04:25<00:00, 94.19it/s] 
Encoding test data: 100%|██████████| 25000/25000 [04:45<00:00, 87.57it/s] 


Load embeddings and labels from disk

In [7]:
import numpy as np
import os

# Load the embeddings and labels from .npy files
try:
    train_embeddings = np.load(os.path.join('data','train_embeddings.npy'))
    train_labels = np.load(os.path.join('data', 'train_labels.npy'))
    print("Train embeddings and labels loaded!")
    
    test_embeddings = np.load(os.path.join('data','train_embeddings.npy'))
    test_labels = np.load(os.path.join('data','test_labels.npy'))
    print("Test embeddings and labels loaded!")
except Exception as e:
    print("Load embedding failed:", e)

Train embeddings and labels loaded!
Test embeddings and labels loaded!


### Train classifiers to evaluate Embedding performance
- Linear
    - SVM
- Non-Lienar
    - MLP

Train SVM model using `train_embeddings` and `train_labels`

In [None]:
from sklearn.svm import SVC
from sklearn.metrics import classification_report

# Train an SVM model
svm_model = SVC(kernel='linear')
svm_model.fit(train_embeddings, train_labels)

Evaluate SVM results using `test_embeddings` and `test_labels`

In [12]:
# Predict and evaluate
predicted_labels = svm_model.predict(test_embeddings)
print(classification_report(y_true=test_labels, y_pred=predicted_labels))

              precision    recall  f1-score   support

           0       0.90      0.91      0.90     12500
           1       0.91      0.90      0.90     12500

    accuracy                           0.90     25000
   macro avg       0.90      0.90      0.90     25000
weighted avg       0.90      0.90      0.90     25000

