## Task 5b: Using a text embedding model to find similar phrases to hot words in Task 5a

This notebook contains the solutions for Task 5b whereby the text embedding model to be used is ```hkunlp/instructor-large``` with the following [HuggingFace Link](https://huggingface.co/hkunlp/instructor-large) to find similar phrases to the hotwords ```(be careful, destroy, stranger)```


In [1]:
# Only run this line if you still face issues with the sentence-transformers library
# %pip install -U sentence-transformers

### Loading the cv-valid-dev.csv dataset

In [2]:
import pandas as pd
import numpy as np

# Set Pandas display options to show full text
pd.set_option("display.max_colwidth", None)

# Read the csv file
df = pd.read_csv('cv-valid-dev.csv')
df = df[['filename', 'generated_text']] # Filtering to required columns
df.head()

Unnamed: 0,filename,generated_text
0,cv-valid-dev/sample-000000.mp3,be careful with your propnastigations said the stranger
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se born
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with bagage entered and greted the englishman
3,cv-valid-dev/sample-000003.mp3,i felt that everything i owned would be destroyed
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could hear him


The following code implementation was adopted from the HuggingFace Repository, however there were many errors faced as the ```InstructorEmbedding``` Library was not maintained so suit the latest version of ```sentence-transformer```, hence the following method will not be used. 

```Python
from InstructorEmbedding import INSTRUCTOR

model = INSTRUCTOR('hkunlp/instructor-large')
sentence = "3D ActionSLAM: wearable person tracking in multi-floor environments"
instruction = "Represent the Science title:"
embeddings = model.encode([[instruction,sentence]])
print(embeddings)
```

We will directly load the model using the ```SentenceTransformer() Function```.

In [3]:
import torch

# Set the device to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("Using Device:", device)

Using Device: cuda


In [4]:
from sentence_transformers import SentenceTransformer, util

# Load the model
model = SentenceTransformer('hkunlp/instructor-large')

# Ensure that the model is on the device
model.to(device)

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: T5EncoderModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': False})
  (2): Dense({'in_features': 1024, 'out_features': 768, 'bias': False, 'activation_function': 'torch.nn.modules.linear.Identity'})
  (3): Normalize()
)

### Using the Model to generate embeddings

We will use ```hkunlp/instructor-large model``` to encode the hotwords and the text into embeddings which will be converted into tensors for processing. 

The reasons why I converted the embeddings into tensors are because:
1. Efficiency: Tensors are optimised for numerical computations and we can utilise GPU acceleration to speed up the process rather than processing each textual-based embedding one at a time
2. Compatibility: The utility functions together with the sentence_transformers package contains ```util.cos.sim``` which will be used for Cosine Similarity scores are designed to work with tensors

In [5]:
# Define the hotwords
hotwords = ["be careful", "destroy", "stranger"]

# Compute embeddings for the hotwords
hotword_embeddings = model.encode(hotwords, convert_to_tensor=True)

The following function is then created to check the similarity of each sentence embedding with the hot word embedding to look for similarities. 

```The choice of threshold = 87% similar score was dependent on experimentations between the range 80% - 89%, I found that 87% as the threshold is balanced to classify the correct and incorrect sentences appropriately``` 

In [6]:
from tqdm import tqdm

def is_similar(text: str, threshold=0.87) -> bool:
    """
    Check if the text is similar to any of the hotwords.
    Returns True if the cosine similarity is above the threshold, otherwise False.
    """
    # Compute embedding for the input text
    text_embedding = model.encode(text, convert_to_tensor=True)
    
    # Compute cosine similarities between the text and all hotwords
    similarities = util.cos_sim(text_embedding, hotword_embeddings)
    
    # Check if any similarity score exceeds the threshold
    return bool((similarities > threshold).any().item())


tqdm.pandas(desc="Similarity Check Progress")

# Apply the similarity check to each row in the DataFrame
df["similarity"] = df["generated_text"].progress_apply(lambda x: is_similar(x) if pd.notna(x) else False)
df

Similarity Check Progress: 100%|██████████| 4076/4076 [00:41<00:00, 99.22it/s] 


Unnamed: 0,filename,generated_text,similarity
0,cv-valid-dev/sample-000000.mp3,be careful with your propnastigations said the stranger,True
1,cv-valid-dev/sample-000001.mp3,then why should they be surprised when they se born,False
2,cv-valid-dev/sample-000002.mp3,a young arab also loaded down with bagage entered and greted the englishman,False
3,cv-valid-dev/sample-000003.mp3,i felt that everything i owned would be destroyed,False
4,cv-valid-dev/sample-000004.mp3,he moved about invisible but everyone could hear him,False
...,...,...,...
4071,cv-valid-dev/sample-004071.mp3,but they could never have taught him arabic,False
4072,cv-valid-dev/sample-004072.mp3,he decided to concentrate on more practical maters,False
4073,cv-valid-dev/sample-004073.mp3,that's what i'm not suposed to say,False
4074,cv-valid-dev/sample-004074.mp3,just andoin pe made him fel picter,False


In [7]:
# Save the dataframe as the updated cv-valid-dev.csv
df.to_csv('cv-valid-dev.csv', index=False)