<a href="https://colab.research.google.com/github/Zeeshan13/Colab_HuggingFace/blob/main/Sentence-Transformers_and_Cosine_Similarity.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Using Sentence-Transformers and Cosine Similarity**

Running code

In [None]:
#installing necessary libraries
!pip install pypdf transformers sentence-transformers langchain llama_index


Collecting pypdf
  Downloading pypdf-4.3.0-py3-none-any.whl (295 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/295.7 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m143.4/295.7 kB[0m [31m4.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.7/295.7 kB[0m [31m5.6 MB/s[0m eta [36m0:00:00[0m
Collecting sentence-transformers
  Downloading sentence_transformers-3.0.1-py3-none-any.whl (227 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.1/227.1 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting langchain
  Downloading langchain-0.2.8-py3-none-any.whl (987 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m987.6/987.6 kB[0m [31m34.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama_index
  Downloading llama_index-0.10.55-py3-none-any.whl (6.8 kB)
Collecting langchain-core<0.3.0,>=0

In [None]:
#impoerting required libraries
from sentence_transformers import SentenceTransformer, util
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, ServiceContext, Document
from google.colab import drive
import pandas as pd


  from tqdm.autonotebook import tqdm, trange


In [None]:
# Mount Google Drive to access files
drive.mount('/content/drive')


Mounted at /content/drive


In [None]:
# Define directory path for documents where ad descriptions are stored
directory_path = "/content/drive/MyDrive/Task/Data"


# The below code chucnk load the document from the specified directory containing description of the video ad. It initializes the **all-mpnet-base-v2** SentenceTransformer model, which converts text data into numerical embeddings for similarity comparison.
# Additionally, we created a **ServiceContext** with a chunk size of 512, without specifying a language model or embedding model, to manage settings for document processing and querying efficiently.

In [None]:
# Load documents from the directory
documents = SimpleDirectoryReader(directory_path).load_data()

# Initialize SentenceTransformer model
#model = SentenceTransformer('all-MiniLM-L6-v2')

model = SentenceTransformer('all-mpnet-base-v2')  # Updated model


# Create service context
service_context = ServiceContext.from_defaults(
    chunk_size=512,
    llm=None,
    embed_model=None
    )

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]



config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

  service_context = ServiceContext.from_defaults(


LLM is explicitly disabled. Using MockLLM.
Embeddings have been explicitly disabled. Using MockEmbedding.


In [None]:
# Load sample data
sample_data = pd.read_csv('/content/drive/MyDrive/Task/Data/Sample.csv')

# Load ground truth data
ground_truth = pd.read_excel('/content/drive/MyDrive/Task/ground_truth.xlsx')



### In this part of the code, I first took each text description from the sample data and wrapped it in a **Document** object. This step is crucial because it prepares each advertisement description for further processing. I iterated over **sample_data['creative_data_description']** to make sure each description is encapsulated as a **Document**.

### Next, I built an index from these **Document** objects using the **VectorStoreIndex**. This index, created with the help of our previously set up **service_context**, allows for efficient storage and retrieval of document embeddings. Finally, I created a query engine from this index. This query engine is essential because it enables us to perform searches and queries on the indexed documents effectively, allowing us to analyze the advertisement descriptions based on our questions.

In [None]:

# Wrap each text description in a Document object
documents = [Document(text=description) for description in sample_data['creative_data_description']]

# Build the index from the Document objects
index = VectorStoreIndex.from_documents(documents, service_context=service_context)
query_engine = index.as_query_engine()


In this section of the code, I defined the 21 questions that we want to answer for each advertisement. These questions focus on different aspects of the ad, such as whether it includes a call to action, provides contact information, mentions a price, or evokes certain emotions. By listing these questions, we set clear criteria for evaluating the content of each advertisement. This helps us systematically analyze the ads based on consistent and specific questions, ensuring our analysis covers all relevant aspects of the advertisements.

In [None]:
# Defining the 21 questions
questions_list = [
    "Is there a call to go online (e.g., shop online, visit the Web)?",
    "Is online contact information provided (e.g., URL, website)?",
    "Is there a visual or verbal call to purchase (e.g., buy now, order now)?",
    "Does the ad portray a sense of urgency to act (e.g., buy before sales end, order before it ends)?",
    "Is there an incentive to buy (e.g., a discount, a coupon, a sale, or 'limited time offer')?",
    "Is there offline contact information provided (e.g., phone, mail, store location)?",
    "Is there mention of something free?",
    "Does the ad mention at least one specific product or service (e.g., model, type, item)?",
    "Is there any verbal or visual mention of the price?",
    "Does the ad show the brand (logo, brand name) or trademark (something that most people know is the brand) multiple times?",
    "Does the ad show the brand or trademark exactly once at the end of the ad?",
    "Is the ad intended to affect the viewer emotionally, either with positive emotion (fun, joy), negative emotion (sad, anxious) or another type of emotion?",
    "Does the ad give you a positive feeling about the brand?",
    "Does the ad have a story arc, with a beginning and an end?",
    "Does the ad have a reversal of fortune, where something changes for the better, or changes for the worse?",
    "Does the ad have relatable characters?",
    "Is the ad creative/clever?",
    "Is the ad intended to be funny?",
    "Does this ad provide sensory stimulation (e.g., cool visuals, arousing music, mouth-watering)?",
    "Is the ad visually pleasing?",
    "Does the ad have cute elements like animals, babies, animated characters, etc?"
]

In this part of the code, I first defined a function, determine_answer, which uses cosine similarity to determine yes/no answers. This function takes the embeddings of the ad description and the question and calculates their similarity. If the similarity score exceeds a specified threshold (0.2 in this case), it returns 'yes'; otherwise, it returns 'no'. This approach helps us decide whether a particular aspect is present in the advertisement based on the textual content.

Next, I precomputed the embeddings for all the questions using the model. By converting the questions into tensor format, we can efficiently compare them with the ad descriptions.

Lastly, I initialized columns in the **sample_data** DataFrame to store the generated answers for each question. This prepares the DataFrame to hold the results of our analysis, with each column corresponding to a question and its generated yes/no answer.

In [None]:
# Function to determine yes/no answers using cosine similarity
def determine_answer(description_embedding, question_embedding, threshold=0.2):
    similarity = util.pytorch_cos_sim(description_embedding, question_embedding)
    return 'yes' if similarity.item() > threshold else 'no'

# Precompute question embeddings
question_embeddings = model.encode(questions_list, convert_to_tensor=True)

# Initialize columns for generated answers
for question in questions_list:
    sample_data[f'{question}_generated_answer'] = ""

In this part of the code, I used batch processing to handle the data in smaller chunks, which helps reduce memory usage. By setting a batch_size of 10 (which can be adjusted as needed), I processed the data in manageable portions.

For each batch, I extracted a subset of sample_data and encoded the ad descriptions into embeddings. Then, for each ad description embedding, I compared it with each question embedding using the determine_answer function to generate a yes/no answer. These answers were then stored in the corresponding columns of the batch_data.

After processing each batch, I updated the original sample_data with the new answers. This approach ensures that we don't run into memory issues by processing the entire dataset at once.

Once all batches were processed, I prepared the final data for output. I created a new DataFrame, output_data, containing the creative_data_id and the generated answers for each question. This organized the results neatly.

Finally, I saved the output_data to a CSV file, making it easy to share and analyze the results. This CSV file contains the generated yes/no answers for each advertisement based on the defined questions.

In [None]:
# Batch processing to reduce memory usage
batch_size = 10  # We canb adjust batch size as needed
for start in range(0, len(sample_data), batch_size):
    end = start + batch_size
    batch_data = sample_data.iloc[start:end]
    description_embeddings = model.encode(batch_data['creative_data_description'].tolist(), convert_to_tensor=True)

    for idx, description_embedding in enumerate(description_embeddings):
        for q_idx, question_embedding in enumerate(question_embeddings):
            question = questions_list[q_idx]
            answer = determine_answer(description_embedding, question_embedding)
            batch_data.at[start + idx, f'{question}_generated_answer'] = answer

    sample_data.iloc[start:end] = batch_data

# Prepare Data for .csv
output_data = sample_data[['creative_data_id']].copy()

for i, question in enumerate(questions_list, 1):
    output_data[f'question_{i}'] = sample_data[f'{question}_generated_answer']

# Display the prepared data
print("\nPrepared Data:")
print(output_data.head())

# Save the output data to a .csv file
output_data.to_csv('/content/drive/MyDrive/Task/output_answers.csv', index=False)

print("CSV file has been created successfully!")



Prepared Data:
   creative_data_id question_1 question_2 question_3 question_4 question_5  \
0           2194673         no         no         no         no        yes   
1           2142915         no         no         no         no         no   
2           1702851         no         no         no         no         no   
3           1671980         no         no         no         no         no   
4           1749291         no         no         no         no         no   

  question_6 question_7 question_8 question_9  ... question_12 question_13  \
0         no         no         no         no  ...          no          no   
1         no         no         no         no  ...          no          no   
2         no         no         no         no  ...          no          no   
3         no         no         no         no  ...          no          no   
4         no         no         no         no  ...          no         yes   

  question_14 question_15 question_16 question

####Comparing the generated file and ground truth file

In [None]:
import pandas as pd
from sklearn.metrics import precision_recall_fscore_support

# Load generated data
generated_data = pd.read_csv('/content/drive/MyDrive/Task/output_answers.csv')

# Load ground truth data
ground_truth = pd.read_excel('/content/drive/MyDrive/Task/ground_truth.xlsx')

# Ensure both datasets are sorted by 'creative_data_id' to align rows correctly
generated_data = generated_data.sort_values('creative_data_id').reset_index(drop=True)
ground_truth = ground_truth.sort_values('creative_data_id').reset_index(drop=True)

In [None]:
# Display column names
print(ground_truth.columns.tolist())

['Timestamp', 'creative_data_id', 'Is there a call to go online (e.g., shop online, visit the Web)? ', 'Is there online contact information provided (e.g., URL, website)? ', 'Is there a visual or verbal call to purchase (e.g., buy now, order now)?', 'Does the ad portray a sense of urgency to act (e.g., buy before sales ends, order before ends)? ', 'Is there an incentive to buy (e.g., a discount, a coupon, a sale or "limited time offer")? ', 'Is there offline contact information provided (e.g., phone, mail, store location)?', 'Is there mention of something free? ', 'Does the ad mention at least one specific product or service (e.g., model, type, item)? ', 'Is there any verbal or visual mention of the price?', 'Does the ad show the brand (logo, brand name) or trademark (something that most people know is the brand) multiple times?\n\nFor example, Nike ads often have the "swoosh" logo prominently displayed on shoes and apparel worn by celebrity athletes. The "Just Do It" slogan is another

In [None]:
# Define mapping from questions in ground truth to columns in generated data
questions_mapping = {
    "question_1": "Is there a call to go online (e.g., shop online, visit the Web)? ",
    "question_2": "Is there online contact information provided (e.g., URL, website)? ",
    "question_3": "Is there a visual or verbal call to purchase (e.g., buy now, order now)?",
    "question_4": "Does the ad portray a sense of urgency to act (e.g., buy before sales ends, order before ends)? ",
    "question_5": "Is there an incentive to buy (e.g., a discount, a coupon, a sale or \"limited time offer\")? ",
    "question_6": "Is there offline contact information provided (e.g., phone, mail, store location)?",
    "question_7": "Is there mention of something free? ",
    "question_8": "Does the ad mention at least one specific product or service (e.g., model, type, item)? ",
    "question_9": "Is there any verbal or visual mention of the price?",
    "question_10": "Does the ad show the brand (logo, brand name) or trademark (something that most people know is the brand) multiple times?\n\nFor example, Nike ads often have the \"swoosh\" logo prominently displayed on shoes and apparel worn by celebrity athletes. The \"Just Do It\" slogan is another Nike trademark frequently included.",
    "question_11": "Does the ad show the brand or trademark exactly once at the end of the ad?",
    "question_12": "Is the ad intended to affect the viewer emotionally, either with positive emotion (fun, joy), negative emotion (sad, anxious) or another type of emotion? (Note: You may not personally agree, but assess if that was the intention.)",
    "question_13": "Does the ad give you a positive feeling about the brand? ",
    "question_14": "Does the ad have a story arc, with a beginning and an end? ",
    "question_15": "Does the ad have a reversal of fortune, where something changes for the better, or changes for the worse?",
    "question_16": "Does the ad have relatable characters? ",
    "question_17": "Is the ad creative/clever?",
    "question_18": "Is the ad intended to be funny? (Note: You may not personally agree, but assess if that was the intention.) ",
    "question_19": "Does this ad provide sensory stimulation (e.g., cool visuals, arousing music, mouth-watering)? ",
    "question_20": "Is the ad visually pleasing?",
    "question_21": "Does the ad have cute elements like animals, babies, animated, characters, etc?"
}


In this section of the code, we are converting the ground truth answers to a binary format and ensuring that we have a consistent way to compare the generated answers with the ground truth answers.

In [None]:
# Convert ground truth answers to binary format and compute majority vote
def standardize_answers(answer):
    if isinstance(answer, str):
        return 1 if 'yes' in answer.lower() else 0
    return 0

# Standardize answers
for question, column_name in questions_mapping.items():
    if column_name in ground_truth.columns:  # Check if column exists
        ground_truth[column_name] = ground_truth[column_name].apply(standardize_answers)
    else:
        print(f"Warning: Question '{column_name}' not found in ground truth data.")

In this section, we compute the majority vote for each question and align the generated data with the aggregated ground truth data based on **creative_data_id**.

In [None]:
# Compute majority vote for each question
def majority_vote(series):
    return 1 if series.sum() >= len(series) / 2 else 0

ground_truth_aggregated = ground_truth.groupby('creative_data_id').agg({col: majority_vote for col in questions_mapping.values()}).reset_index()

# Align the datasets based on 'creative_data_id'
aligned_data = pd.merge(generated_data, ground_truth_aggregated, on='creative_data_id', suffixes=('_gen', '_gt'))



In this section, we define a function to calculate various evaluation metrics for the generated answers compared to the ground truth answers.

In [None]:
# Function to calculate metrics
def calculate_metrics(generated_data, ground_truth, generated_col, ground_truth_col):
    y_true = ground_truth[ground_truth_col]
    y_pred = generated_data[generated_col].apply(lambda x: 1 if x == 'yes' else 0)

    precision, recall, f1, _ = precision_recall_fscore_support(y_true, y_pred, average='binary', zero_division=1)
    agreement_percentage = (y_true == y_pred).mean() * 100

    return precision, recall, f1, agreement_percentage

In this part of the code, we calculate the evaluation metrics for each question and store the results in a list.

In [None]:
# Calculate metrics for each question
metrics = []
for gen_question, gt_question in questions_mapping.items():
    if gt_question in aligned_data.columns:
        precision, recall, f1, agreement_percentage = calculate_metrics(aligned_data, aligned_data, gen_question, gt_question)
        metrics.append({
            'Question': gt_question,
            'Precision': precision,
            'Recall': recall,
            'F1 Score': f1,
            'Agreement Percentage': agreement_percentage
        })
    else:
        print(f"Warning: Question '{gt_question}' not found in ground truth data.")


In this final part of the code, we convert the calculated metrics into a DataFrame for better readability and then save it to a CSV file for further analysis.

In [None]:
# Convert metrics to DataFrame for easier viewing
metrics_df = pd.DataFrame(metrics)

# Display the metrics
print(metrics_df)

# Save the metrics to a CSV file
metrics_df.to_csv('/content/drive/MyDrive/Task/metrics.csv', index=False)
print("Metrics CSV file has been created successfully!")

                                             Question  Precision    Recall  \
0   Is there a call to go online (e.g., shop onlin...   0.750000  0.069767   
1   Is there online contact information provided (...   0.000000  0.000000   
2   Is there a visual or verbal call to purchase (...   0.450000  0.163636   
3   Does the ad portray a sense of urgency to act ...   0.475000  0.475000   
4   Is there an incentive to buy (e.g., a discount...   0.878788  0.439394   
5   Is there offline contact information provided ...   0.400000  0.074074   
6                Is there mention of something free?    0.285714  0.166667   
7   Does the ad mention at least one specific prod...   0.885714  0.240310   
8   Is there any verbal or visual mention of the p...   0.833333  0.090909   
9   Does the ad show the brand (logo, brand name) ...   0.833333  0.078740   
10  Does the ad show the brand or trademark exactl...   1.000000  0.083333   
11  Is the ad intended to affect the viewer emotio...   0.928571