# DATA SELECTION - SEMANTIC RETRIEVAL WITH SENTENCE TRANSFORMERS

In [None]:
%pip install sentence-transformers 

In [50]:
import pandas as pd
from sentence_transformers import SentenceTransformer, util

After filtering, there ~49k posts and comments in the dataset. Next, we will select the most relevant records that express sentiments about OpenAI, and filter out low quality data. This step will enable us to produce a high quality dataset for company reputation analysis.

Prior to using embedding-based semantic search, we experimented with TF-IDF-based retrieval, to find the most relevant records, i.e, the records with the highest cosine similarity to a given query (when using TF-IDF vectorization). However, upon manually labelling ~450 of the most relevant records selected using TF-IDF, we found that ~41% of the records were irrelevant, i.e, they express no positive/negative/neutral sentiment about OpenAI.

This is primarily because term-based vectorization methods like TF-IDF do not represent the semantic meaning of the data. Therefore, we decided to experiment with using embedding models with the Sentence Transformers library, which are specialized for conducting semantic retrieval of the most relevant data points, using cosine similarity.

We are utilizing the msmarco-distilbert-cos-v5 model as the embedding model for the following reasons:
1. As visualized during exploratory data analysis, our "passages" (comments and posts) are generally longer than the length of the queries we will be using for retrieval (see below). Therefore, we require a model for asymmetric semantic search (where the query is generally shorter in length than the passages to be retrieved). The [Sentence Transformer documentation](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) recommends models trained on the MS-MARCO information retrieval dataset, for asymmetric semantic search. 

2. DistilBERT is a smaller, lighter version of BERT that maintains most of the original performance. It is used as the backbone of this embedding model. Therefore, it will be efficient and quick to retrieve relevant examples from our dataset. 

3. The model performs relatively well compared to other Sentence Transformers on various [information retrieval benchmarks](https://www.sbert.net/docs/pretrained-models/msmarco-v5.html#performance).

In [51]:
# Read the filtered data
filtered_data = pd.read_csv("../Data/filtered_data.csv")

In [52]:
# Display the first few rows of the text field
pd.set_option('display.max_colwidth', None)
filtered_data['text'].head()

0                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            ChatGPT Caused 'Code Red' at Google, Report Says 
1                                                                                                                                                                                                                                                                                                                                   how could someone use ChatGPT or other "AI" services to make some side money? I'm just looking to make some extra cas

The top 5 records in the filtered data, shown above, do not express any sentiment regarding OpenAI or its products. Many times, these comments/posts simply ask questions about OpenAI/ChatGPT, without expressing their opinion on the company. During manual labelling, we recognized that these irrelevant records were also found to be retrieved by the cosine similarity search with the TF-IDF vectors, as they still contain the same keywords as the queries searched. 

Therefore, we employ embedding-based semantic retrieval, to retrieve the most relevant records that are relevant to sentiment analysis.

In [53]:
# Load the embedding model
embedding_model = SentenceTransformer("msmarco-distilbert-cos-v5")

In [54]:
# Define multiple search queries, corresponding to each sentiment label, to help
# retrieve a balanced dataset
queries = ["What do users think about OpenAI’s ChatGPT, DALL·E, and other AI tools?",
           "How well do OpenAI’s models perform according to user reviews?",
           "Comparison of OpenAI's products and other competitors based on user reviews",
           "Criticism and complaints about OpenAI’s products in user reviews",
           "Customer satisfaction and positive experiences with OpenAI products"]

In [55]:
# Extract the text column of filtered_data as a list 
reviews = filtered_data["text"].values.tolist()

In [56]:
# Generate embeddings for the queries
query_embeddings = embedding_model.encode(queries, convert_to_tensor=True)

In [57]:
# Generate embeddings for the reviews
review_embeddings = embedding_model.encode(reviews, convert_to_tensor=True)

In [58]:
# Perform cosine similarity search between the queries and reviews embeddings, and retrieve the top 3000 most similar reviews, for each query
retrieved_reviews = util.semantic_search(query_embeddings, review_embeddings, top_k = 3000)

In [60]:
# Create a dictionary to store the highest score for each unique id
# from the results of all the queries
unique_reviews = {}

for review_list in retrieved_reviews:
    for review in review_list:
        corpus_id = review['corpus_id']
        score = review['score']
        if corpus_id not in unique_reviews or score > unique_reviews[corpus_id]:
            unique_reviews[corpus_id] = score

In [61]:
# Modify the filtered_data DataFrame to include a new column for the cosine similarity score
# for each unique id
filtered_data['cosine_similarity'] = filtered_data.index.map(unique_reviews.get)

In [62]:
# Sort the data based on the cosine similarity, and drop rows with NaN values (which were not retrieved by the semantic search)
filtered_data = filtered_data.dropna(subset=['cosine_similarity'])
filtered_data = filtered_data.sort_values('cosine_similarity', ascending=False)

In [63]:
# Display the first few rows of the text field to see the top retrieved reviews
filtered_data['text'].head()

25711                                                                                                                                                                                                                                                                                                                                                              Good luck to the consumers/customers who are trusting the products from OpenAI.
27162    OpenAI did a great job of showing the public the potential for AI. ChatGPT is a great tool for some people. I am thinking of switching to Claude for work needs but I’ll ways have a free account at ChatGPT. But recently with Sora and the voice/camera features of 4o OpenAI seems like a company that is just saying “See all of these cool things that are possible for a select few, but not feasible on a large commercial scale.”
4674                                                                                                                              

In [64]:
filtered_data.describe()

Unnamed: 0,number_of_comments,number_of_upvotes,cosine_similarity
count,5743.0,5743.0,5743.0
mean,118.025422,110.351733,0.398473
std,240.224534,541.789893,0.083837
min,0.0,-20.0,0.25381
25%,11.0,2.0,0.332448
50%,42.0,7.0,0.386827
75%,133.5,48.0,0.451919
max,3958.0,17877.0,0.717946


In [65]:
# Save the retrieved data to a new CSV file
filtered_data.to_csv('../Data/selected_data.csv', index=False)

As seen above, the top records retrieved by the semantic retrieval are relevant to OpenAI, and do express positive/negative/neutral sentiments about the company and its products. Therefore, we have successfully retrieved relevant records, ensuring high quality in our final OpenAI reputation analysis dataset. 