# DATA SELECTION AND LABELLING

In [None]:
# %pip install transformers sentence-transformers vaderSentiment

In [1]:
import pandas as pd
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer, util
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

  from .autonotebook import tqdm as notebook_tqdm


## 1. SEMANTIC SEARCH WITH SENTENCE TRANSFORMERS

After filtering, there ~54k records (~9k posts, ~47k comments).

The goal of this section is to select the most relevant records that express sentiments about OpenAI, and filter out low quality data. It will enable us to produce a high quality dataset for company reputation analysis.

Prior to using embedding-based semantic search, we experimented with TF-IDF-based retrieval, to find the most relevant records, i.e, the records with the highest cosine similarity to a given query. However, upon manually labellign ~450 of the most relevant records selected using TF-IDF, we found that ~41% of the records were irrelevant, i.e, they express no positive/negative/neutral sentiment about OpenAI.

This is primarily because term-based vectorization methods like TF-IDF do not represent the semantic meaning of the data. Therefore, we decided to experiment with using embedding models with the Sentence Transformers library, which are specialized for conducting semantic retrieval of the most relevant data points, using cosine similarity.

You can find our experiments with retrieval using TF-IDF here: 

We are utilizing the msmarco-distilbert-cos-v5 model as the embedding model for the following reasons:
1. As visualized during exploratory data analysis, our "passages" (comments and posts) are generally longer than the length of the queries we will be using for retrieval (see below). Therefore, we require a model for asymmetric semantic search (where the query is generally shorter in length than the passages to be retrieved). The [Sentence Transformer documentation](https://www.sbert.net/examples/applications/semantic-search/README.html#symmetric-vs-asymmetric-semantic-search) recommends models trained on the MS-MARCO information retrieval dataset, for asymmetric semantic search. 

2. DistilBERT is a smaller, lighter version of BERT that maintains most of the original performance. It is used as the backbone of this embedding model. Therefore, it will be efficient and quick to retrieve relevant examples from our dataset. 

3. The model performs relatively well compared to other Sentence Transformers on various [information retrieval benchmarks](https://www.sbert.net/docs/pretrained-models/msmarco-v5.html#performance).

In [16]:
# Load the embedding model
embedding_model = SentenceTransformer("msmarco-distilbert-cos-v5")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/5.17k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/545 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/265M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/319 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling%2Fconfig.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Define multiple search queries, corresponding to each sentiment label, to help
# retrieve a balanced dataset
queries = ["What do users think about OpenAI’s ChatGPT, DALL·E, and other AI tools?",
           "How well do OpenAI’s models perform according to user reviews?",
           "Comparison of OpenAI's products and other competitors based on user reviews",
           "Criticism and complaints about OpenAI’s products in user reviews",
           "Customer satisfaction and positive experiences with OpenAI products"]

In [18]:
# Extract the text column of filtered_data as a list 
reviews = filtered_data["text"].values.tolist()

In [19]:
# Generate embeddings for the queries
query_embeddings = embedding_model.encode(queries, convert_to_tensor=True)

In [20]:
# Generate embeddings for the reviews
review_embeddings = embedding_model.encode(reviews[:15], convert_to_tensor=True)

In [21]:
# Perform cosine similarity search between the queries and reviews embeddings, and retrieve the top 5000 most similar reviews, for each query
retrieved_reviews = util.semantic_search(query_embeddings, review_embeddings, top_k = 5000)

In [22]:
retrieved_reviews

[[{'corpus_id': 6, 'score': 0.44747495651245117},
  {'corpus_id': 10, 'score': 0.21365712583065033},
  {'corpus_id': 7, 'score': 0.20477834343910217},
  {'corpus_id': 9, 'score': 0.1452714502811432},
  {'corpus_id': 14, 'score': 0.13351130485534668},
  {'corpus_id': 1, 'score': 0.1301075518131256},
  {'corpus_id': 2, 'score': 0.12205448001623154},
  {'corpus_id': 8, 'score': 0.09173666685819626},
  {'corpus_id': 3, 'score': 0.0722956731915474},
  {'corpus_id': 13, 'score': 0.06409384310245514},
  {'corpus_id': 12, 'score': 0.06230776011943817},
  {'corpus_id': 4, 'score': 0.038176536560058594},
  {'corpus_id': 11, 'score': 0.01644682139158249},
  {'corpus_id': 5, 'score': 0.01644682139158249},
  {'corpus_id': 0, 'score': 0.0032186005264520645}],
 [{'corpus_id': 6, 'score': 0.45526471734046936},
  {'corpus_id': 7, 'score': 0.2362341433763504},
  {'corpus_id': 10, 'score': 0.21888545155525208},
  {'corpus_id': 14, 'score': 0.1883079707622528},
  {'corpus_id': 9, 'score': 0.11210362613201

In [23]:
# Create a dictionary to store the highest score for each unique id
# from the results of all the queries
unique_reviews = {}

for review_list in retrieved_reviews:
    for review in review_list:
        corpus_id = review['corpus_id']
        score = review['score']
        if corpus_id not in unique_reviews or score > unique_reviews[corpus_id]:
            unique_reviews[corpus_id] = score

In [24]:
# Modify the filtered_data DataFrame to include a new column for the cosine similarity score
# for each unique id
filtered_data['cosine_similarity'] = filtered_data['text'].apply(lambda x: unique_reviews.get(x, 0))

In [25]:
# Drop rows where cosine similarity is 0
filtered_data = filtered_data[filtered_data['cosine_similarity'] != 0]

In [26]:
# Order the DataFrame by cosine similarity in descending order
filtered_data = filtered_data.sort_values(by='cosine_similarity', ascending=False)

In [27]:
# Display the first few rows of the selected data
filtered_data.head()

Unnamed: 0,post_id,subreddit,post_title,post_body,number_of_comments,readable_datetime,post_author,number_of_upvotes,query,text,comment_id,comment_body,comment_author,cosine_similarity


In [28]:
filtered_data.describe()

Unnamed: 0,number_of_comments,readable_datetime,number_of_upvotes,cosine_similarity
count,0.0,0,0.0,0.0
mean,,NaT,,
min,,NaT,,
25%,,NaT,,
50%,,NaT,,
75%,,NaT,,
max,,NaT,,
std,,,,


In [29]:
# Save the retrieved data to a new CSV file
filtered_data.to_csv('../Data/selected_data.csv', index=False)

## LABELLING THE DATASET

### Labelling with RoBERTa based sentiment analysis model

In [None]:
torch.cuda.is_available()

True

In [31]:
# Initialize the sentiment analysis pipeline
sentiment_pipeline = pipeline("text-classification", 
                              model="cardiffnlp/twitter-roberta-base-sentiment-latest",
                              device=0) 

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use cuda:0


In [32]:
selected_data = filtered_data.copy()

In [33]:
# Extract the text column of selected_data as a list
reviews = selected_data["text"].tolist()

In [34]:
# Calculate the sentiment of the each of the reviews
kwargs = {'padding':True,'truncation':True,'max_length':512}
results = sentiment_pipeline(reviews, **kwargs) 

In [35]:
selected_data["roberta_label"] = [res["label"] for res in results]
selected_data["roberta_score"] = [res["score"] for res in results]

### Labelling with VADER (Lexicon and Rule-Based Model)

VADER was selected as it is specialized for sentiment analysis on social media comments.

In [38]:
sentimentAnalyzer = SentimentIntensityAnalyzer()

In [39]:
vader_label, vader_score = [], []

for review in reviews:
    # Calculate the sentiment of the review using VADER
    sentiment = sentimentAnalyzer.polarity_scores(review)
    vader_score.append(sentiment["compound"])
    
    if sentiment["compound"] >= 0.07:
        vader_label.append("positive")
    elif sentiment["compound"] <= -0.07:
        vader_label.append("negative")
    else:
        vader_label.append("neutral")


In [40]:
# Add the VADER sentiment label and score to the selected_data DataFrame
selected_data["vader_label"] = vader_label
selected_data["vader_score"] = vader_score

In [41]:
# Display the first few rows of the selected_data DataFrame
selected_data.head()

Unnamed: 0,post_id,subreddit,post_title,post_body,number_of_comments,readable_datetime,post_author,number_of_upvotes,query,text,comment_id,comment_body,comment_author,cosine_similarity,roberta_label,roberta_score,vader_label,vader_score


In [42]:
# Save the selected_data DataFrame to a new CSV file
selected_data.to_csv('../Data/labelled_data.csv', index=False)