<h1> Title: Comment Quality</h1>

<strong>Overview: In this notebook, I aim to analyze the quality and relevance of comments at scale using the _____.csv</strong><br>
In this notebook, it covers:<br>
1.0 Load dataset<br>
2.0 Language detection and translation<br>
3.0 Spam detection<br>
4.0 Text preprocessing<br>
5.0 Category classification<br>
6.0 Sentiment analysis<br>
7.0 Quality comment analysis

In [2]:
# Basic Libraries
import pandas as pd  
import numpy as np  
import random  
import warnings 
import re
import json
import requests

# Translation
from langdetect import detect
from deep_translator import GoogleTranslator

# Text preprocessing
import nltk  
import contractions  
from nltk.tokenize import word_tokenize  
from nltk.corpus import stopwords, wordnet  
from nltk.stem import PorterStemmer, WordNetLemmatizer  

# Pretrained models
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
from transformers import pipeline

  from .autonotebook import tqdm as notebook_tqdm


---
## 1.0 Load dataset

- Only rows with videoId in the range 0–40 are collected

In [17]:
# Collect file paths from comments1.csv to comments5.csv
file_paths = [f"Datasets/comments{i}.csv" for i in range(1, 6)]

filtered_chunks = []

for file_path in file_paths:
    for chunk in pd.read_csv(file_path, chunksize=50000):
        # Filter rows where videoId is between 0 and 40
        chunk_filtered = chunk[chunk['videoId'].between(0, 40)]
        if not chunk_filtered.empty:
            filtered_chunks.append(chunk_filtered)

# Combine filtered results from all files
df_filtered = pd.concat(filtered_chunks, ignore_index=True)

# Sort by videoId in ascending order
df_filtered = df_filtered.sort_values(by="videoId", ascending=True)

print(df_filtered.head())

                 kind  commentId  channelId  videoId  authorId  \
1443  youtube#comment    2895557      15366        0   2425288   
3673  youtube#comment     101047      29145        0   3378074   
3679  youtube#comment       2555      30692        0   3456989   
90    youtube#comment    1822478      15366        0   3390312   
91    youtube#comment       2539      30692        0    259614   

                                           textOriginal  parentCommentId  \
1443  The, uh, *shape* of the containers is somethin...              NaN   
3673        And with perfect people like you â¤ï¸ðŸŒ¸        2214515.0   
3679  Please don't call me sir😅😅 best part is you re...        2452518.0   
90    Lol All you need is to put your hair in two po...              NaN   
91                        It's "for the record" by Ooyy        1275651.0   

      likeCount                publishedAt                  updatedAt  
1443          4  2022-09-23 19:12:24+00:00  2022-09-23 19:12:24+00:00  
36

In [18]:
# Save to CSV
output_path = "Datasets/comments.csv"
df_filtered.to_csv(output_path, index=False)

print("Total rows collected:", len(df_filtered))

Total rows collected: 3755


In [2]:
df = pd.read_csv("Datasets/comments.csv")  
df.head()

Unnamed: 0,kind,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,translated
0,youtube#comment,2895557,15366,0,2425288,"The, uh, *shape* of the containers is somethin...",,4,2022-09-23 19:12:24+00:00,2022-09-23 19:12:24+00:00,"The, uh, *shape* of the containers is somethin..."
1,youtube#comment,101047,29145,0,3378074,And with perfect people like you â¤ï¸ðŸŒ¸,2214515.0,1,2021-11-11 03:33:45+00:00,2021-11-11 03:33:45+00:00,And with perfect people like you â¤ï¸ðŸŒ¸
2,youtube#comment,2555,30692,0,3456989,Please don't call me sir😅😅 best part is you re...,2452518.0,1,2020-02-12 15:27:17+00:00,2020-02-12 15:27:17+00:00,Please don't call me sir😅😅 best part is you re...
3,youtube#comment,1822478,15366,0,3390312,Lol All you need is to put your hair in two po...,,33,2022-09-23 19:26:44+00:00,2022-09-23 19:26:44+00:00,Lol All you need is to put your hair in two po...
4,youtube#comment,2539,30692,0,259614,"It's ""for the record"" by Ooyy",1275651.0,0,2020-02-13 16:16:00+00:00,2020-02-13 16:16:00+00:00,"It's ""for the record"" by Ooyy"


---
## 2.0 Language Translation

In [3]:
LANG_CODE_MAP = {
    "zh-cn": "zh",  # Simplified Chinese
    "zh-tw": "zh",  # Traditional Chinese
    "ms": "ms",     # Malay
    "id": "id",     # Indonesian
    "en": "en",     # English
    "fr": "fr",     # French
    "de": "de",     # German
    "es": "es",     # Spanish
    "ja": "ja",     # Japanese
    "ko": "ko",     # Korean
}

In [4]:
def context_translate(text: str):
    text = text.strip()
    try:
        lang = detect(text)
    except Exception:
        lang = "unknown"

    lang = LANG_CODE_MAP.get(lang, lang)  # normalize with mapping

    if lang != "en" and lang != "unknown":
        try:
            translated = GoogleTranslator(source=lang, target="en").translate(text)
            return translated, lang
        except Exception as e:
            print(f"⚠️ Translation failed: {e} | Detected: {lang}")
            return text, lang
    return text, lang


In [5]:
context_results = []
for text in df['textOriginal']:  
    translated, lang = context_translate(str(text))
    context_results.append((text, lang, translated))

# Print results
for original, lang, translated in context_results:
    print(f"[{lang}] {original}  -->  {translated}")

⚠️ Translation failed: Make More makeover videos --> No translation was found using the current translator. Try another translator? | Detected: no
[en] The, uh, *shape* of the containers is something else 😳  -->  The, uh, *shape* of the containers is something else 😳
[en] And with perfect people like you â¤ï¸ðŸŒ¸  -->  And with perfect people like you â¤ï¸ðŸŒ¸
[en] Please don't call me sir😅😅 best part is you reply to every component  -->  Please don't call me sir😅😅 best part is you reply to every component
[en] Lol All you need is to put your hair in two ponytails and you'll look like Harley Quinn! You could definitely pull that look off!  -->  Lol All you need is to put your hair in two ponytails and you'll look like Harley Quinn! You could definitely pull that look off!
[en] It's "for the record" by Ooyy  -->  It's "for the record" by Ooyy
[en] All that to look just above average teanage boy  -->  All that to look just above average teanage boy
[en] Alright  -->  Alright
[en] Ple

In [7]:
# Save translated texts in comments.csv
translations = [translated for _, _, translated in context_results]

df["translated"] = translations

df.to_csv("Datasets/comments.csv", index=False)

---
## 3.0 Spam Detection

In [3]:
# Load model and tokenizer from Hugging Face Hub
model = AutoModelForSequenceClassification.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457")
tokenizer = AutoTokenizer.from_pretrained("madhurjindal/autonlp-Gibberish-Detector-492513457")

# Predicts whether a list of comments is spam using a pre-trained model.
def detect_spam(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    
    probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_label_id = probabilities.argmax().item()
    
    return model.config.id2label[predicted_label_id]

In [4]:
df["spam"] = df["translated"].apply(detect_spam)

print(df[["translated", "spam"]])

                                             translated            spam
0     The, uh, *shape* of the containers is somethin...  mild gibberish
1           And with perfect people like you â¤ï¸ðŸŒ¸      word salad
2     Please don't call me sir😅😅 best part is you re...  mild gibberish
3     Lol All you need is to put your hair in two po...           clean
4                         It's "for the record" by Ooyy           clean
...                                                 ...             ...
3750  Apparently every other girl is beautiful becau...  mild gibberish
3751  Me being a girl with skintone that doesn't go ...           clean
3752     I'm boutta look like the Colleen Bollinger 😭😭😭  mild gibberish
3753                 I've been doing this since so long           clean
3754                         Red isnt my shade though 😭      word salad

[3755 rows x 2 columns]


In [6]:
print(df["spam"].value_counts())

spam
clean             1724
mild gibberish    1388
word salad         406
noise              237
Name: count, dtype: int64


In [7]:
# Filter only rows predicted as noise (spam)
noise_df = df[df["spam"] == "noise"]

print(noise_df[["textOriginal", "spam"]])

                                           textOriginal   spam
12    I'm so happy thanks for the song â¤â¤â¤â¤ð...  noise
13                                  Muito preciosa ðŸ’Ž  noise
21                 Bhaii lacto calamine pe video bna do  noise
23                                                 Bhhh  noise
25                                    â¤ðŸ˜‚â¤ðŸ˜‚â¤  noise
...                                                 ...    ...
3610                    NO USEN FAJAS HACEN MAL , SALU2  noise
3673                          Con faja: 🤮\nSin faja: 👌👍  noise
3716            Cual es el punto si ella no tiene pansa  noise
3717                                      Que t e t a s  noise
3747                                  Slayyy ikr right❤  noise

[237 rows x 2 columns]


In [8]:
# Filter only rows predicted as word salad
salad_df = df[df["spam"] == "word salad"]

print(salad_df[["textOriginal", "spam"]])

                                     textOriginal        spam
1     And with perfect people like you â¤ï¸ðŸŒ¸  word salad
59                                   ðŸ˜‚ðŸ˜‚ðŸ‘  word salad
61                               Her smile tho ❤🤗  word salad
62                                         Good ❤  word salad
66                                        Awesome  word salad
...                                           ...         ...
3739              If been doing this my hole life  word salad
3741                                Its my go to❤  word salad
3744            Same like with brown olive skin 😂  word salad
3745                                        Same😭  word salad
3754                   Red isnt my shade though 😭  word salad

[406 rows x 2 columns]


In [9]:
# Filter only rows predicted as mild gibberish
gibberish_df = df[df["spam"] == "mild gibberish"]

print(gibberish_df[["textOriginal", "spam"]])

                                           textOriginal            spam
0     The, uh, *shape* of the containers is somethin...  mild gibberish
2     Please don't call me sir😅😅 best part is you re...  mild gibberish
5       All that to look just above average teanage boy  mild gibberish
6                                               Alright  mild gibberish
11     What lash searums do you recommend that is cheap  mild gibberish
...                                                 ...             ...
3742                                             whole*  mild gibberish
3743                         Step one: have clear skin😭  mild gibberish
3748                                               Yeja  mild gibberish
3750  Apparently every other girl is beautiful becau...  mild gibberish
3752         I'm boutta look like Colleen Bollinger 😭😭😭  mild gibberish

[1388 rows x 2 columns]


In [10]:
# Filter only rows predicted as clean (Non-spam)
clean_df = df[df["spam"] == "clean"]

print(clean_df[["textOriginal", "spam"]])

                                           textOriginal   spam
3     Lol All you need is to put your hair in two po...  clean
4                         It's "for the record" by Ooyy  clean
7        Please show us what these are like in the sun.  clean
8                     This is a great point, thank you!  clean
9     Pretty brave to try it anyway after dreaming t...  clean
...                                                 ...    ...
3740  since*\n   also, everyone is beautiful, haters...  clean
3746                                did u try dark red?  clean
3749                    Just gotta find the right shade  clean
3751  Me being a girl with skintone that doesn't go ...  clean
3753                 I've been doing this since so long  clean

[1724 rows x 2 columns]


In [5]:
# Save spam column in comments.csv
df.to_csv("Datasets/comments.csv", index=False)

---
## 4.0 Text Preprocessing
- Lowercasing
- Expanding contractions
- Expanding short forms 
- Removing punctuation, special characters, digits
- Tokenization
- Handling negations
- Lemmatization with POS tagging
- Removing stopwords

In [26]:
# Text Preprocessing
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

# Load shortform dictionary
with open("shortform_dict.json", "r") as f:
    shortform_dict = json.load(f)

def expand_shortforms(text, shortform_dict):
    def replace(match):
        word = match.group(0)
        return shortform_dict.get(word.lower(), word)
    pattern = re.compile(r'\b(' + '|'.join(re.escape(k) for k in shortform_dict.keys()) + r')\b', flags=re.IGNORECASE)
    return pattern.sub(replace, text)

def clean_text(text):
    text = re.sub(r'[-—]', ' ', text)   # replace hyphens
    # Keep letters, spaces, and emojis (remove digits/punctuation)
    text = re.sub(r'[^a-zA-Z\s\u263a-\U0001f645]', '', text)
    return ' '.join(text.split())

def handle_negations(tokens):
    negation_words = {"not", "no", "never", "n't", "neither", "nor"}
    new_tokens = []
    i = 0
    while i < len(tokens):
        if tokens[i] in negation_words and i + 1 < len(tokens):
            new_tokens.append(tokens[i] + "_" + tokens[i + 1])
            i += 2
        else:
            new_tokens.append(tokens[i])
            i += 1
    return new_tokens

def get_wordnet_pos(word):
    tag = nltk.pos_tag([word])[0][1][1][0].upper()
    tag_dict = {'J': wordnet.ADJ, 'N': wordnet.NOUN, 'V': wordnet.VERB, 'R': wordnet.ADV}
    return tag_dict.get(tag, wordnet.NOUN)

In [None]:
def preprocess(text):
    text = text.lower() # lowercase
    text = contractions.fix(text) # expand contractions
    text = expand_shortforms(text, shortform_dict) # expand short forms
    text = clean_text(text) # remove punctuation, special characters, digits, and extra whitespace
    tokens = word_tokenize(text) # tokenize
    tokens = handle_negations(tokens) # handle negations
    lemmatized = [lemmatizer.lemmatize(w, get_wordnet_pos(w)) for w in tokens] # lemmatization with POS tagging
    filtered_tokens = [word for word in lemmatized if word not in stop_words] # remove stopwords
    return " ".join(filtered_tokens)

In [28]:
# Filter only rows predicted as not noise 
nonSpam_df = df[df["spam"] != "noise"]

nonSpam_df["cleaned"] = nonSpam_df["translated"].apply(preprocess)

print(nonSpam_df[["translated", "cleaned"]])

                                             translated  \
0     The, uh, *shape* of the containers is somethin...   
1           And with perfect people like you â¤ï¸ðŸŒ¸   
2     Please don't call me sir😅😅 best part is you re...   
3     Lol All you need is to put your hair in two po...   
4                         It's "for the record" by Ooyy   
...                                                 ...   
3513  Apparently every other girl is beautiful becau...   
3514  Me being a girl with skintone that doesn't go ...   
3515     I'm boutta look like the Colleen Bollinger 😭😭😭   
3516                 I've been doing this since so long   
3517                         Red isnt my shade though 😭   

                                                cleaned  
0                   uh shape container something else 😳  
1                                   perfect people like  
2     please not_call sir😅😅 best part reply every co...  
3     laughing loud need put hair two ponytail look ...  
4

In [29]:
# Save the preprocessed text in a new CSV, cleaned_comments.csv
nonSpam_df.to_csv("Datasets/cleaned_comments.csv", index=False)

---
## 5.0 Sentiment Analysis

In [None]:
df = pd.read_csv("Datasets/cleaned_comments.csv")  
df.head()

In [None]:
API_URL = "http://localhost:8000/predict"
API_KEY = "my_secret_key_605"

def predict_emotion(comment: str, threshold: float = 0.8, use_gnn: bool = True):
    if not comment or pd.isna(comment):   # skip NaN or empty text
        return []
    
    headers = {"x-api-key": API_KEY}
    payload = {
        "text": comment,
        "threshold": threshold,
        "use_gnn": use_gnn
    }
    response = requests.post(API_URL, json=payload, headers=headers)

    if response.status_code == 200:
        return response.json()["predictions"]
    else:
        raise Exception(f"Error {response.status_code}: {response.text}")

In [32]:
df["emotion"] = df["cleaned"].apply(predict_emotion)
print(df[["cleaned", "emotion"]])

                                                cleaned  \
0                     uh shape container something else   
1                                   perfect people like   
2     please not_call sir best part reply every comp...   
3     laughing loud need put hair two ponytail look ...   
4                                           record ooyy   
...                                                 ...   
3513  apparently every girl beautiful wear makeup ce...   
3514              girl skintone doe not_go red lipstick   
3515                 boutta look like colleen bollinger   
3516                                         since long   
3517                            red not_my shade though   

                                                emotion  
0     [[curiosity, 0.9953631162643433], [neutral, 0....  
1     [[admiration, 0.9957285523414612], [pride, 0.9...  
2                         [[caring, 0.964708685874939]]  
3     [[amusement, 0.981054425239563], [joy, 0.97373...  
4

In [33]:
# Emotion-to-sentiment mapping
emotion_to_sentiment = {
    "admiration": "positive",
    "amusement": "positive",
    "approval": "positive",
    "caring": "positive",
    "curiosity": "positive",
    "desire": "positive",
    "excitement": "positive",
    "gratitude": "positive",
    "joy": "positive",
    "love": "positive",
    "optimism": "positive",
    "pride": "positive",
    "relief": "positive",

    "anger": "negative",
    "annoyance": "negative",
    "confusion": "negative",
    "disappointment": "negative",
    "disapproval": "negative",
    "disgust": "negative",
    "embarrassment": "negative",
    "fear": "negative",
    "grief": "negative",
    "nervousness": "negative",
    "remorse": "negative",
    "sadness": "negative",

    "neutral": "neutral",
    "realization": "neutral",
    "surprise": "neutral"
}

In [34]:
# Convert emotions → sentiment with scores
def map_to_sentiment(emotion_predictions):
    sentiment_scores = {"positive": 0.0, "negative": 0.0, "neutral": 0.0}
    for emotion, score in emotion_predictions:
        sentiment = emotion_to_sentiment.get(emotion, "neutral")
        sentiment_scores[sentiment] += score
    return sentiment_scores

# Get final sentiment (highest total score)
def get_final_sentiment(emotion_predictions):
    scores = map_to_sentiment(emotion_predictions)
    return max(scores, key=scores.get)

In [43]:
df["sentiment_scores"] = df["emotion"].apply(map_to_sentiment)
df["final_sentiment"] = df["emotion"].apply(get_final_sentiment)

df[["translated", "emotion", "sentiment_scores", "final_sentiment"]]

Unnamed: 0,translated,emotion,sentiment_scores,final_sentiment
0,"The, uh, *shape* of the containers is somethin...","[[curiosity, 0.9953631162643433], [neutral, 0....","{'positive': 0.9953631162643433, 'negative': 0...",positive
1,And with perfect people like you â¤ï¸ðŸŒ¸,"[[admiration, 0.9957285523414612], [pride, 0.9...","{'positive': 1.9077165722846985, 'negative': 0...",positive
2,Please don't call me sir😅😅 best part is you re...,"[[caring, 0.964708685874939]]","{'positive': 0.964708685874939, 'negative': 0....",positive
3,Lol All you need is to put your hair in two po...,"[[amusement, 0.981054425239563], [joy, 0.97373...","{'positive': 1.9547926783561707, 'negative': 0...",positive
4,"It's ""for the record"" by Ooyy","[[pride, 0.9957169890403748], [admiration, 0.9...","{'positive': 1.9819105863571167, 'negative': 0...",positive
...,...,...,...,...
3513,Apparently every other girl is beautiful becau...,"[[admiration, 0.9927363395690918], [disgust, 0...","{'positive': 0.9927363395690918, 'negative': 0...",positive
3514,Me being a girl with skintone that doesn't go ...,"[[disapproval, 0.988126277923584], [annoyance,...","{'positive': 0.0, 'negative': 2.79332548379898...",negative
3515,I'm boutta look like the Colleen Bollinger 😭😭😭,"[[curiosity, 0.993008017539978], [neutral, 0.9...","{'positive': 0.993008017539978, 'negative': 0....",positive
3516,I've been doing this since so long,"[[neutral, 0.9928674697875977], [curiosity, 0....","{'positive': 0.9684097170829773, 'negative': 0...",neutral


In [37]:
# Save the sentiment predicted in cleaned_comments.csv
df.to_csv("Datasets/cleaned_comments.csv", index=False)

---
## 6.0 Category Classification

In [None]:
df = pd.read_csv("Datasets/cleaned_comments.csv")  
df.head()

Unnamed: 0,kind,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,translated,spam,cleaned,emotion,sentiment_scores,final_sentiment
0,youtube#comment,2895557,15366,0,2425288,"The, uh, *shape* of the containers is somethin...",,4,2022-09-23 19:12:24+00:00,2022-09-23 19:12:24+00:00,"The, uh, *shape* of the containers is somethin...",mild gibberish,uh shape container something else,"[['curiosity', 0.9953631162643433], ['neutral'...","{'positive': 0.9953631162643433, 'negative': 0...",positive
1,youtube#comment,101047,29145,0,3378074,And with perfect people like you â¤ï¸ðŸŒ¸,2214515.0,1,2021-11-11 03:33:45+00:00,2021-11-11 03:33:45+00:00,And with perfect people like you â¤ï¸ðŸŒ¸,word salad,perfect people like,"[['admiration', 0.9957285523414612], ['pride',...","{'positive': 1.9077165722846985, 'negative': 0...",positive
2,youtube#comment,2555,30692,0,3456989,Please don't call me sir😅😅 best part is you re...,2452518.0,1,2020-02-12 15:27:17+00:00,2020-02-12 15:27:17+00:00,Please don't call me sir😅😅 best part is you re...,mild gibberish,please not_call sir best part reply every comp...,"[['caring', 0.964708685874939]]","{'positive': 0.964708685874939, 'negative': 0....",positive
3,youtube#comment,1822478,15366,0,3390312,Lol All you need is to put your hair in two po...,,33,2022-09-23 19:26:44+00:00,2022-09-23 19:26:44+00:00,Lol All you need is to put your hair in two po...,clean,laughing loud need put hair two ponytail look ...,"[['amusement', 0.981054425239563], ['joy', 0.9...","{'positive': 1.9547926783561707, 'negative': 0...",positive
4,youtube#comment,2539,30692,0,259614,"It's ""for the record"" by Ooyy",1275651.0,0,2020-02-13 16:16:00+00:00,2020-02-13 16:16:00+00:00,"It's ""for the record"" by Ooyy",clean,record ooyy,"[['pride', 0.9957169890403748], ['admiration',...","{'positive': 1.9819105863571167, 'negative': 0...",positive


In [None]:
candidate_labels = ["skincare", "makeup","hair", "other"]

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

def classification(comment:str):
    if not comment or pd.isna(comment):   # skip NaN or empty text
        return []
    result = classifier(comment, candidate_labels)
    return result["labels"][0]  # take the best match

df["category"] = df["cleaned"].apply(classification)

In [9]:
df[["translated","cleaned" ,"category"]]

Unnamed: 0,translated,cleaned,category
0,"The, uh, *shape* of the containers is somethin...",uh shape container something else,other
1,And with perfect people like you â¤ï¸ðŸŒ¸,perfect people like,other
2,Please don't call me sir😅😅 best part is you re...,please not_call sir best part reply every comp...,other
3,Lol All you need is to put your hair in two po...,laughing loud need put hair two ponytail look ...,hair
4,"It's ""for the record"" by Ooyy",record ooyy,other
...,...,...,...
3513,Apparently every other girl is beautiful becau...,apparently every girl beautiful wear makeup ce...,makeup
3514,Me being a girl with skintone that doesn't go ...,girl skintone doe not_go red lipstick,makeup
3515,I'm boutta look like the Colleen Bollinger 😭😭😭,boutta look like colleen bollinger,other
3516,I've been doing this since so long,since long,other


In [10]:
# Save the categories predicted in cleaned_comments.csv
df.to_csv("Datasets/cleaned_comments.csv", index=False)

---
## 7.0 Quality Comment Analysis

The **overall quality score** was calculated as a weighted sum:
$$Quality Score=(0.50×Relevance)+(0.30×Sentiment)+(0.20×Engagement)$$

- Scores range from 0 (worst) to 1 (best).
- If the comment was detected as spam, it was immediately assigned 0 quality score.
- A threshold of 0.7 was applied:
    - ≥ 0.7 → High-quality comment
    - < 0.7 → Low-quality comment

In [167]:
df = pd.read_csv("Datasets/cleaned_comments.csv")  

In [168]:
# Define a function to calculate relevance score based on category
def calculate_relevance(text, category):
    # Define keywords for L'Oréal brands
    keywords = [
        'loreal', 'l\'oreal', 'glycolic bright', 'revitalift', 'hyaluronic acid', 'micellar water',
        'aura perfect', 'youth code', 'uv defender', 'infallible', 'true match',
        'rouge signature', 'chiffon signature', 'color riche', 'lash paradise', 'brow artist', 'super liner',
        'la petite', 'elseve', 'extraordinary oil', 'ever', 'excellence ash supreme', 'excellence crème',
        'magic retouch', 'hydra energetic', 'hydra power'
    ]

    # Check if text contains any brand/keyword (case insensitive)
    keyword_count = sum(1 for keyword in keywords if re.search(r'\b' + re.escape(keyword) + r'\b', text))
    
    # Score based on keyword presence and category
    relevance_score = 0
    
    # Add points for keywords
    if keyword_count > 0:
        relevance_score += keyword_count * 0.2  # Cap at 1.0
    
    # Add points for specific category (not 'other')
    if category != 'other':
        relevance_score += 0.5
    
    # Cap the total relevance score at 1.0
    return min(relevance_score, 1.0)

In [169]:
# Define a function to calculate sentiment score
def calculate_sentiment_score(sentiment, sentiment_scores):
    if sentiment == 'positive':
        # Use the positive score from sentiment_scores
        if isinstance(sentiment_scores, str):
            # Convert string representation of dict to actual dict
            try:
                scores_dict = eval(sentiment_scores)
                positive_score = scores_dict.get('positive', 0.5)
            except:
                positive_score = 0.8  # Default for positive
        else:
            positive_score = 0.8  # Default for positive
        return positive_score
    elif sentiment == 'negative':
        return 0.1  # Low score for negative
    else:  # neutral
        return 0.5  # Medium score for neutral

In [170]:
# Define a function to normalize likeCount
def normalize_likes(like_count, max_likes):
    """
    Normalize likeCount to a 0-1 scale using logarithmic scaling to prevent a few highly-liked comments from dominating the score.
    """
    if max_likes == 0:
        return 0
    # Use logarithmic scaling to normalize
    return np.log1p(like_count) / np.log1p(max_likes)

In [171]:
# Calculate raw scores first
df['relevance_score'] = df.apply(lambda row: calculate_relevance(row['textOriginal'], row['category']), axis=1)
df['sentiment_score'] = df.apply(lambda row: calculate_sentiment_score(row['final_sentiment'], row['sentiment_scores']), axis=1)
max_likes = df['likeCount'].max()
df['like_score'] = df['likeCount'].apply(lambda x: normalize_likes(x, max_likes))

In [172]:
# Normalize each score to [0, 1] using min-max normalization
def min_max_normalize(series):
    min_val = series.min()
    max_val = series.max()
    if max_val == min_val:
        return series.apply(lambda x: 0.0)  # Avoid division by zero
    return (series - min_val) / (max_val - min_val)

df['relevance_score'] = min_max_normalize(df['relevance_score'])
df['sentiment_score'] = min_max_normalize(df['sentiment_score'])

In [173]:
# Define a function to calculate overall quality score for a comment
def calculate_quality_score(row):
    """
    Calculate an overall quality score for a comment.
    Ensures the score is always between 0 and 1 and no component dominates.
    """
    # Penalize spam comments heavily
    if row['spam'] != 'clean':
        return 0.0  # Spam comments get quality score of 0
    
    # Weights must sum to 1.0
    w_relevance = 0.50
    w_sentiment = 0.30
    w_engagement = 0.20

    score = (w_relevance * row['relevance_score'] +
             w_sentiment * row['sentiment_score'] +
             w_engagement * row['like_score'])
    return round(min(max(score, 0), 1), 4)

In [174]:
# # Calculate the quality score
df['quality_score'] = df.apply(calculate_quality_score, axis=1)

In [175]:
# Classify comments as quality or not based on a threshold
quality_threshold = 0.1 # Adjust this threshold as needed
df['is_quality'] = df['quality_score'] >= quality_threshold
df

Unnamed: 0,kind,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,...,cleaned,emotion,sentiment_scores,final_sentiment,category,relevance_score,sentiment_score,like_score,quality_score,is_quality
0,youtube#comment,2895557,15366,0,2425288,"The, uh, *shape* of the containers is somethin...",,4,2022-09-23 19:12:24+00:00,2022-09-23 19:12:24+00:00,...,uh shape container something else,"[['curiosity', 0.9953631162643433], ['neutral'...","{'positive': 0.9953631162643433, 'negative': 0...",positive,other,0.000000,0.177558,0.167628,0.0000,False
1,youtube#comment,101047,29145,0,3378074,And with perfect people like you â¤ï¸ðŸŒ¸,2214515.0,1,2021-11-11 03:33:45+00:00,2021-11-11 03:33:45+00:00,...,perfect people like,"[['admiration', 0.9957285523414612], ['pride',...","{'positive': 1.9077165722846985, 'negative': 0...",positive,other,0.000000,0.340309,0.072194,0.0000,False
2,youtube#comment,2555,30692,0,3456989,Please don't call me sir😅😅 best part is you re...,2452518.0,1,2020-02-12 15:27:17+00:00,2020-02-12 15:27:17+00:00,...,please not_call sir best part reply every comp...,"[['caring', 0.964708685874939]]","{'positive': 0.964708685874939, 'negative': 0....",positive,other,0.000000,0.172090,0.072194,0.0000,False
3,youtube#comment,1822478,15366,0,3390312,Lol All you need is to put your hair in two po...,,33,2022-09-23 19:26:44+00:00,2022-09-23 19:26:44+00:00,...,laughing loud need put hair two ponytail look ...,"[['amusement', 0.981054425239563], ['joy', 0.9...","{'positive': 1.9547926783561707, 'negative': 0...",positive,hair,0.714286,0.348707,0.367282,0.5352,True
4,youtube#comment,2539,30692,0,259614,"It's ""for the record"" by Ooyy",1275651.0,0,2020-02-13 16:16:00+00:00,2020-02-13 16:16:00+00:00,...,record ooyy,"[['pride', 0.9957169890403748], ['admiration',...","{'positive': 1.9819105863571167, 'negative': 0...",positive,other,0.000000,0.353544,0.000000,0.1061,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3513,youtube#comment,1537690,4217,37,572805,Apparently every other girl is beautiful becau...,,1,2025-06-27 15:13:34+00:00,2025-06-27 15:13:34+00:00,...,apparently every girl beautiful wear makeup ce...,"[['admiration', 0.9927363395690918], ['disgust...","{'positive': 0.9927363395690918, 'negative': 0...",positive,makeup,0.714286,0.177090,0.072194,0.0000,False
3514,youtube#comment,2089049,4217,37,890840,Me being a girl with skintone that doesn't go ...,,96,2025-06-27 14:19:29+00:00,2025-06-27 14:19:29+00:00,...,girl skintone doe not_go red lipstick,"[['disapproval', 0.988126277923584], ['annoyan...","{'positive': 0.0, 'negative': 2.79332548379898...",negative,makeup,0.714286,0.017839,0.476471,0.4578,True
3515,youtube#comment,2378708,4217,37,1009656,I'm boutta look like Colleen Bollinger 😭😭😭,,0,2025-06-27 13:03:52+00:00,2025-06-27 13:03:52+00:00,...,boutta look like colleen bollinger,"[['curiosity', 0.993008017539978], ['neutral',...","{'positive': 0.993008017539978, 'negative': 0....",positive,other,0.000000,0.177138,0.000000,0.0000,False
3516,youtube#comment,3818988,4217,37,1403503,I've been doing this since so long,,5,2025-06-27 11:27:32+00:00,2025-06-27 11:27:32+00:00,...,since long,"[['neutral', 0.9928674697875977], ['curiosity'...","{'positive': 0.9684097170829773, 'negative': 0...",neutral,other,0.000000,0.089193,0.186618,0.0641,False


In [176]:
# Calculate overall quality ratio
total_comments = len(df)
quality_comments = df['is_quality'].sum()
quality_ratio = quality_comments / total_comments

print(f"Total comments: {total_comments}")
print(f"Quality comments: {quality_comments}")
print(f"Quality ratio: {quality_ratio:.2%}")

Total comments: 3518
Quality comments: 474
Quality ratio: 13.47%


In [177]:
# Analyze quality comments by category
quality_by_category = df[df['is_quality']].groupby('category').size()
total_by_category = df.groupby('category').size()
category_quality_ratio = (quality_by_category / total_by_category).fillna(0)

print("\nQuality ratio by category:")
print(category_quality_ratio.sort_values(ascending=False))


Quality ratio by category:
category
hair        0.518519
makeup      0.483333
skincare    0.315789
other       0.106444
[]          0.026786
dtype: float64


In [178]:
quality_df = df[df["is_quality"]]

quality_df

Unnamed: 0,kind,commentId,channelId,videoId,authorId,textOriginal,parentCommentId,likeCount,publishedAt,updatedAt,...,cleaned,emotion,sentiment_scores,final_sentiment,category,relevance_score,sentiment_score,like_score,quality_score,is_quality
3,youtube#comment,1822478,15366,0,3390312,Lol All you need is to put your hair in two po...,,33,2022-09-23 19:26:44+00:00,2022-09-23 19:26:44+00:00,...,laughing loud need put hair two ponytail look ...,"[['amusement', 0.981054425239563], ['joy', 0.9...","{'positive': 1.9547926783561707, 'negative': 0...",positive,hair,0.714286,0.348707,0.367282,0.5352,True
4,youtube#comment,2539,30692,0,259614,"It's ""for the record"" by Ooyy",1275651.0,0,2020-02-13 16:16:00+00:00,2020-02-13 16:16:00+00:00,...,record ooyy,"[['pride', 0.9957169890403748], ['admiration',...","{'positive': 1.9819105863571167, 'negative': 0...",positive,other,0.000000,0.353544,0.000000,0.1061,True
7,youtube#comment,4662277,15366,0,2916157,Please show us what these are like in the sun.,,0,2022-09-25 23:36:49+00:00,2022-09-25 23:36:49+00:00,...,please show us like sun,"[['admiration', 0.9841416478157043], ['love', ...","{'positive': 2.8433847427368164, 'negative': 0...",positive,other,0.000000,0.507219,0.000000,0.1522,True
8,youtube#comment,96146,51730,0,2607618,"This is a great point, thank you!",2309738.0,0,2021-10-17 04:26:39+00:00,2021-10-17 04:26:39+00:00,...,great point thank,"[['gratitude', 0.9998877048492432], ['admirati...","{'positive': 1.9980447888374329, 'negative': 0...",positive,other,0.000000,0.356422,0.000000,0.1069,True
20,youtube#comment,2084275,15366,0,1992873,Is there a semi permanent hair dye out there t...,,0,2022-09-23 19:23:58+00:00,2022-09-23 19:23:58+00:00,...,semi permanent hair dye color changing,"[['curiosity', 0.9986960291862488], ['neutral'...","{'positive': 0.9986960291862488, 'negative': 0...",positive,hair,0.714286,0.178153,0.000000,0.4106,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3499,youtube#comment,513174,4217,37,1716643,Ruby Woo is the best red lipstick ever!!,2034641.0,0,2025-07-13 23:45:21+00:00,2025-07-13 23:45:21+00:00,...,ruby woo best red lipstick ever,"[['admiration', 0.9990666508674622], ['love', ...","{'positive': 1.8762611150741577, 'negative': 0...",positive,makeup,1.000000,0.334698,0.000000,0.6004,True
3500,youtube#comment,530606,4217,37,436620,I love red lipstick 💄 💋 😊❤,,11,2025-06-28 02:10:33+00:00,2025-06-28 02:10:33+00:00,...,love red lipstick,"[['love', 0.9987545013427734], ['admiration', ...","{'positive': 1.9693917036056519, 'negative': 0...",positive,makeup,0.714286,0.351311,0.258811,0.5143,True
3501,youtube#comment,512521,4217,37,1408547,Nude red lipstick might work,2089049.0,4,2025-06-28 08:37:37+00:00,2025-06-28 08:37:37+00:00,...,nude red lipstick might work,"[['curiosity', 0.9919008016586304], ['neutral'...","{'positive': 0.9919008016586304, 'negative': 0...",positive,makeup,0.714286,0.176941,0.167628,0.4438,True
3504,youtube#comment,512565,4217,37,497509,"since*\r\n also, everyone is beautiful, hate...",1537690.0,0,2025-06-28 16:03:18+00:00,2025-06-28 16:03:18+00:00,...,since also everyone beautiful hater hate keep ...,"[['admiration', 0.9962561130523682], ['anger',...","{'positive': 1.8876793384552002, 'negative': 0...",positive,other,0.000000,0.336735,0.000000,0.1010,True


In [179]:
# Display top quality comments with like counts
print("\nTop 10 Quality Comments:")
top_quality = df[df['is_quality']].nlargest(10, 'quality_score')[['textOriginal', 'quality_score', 'likeCount', 'category', 'final_sentiment']]
for idx, row in top_quality.iterrows():
    print(f"\nScore: {row['quality_score']} | Likes: {row['likeCount']} | Category: {row['category']} | Sentiment: {row['final_sentiment']}")
    print(f"Comment: {row['textOriginal'][:100]}..." if len(row['textOriginal']) > 100 else f"Comment: {row['textOriginal']}")


Top 10 Quality Comments:

Score: 0.6004 | Likes: 0 | Category: makeup | Sentiment: positive
Comment: Ruby Woo is the best red lipstick ever!!

Score: 0.5805 | Likes: 1 | Category: hair | Sentiment: positive
Comment: Hi.  So how long did it take you to get that gorgeous hair? Mine is getting better. I straightened m...

Score: 0.5601 | Likes: 0 | Category: hair | Sentiment: positive
Comment: I luv u with black hair but u looked cute in purple hair

Score: 0.5598 | Likes: 0 | Category: makeup | Sentiment: positive
Comment: You look like the main girl from kakeguri

Score: 0.5562 | Likes: 6 | Category: hair | Sentiment: positive
Comment: Hi Aislinn! Would you be willing to try this on dark hair extensions? I'm curious as a brunette how ...

Score: 0.5535 | Likes: 0 | Category: hair | Sentiment: positive
Comment: That is the best haircut I’ve ever seen

Score: 0.5521 | Likes: 6 | Category: hair | Sentiment: positive
Comment: I would love it if you had black hair with a darker purple roots

In [181]:
# Save the quality comments analysis in cleaned_comments.csv
df.to_csv("Datasets/cleaned_comments.csv", index=False)