## Step 1: Import Required Libraries

In [1]:
import pandas as pd

# Google API client to fetch data (YouTube comments)
from googleapiclient.discovery import build
from googleapiclient.errors import HttpError

# Sentiment analysis library
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer


# Step 2: Fetch YouTube Video IDs Using Google API

The script leverages the YouTube Data API v3 to perform programmatic video searches for specified queries, retrieving and aggregating the videoId of each result. It ensures a unique set of video IDs, forming a foundation for subsequent operations like comment scraping or sentiment analysis.

In [2]:
# Set up YouTube API Client
API_KEY = "AIzaSyCXQAtU3J_UDEt2XtY0NC_XJXP4SvDIJso"
youtube = build("youtube", "v3", developerKey=API_KEY)

# Function: Search Videos
def search_videos(query, max_results=15):
    request = youtube.search().list(
        part="snippet",
        q=query,
        type="video",
        maxResults=max_results,
        order="relevance"
    )

    response = request.execute()

    video_ids = []

    for item in response["items"]:
        video_ids.append(item["id"]["videoId"])

    return video_ids

# Define queries
queries = [
    "Amazon work culture",
    "why I quit job at Amazon",
    "Amazon employee review",
    "Amazon toxic workplace"
]

video_ids = []

# Fetch videos for all queries
for query in queries:
    ids = search_videos(query, max_results=30)
    video_ids.extend(ids)

video_ids = list(set(video_ids))

# Step 3: Fetch Comments from a YouTube Video

The function calls **YouTube CommentThreads API** to fetch top-level comments for a given videoId, paginating through results using list_next() until reaching max_comments. It handles errors gracefully if comments are disabled or restricted, returning a **list of plain-text comments** for downstream analysis.

In [3]:
def get_comments_from_video(video_id, max_comments=200):
    comments = []

    try:
        request = youtube.commentThreads().list(
            part="snippet",
            videoId=video_id,
            maxResults=100,
            textFormat="plainText"
        )

        while request and len(comments) < max_comments:
            response = request.execute()

            for item in response['items']:
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                comments.append(comment)

            request = youtube.commentThreads().list_next(request, response)

    except HttpError as e:
        print(f"Skipping video {video_id} (comments disabled or restricted)")
        return []

    return comments



# Step 4: Aggregate Comments from Multiple Videos

This code iterates over all fetched `video_ids`, calling `get_comments_from_video()` for each and storing the results with their corresponding video IDs in a **structured list**. It then converts this list into a **pandas DataFrame**, producing a unified dataset of comments ready for analysis.  


In [4]:
all_comments = []

for vid in video_ids:
    print(f"Fetching comments from video: {vid}")
    comments = get_comments_from_video(vid, max_comments=200)

    for comment in comments:
        all_comments.append({
            "video_id": vid,
            "comment": comment
        })

df = pd.DataFrame(all_comments)

print("Total comments collected:", len(df))


Fetching comments from video: PE2wHqiwTf0
Fetching comments from video: _QgIj7eNYAM
Fetching comments from video: 1QUy3_Uofoo
Fetching comments from video: tsgvHZr7vSQ
Fetching comments from video: cLpQb-_c80Q
Fetching comments from video: j2l1Ke0UtAE
Fetching comments from video: MvVT08Eiml8
Fetching comments from video: k-RcftDUUpw
Fetching comments from video: r7T0k5QTKj8
Fetching comments from video: tFjFBGMLj_M
Fetching comments from video: OHQqgnq7Spc
Fetching comments from video: CafjaZqnm6I
Fetching comments from video: aMY5pQeGIE8
Fetching comments from video: rQxPWdhoQe0
Fetching comments from video: _sr2wpO4b64
Fetching comments from video: 5DZlA-j__5E
Fetching comments from video: X67GxburFjQ
Fetching comments from video: BdweCZI0bGo
Fetching comments from video: RTVZVv2Xnv8
Fetching comments from video: SMQ9jIypeOA
Fetching comments from video: aFcgKYUM_Yk
Fetching comments from video: VzmgTg4g2Ng
Fetching comments from video: wk1w_U1Xmcw
Fetching comments from video: rq3a

# Step 5: Filter Comments by Relevance Keywords

This code filters the DataFrame to retain only comments containing **workplace-related keywords**, using a case-insensitive check. The resulting dataset contains **relevant comments** for further sentiment or topic analysis, reducing noise from unrelated content.  

In [5]:
# Define a list of keywords relevant to workplace discussions
keywords = [
    "work", "job", "manager", "salary", "culture",
    "toxic", "career", "stress", "environment",
    "boss", "employee", "promotion", "office",
    "quit", "fired", "workload", "shift"
]

def is_relevant(comment):
    comment_lower = comment.lower()
    return any(word in comment_lower for word in keywords)

df = df[df["comment"].apply(is_relevant)]
print("Remaining comments:", len(df))

Remaining comments: 4659


# Step 6: Remove Duplicate Comments

This code eliminates duplicate comments from the DataFrame by checking the `"comment"` column, ensuring each comment is unique. It also reports how many duplicates were removed, providing a **cleaned dataset** ready for analysis. 

In [6]:
initial_count = len(df)

df = df.drop_duplicates(subset="comment")

print("Removed duplicates:", initial_count - len(df))


Removed duplicates: 17


# Step 7: Filter Short and Spam/Promotional Comments

This code filters out **short comments (<6 words)** and **spam/promotional content** containing URLs or keywords like “subscribe” or “channel,” producing a **cleaner, high-quality dataset** for analysis.  

In [7]:
# Remove comments with fewer than 6 words
df = df[df["comment"].str.split().str.len() > 5]

# Remove comments containing URLs or common promotional terms
df = df[~df["comment"].str.contains(r"http|www|subscribe|channel", case=False, regex=True)]

# Step 8: Sentiment Analysis Using VADER

This code uses **VADER sentiment analysis** to classify each comment as **positive, negative, or neutral** based on the compound score. The resulting `label` column allows for **quantitative sentiment analysis** of the comment dataset.  

In [8]:
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Function to assign sentiment label based on compound score
def label_sentiment(text):
    score = analyzer.polarity_scores(text)["compound"]
    
    if score >= 0.05:
        return "Positive"
    elif score <= -0.05:
        return "Negative"
    else:
        return "Neutral"

# Apply sentiment labeling to all comments
df["label"] = df["comment"].apply(label_sentiment)

# Display counts of each sentiment category
df["label"].value_counts()

label
Positive    2240
Negative    1730
Neutral      484
Name: count, dtype: int64

1. Sentiment labeling is done using VADER, a lexicon and rule-based sentiment analyzer designed for social media text. 
2. It classifies comments as positive, negative, or neutral based on a compound score. 
3. Alternative approaches include manual hand-labeling (more accurate but time-consuming) or using pre-labeled datasets. 
4. VADER is chosen here because it is fast, effective for short informal text like YouTube comments, and requires no manual labeling.

In [9]:
df.head(20)

Unnamed: 0,video_id,comment,label
1,PE2wHqiwTf0,The most gangster thing I've herd all year... ...,Negative
9,PE2wHqiwTf0,I feel bad for her coworkers who are probably ...,Negative
11,PE2wHqiwTf0,Women are wired differently than men we need o...,Positive
12,PE2wHqiwTf0,I think it's news to Amazon that she's an empl...,Negative
16,PE2wHqiwTf0,Beo I swear women get so much leniency on jobs...,Positive
17,PE2wHqiwTf0,8-1pm is a big gap. Obviously Amazon doesn’t c...,Positive
18,PE2wHqiwTf0,"From my understanding she is flex, she can wor...",Positive
20,PE2wHqiwTf0,Amazon can be very lazy or not care. When I wo...,Negative
24,PE2wHqiwTf0,Nope not Me bro my shift is 1:20 to 11:50,Neutral
27,PE2wHqiwTf0,I was 5 minutes late to my shift once and they...,Negative


In [10]:
# Save the cleaned and labeled comments to a CSV file
df.to_csv("youtube_comments_cleaned.csv", index=False)