# IMT 547 Project Part II: Data Preprocessing

Chesie Yu

02/25/2024

<style type = "text/css">  
    body {
        font-family: "Serif"; 
        font-size: 12pt;
    }
    em {
        color: #4E7F9E;
    }
    strong {
        color: #436D87;
    }
    li {
        color: #4E7F9E;
    }
    ul {
        color: #4E7F9E;
    }
    img {
        display: block;
        margin: auto;
    } 
    .jp-RenderedHTMLCommon a:link { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon a:visited { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon code {
        color: #4E7F9E;
    }  
    .mark {
        color: #B00D00;
        background-color: #FFF7B1;
    }
</style>

_This notebook outlines the **data preprocessing** process for the **YouTube Gaming Comment Toxicity** project._    

**Components**  
1. **Data Cleaning**: Data cleaning procedures including handling missing values and converting data types.    
2. **Feature Engineering**: Feature engineering process of deriving new features by transforming existing variables.   
3. **Text Preprocessing**: Text cleaning measures including text standardization, noise elimination, stopwords removal, and tokenization.   
4. **Data Labeling**: Perspective API toxicity annotations and VADER/TextBlob/Empath sentiment evaluation.  


**Functions**   
- **`build_client(api_key)`**: Build a client for a given Perspective API key.  
- **`split_text(text, size=40, overlap=10)`**: Split the given text into overlapping chunks.
- **`calculate_proportions(chunks, overlap=10)`**: Calculate the proportion of each chunk.  Assign lower weight to overlapping words.
- **`perspective_toxicity(comments)`**: Compute Perspective toxicity scores for a given list of texts. Support throttling management w/ client reuse, key rotation, and exponential backoff.   
- **`remove_noise(text)`**: Perform minimal text preprocessing by removing noise.  
- **`clean(text)`**: Perform moderate text preprocessing on a given document.  
- **`vader_sentiment(text)`**: Compute VADER sentiment scores for a given text.    
- **`textblob_sentiment(text)`**: Compute TextBlob sentiment scores for a given text.   
- **`empath_sentiment(text)`**: Compute Empath sentiment scores for a given text.   

**Runtime**  
_Approximately 7.5 hours._    

In [1]:
# Import the libraries
import json
import random
import re
import time

import isodate
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import contractions
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

## 1. Load the Data

### Channel

In [2]:
# Load the data
channel = pd.read_csv("../data/channel.csv")
channel.head(2)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753
1,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,"Hi, I'm Markiplier. I make videos. \n\nFrom qu...",US,UU7_YxT-KID8kRbqZo7MyscQ,21204065899,36400000,5577


In [3]:
# Check the dimensions
print(f"Number of rows: {channel.shape[0]}\n"
      f"Number of columns: {channel.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {channel.isna().sum().sum()}")

Number of rows: 32
Number of columns: 8

Number of missing values: 0


### Video

In [4]:
# Load the data
video = pd.read_csv("../data/video.csv")
video.head(2)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_live,video_genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30T16:40:18Z,🌏 Get exclusive NordVPN deal here ➵ https://N...,PT53M47S,"['pewdiepie', 'pewds', 'pewdie']",11590164,474052,15146.0,../subtitle/F-yEoHL7MYY.en.json3,i have beaten all souls games without dying a ...,False,action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24T15:00:10Z,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,PT13M38S,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5179366,192101,4313.0,../subtitle/PV4NGwn_xdI.en.json3,ah you ready yes we're ready eldon ring baby l...,False,action


In [5]:
# Check the dimensions
print(f"Number of rows: {video.shape[0]}\n"
      f"Number of columns: {video.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {video.isna().sum().sum()}")

Number of rows: 1407
Number of columns: 14

Number of missing values: 184


### Comment

In [6]:
# Load the data
comment = pd.read_csv("../data/comment.csv")
comment.head(2)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02T19:37:22Z,9907.0,47.0
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14T23:36:11Z,6299.0,9.0


In [7]:
# Check the dimensions
print(f"Number of rows: {comment.shape[0]}\n"
      f"Number of columns: {comment.shape[1]}\n")

# Check for missing values
print(f"Number of missing values: {comment.isna().sum().sum()}")

Number of rows: 136463
Number of columns: 7

Number of missing values: 39


_The dataset contains **136,463 comments** collected from action and non-action gaming videos on YouTube.  It features a total of **26 columns** on metadata associated with the channels, videos, and comments.  **184** and **39 missing entries** are detected in the video and comment dataset, respectively; in the subsequent sections, we will address these data quality concerns._    

<br>

## 2. Data Cleaning

### Convert Data Types

*Note that the **`video_creation_time`** and **`comment_time`** are represented as **objects**; since these two columns represent dates and times, we will convert them to the more appropriate type **`datetime`** for efficient analysis.*  

In [8]:
# Check the data types
video.dtypes

channel_id              object
video_id                object
video_title             object
video_creation_time     object
video_description       object
video_duration          object
video_tags              object
video_viewcount          int64
video_likecount          int64
video_commentcount     float64
video_subtitle_path     object
video_subtitle          object
video_live                bool
video_genre             object
dtype: object

In [9]:
# Check the data types
comment.dtypes

video_id               object
comment_id             object
comment_author_id      object
comment_text           object
comment_time           object
comment_likecount     float64
comment_replycount    float64
dtype: object

In [10]:
# Convert to datetime
video["video_creation_time"] = pd.to_datetime(video["video_creation_time"])
comment["comment_time"] = pd.to_datetime(comment["comment_time"])

### Parse ISO 8601 Durations

_Since the video durations are coded in **ISO 8601 format**, we parsed them into the **total number of seconds** to facilitate the analysis._  

In [11]:
# Parse duration strings
video["video_duration"] = \
    video["video_duration"].apply(isodate.parse_duration)\
                           .apply(lambda x: x.total_seconds())
video["video_duration"][:5]

0    3227.0
1     818.0
2     570.0
3    9328.0
4    8522.0
Name: video_duration, dtype: float64

### Handle Missing Values

_Given that the missing entries account for only **0.0585%** of the dataset, we will employ the **deletion** method to handle these missings.  By eliminitating rows that contain missing values, we ensure that our analysis is based on **complete and accurate** information._    

In [12]:
# Check the missings
video.isna().sum()

channel_id              0
video_id                0
video_title             0
video_creation_time     0
video_description      17
video_duration          0
video_tags              0
video_viewcount         0
video_likecount         0
video_commentcount      3
video_subtitle_path    82
video_subtitle         82
video_live              0
video_genre             0
dtype: int64

In [13]:
# Identify videos with missings
video_na = video[video.isna().any(axis=1)]["video_id"]
len(video_na)

100

In [14]:
# Check the missings
comment.isna().sum()

video_id               0
comment_id             3
comment_author_id      3
comment_text          12
comment_time           7
comment_likecount      7
comment_replycount     7
dtype: int64

_Upon inspection, we also identified a few **invalid entries** due to **improper handling of carriage returns (\r)** when writing to CSV. These will be cleaned up in the process of removing missing data._  

In [15]:
# Inspect invalid entries
comment_na = comment[comment.isna().any(axis=1)]
comment_na.loc[34976:34980]

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
34976,1dQH2Gu3SfA,UgxQN_eazMleay2VPfh4AaABAg,UCLwteEET5tXvVLfTYCA3lrw,AHAHAHAHAHAHAHHAHAAI’m dying dude 🤣,NaT,,
34977,🤣,,,,NaT,,
34978,🤣,,,,NaT,,
34979,🤣,,,,NaT,,
34980,🤣,2021-05-12T19:50:42Z,28,3,NaT,,


In [16]:
# Inspect more invalid entries
comment_na.loc[88113:88115]

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
88113,_6NLLR825u0,Ugz7vZVax-_tWTS9BGR4AaABAg,UC-HUp6Zw68aoi4VLmpYra5g,Can we just stop and take a moment to apprecia...,NaT,,
88114,LazarBeam does.,2021-09-04T23:03:10Z,5,0,NaT,,


In [17]:
# Remove the missings
video.dropna(inplace=True)
video.shape

(1307, 14)

In [18]:
# Cascade Delete for comment
comment = comment[~comment["video_id"].isin(video_na)]
comment.shape

(128476, 7)

In [19]:
# Remove the missings
comment.dropna(inplace=True)
comment.shape

(128461, 7)

_Following the removal of missing entries, our dataset now comprises **128,461 comments** across **1,307 distinct videos**._  

### Check Duplicate Entries

_In the previous iteration, we discovered 4 videos that were categorized as both action and non-action. To address this, we implemented a **duplicate check** to **prevent overlapping genres**._  

In [20]:
# Check duplicate videos
video[video.duplicated(subset="video_id", keep=False)]

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_live,video_genre


In [21]:
# Check duplicate comments
comment[comment.duplicated(keep=False)]

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount


### Remove Sparse Channels

_Our initial plan was to exclude sparse channels with a **very low count of videos** to **minimize bias in channel-level analyses**. However, we decided to retain these channels at this stage to explore interesting patterns from specific channels (IShowSpeed) that would have otherwise been removed._  

In [22]:
# Number of videos in each genre for each channel
channel_video_counts = \
    video.groupby(["channel_id", "video_genre"])["video_id"].count()\
         .reset_index(name="video_count")\
         .pivot_table(index="channel_id",
                      columns="video_genre",
                      values="video_count",
                      fill_value=0)\
         .reset_index()\
         .rename_axis(None, axis=1)

# Fix data types
channel_video_counts["action"] = channel_video_counts["action"].astype(int)
channel_video_counts["non-action"] = channel_video_counts["non-action"].astype(int)

# Compute total video count
channel_video_counts["video_count"] = channel_video_counts["action"] + channel_video_counts["non-action"]
pd.concat([channel_video_counts.sort_values(by = "video_count").head(6),\
           channel_video_counts.sort_values(by = "video_count").tail(3)])

Unnamed: 0,channel_id,action,non-action,video_count
16,UCWsDFcIhY2DBi3GB5uykGXA,5,0,5
13,UCMNmwqCtCSpftrbvR3KkHDA,0,6,6
6,UCAW-NpUFkMyCNrvRSSGIvDQ,4,5,9
25,UCpGdL9Sn3Q5YWUH2DVUW1Ug,0,30,30
12,UCKy1dAqELo0zrOtPkf0eTMw,14,16,30
10,UCJZam2u1G0syq3kyqrCXrNw,1,30,31
28,UCw1SQ6QRRtfAhrN_cjkrOgA,30,30,60
1,UC0DZmkupLYwc0yDsfocLh0A,30,30,60
15,UCVAYy7_zddIx6USjXscV9Vg,30,30,60


In [23]:
# Calculate video counts for each channel
video_counts = video.groupby("channel_id")["video_id"].count()\
                    .reset_index(name="video_count")

# Identify channels with fewer than 10 videos
channel_sparse = video_counts[video_counts["video_count"] < 10]["channel_id"]
channel_sparse

6     UCAW-NpUFkMyCNrvRSSGIvDQ
13    UCMNmwqCtCSpftrbvR3KkHDA
16    UCWsDFcIhY2DBi3GB5uykGXA
Name: channel_id, dtype: object

In [24]:
# # Remove sparse channels
# channel = channel[~channel["channel_id"].isin(channel_sparse)]
# channel.shape

In [25]:
# # Identify videos from sparse channels
# video_sparse = video[video["channel_id"].isin(channel_sparse)]["video_id"]
# len(video_sparse)

In [26]:
# # Cascade delete videos in sparse channels
# video = video[~video["video_id"].isin(video_sparse)]
# video.shape

In [27]:
# # Cascade delete comments from videos in sparse channels
# comment = comment[~comment["video_id"].isin(video_sparse)]
# comment.shape

<br>

## 3. Feature Engineering

_By **transforming existing variables**, we derived several new features relevant to our analysis._  

### Video: Game Involved

*`video_game` labels the **game referenced** in the video title, enabling more nuanced analysis on a **per-game basis**.*     

In [28]:
# Game keywords
games = [
    "call of duty", "gta", "the last of us", "god of war", "cyberpunk", 
    "red dead redemption", "fallout", "elden ring", "assassin's creed", 
    "star wars jedi", "resident evil", "tomb raider", "minecraft", 
    "pokemon go", "just dance", "it takes two", "uncharted", "brawl stars"
]

# Extract the game from video titles
video["video_game"] = video["video_title"]\
    .apply(lambda title: ", ".join([g for g in games if g.lower() in title.lower()]))
video["video_game"][:5]

0       elden ring
1       elden ring
2              gta
3    resident evil
4    resident evil
Name: video_game, dtype: object

### Video: Blocked Words

_We noticed that **YouTube automatically blocks certain profanities** in its video subtitles.  Given that this would likely **systematically lower perceived video toxicity scores**, we extracted the metrics, counts and proportions of **censored words in video subtitles**, as a **supplementary dimension** to video toxicity assessment._  

In [29]:
# Define regex pattern to match "[ __ ]"
pattern = r"\[\s*[_]+\s*\]"

# Count occurrences of blocked words in video subtitle
video["video_wordcount"] = video["video_subtitle"].apply(lambda x: len(x.split()))
video["video_blocked_wordcount"] = video["video_subtitle"].str.count(pattern)
video["video_blocked_proportion"] = video["video_blocked_wordcount"] /video["video_wordcount"]
video[["video_wordcount", "video_blocked_wordcount", "video_blocked_proportion"]][:5]

Unnamed: 0,video_wordcount,video_blocked_wordcount,video_blocked_proportion
0,8913,32,0.00359
1,1607,19,0.011823
2,1277,3,0.002349
3,11485,152,0.013235
4,9941,150,0.015089


### Video: Speaking Speed

*`video_speed` denotes the **speaking speed** of the YouTuber, computed as **number of words per second** within each video.  We hypothesize that toxicity in comments may be related to the talking speed in video.  Videos with a **faster speaking rate** may evoke **heightened emotional state** in viewers, **intensifying subsequent reactions** to leave aggressive comments, thereby **elevating the perceived toxicity**.  Since we do not have audio features such as vocal pitch, `video_speed` acts as a **simplified proxy** for assessing this impact.*  

_However, there is a **significant limitation** in this approach - we did not take into account the **pauses in commentary**, which often occur as players focus on gameplay.  These breaks in speech can **misrepresent** the actual speaking speed; consequently, this variable was not used in our analysis due to concerns over its reliability._  

In [30]:
# Extract words per second for videos
video["video_speed"] = video["video_wordcount"] / video["video_duration"]
video["video_speed"][:5]

0    2.762008
1    1.964548
2    2.240351
3    1.231239
4    1.166510
Name: video_speed, dtype: float64

<br>

## 4. Content Labeling: Part I

In [31]:
# Extract the comments
comments = comment["comment_text"]
comments[:5]

0    Damn dude, even with mimic I think it would ta...
1    This is the pewds that I thought he’d turn int...
2    This is actually awesome. Can't believe a meme...
3    Wow, didn't even know Pewds had this analytica...
4    Damn, i cant believe it took me 11 months afte...
Name: comment_text, dtype: object

In [32]:
# Extract the subtitles
subtitles = video["video_subtitle"]
subtitles[:5]

0    i have beaten all souls games without dying a ...
1    ah you ready yes we're ready eldon ring baby l...
2    but the new gta game is iron and it's not what...
3    [Music] hello good morning gamers early today ...
4    [Music] so [Music] yes welcome welcome welcome...
Name: video_subtitle, dtype: object

### Toxicity Annotation

_Acquiring the toxicity labels is crucial for analyzing toxicity in comments.  However, manually annotating nearly 130,000 comments is **impractical** given the large volume and resource limitations.  Thus, to effectively **quantify the level of toxicity** in comments, we will leverage the **[Perspective API](https://perspectiveapi.com/)** to obtain our true labels.  To **retain the original context**, we applied the algorithm directly to the raw texts._   

**Quota Limits and Throttling Management**  

*The Perspective API, however, enforces a **[quota limit](https://developers.perspectiveapi.com/s/about-the-api-limits-and-errors?language=en_US)** of **1 query per second (QPS)** for each project.  Despite the **lack of batch processing** support, we have devised a **throttling management** strategy that incorporates **key rotation** and **exponential backoff** to efficiently manage this constraint.*    

*Our approach involves cycling through **10 different API keys** and their respective **pre-built clients**, enhancing our query capacity within the API's quota restrictions.  Furthermore, an **exponential backoff** mechanism is enforced to manage **retries** following any quota breaches or server errors.  This method will **systematically increase the delay between subsequent requests**, thereby minimizing the likelihood of successive failures and mitigating the impact on the API server. (See Latency & Reliability Section in [Limits and Error](https://developers.perspectiveapi.com/s/about-the-api-limits-and-errors?language=en_US))*  

_Additional features such as **logging** and **exception handling** are integrated to support **monitoring** and **troubleshooting**, facilitating a smooth and efficient data labeling process.  These measures collectively **reduce the projected processing time** from an initial estimate of **3.37 days** to **less than 6 hours**._    

**Video Toxicity and Byte Limits**   

*When evaluating the toxicity levels within video transcripts, we encountered **[limitations](https://developers.perspectiveapi.com/s/about-the-api-limits-and-errors?language=en_US) with processing long texts exceeding 20,480 bytes**.  Moreover, Perspective API is optimally designed for shorter, **comment-length documents** as it was originally trained on comments (See [Model Cards](https://developers.perspectiveapi.com/s/about-the-api-model-cards?language=en_US)).*    

*To overcome this issue, we implemented two functions. In **`split_text`**, we split the transcript into **overlapping chunks of 100 words**, **preserving the context** and accounting for the fact that these chunks are **not independent**. In **`calculate_proportion`**, we computed the **weight** of each chunk and assigned **lower weights to overlapping words**. We then compute the **weighted average** of the scores for each chunk as the **"perceived video toxicity"**.*    

In [33]:
# Import the libraries
import itertools
import logging
from googleapiclient import discovery
from googleapiclient.errors import HttpError

In [34]:
# Configure logging to file
logging.basicConfig(
    filename="../logs/toxicity.log",
    level=logging.INFO,  # Log info, warning, error, critical
    format="%(asctime)s - %(levelname)s - %(message)s",
    filemode="w"  # Overwrite on each run
)

In [35]:
# The Perspective API keys
PERSPECTIVE_API_KEYS = [
    ""
]

def build_client(api_key):
    """
    Build a client for a given Perspective API key.
    """
    # Create a client object
    # Reference: https://developers.google.com/codelabs/setup-perspective-api#4
    client = discovery.build(
        "commentanalyzer",  # Name
        "vlalpha1",  # Version
        developerKey=api_key,
        discoveryServiceUrl="https://commentanalyzer.googleapis.com/$discovery/rest?version=v1alpha1",
        static_discovery=False
    )
    return client

# Pre-build a client for each API key
clients = {key: build_client(key) for key in PERSPECTIVE_API_KEYS}

# Set up the iterator
api_key_iterator = itertools.cycle(PERSPECTIVE_API_KEYS)

In [36]:
def split_text(text, size=40, overlap=10):
    """
    Split the given text into overlapping chunks.
    """
    chunks = []
    words = text.split()
    total_words = len(words)
    start = 0
    
    while start < total_words:
        # Set the end index
        end = start + size
        # Check if this is the last chunk
        if end >= total_words:
            # Add the last chunk and break
            chunks.append(" ".join(words[start:total_words]))
            break
        else:
            chunks.append(" ".join(words[start:end]))
        # Update the start index
        start = end - overlap
    return chunks

In [37]:
def calculate_proportions(chunks, overlap=10):
    """
    Calculate the proportion of each chunk.  
    Assign lower weight to overlapping words.
    """
    proportions = []
    n = len(chunks)

    # Calculate total number of words
    total_words = sum(len(chunk.split()) for chunk in chunks) - (n - 1) * overlap

    # If only one chunk, set proportion to 1
    if n == 1:
        proportions.append(1)
        return proportions
        
    for i, chunk in enumerate(chunks):
        count = len(chunk.split())
        # Number of overlaps for middle chunks
        overlap_count = overlap
        # Number of overlaps for first and last chunk
        if i == 0 or i == n - 1:
            overlap_count = overlap / 2
        
        # Compute the proportion
        proportions.append((count - overlap_count) / total_words)

    return proportions

In [38]:
def perspective_toxicity(texts, prefix, size=40, overlap=10, split=False):
    """
    Compute Perspective toxicity scores for a given list of texts.
    Support throttling management w/ client reuse, key rotation, and 
    exponential backoff.
    """
    # Start timing
    start_time = time.time()
    
    # Empty list to store toxicity scores
    scores = []

    # Loop through the texts
    for index, text in enumerate(texts):  
        # Default to single chunk 
        chunks = [text]
        proportions = [1]
        
        # If text splitting is enabled
        if split:
            chunks = split_text(text, size, overlap)
            proportions = calculate_proportions(chunks, overlap)
            logging.info(f"Processing subtitle #{index}: {len(text.split())} words; {len(chunks)} chunk(s).")

        # Empty list to store chunk scores
        temp_scores = []
        
        for chunk_index, (chunk, proportion) in enumerate(zip(chunks, proportions)):
            # Specify the text and attributes
            analyze_request = {
                "comment": {"text": chunk},
                "languages": ["en"],
                "requestedAttributes": {
                    "TOXICITY": {},
                    "SEVERE_TOXICITY": {},
                    "IDENTITY_ATTACK": {},
                    "INSULT": {},
                    "PROFANITY": {},
                    "THREAT": {}}
            }
        
            # Reset attempt count for each text
            attempt = 0
            # Attempts allowed
            max_attempts = len(PERSPECTIVE_API_KEYS) * 5  # Number of keys * Attempts per key

            # While retry attempts are not exhausted
            while attempt < max_attempts:
                # Rotate to the next API key
                current_key = next(api_key_iterator)
                client = clients[current_key]

                try:
                    res = client.comments().analyze(body=analyze_request).execute()
                    s = res["attributeScores"]
                    temp_scores.append({
                        "toxicity": s["TOXICITY"]["summaryScore"]["value"] * proportion,
                        "severe_toxicity": s["SEVERE_TOXICITY"]["summaryScore"]["value"] * proportion,
                        "identity_attack": s["IDENTITY_ATTACK"]["summaryScore"]["value"] * proportion,
                        "insult": s["INSULT"]["summaryScore"]["value"] * proportion,
                        "profanity": s["PROFANITY"]["summaryScore"]["value"] * proportion,
                        "threat": s["THREAT"]["summaryScore"]["value"] * proportion,
                        "language": res["detectedLanguages"]
                    })
                    if split:
                        logging.info(f"Success for subtitle #{index} part {chunk_index + 1}/{len(chunks)} with key {current_key} on attempt {attempt + 1}")
                    else:
                        logging.info(f"Success for comment #{index} with key {current_key} on attempt {attempt + 1}")
                    # Break the loop if successful
                    break

                # Http errors
                except HttpError as e:
                    # Rate limit errors
                    if e.resp.status == 429:
                        logging.warning(f"HTTP 429 Rate limit exceeded for text #{index} with key '{current_key}' on attempt {attempt + 1}. Retrying with exponential backoff.")
                    else:
                        logging.warning(f"HTTP error for text #{index} with key '{current_key}' on attempt {attempt + 1}: {e}. Retrying with exponential backoff.")
                # Timeout errors
                except TimeoutError:
                    logging.warning(f"TimeoutError for text #{index} with key '{current_key}' on attempt {attempt + 1}. Retrying with exponential backoff.")
                # Unexpected errors
                except Exception as e:
                    logging.warning(f"Unexpected error for text #{index} with key '{current_key}' on attempt {attempt + 1}: {e}. Retrying with exponential backoff.")

                # Exponential backoff + random jitter
                sleep_time = (2 ** (attempt // len(PERSPECTIVE_API_KEYS))) + random.uniform(0, 1)
                time.sleep(sleep_time)
                attempt += 1

                # Check if all retry attempts are exhausted
                if attempt == max_attempts:
                    logging.error(f"Max attempts reached for text #{index} with key {current_key}. Moving to the next text.")
    
        # Compute the weighted average for each score
        aggregated_score = {f"{prefix}{k}": sum(temp[k] for temp in temp_scores)\
                            for k in temp_scores[0].keys()\
                            if k != "language"}
        # Use the detected languages from the first chunk
        aggregated_score[f"{prefix}language"] = temp_scores[0]["language"]
        scores.append(aggregated_score)
    
    # Convert to DataFrame
    toxicity_scores = pd.DataFrame(scores)
    
    # End timing
    logging.info(f"Total Runtime: {time.time() - start_time:.4f} seconds\n")
    
    return toxicity_scores

#### Comments

In [39]:
# Start timing
start_time = time.time()

# Compute Perspective API toxicity scores for each comment
comment_toxicity_scores = perspective_toxicity(
    texts=comments, 
    prefix="comment_", 
    split=False
)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
comment_toxicity_scores.head()

Runtime: 13657.1657



Unnamed: 0,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language
0,0.543256,0.03411,0.028629,0.338892,0.627178,0.049451,[en]
1,0.077668,0.00412,0.011765,0.019741,0.039407,0.010991,[en]
2,0.146031,0.006599,0.009027,0.021933,0.064458,0.055594,[en]
3,0.081625,0.004139,0.010137,0.017228,0.041161,0.021127,[en]
4,0.45703,0.024115,0.028017,0.154143,0.508187,0.011379,[en]


In [40]:
# Combine into one DataFrame
for column in comment_toxicity_scores.columns:
    comment[column] = comment_toxicity_scores[column].values
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9907.0,47.0,0.543256,0.03411,0.028629,0.338892,0.627178,0.049451,[en]
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6299.0,9.0,0.077668,0.00412,0.011765,0.019741,0.039407,0.010991,[en]
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5091.0,54.0,0.146031,0.006599,0.009027,0.021933,0.064458,0.055594,[en]


In [41]:
# Check the dimensions
comment.shape

(128461, 14)

#### Subtitles

In [42]:
# Start timing
start_time = time.time()

# Compute Perspective API toxicity scores for each subtitle
video_toxicity_scores = perspective_toxicity(
    texts=subtitles, 
    prefix="video_",
    size=100, 
    overlap=20, 
    split=True
)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
video_toxicity_scores.head()

Runtime: 7776.3443



Unnamed: 0,video_toxicity,video_severe_toxicity,video_identity_attack,video_insult,video_profanity,video_threat,video_language
0,0.358079,0.071938,0.056368,0.148396,0.244498,0.224883,[en]
1,0.500542,0.155937,0.098798,0.280781,0.373053,0.261945,[en]
2,0.4074,0.082654,0.078979,0.231179,0.271065,0.105283,[en]
3,0.400487,0.093281,0.070456,0.196307,0.288111,0.19028,[en]
4,0.432568,0.120564,0.077785,0.209814,0.314366,0.241215,[en]


In [43]:
# Combine into one DataFrame
for column in video_toxicity_scores.columns:
    video[column] = video_toxicity_scores[column].values
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,...,video_blocked_wordcount,video_blocked_proportion,video_speed,video_toxicity,video_severe_toxicity,video_identity_attack,video_insult,video_profanity,video_threat,video_language
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,3227.0,"['pewdiepie', 'pewds', 'pewdie']",11590164,474052,15146.0,...,32,0.00359,2.762008,0.358079,0.071938,0.056368,0.148396,0.244498,0.224883,[en]
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,818.0,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5179366,192101,4313.0,...,19,0.011823,1.964548,0.500542,0.155937,0.098798,0.280781,0.373053,0.261945,[en]
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,570.0,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053858,282853,9073.0,...,3,0.002349,2.240351,0.4074,0.082654,0.078979,0.231179,0.271065,0.105283,[en]


In [44]:
# Check the dimensions
video.shape

(1307, 26)

<br>  

## 5. Text Preprocessing

### Filter English Comments

_To **align** our analysis with the interests of the English-speaking YouTube gaming community, we utilized the **`detectedLanguage` attribute from Perspective API** to **filter out non-English comments**._  

In [45]:
# Inspect non-English videos
len(video[video["video_language"].apply(lambda x: "en" not in x)])

0

In [46]:
# Remove non-English videos
video = video[video["video_language"].apply(lambda x: "en" in x)]
video.shape

(1307, 26)

In [47]:
# Inspect non-English comments
len(comment[comment["comment_language"].apply(lambda x: "en" not in x)])

3528

In [48]:
# Inspect non-English comments
comment[comment["comment_language"].apply(lambda x: "en" not in x)].head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language
903,Dy3ege2nD6M,Ugw_KYxpcQBlXGdN2Oh4AaABAg,UCEU6y10PTmsVpLPmutO4L6g,Seruuu berasa ikut naik motor... Yeeyyy,2020-12-14 01:35:53+00:00,0.0,0.0,0.010744,0.000854,0.001119,0.008292,0.012859,0.006563,"[id, ms]"
1002,iROAUHfCrnU,UgzlNvwjnXvpxbcs8zt4AaABAg,UCEU6y10PTmsVpLPmutO4L6g,Wih keren juga nih game,2020-12-20 10:55:30+00:00,0.0,0.0,0.004901,0.000529,0.000906,0.006116,0.012825,0.005777,[id]
1074,iROAUHfCrnU,UgwNZwmVT5yCLM8As-V4AaABAg,UC1R2DBFNbKrX5aQ-yRDW2fA,Me sorprende que de 108 M de suscriptores solo...,2020-12-21 02:22:13+00:00,5.0,3.0,0.004335,0.000269,0.000666,0.006249,0.010126,0.00533,[es]


In [49]:
# Remove non-English comments
comment = comment[comment["comment_language"].apply(lambda x: "en" in x)]
comment.shape

(124933, 14)

### Text Cleaning

_To preserve the **most relevant information**, we will undertake a series of text preprocessing steps to refine our corpus for analysis.  We segment text cleaning into **three levels** - **minimal, moderate, and tokenization** - to **tailor preprocessing needs** for sentiment analysis and potential future modeling._   

#### Level 1: Contraction Expansion + Noise Removal

_The first level involves **contraction expansion** and **noise elimination**.  **Contractions** will be expanded to their full forms using the [`contractions`](https://pypi.org/project/contractions/) library to ensure that the text will be **consistently understood** by analytical tools.  The **URLs, mentions, hashtags, and new-line characters** will be removed to eliminate the noise in data._  

In [50]:
def remove_noise(text):
    """
    Perform minimal text preprocessing by removing noise.
    """
    # Remove contractions
    text = contractions.fix(text)
    
    # Remove URLs
    text = re.sub(r"http\S+", "", text)
    # Remove mentions
    text = re.sub(r"(?<![@\w])@(\w{1,25})", "", text)
    # Remove hashtags
    text = re.sub(r"(?<![#\w])#(\w{1,25})", "", text)
    # Remove new line characters
    text = re.sub("\n", " ", text)
    
    return text

In [51]:
# Remove noise for comments
print(comments[41])
comments_filtered = comments.apply(remove_noise)
print(comments_filtered[41])

I haven't watched a video from Felix for a long time but this one was great! 
Reminded me of the good times I had when I struggled with uni and his videos gave me strength to still push on. 
In a sense, I partly owe Felix my degree and job opportunities:)
I have not watched a video from Felix for a long time but this one was great!  Reminded me of the good times I had when I struggled with uni and his videos gave me strength to still push on.  In a sense, I partly owe Felix my degree and job opportunities:)


In [52]:
# Remove noise for subtitles
subtitles_filtered = subtitles.apply(remove_noise)

#### Level 2: Normalization + Punctuation & Stopword Removal

_Building on to the first level, the second level of preprocessing incorporates **text normalization** along with **punctuation and stopword removal**.  To ensure uniformity across our dataset, all texts will be converted to **lowercase**.  Additionally, **non-alphabetic characters** and common English **stopwords** will also be removed as they do not possess significant information.  Note the **potential caveat** in this procedure: the elimination of these elements could result in loss of certain nuances in text._  

In [53]:
def clean(text):
    """
    Perform text preprocessing on a given document.
    """
    # Convert to lowercase
    text = text.lower()
    # Contraction expansion
    text = contractions.fix(text)
    
    # Remove URLs
    text = re.sub(r"http\S+", "", text)
    # Remove mentions
    text = re.sub(r"(?<![@\w])@(\w{1,25})", "", text)
    # Remove hashtags
    text = re.sub(r"(?<![#\w])#(\w{1,25})", "", text)
    # Remove new line characters
    text = re.sub("\n", " ", text)
    
    # Remove non-alphabetic characters
    text = re.sub(r"[^a-zA-Z\s]", " ", text)
    
    # Remove extra spaces
    text = re.sub(r"\s+", " ", text)
    
    # Tokenization
    tokens = word_tokenize(text)

    # Remove stop words
    stop_words = set(stopwords.words("english"))
    tokens = [word for word in tokens if word not in stop_words]

    # Lemmatization
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    # Re-join tokens
    lemmatized_text = " ".join(lemmatized_tokens)
    
    return text

In [54]:
# Clean the comments
print(comments[0])
comments_cleaned = comments.apply(clean)
print(comments_cleaned[0])

Damn dude, even with mimic I think it would take me approximately 3 lifetimes to complete a no death run... you really are a gamer god.
damn dude even mimic think would take approximately lifetimes complete death run really gamer god


In [55]:
# Clean the subtitles
subtitles_cleaned = subtitles.apply(clean)

#### Level 3: Tokenization

*Using `word_tokenizer`, we will **tokenize** the text into smaller pieces.  This process will be crucial for **analyzing term frequency** or **identifying common themes** within the corpus as the analysis progresses.*  

In [56]:
# Tokenize the comments
comments_tokenized = comments_cleaned.apply(word_tokenize)
comments_tokenized[1]

['pewds',
 'thought',
 'would',
 'turn',
 'gaming',
 'early',
 'channel',
 'right',
 'gives',
 'joy']

In [57]:
# Combine into one DataFrame
comment["comment_filtered"] = comments_filtered
comment["comment_cleaned"] = comments_cleaned
comment["comment_tokenized"] = comments_tokenized
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,comment_insult,comment_profanity,comment_threat,comment_language,comment_filtered,comment_cleaned,comment_tokenized
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9907.0,47.0,0.543256,0.03411,0.028629,0.338892,0.627178,0.049451,[en],"Damn dude, even with mimic I think it would ta...",damn dude even mimic think would take approxim...,"[damn, dude, even, mimic, think, would, take, ..."
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6299.0,9.0,0.077668,0.00412,0.011765,0.019741,0.039407,0.010991,[en],This is the pewds that I thought he would turn...,pewds thought would turn gaming early channel ...,"[pewds, thought, would, turn, gaming, early, c..."
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5091.0,54.0,0.146031,0.006599,0.009027,0.021933,0.064458,0.055594,[en],This is actually awesome. Cannot believe a mem...,actually awesome cannot believe meme became tr...,"[actually, awesome, can, not, believe, meme, b..."


In [58]:
# Remove the missings
comment.dropna(inplace=True)
comment = comment[comment["comment_cleaned"].str.len() > 0]
comment.shape

(124704, 17)

In [59]:
# Tokenize the subtitles
subtitles_tokenized = subtitles_cleaned.apply(word_tokenize)

In [60]:
# Combine into one DataFrame
video["video_subtitle_filtered"] = subtitles_filtered
video["video_subtitle_cleaned"] = subtitles_cleaned
video["video_subtitle_tokenized"] = subtitles_tokenized
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,...,video_toxicity,video_severe_toxicity,video_identity_attack,video_insult,video_profanity,video_threat,video_language,video_subtitle_filtered,video_subtitle_cleaned,video_subtitle_tokenized
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,3227.0,"['pewdiepie', 'pewds', 'pewdie']",11590164,474052,15146.0,...,0.358079,0.071938,0.056368,0.148396,0.244498,0.224883,[en],i have beaten all souls games without dying a ...,beaten souls games without dying single time m...,"[beaten, souls, games, without, dying, single,..."
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,818.0,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5179366,192101,4313.0,...,0.500542,0.155937,0.098798,0.280781,0.373053,0.261945,[en],ah you ready yes we are ready eldon ring baby ...,ah ready yes ready eldon ring baby likely unli...,"[ah, ready, yes, ready, eldon, ring, baby, lik..."
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,570.0,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053858,282853,9073.0,...,0.4074,0.082654,0.078979,0.231179,0.271065,0.105283,[en],but the new gta game is iron and it is not wha...,new gta game iron whatever one wanted new gta ...,"[new, gta, game, iron, whatever, one, wanted, ..."


In [61]:
# Check the dimensions
video.dropna(inplace=True)
video.shape

(1307, 29)

<br>

## 6. Content Labeling: Part II

In [62]:
# Extract the comments
comments_filtered = comment["comment_filtered"]
comments_cleaned = comment["comment_cleaned"]

In [63]:
# Extract the subtitles
subtitles_filtered = video["video_subtitle_filtered"]
subtitles_cleaned = video["video_subtitle_cleaned"]

### Sentiment Evaluation

_To further investigate the **emotional dynamics** of the comments, we will compute the **sentiment scores** using **[VADER](https://doi.org/10.1609/icwsm.v8i1.14550)**, **[TextBlob](https://textblob.readthedocs.io/en/dev/)**, and **[Empath](https://doi.org/10.48550/arXiv.1602.06979)**.  For Empath analysis, we collected all **194 categories** it offers, aiming for a more nuanced understanding of the **prevalent themes** within YouTube gaming comments - i.e., are toxicity more prevalent when certain topics are present in the video?_    

*In the following steps, we will apply **VADER** to minimally cleaned text to leverage its design, which is **tailored to [social media content](https://github.com/cjhutto/vaderSentiment/blob/master/vaderSentiment/vaderSentiment.py#L517)** with its ability to **interpret punctuation, all caps, slang, and emoji**.  On the other hand, **TextBlob** and **Empath** analyses will be performed on text that has been more thoroughly cleaned.  This decision stems from the absence of clear evidence on their ability to handle social media text nuances from the [TextBlob documentation](https://textblob.readthedocs.io/en/dev/api_reference.html#textblob.blob.TextBlob) or the [Empath paper](https://arxiv.org/pdf/1602.06979.pdf).*  

In [64]:
# Import the libraries
from nltk.corpus import opinion_lexicon
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob
from empath import Empath

In [65]:
def vader_sentiment(text, prefix):
    """
    Compute VADER sentiment scores for a given text.
    """
    # Initialize the analyzer
    analyzer = SentimentIntensityAnalyzer()
    
    # Compute the scores
    scores = analyzer.polarity_scores(text)
    vader_scores = {f"{prefix}{k}": v for k, v in scores.items()}
    
    return vader_scores

In [66]:
def textblob_sentiment(text, prefix):
    """
    Compute TextBlob sentiment scores for a given text.
    """
    # Initialize the analyzer
    blob = TextBlob(text)
    
    # Compute the scores
    textblob_scores = {
        f"{prefix}polarity": blob.sentiment.polarity, 
        f"{prefix}subjectivity": blob.sentiment.subjectivity
    }
    
    return textblob_scores

In [67]:
def empath_sentiment(text, prefix):
    """
    Compute Empath sentiment scores for a given text.
    """
    # Initialize the analyzer
    lexicon = Empath()
    
    # Compute the scores
    categories = lexicon.analyze(text, normalize=True)
    empath_scores = {f"{prefix}{k}":v for k, v in categories.items()}
    
    return empath_scores

#### Comments

In [68]:
# Start timing
start_time = time.time()

# Compute VADER sentiment scores for each comment
comment_vader_scores = \
    comments_filtered.apply(lambda x: vader_sentiment(x, "comment_")).apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
comment_vader_scores.head()

Runtime: 763.6631



Unnamed: 0,comment_neg,comment_neu,comment_pos,comment_compound
0,0.087,0.743,0.17,0.3707
1,0.0,0.872,0.128,0.5859
2,0.086,0.613,0.301,0.7906
3,0.0,0.704,0.296,0.9358
4,0.029,0.813,0.158,0.8761


In [69]:
# Compute TextBlob sentiment scores for each comment
comment_textblob_scores = \
    comments_cleaned.apply(lambda x: textblob_sentiment(x, "comment_")).apply(pd.Series)
comment_textblob_scores.head()

Unnamed: 0,comment_polarity,comment_subjectivity
0,-0.033333,0.4
1,0.395238,0.345238
2,0.416667,0.583333
3,0.48,0.56
4,-0.126667,0.673333


In [70]:
# Start timing
start_time = time.time()

# Compute Empath sentiment scores for each comment
comment_empath_scores = \
    comments_cleaned.apply(lambda x: empath_sentiment(x, "comment_")).apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
comment_empath_scores.head()

Runtime: 1670.6397



Unnamed: 0,comment_help,comment_office,comment_dance,comment_money,comment_wedding,comment_domestic_work,comment_sleep,comment_medical_emergency,comment_cold,comment_hate,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.047619,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.047619
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.038462,0.0


In [71]:
# Combine into one DataFrame
comment = pd.concat([comment, comment_vader_scores, 
                     comment_textblob_scores, 
                     comment_empath_scores], axis=1)
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount,comment_toxicity,comment_severe_toxicity,comment_identity_attack,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02 19:37:22+00:00,9907.0,47.0,0.543256,0.03411,0.028629,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14 23:36:11+00:00,6299.0,9.0,0.077668,0.00412,0.011765,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31 18:16:36+00:00,5091.0,54.0,0.146031,0.006599,0.009027,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
# Check the dimensions
comment.shape

(124704, 217)

In [73]:
# Write to CSV
comment.to_csv("../data/comment-labeled.csv", index=False, lineterminator="\r\n")

#### Subtitles

In [74]:
# Start timing
start_time = time.time()

# Compute VADER sentiment scores for each subtitle
video_vader_scores = \
    subtitles_filtered.apply(lambda x: vader_sentiment(x, "video_")).apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
video_vader_scores.head()

Runtime: 3121.7435



Unnamed: 0,video_neg,video_neu,video_pos,video_compound
0,0.129,0.733,0.137,0.9915
1,0.136,0.684,0.18,0.9976
2,0.101,0.673,0.226,0.9997
3,0.107,0.717,0.176,1.0
4,0.104,0.72,0.176,1.0


In [75]:
# Compute TextBlob sentiment scores for each subtitle
video_textblob_scores = \
    subtitles_cleaned.apply(lambda x: textblob_sentiment(x, "video_")).apply(pd.Series)
video_textblob_scores.head()

Unnamed: 0,video_polarity,video_subjectivity
0,0.082446,0.497534
1,0.150232,0.576019
2,0.093443,0.562179
3,0.148292,0.519329
4,0.1375,0.50903


In [76]:
# Start timing
start_time = time.time()

# Compute Empath sentiment scores for each subtitle
video_empath_scores = \
    subtitles_cleaned.apply(lambda x: empath_sentiment(x, "video_")).apply(pd.Series)

# End timing
print(f"Runtime: {time.time() - start_time:.4f}\n")
video_empath_scores.head()

Runtime: 17.3947



Unnamed: 0,video_help,video_office,video_dance,video_money,video_wedding,video_domestic_work,video_sleep,video_medical_emergency,video_cold,video_hate,...,video_weapon,video_children,video_monster,video_ocean,video_giving,video_contentment,video_writing,video_rural,video_positive_emotion,video_musical
0,0.003275,0.003527,0.004787,0.004031,0.004031,0.00126,0.003023,0.010078,0.005543,0.011086,...,0.019148,0.013102,0.002771,0.000756,0.00907,0.001512,0.00126,0.001764,0.017889,0.005291
1,0.002491,0.002491,0.004981,0.003736,0.004981,0.0,0.003736,0.004981,0.009963,0.003736,...,0.012453,0.022416,0.002491,0.012453,0.008717,0.0,0.001245,0.002491,0.022416,0.007472
2,0.0,0.001508,0.004525,0.007541,0.0,0.001508,0.0,0.001508,0.001508,0.019608,...,0.003017,0.00905,0.004525,0.003017,0.004525,0.001508,0.001508,0.0,0.010558,0.013575
3,0.002962,0.002962,0.007775,0.005739,0.004628,0.003517,0.002592,0.003147,0.008886,0.005924,...,0.01555,0.019067,0.001851,0.003517,0.01629,0.001851,0.000555,0.003702,0.01518,0.010552
4,0.002365,0.00215,0.007739,0.010963,0.003439,0.003224,0.00258,0.00258,0.006234,0.008169,...,0.026225,0.013328,0.003869,0.017627,0.018487,0.00129,0.001075,0.00215,0.011608,0.007954


In [77]:
# Combine into one DataFrame
video = pd.concat([video, video_vader_scores, 
                   video_textblob_scores, 
                   video_empath_scores], axis=1)
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,...,video_weapon,video_children,video_monster,video_ocean,video_giving,video_contentment,video_writing,video_rural,video_positive_emotion,video_musical
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30 16:40:18+00:00,🌏 Get exclusive NordVPN deal here ➵ https://N...,3227.0,"['pewdiepie', 'pewds', 'pewdie']",11590164,474052,15146.0,...,0.019148,0.013102,0.002771,0.000756,0.00907,0.001512,0.00126,0.001764,0.017889,0.005291
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24 15:00:10+00:00,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,818.0,"['pewdiepie', 'pewds', 'pewdie', 'elden ring',...",5179366,192101,4313.0,...,0.012453,0.022416,0.002491,0.012453,0.008717,0.0,0.001245,0.002491,0.022416,0.007472
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19 17:15:01+00:00,Grand Theft Auto: The Trilogy is not epic bros...,570.0,"['pewdiepie', 'pewds', 'pewdie', 'Grand Theft ...",4053858,282853,9073.0,...,0.003017,0.00905,0.004525,0.003017,0.004525,0.001508,0.001508,0.0,0.010558,0.013575


In [78]:
# Check the dimensions
video.shape

(1307, 229)

In [79]:
# Write to CSV
video.to_csv("../data/video-labeled.csv", index=False)

#### All YouTube Data

In [80]:
# Merge channel, video, and comments information
channel_videos = pd.merge(
    channel, 
    video.drop(columns=[
        "video_subtitle", "video_subtitle_filtered", 
        "video_subtitle_cleaned", "video_subtitle_tokenized"
    ]),       
    on="channel_id", 
    how="left"
)
yt = pd.merge(channel_videos, comment, on="video_id", how="inner")
yt.head(3)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount,video_id,video_title,...,comment_weapon,comment_children,comment_monster,comment_ocean,comment_giving,comment_contentment,comment_writing,comment_rural,comment_positive_emotion,comment_musical
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.0,0.0,0.0,0.1,0.0,0.0,0.1,0.0
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [81]:
# Check the dimensions
yt.shape

(124704, 448)

In [82]:
# Write to CSV
yt.to_csv("../data/yt-labeled.csv", index=False, lineterminator="\r\n")

_The final labeled dataset contains **124,704 rows** and **448 columns**.  In `03-preliminary analysis`, we will begin to explore the dataset, examining its **distribution** through **exploratory data analysis** and **visualizations**._   