# IMT 547 Project Part I: Data Collection

Chesie Yu

02/24/2024

<style type = "text/css">  
    body {
        font-family: "Serif"; 
        font-size: 12pt;
    }
    em {
        color: #4E7F9E;
    }
    strong {
        color: #436D87;
    }
    li {
        color: #4E7F9E;
    }
    ul {
        color: #4E7F9E;
    }
    img {
        display: block;
        margin: auto;
    } 
    .jp-RenderedHTMLCommon a:link { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon a:visited { 
        color: #94C1C9;
    }
    .jp-RenderedHTMLCommon code {
        color: #4E7F9E;
    }  
    .mark {
        color: #B00D00;
        background-color: #FFF7B1;
    }
</style>

_This notebook outlines the **data collection** process for the **YouTube Gaming Comment Toxicity** project._    

**Components**  
1. **Authentication & Configuration**: Library setup, logging configuration, and API client initialization.      
2. **Utility Functions**: A series of functions designed to streamline the data collection workflow.   
3. **Data Collection**: Channel- and keyword-based data collection producing DataFrames containing channel, video, and comment data.  

**Functions**   
- **`get_channel_info(channel_ids)`**: Fetch channel info for a list of YouTube channels.    
- **`get_video_ids(uploads_id, max_videos=30, keywords="")`**: Fetch video IDs (default up to 30) based on given keywords from a upload playlist.  
- **`get_video_info(video_ids)`**: Fetch video info from a list of YouTube videos.   
- **`get_video_subtitle(video_id)`**: Fetch video subtitle for a given video.  
- **`get_video_comments(video_ids, max_comments=100)`**: Fetch comment info (default up to 100) for a list of YouTube videos.  
- **`get_youtube_data(channel_ids, max_videos=30, max_comments=100, keywords=“”)`**: Main function. Fetch videos and comments for a list of channels.    

**Runtime**  
_Approximately 63 minutes._    

**Data Collection Procedures**  

_To support our examination of the impact of game genres on comment toxicity across YouTube gaming channels, we have devised the following data collection approach:_   

**Step 1: Keyword Selection**   

_To **differentiate** action and non-action gaming videos on YouTube, we identified **two sets of keywords** representing popular games in each category._   

_The keyword sets are as follows:_    
- **Action Games**: {"call of duty", "gta", "the last of us", "god of war", "red dead redemption", "assassin's creed", "star wars jedi", "resident evil", "cyberpunk", "fallout", "tomb raider", "elden ring"}    

- **Non-Action Games**: {"minecraft", "pokemon go", "just dance", "it takes two", "uncharted", "brawl stars"}        


**Step 2: Channel Selection**      

*From [SocialBook's Top 100 Gaming YouTubers](https://socialbook.io/youtube-channel-rank/top-100-gaming-youtubers), we curated a list of **32 channels** that predominantly create gaming content in **English**.  For each channel, we **manually extracted** the `channel_id` and **assigned the binary labels** `english` and `gamer` in `gamer-100.csv`, ensuring our focus on **English-speaking gaming community**.*  

**Step 3: Data Collection**     

_Leveraging the **[YouTube Data API](https://developers.google.com/youtube/v3/getting-started)**, we gathered data from **30 videos per category for each channel**, using pre-defined keywords for action and non-action games.  For each video, we extracted the respective features along with the **automatic English subtitle** using **[yt-dlp](https://github.com/yt-dlp/yt-dlp)**.  We then collected the **100 most relevant top-level comments for each video**._  

_The sets of features include:_  
- **Channel Features**: `["channel_id", "channel_name", "channel_description", "channel_country", "channel_uploads_id", "channel_viewcount", "channel_subscribercount", "channel_videocount"]`   

- **Video Features**: `[“video_id”, “video_title”, “video_creation_time”, “video_description”, "video_duration", “video_tags”, “video_viewcount”, “video_likecount”, "video_commentcount", "video_subtitle_path", "video_subtitle", "video_live"]`     

- **Comment Features**: `[“comment_id”, “comment_author_id”, “comment_text”, “comment_time”, “comment_likecount”, "comment_replycount"]`       

_The final dataset consists of **136,463 comments** and **1407 unique videos**, encompassing **26 channel, video, and comment features**.  Through analyzing this data, we aim to uncover insights into the dynamics of toxic commenting behaviors within the YouTube gaming communities.  `02-preprocessing.ipynb` will focus on **data cleaning, feature engineering, content labeling, and text preprocessing** for subsequent analysis._    

## 1. Authentication & Configuration

In [1]:
# The YouTube API key
API_KEY = ""

In [2]:
# Install libraries
!pip install --upgrade google-api-python-client --quiet

In [3]:
# Import libraries
import json
import logging
import os
import time
import pandas as pd
import googleapiclient
from googleapiclient import discovery
from googleapiclient.errors import HttpError
from yt_dlp import YoutubeDL

In [4]:
# Configure logging to file
logging.basicConfig(
    filename="../logs/data.log",
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    filemode="w"
)

In [5]:
# Initialize the YouTube API
youtube = googleapiclient.discovery.build("youtube", "v3", developerKey=API_KEY)

<br>

## 2. Utility Functions

In [6]:
def get_channel_info(channel_ids):
    """
    Fetch channel info for a list of YouTube channels. 
    """
    # Empty list to store channel info
    channel_info = []

    # Concatenate channel_ids into comma-separated string
    channel_ids_str = ",".join(channel_ids)
    
    # Call the API to find uploads channel id
    # Documentation: https://developers.google.com/youtube/v3/docs/channels/list
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=channel_ids_str
    )
    res = request.execute()

    for c in res["items"]:
        # Extract relevant channel info
        channel_info.append({
            "channel_id": c["id"],
            "channel_name": c["snippet"]["title"],
            "channel_description": c["snippet"]["description"],
            "channel_country": c["snippet"].get("country", "Not specified"), 
            "channel_uploads_id": c["contentDetails"]["relatedPlaylists"]["uploads"],
            "channel_viewcount": c["statistics"]["viewCount"],
            "channel_subscribercount": c["statistics"]["subscriberCount"],
            "channel_videocount": c["statistics"]["videoCount"]
        })
    
    return sorted(channel_info, key=lambda x: int(x["channel_subscribercount"]), reverse=True)

In [7]:
def get_video_ids(uploads_id, max_videos=30, keywords=""):
    """
    Fetch video IDs from a YouTube playlist.
    """
    # Empty list to store video_ids
    video_ids = []
    page_token = None
    
    # Loop until we collect enough videos
    while len(video_ids) < max_videos: 
        # Call the API to extract video IDs from playlist
        # Documentation: https://developers.google.com/youtube/v3/docs/playlistItems
        request = youtube.playlistItems().list(
            part="snippet",
            playlistId=uploads_id,
            pageToken=page_token,
            maxResults=50
        )
        res = request.execute()
    
        # Store the video ids
        for v in res["items"]: 
            # Check if title contains keywords
            # Maybe try stemming/lemmatization if I have the time?
            title = v["snippet"]["title"].lower()
            if any(k.lower() in title for k in keywords):
                video_ids.append(v["snippet"]["resourceId"]["videoId"])

            # Exit the loop once the required number of videos is reached
            if len(video_ids) >= max_videos:
                break 
    
        # Set the token
        page_token = res.get("nextPageToken")

        # Exit the loop if no token is found
        if not page_token: 
            break
    
    return video_ids

In [8]:
def get_video_info(video_ids):
    """
    Fetch video info for a list of YouTube videos.
    """
    # Empty list to store video info
    video_info = []

    # Concatenate video_ids into comma-separated string
    video_ids_str = ",".join(video_ids)
    
    # Call the API to extract video info from ids
    # Documentation: https://developers.google.com/youtube/v3/docs/videos#resource
    request = youtube.videos().list(
        part="snippet,statistics,contentDetails,liveStreamingDetails",
        id=video_ids_str
    )
    res = request.execute()
        
    for v in res["items"]:
        # Extract subtitle path and text
        video_subtitle_path, video_subtitle = get_video_subtitle(v["id"])
        
        # Extract relevant video info
        video_info.append({
            "channel_id": v["snippet"]["channelId"],
            "video_id": v["id"],
            "video_title": v["snippet"]["title"],
            "video_creation_time": v["snippet"]["publishedAt"],
            "video_description": v["snippet"]["description"],
            "video_duration": v["contentDetails"]["duration"],
            "video_tags": v["snippet"].get("tags", []),
            "video_viewcount": v["statistics"].get("viewCount"),
            "video_likecount": v["statistics"].get("likeCount"),
            "video_commentcount": v["statistics"].get("commentCount"),
            "video_subtitle_path": video_subtitle_path,
            "video_subtitle": video_subtitle,
            "video_live": True if "liveStreamingDetails" in v else False
        })

    return video_info

In [9]:
def get_video_subtitle(video_id):
    """
    Fetch video subtitle for a given video. 
    """
    # Directory to save subtitles
    subtitle_dir = "../subtitle"

    # Construct the full URL from video ID
    video_url = f"https://www.youtube.com/watch?v={video_id}"
        
    # Define yt-dlp options for downloading subtitles
    # Documentation: https://github.com/yt-dlp/yt-dlp/blob/master/yt_dlp/YoutubeDL.py#L183
    ydl_opts = {
        "skip_download": True,  # Skip downloading the video
        "writeautomaticsub": True,  # Download the automatic subtitle
        "writesubtitles": True,  # Download the subtitle
        "subtitleslangs": ["en"],  # Specify the languauge to download 
        "subtitlesformat": "json3",  # Define the format for subtitle
        "outtmpl": f"{subtitle_dir}/{video_id}.%(ext)s",  # Specify the output names
        "quiet": True,
        "no_warnings": True
    }
    
    # Initialize return values
    video_subtitle_path, video_subtitle = None, None

    try: 
        # Download the subtitle
        with YoutubeDL(ydl_opts) as ydl:
            ydl.download([video_url])
            
        # Check if the subtitle file exists
        if os.path.exists(f"{subtitle_dir}/{video_id}.en.json3"):
            video_subtitle_path = f"{subtitle_dir}/{video_id}.en.json3"
            logging.info(f"Downloaded subtitles for video {video_id}")

            # Open the subtitle file
            with open(video_subtitle_path, "r") as file:
                subtitle = json.load(file)

            # Concatenate the subtitle text
            video_subtitle = ""
            for event in subtitle["events"]:
                if "segs" in event:
                    for seg in event["segs"]:
                        video_subtitle += seg["utf8"] + " "

            # Clean up the subtitle text
            video_subtitle = " ".join(video_subtitle.split()).replace("\n", " ")
    
        else: 
            # Subtitle not found
            logging.warning(f"No subtitles found for video {video_id}")
        
    # Exception handling
    except Exception as e:
        logging.error(f"Error downloading subtitles for video {video_id}: {e}")
    
    return video_subtitle_path, video_subtitle

In [10]:
def get_comment_info(video_ids, max_comments=100):
    """
    Fetch comments (up to max_comments) for a list of videos.
    """
    # Empty list to store the comments
    comment_info = []

    # Loop through the video ids
    for vid in video_ids:
        page_token = None 
        
        # Empty list to store individual video comments
        video_comment_info = []
        
        while len(video_comment_info) < max_comments:
            try:
                # Call the API to extract comments for videos
                # Documentation: https://developers.google.com/youtube/v3/docs/commentThreads/list
                request = youtube.commentThreads().list(
                    videoId=vid,
                    part="id, snippet, replies",
                    textFormat="plainText",
                    order="relevance",
                    maxResults=100,
                    pageToken=page_token
                )
                res = request.execute()

                # Extract relevant comment info
                for c in res["items"]:
                    video_comment_info.append({
                        "video_id": c["snippet"]["videoId"],
                        "comment_id": c["snippet"]["topLevelComment"]["id"],
                        "comment_author_id": c["snippet"]["topLevelComment"]["snippet"]["authorChannelId"]["value"],
                        "comment_text": c["snippet"]["topLevelComment"]["snippet"]["textOriginal"],
                        "comment_time": c["snippet"]["topLevelComment"]["snippet"]["updatedAt"],
                        "comment_likecount": c["snippet"]["topLevelComment"]["snippet"]["likeCount"],
                        "comment_replycount": c["snippet"]["totalReplyCount"]
                    })
                    
                    # Exit the loop once the required number of comments is reached
                    if len(video_comment_info) >= max_comments:
                        break

                # Set the token
                page_token = res.get("nextPageToken")

                # Exit the loop if no token is found
                if not page_token: 
                    break

            # Error handling for commentsDisabled
            except HttpError as e:
                if e.resp.status == 403 and "commentsDisabled" in str(e):
                    logging.warning(f"Comments are disabled for video {vid}.")
                else:
                    logging.error(f"Error extracting comments for video {vid}: {e}")
                break 
        
        # Log the number of comments extracted for the video
        logging.info(f"Extracted {len(video_comment_info)} comments for Video {vid}.")
    
        # Extend the comment info
        comment_info.extend(video_comment_info)
    
    return comment_info

### Main Function

In [11]:
def get_youtube_data(channel_ids, max_videos=30, max_comments=100, keywords=""):
    """
    Fetch videos and comments for a list of channels.
    """
    # Start timing
    start_time = time.time()

    # Get channel info for given list of channels
    all_channel_info = get_channel_info(channel_ids)
    
    # Empty list to store video and comment info
    all_video_info = []
    all_comment_info = []
    
    for channel in all_channel_info:
        uploads_id = channel["channel_uploads_id"]
        channel_name = channel["channel_name"]
        logging.info(f"Processing channel: {channel_name}")

        try:
            channel_start_time = time.time()

            # Get video ids from uploads playlist id
            video_ids = get_video_ids(uploads_id, max_videos, keywords)
            logging.info(f"Number of Videos Extracted: {len(video_ids)}")

            # Get video info from videos ids
            video_info = get_video_info(video_ids)
            all_video_info.extend(video_info)

            # Get comments for each video
            comment_info = get_comment_info(video_ids, max_comments)
            all_comment_info.extend(comment_info)
            logging.info(f"Number of Comments Extracted: {len(comment_info)}")
            logging.info(f"Runtime: {time.time() - channel_start_time:.4f} seconds\n")
        
        # Http errors
        except HttpError as e:
            logging.error(f"HttpError processing channel {channel_name}: {e}")
        # Timeout errors
        except TimeoutError:
            logging.error(f"TimeoutError for text #{index} with key '{current_key}' on attempt {attempt + 1}. Retrying with exponential backoff.")
        # Unexpected errors
        except Exception as e:
            logging.error(f"Unexpected error processing channel {channel_name}: {e}")
            
    # Convert to DataFrames
    channel = pd.DataFrame(all_channel_info)
    video = pd.DataFrame(all_video_info)
    comment = pd.DataFrame(all_comment_info)

    # Merge channel, video, and comments information
    channel_videos = pd.merge(channel, video.drop(columns=["video_subtitle"]), on="channel_id", how="left")
    yt = pd.merge(channel_videos, comment, on="video_id", how="inner")

    # End timing
    logging.info(f"Total Runtime: {time.time() - start_time:.4f} seconds\n")
    
    return channel, video, comment, yt

<br>

## 3. Data Collection

In [12]:
# Set the parameters
max_videos = 30
max_comments = 100

# Select the keywords
action_keywords = [
    "call of duty", "gta", "the last of us", "god of war", "cyberpunk", 
    "red dead redemption", "fallout", "elden ring", 
    "assassin's creed", "star wars jedi", "resident evil", "tomb raider" 
]

nonaction_keywords = [
    "minecraft", "pokemon go", "just dance", "it takes two", "uncharted",
    "brawl stars"
]

In [13]:
# Load the data
gamers = pd.read_csv("../data/gamer-100.csv")
gamers.head()

Unnamed: 0,channel,channel_id,english,gamer,influence_score,followers,avg_views,posts,eng_rate_60_day,new_video_avg_views,total_views,country
0,PewDiePie,UC-lHJZR3Gqxm24_Vd_AJ5Yw,1.0,1,88,111.0m,7.6m,4.8k,2.80%,2.6m,29.2b,Japan
1,A4,UC2tsySbe9TNrI-xh2lximHA,0.0,1,61,51.3m,20.8m,868,21.90%,10.1m,26.6b,Belarus
2,JuegaGerman,UCYiGq8XF7YQD00x7wAd62Zg,0.0,1,82,49.3m,5.3m,2.1k,3.00%,1.1m,15.1b,Chile
3,Mikecrack,UCqJ5zFEED1hWs0KNQCQuYdQ,0.0,1,59,47.7m,9.3m,2.0k,5.10%,2.1m,17.8b,Spain
4,Fernanfloo,UCV4xOVpbcV8SdueDCOxLXtQ,0.0,1,82,46.9m,30.8m,544,4.00%,0,10.5b,El Salvador


In [14]:
# Filter the English-speaking gamer channels
filtered_channels = gamers[(gamers["english"] == 1) & (gamers["gamer"] == 1)]

# Select the channels
channel_ids = filtered_channels["channel_id"].tolist()
len(channel_ids)

32

### Action Gaming Videos

In [15]:
# Get YouTube videos and comments for action videos
channel, video_action, comment_action, yt_action = \
    get_youtube_data(channel_ids, max_videos, max_comments, action_keywords)
video_action["video_genre"] = "action"
video_action.head(3)

                                                         

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_live,video_genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30T16:40:18Z,🌏 Get exclusive NordVPN deal here ➵ https://N...,PT53M47S,"[pewdiepie, pewds, pewdie]",11590164,474052,15146,../subtitle/F-yEoHL7MYY.en.json3,i have beaten all souls games without dying a ...,False,action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24T15:00:10Z,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,PT13M38S,"[pewdiepie, pewds, pewdie, elden ring, elden r...",5179366,192101,4313,../subtitle/PV4NGwn_xdI.en.json3,ah you ready yes we're ready eldon ring baby l...,False,action
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19T17:15:01Z,Grand Theft Auto: The Trilogy is not epic bros...,PT9M30S,"[pewdiepie, pewds, pewdie, Grand Theft Auto: T...",4053858,282853,9073,../subtitle/CF3jK8ai0l4.en.json3,but the new gta game is iron and it's not what...,False,action


In [16]:
# Check the dimension
channel.shape, video_action.shape, comment_action.shape, yt_action.shape

((32, 8), (618, 14), (60078, 7), (60078, 25))

### Non-Action Gaming Videos

In [17]:
# Get YouTube videos and comments for non-action videos
channel, video_nonaction, comment_nonaction, yt_nonaction = \
    get_youtube_data(channel_ids, max_videos, max_comments, nonaction_keywords)
video_nonaction["video_genre"] = "non-action"
video_nonaction.head(3)

                                                         

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_live,video_genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,KeeeLsAa30M,"$39,000,000 Minecraft House..",2023-01-17T17:45:00Z,#AD - Pre-Order G FUEL’s New PAC-MAN Flavor! h...,PT20M46S,"[pewdiepie, pewds, pewdie]",3304606,144307,5113,../subtitle/KeeeLsAa30M.en.json3,dang do you have 39 million dollars laying aro...,False,non-action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,5gWdiDI8tp4,I Found Wardens Hidden Base and Loot in Minecraft,2022-03-06T18:15:51Z,🧎#Subscribe🧎\n\nStock Up On ➡️🥤Gfuel (affiliat...,PT11M27S,"[pewdiepie, pewds, pewdie]",4179633,263512,6089,../subtitle/5gWdiDI8tp4.en.json3,bros i wish i was joking but the minecraft war...,False,non-action
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,2osdz9Z5JKY,Minecraft Warden Update is a NIGHTMARE!,2022-02-26T17:30:01Z,Get exclusive NordVPN deal here ➵ https://Nord...,PT15M46S,"[pewdiepie, pewds, pewdie]",6303313,368836,10442,../subtitle/2osdz9Z5JKY.en.json3,this is a bad idea it's a very bad idea it's p...,False,non-action


In [18]:
# Check the dimension
channel.shape, video_nonaction.shape, comment_nonaction.shape, yt_nonaction.shape

((32, 8), (789, 14), (76380, 7), (76380, 25))

### Complete Dataset

#### Channel

In [19]:
# View the data
channel.head(3)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753
1,UC7_YxT-KID8kRbqZo7MyscQ,Markiplier,"Hi, I'm Markiplier. I make videos. \n\nFrom qu...",US,UU7_YxT-KID8kRbqZo7MyscQ,21204065899,36400000,5577
2,UCpB959t8iPrxQWj7G6n0ctQ,SSSniperWolf,"Hi I'm SSSniperWolf! You can call me Lia, snip...",US,UUpB959t8iPrxQWj7G6n0ctQ,24651326157,34300000,3496


In [20]:
# Check the dimension
channel.shape

(32, 8)

In [21]:
# Write to CSV
channel.to_csv("../data/channel.csv", index=False)

#### Video

In [22]:
# Combine into one DataFrame
video = pd.concat([video_action, video_nonaction], ignore_index=True)
video.head(3)

Unnamed: 0,channel_id,video_id,video_title,video_creation_time,video_description,video_duration,video_tags,video_viewcount,video_likecount,video_commentcount,video_subtitle_path,video_subtitle,video_live,video_genre
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,2022-04-30T16:40:18Z,🌏 Get exclusive NordVPN deal here ➵ https://N...,PT53M47S,"[pewdiepie, pewds, pewdie]",11590164,474052,15146,../subtitle/F-yEoHL7MYY.en.json3,i have beaten all souls games without dying a ...,False,action
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PV4NGwn_xdI,Elden Ring #1 - 0 DEATH PLAYTHROUGH,2022-02-24T15:00:10Z,NEW animated wallpapers on GX 👇👇👇\nhttps://ope...,PT13M38S,"[pewdiepie, pewds, pewdie, elden ring, elden r...",5179366,192101,4313,../subtitle/PV4NGwn_xdI.en.json3,ah you ready yes we're ready eldon ring baby l...,False,action
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,CF3jK8ai0l4,New GTA Definitive Edition is hilariously bad,2021-11-19T17:15:01Z,Grand Theft Auto: The Trilogy is not epic bros...,PT9M30S,"[pewdiepie, pewds, pewdie, Grand Theft Auto: T...",4053858,282853,9073,../subtitle/CF3jK8ai0l4.en.json3,but the new gta game is iron and it's not what...,False,action


In [23]:
# Check the dimension
video.shape

(1407, 14)

In [24]:
# Write to CSV
video.to_csv("../data/video.csv", index=False)

#### Comment

In [25]:
# Combine into one DataFrame
comment = pd.concat([comment_action, comment_nonaction], ignore_index=True)
comment.head(3)

Unnamed: 0,video_id,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
0,F-yEoHL7MYY,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02T19:37:22Z,9907,47
1,F-yEoHL7MYY,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14T23:36:11Z,6299,9
2,F-yEoHL7MYY,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31T18:16:36Z,5091,54


In [26]:
# Check the dimension
comment.shape

(136458, 7)

In [27]:
# Write to CSV
comment.to_csv("../data/comment.csv", index=False, escapechar="\\")

#### All YouTube Data

In [28]:
# Combine into one DataFrame
yt = pd.concat([yt_action, yt_nonaction], ignore_index=True)
yt.head(3)

Unnamed: 0,channel_id,channel_name,channel_description,channel_country,channel_uploads_id,channel_viewcount,channel_subscribercount,channel_videocount,video_id,video_title,...,video_likecount,video_commentcount,video_subtitle_path,video_live,comment_id,comment_author_id,comment_text,comment_time,comment_likecount,comment_replycount
0,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,474052,15146,../subtitle/F-yEoHL7MYY.en.json3,False,UgwN1kGXwi9M7jeOb0d4AaABAg,UCLHsZ4X7YemjxRrvq0AI4LA,"Damn dude, even with mimic I think it would ta...",2022-05-02T19:37:22Z,9907,47
1,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,474052,15146,../subtitle/F-yEoHL7MYY.en.json3,False,UgwW5nWEkxdES-g3hk54AaABAg,UCJ9VDCLZDmeJIU3Branlstg,This is the pewds that I thought he’d turn int...,2022-12-14T23:36:11Z,6299,9
2,UC-lHJZR3Gqxm24_Vd_AJ5Yw,PewDiePie,I make videos.,JP,UU-lHJZR3Gqxm24_Vd_AJ5Yw,29238461440,111000000,4753,F-yEoHL7MYY,I t̶r̶i̶e̶d̶ ̶t̶o̶ beat Elden Ring Without Dyi...,...,474052,15146,../subtitle/F-yEoHL7MYY.en.json3,False,UgyGe0HN8toQWUZZtCl4AaABAg,UCs-mo1206PASdacjDqfdLng,This is actually awesome. Can't believe a meme...,2022-12-31T18:16:36Z,5091,54


In [29]:
# Check the dimension
yt.shape

(136458, 25)

In [30]:
# Write to CSV
yt.to_csv("../data/yt.csv", index=False)