# DSCI 511: Data Acquisition and Pre-Processing <br> Term Project Phase 2: Valorant comments

## Group members 
- Group member 
    - Name: Amira Bendjama
    - Email: ab4745@drexel.edu
- Group member 
    - Name: Nicole Padilla 
    - Email: np858@drexel.edu

# Data collection 

Initial data was gathered via the YouTube API which allows publicly available YouTube comments to be called by anyone who created an app with their Google account. [GeeksforGeeks](https://www.geeksforgeeks.org/how-to-extract-youtube-comments-using-youtube-api-python/) was used as a reference for the code written to run the API call. The code was modified and resulting data was loaded into a .csv file.
There are 22 youtubers selected based on their subscription count. We considered big youtubers the ones that their channels' subscription count surpass 500k, and under 500k to 100k are considered small youtubers. The lower bound 100k for subscription is how much the channel must reach in order to be verified, and since brands look for verified channels, we considered that limit. 
The youtubers information are collected in csv file "Youtubers.csv" that contains 3 columns: channel name, subscription count, channel's URL. 

### Youtubers 
In order to start collecting the comments, we needed a dataset of youtubers. Our selection was based on articles from [Best Valorant Streamers](https://www.esportsbets.com/valorant/streamers/), [Valorant main page on youtube](https://www.youtube.com/channel/UCiMRGE8Sc6oxIGuu_JxFoHg/live), reddit posts about [Valorant favorite youtubers](https://www.reddit.com/r/VALORANT/comments/o29j7i/favourite_valorant_youtuber/), and [Valorant YouTuber to learn the basics ?
](https://www.reddit.com/r/VALORANT/comments/vz5mjp/valorant_youtuber_to_learn_the_basics/).

__Criteria for picking streamers__: 
- Only verified channels, with a lower bound of subscription count of 100k, since the latter is how much the channel must reach in order to be eligible to apply for verification, and companies and brands will only consider verified channel to promote their products, in our case games.
- Most valorant streamers are based on twitch, so a popular twitch streamers doesn’t qualify as a popular youtuber, so we picked valorant youtubers that upload on their main youtube channel and have a certain subscription count. 
- The valorant youtubers are split into two categories: Big youtubers above 500k subscription count, and small youtubers are under and above 100k. 
- Youtbers are english speakers from around the world, so it is not based on location but language.
- Youtube channels are mixed between channels with only valorant videos, and channels with variety of other content besides valorant. Mainly to see the comment section through different communities. 


In [1]:
import pandas as pd
import re

def get_channels_names(file_path):
    youtubers = pd.read_csv(file_path, sep = ",", header = 0)
    return youtubers

In [2]:
youtubers = get_channels_names("data/Youtubers.csv")
youtubers

Unnamed: 0,channel_name,sub_count,url
0,Shroud,6.81M,https://www.youtube.com/@shroud/videos
1,Sykkuno,2.89M,https://www.youtube.com/@Sykkuno
2,iiTzTimmy,1.63M,https://www.youtube.com/@iiTzTimmy
3,TenZ,1.59M,https://www.youtube.com/@TenZ
4,Flights,918K,https://www.youtube.com/@Flightss
5,Grim,893K,https://www.youtube.com/c/GrimGuy
6,Kaydae,879K,https://www.youtube.com/@Kyedae
7,fuslie,732K,https://www.youtube.com/@fuslie
8,Tarik,660K,https://www.youtube.com/@tarik
9,MrLowlander,624K,https://www.youtube.com/@MrLowlander


## Youtube API 
In this project, we used Youtube API to retrieve comments, and videos from channels. We mainly used [youtube guide](https://developers.google.com/youtube/v3/getting-started), and other [ressources](https://towardsdatascience.com/how-to-build-your-own-dataset-of-youtube-comments-39a1e57aade). 
In order to access the API, a project must be created in [Google Developer’s Console](https://console.cloud.google.com/apis/dashboard?project=caramel-logic-370101), where you will have to do two steps: 
* Enable Youtube API data API v3.
* Create API key.

__Quota__ 


## Part 1: Retreiving Valorant Youtube comments

### Building Youtube service 
After setting up the youtube API, we must install libraries for Google API client for python. <br>
There is a quota limitation set by google at 10,000 units per day. To tackle this limitation, we used 4 different API keys to be able to retrieve the amount of videos and comments we want.


In [None]:
def get_keys(file_path):
    with open('data/keys.txt' , "r") as f: 
        keys = f.read()
    keys = keys.split("\n")
    return keys

keys = get_keys('data/keys.txt')
keys

In [4]:
#pip install --upgrade google-api-python-client
from googleapiclient.discovery import build
#building youtube service
def youtube_build_service(KEY):
    
    YOUTUBE_API_SERVICE_NAME = "youtube"
    YOUTUBE_API_VERSION = "v3"

    return build(YOUTUBE_API_SERVICE_NAME,
                 YOUTUBE_API_VERSION,
                 developerKey=KEY)

Getting keys to build the youtube service

In [5]:
#each time call the service pop keys
def get_service():
    global youtube_service 
    if keys:
        youtube_service  = youtube_build_service(keys.pop())       

Call the service each time the quota ends.

In [6]:
#call this function to build the service 
#and also to switch keys
get_service()
youtube_service

<googleapiclient.discovery.Resource at 0x25a15c4aac0>

### Channel information 
Each youtube channel has a unique channel ID, that mostly can be found at the end of the URL. However, some of old URL main channels will have the unique channel ID where others channels will have the name of the channel instead in form of: https://www.youtube.com/@namechannel. To solve issue, BeautifulSoup and requests were used to fecth html page of each channel and getting the unique ID by finding "externalId" that has the channel ID.

In [7]:
import requests
from bs4 import BeautifulSoup
import re

def get_channel_id(channel_url):
    url ="" 
    #getting json
    resp = requests.get(channel_url)
    data = BeautifulSoup(resp.text, "html.parser")
    #finding "externalId" that has the channel id no matter what is link structure
    data_s = str(data)
    
    search_url = re.search('"externalId":',data_s)
    start, end = search_url.span()
    #finding the url after the id, using index
    for i in range(end , end+100):
        if data_s[i] == ",":
            break
        url += data_s[i]
    url = url.split('"')[1]
    return url

Using API call, to get channels information, specifiying statistics, snippets, contentDetails.Also, Quota consumption is 1 quota for each youtube list.

In [8]:
def get_channel_details(youtube, **kwargs):
    return youtube.channels().list(
        part="statistics,snippet,contentDetails",
        **kwargs
    ).execute()

Fetching each channel detail by providing the URL, then extracting the information needed from the object.

In [9]:
def get_channels_details_info(youtubers, youtube_service):
    dict_youtubers = {}
    l_youtubers = []
    for index in range(len(youtubers["url"])):
        # get the channel ID from the URL
        channel_id= get_channel_id(youtubers["url"].iloc[index])
        # get the channel details
        response = get_channel_details(youtube_service, id=channel_id)
        snippet = response["items"][0]["snippet"]
        statistics = response["items"][0]["statistics"]
        dict_youtubers = {
            "channel_id":channel_id,
            "channel_title" : snippet["title"],
            "channel_subscriber_count" : statistics["subscriberCount"],
            "channel_video_count" : statistics["videoCount"],
            "channel_view_count"  : statistics["viewCount"] 
        }
        l_youtubers.append(dict_youtubers)
        
    return l_youtubers
    
  

Saving/loading channels information into/from "./data/channels.csv" after fetching 5 columns:
* "channel_id"
* "channel_title"
* "channel_subscriber_count"
* "channel_video_count"
* "channel_view_count" </br>

All the information will be presented as dataframes. 

In [16]:
import os

if os.path.exists("data/channels.csv"):
    # load any pre-existing data
    df = pd.read_csv('data/channels.csv')
else:
    channels_info = get_channels_details_info(youtubers, youtube_service)
    df = pd.DataFrame(channels_info)
    #save to csv file
    df.to_csv('data/channels.csv', index=False)
df 

Unnamed: 0,channel_id,channel_title,channel_subscriber_count,channel_video_count,channel_view_count
0,UCoz3Kpu5lv-ALhR4h9bDvcw,Shroud,6810000,1428,1007951954
1,UCRAEUAmW9kletIzOxhpLRFw,Sykkuno,2890000,641,371445453
2,UC5v2QgY2D5tlu8uws23MG4Q,iiTzTimmy,1630000,745,270690657
3,UCckPYr9b_iVucz8ID1Q67sw,TenZ,1590000,251,156859008
4,UCIfAlCwj-ZPZq5fqjpYDX3w,Flights,918000,56,96612905
5,UCWphjEePrzIrRA5mwcOt_4Q,Grim,893000,226,107176110
6,UCxjdy5n9BxX_6RTL8Dt_7pg,Kyedae,880000,81,52712496
7,UCujyjxsq5FZNVnQro51zKSQ,fuslie,735000,785,120281865
8,UCTbtlMEiBfs0zZLQyJzR0Uw,tarik,661000,1269,160465751
9,UCgtbMb3djcXKj6CHerHwZ-A,MrLowlander,625000,366,182226931


### Extracting videos from each channel
Manually picking up valorant videos for each channel isn't convenient. In addition, most videos won't have valorant in the title.To address this issue, we used the [search()](https://developers.google.com/youtube/v3/docs/search/list) offered by youtube API, where it has "q" paramter that specifies the query term to search for.<br>
We were able to extract roughly 462 videos,by fetching 21 videos for each youtuber.

In [11]:
def get_channel_videos(youtube, **kwargs):
    return youtube.search().list(
        **kwargs
    ).execute()

In [12]:
def get_video_details(youtube, **kwargs):
    return youtube.videos().list(
        part="snippet,contentDetails,statistics",
        **kwargs
    ).execute()

In [13]:
def video_infos(video_response):
     
    items = video_response.get("items")[0]
    # get the snippet, statistics & content details from the video response
    snippet         = items["snippet"]
    statistics      = items["statistics"]
    content_details = items["contentDetails"]
    # get infos from the snippet
    channel_title = snippet["channelTitle"]
    channel_id = snippet["channelId"]
    title         = snippet["title"]
    publish_time  = snippet["publishedAt"]
    
    # get stats infos
    comment_count = statistics["commentCount"]
    like_count    = statistics["likeCount"]
    view_count    = statistics["viewCount"]
    # get duration from content details
    duration = content_details["duration"]
    
    # duration in the form of something like 'PT5H50M15S'
    # parsing it to be something like '5:50:15'
    parsed_duration = re.search(f"PT(\d+H)?(\d+M)?(\d+S)?", duration).groups()
    duration_str = ""
    for d in parsed_duration:
        if d:
            duration_str += f"{d[:-1]}:"
    duration_str = duration_str.strip(":")
    
    dict_video_info = {
        "Title": title,
        "Channel Title": channel_title,
        "Channel ID": channel_id,
        "Publish time": publish_time,
        "Duration": duration_str,
        "Number of comments": comment_count,
        "Number of likes": like_count,
        "Number of views": view_count
        
    }
    
    return dict_video_info

The main issue we faced throughout the project is the quota limitation. To handle that, we used try/except to handle the HttpError generated from reaching the limits. When our limits reached for a single key, it is switched to another key and we build the youtube service again. Also, we made sure to undestand the quota consumption for each function using [YouTube Data API v3 - Quota Calculator](https://developers.google.com/youtube/v3/determine_quota_cost).

In [14]:
import time 

def get_videos_from_channel(youtube_service, channel_id, videos_limit = 5):
    
    # counting number of videos grabbed
    n_videos = 0
    next_page_token = None
    list_videos = []
    

    while n_videos < videos_limit:
        #paramters to select the videos
        #only valorant related videos
        params = {
            'part': 'snippet',
            'q': 'valorant',
            'channelId': channel_id,
            'type': 'video',
        }
        
        if next_page_token:
            params['pageToken'] = next_page_token
        
        try:
            #getting channel videos based on parameters
            res = get_channel_videos(youtube_service, **params)
            #getting items
            channel_videos = res.get("items")

            for video in channel_videos:
                if n_videos == videos_limit:
                    break

                
                video_id = video["id"]["videoId"]
                # easily construct video URL by its ID
                video_url = f"https://www.youtube.com/watch?v={video_id}"

                video_response = get_video_details(youtube_service, id=video_id)

                # get video details in dictionary
                dictionary_video = video_infos(video_response)
                dictionary_video["video_id"] = video_id
                dictionary_video["url"] = video_url 
                #changed just location
                n_videos += 1

                list_videos.append(dictionary_video)

            # if there is a next page, then add it to our parameters
            # to proceed to the next page
            if "nextPageToken" in res:
                next_page_token = res["nextPageToken"]
            
            #sleep between requests
            time.sleep(2)
            
        #catch the quota exception and switch keys
        except Exception as e:
            if keys:
                #switch key and build service
                get_service()
                continue
            else: 
                #in case of not having keys
                return list_videos

        
    return list_videos


Saving/loading channels information into/from csv file after fetching 10 columns: __Title,	Channel Title, Channel ID, Publish time,	Duration,	Number of comments,	Number of likes,	Number of views,	video_id,	url.__

In [18]:
if os.path.exists("data/videos.csv"):
    # load any pre-existing data
    df_videos = pd.read_csv('data/videos.csv')
    #dropping the index column
    df_videos.pop(df_videos.columns[0])
else:
    videos_retrieved = []
  
    for channel_id in df["channel_id"]:
        #make it 30
        videos_retrieved.extend(get_videos_from_channel(youtube_service, channel_id,30))

    df_videos = pd.DataFrame(videos_retrieved)
    #save to csv file
    df_videos.to_csv('data/videos.csv', index=False)
df_videos

Unnamed: 0,Channel Title,Channel ID,Publish time,Duration,Number of comments,Number of likes,Number of views,video_id,url
0,Shroud,UCoz3Kpu5lv-ALhR4h9bDvcw,2022-11-29T22:16:54Z,10:14,396,9215,221496,jDW6uIbZHO0,https://www.youtube.com/watch?v=jDW6uIbZHO0
1,Shroud,UCoz3Kpu5lv-ALhR4h9bDvcw,2022-10-01T13:01:41Z,9:39,685,22141,560303,DTuS6Bki9kI,https://www.youtube.com/watch?v=DTuS6Bki9kI
2,Shroud,UCoz3Kpu5lv-ALhR4h9bDvcw,2022-10-18T20:25:11Z,10:1,363,14583,435921,-Z2soOp0ZkQ,https://www.youtube.com/watch?v=-Z2soOp0ZkQ
3,Shroud,UCoz3Kpu5lv-ALhR4h9bDvcw,2022-10-30T14:24:11Z,10:18,206,9413,270325,ELNs_hXu1qQ,https://www.youtube.com/watch?v=ELNs_hXu1qQ
4,Shroud,UCoz3Kpu5lv-ALhR4h9bDvcw,2022-10-07T16:00:03Z,8:54,313,18793,584216,Cow9Qa9759Y,https://www.youtube.com/watch?v=Cow9Qa9759Y
...,...,...,...,...,...,...,...,...,...
598,Sydeon,UCtTWOND3uyl4tVc_FarDmpw,2021-07-18T21:50:32Z,10:26,14,1299,13959,DF-3YZHB9iQ,https://www.youtube.com/watch?v=DF-3YZHB9iQ
599,Sydeon,UCtTWOND3uyl4tVc_FarDmpw,2021-05-31T21:15:15Z,13:21,51,2181,29032,jSvcWBehXTM,https://www.youtube.com/watch?v=jSvcWBehXTM
600,Sydeon,UCtTWOND3uyl4tVc_FarDmpw,2021-06-13T17:10:52Z,13:42,40,1981,25275,eXiYalbGQrI,https://www.youtube.com/watch?v=eXiYalbGQrI
601,Sydeon,UCtTWOND3uyl4tVc_FarDmpw,2021-03-09T00:17:46Z,15:25,40,1625,18133,MB18fkSyWss,https://www.youtube.com/watch?v=MB18fkSyWss


### Extracting Youtube comments from each video extracted 
Youtube API allows us to extract youtube comments, where we were able to extract all comments from each video. One key can provide us with 250,000 comments in one day.

In [None]:
def get_comments(youtube, **kwargs):
    return youtube.commentThreads().list(
        part="snippet",
        **kwargs
    ).execute()

In [None]:
def get_comments_video(videoId, total_comments = 8000, max_comment_per_page = 100 , order = "time"):
    
    comments_nb = 0 

    list_comments = []
    comments_dict = {}
    
    while comments_nb <total_comments:
       
        params = {
                'videoId': videoId, 
                'maxResults': max_comment_per_page,
                'order': 'relevance', # default is 'time' (newest)
            }
        try:
            response = get_comments(youtube_service, **params)

            items = response.get("items")



            # if items is empty, breakout of the loop
            if not items:
                break


            for item in items:
                if comments_nb == total_comments:
                    break 
                comment = item['snippet']['topLevelComment']['snippet']['textDisplay']
                comment_id = item['snippet']['topLevelComment']['id']
                reply_count = item['snippet']['totalReplyCount']
                like_count = item['snippet']['topLevelComment']['snippet']['likeCount']

                comments_dict = {
                    "Comment ID":comment_id, 
                    "Comment": comment,
                    "Likes": like_count,
                    "Replies": reply_count,
                    "Video ID": videoId
                    }
                comments_nb+=1
                list_comments.append(comments_dict)


            if "nextPageToken" in response:
                # if there is a next page
                # add next page token to the params we pass to the function
                params["pageToken"] =  response["nextPageToken"]
            
            else:
                # must be end of comments!!!!
                break
                
        except Exception:
            if keys:          
                print("switching keys", len(list_comments))
                get_service()
                continue
            else: 
                print("break",len(list_comments) )
                return list_comments


    return list_comments

Saving/loading comment information into/from csv file after fetching 5 columns: Comment ID,	Comment,	Likes,	Replies,	Video ID.

In [None]:
if os.path.exists("data/comments.csv"):
    # load any pre-existing data
    df_comments = pd.read_csv('data/comments.csv')
    df_comments.pop(df_comments.columns[0])
else:
    comments = []
    
    for i , video_id in enumerate(df_videos["video_id"]):
        print("next video", i)
        comments.extend(get_comments_video(video_id))

    df_comments = pd.DataFrame(comments)
    df_comments.to_csv('data/comments.csv', index=False)
     

df_comments

next video 0
next video 1
next video 2
next video 3
next video 4
next video 5
next video 6
next video 7
next video 8
next video 9
next video 10
next video 11
next video 12
next video 13
next video 14
next video 15
next video 16
next video 17
next video 18
next video 19
next video 20
next video 21
next video 22
next video 23
next video 24
next video 25
next video 26
next video 27
next video 28
next video 29
next video 30
next video 31
next video 32
next video 33
next video 34
next video 35
next video 36
next video 37
next video 38
next video 39
next video 40
next video 41
next video 42
next video 43
next video 44
next video 45
next video 46
next video 47
next video 48


### Join tables 
In order to manipulate the comments and to get a clear understanding of each comment, we joined all tables using their ID column. 

In [None]:
#join comment, video and channal data on video and channel id
df_video_comment_data = pd.merge(df_videos, df_comments, how = 'outer', left_on = ['video_id'], right_on = ['Video ID'])
df_video_comment_channel_data = pd.merge(df_video_comment_data, df_channel_info, how = 'outer', left_on = ['Channel ID'], right_on = ['channel_id'])

df_video_comment_channel_data
df_video_comment_channel_data.to_csv('data/comments_videos_channel_info.csv')

In [None]:
keys =["AIzaSyAkKs_1ndolibMgBUR94PQi1MJoGGM6mU0", "AIzaSyAnhWSJGOoFQwdiB5DFnNeMPfrGMUpm04w"]