# Exploratory Data Analysing Using Youtube Video Data from One of My Favourite Channels 

# 1. Aims, objectives and background

# 1.1. Introduction

YouTube, founded in 2005, has grown to become the second largest search engine in the world, processing more than 3 billion searches per month. However, it is still largely unknown how the YouTube algorithm works and what makes one video get more views and be recommended over another. YouTube has a highly advanced recommendation system, making it a challenge for new content creators to understand why some videos are more successful than others. There are various myths around the success of a YouTube video, such as having more likes or comments or being a certain duration. It is important for new content creators to experiment and identify trends in their niche.

As a new content creator with a YouTube channel focused on data analytics and data science, I have decided to investigate this topic in order to gain insights that could be helpful for other new content creators. However, the scope of this project will be limited to data science channels only, as other niches may have different characteristics and audiences. The project will focus on analyzing the statistics of the 10 most successful data science YouTube channels.

# 1.2. Aims and objectives

Within this project, I would like to explore the following:

- Part I - Getting to know Youtube API and how to obtain video data. 

- Part II - Analyzing video data and verify different common "myths" about what makes a video do well on Youtube, for example:
Does the number of likes and comments matter for a video to get more views?
Does the video duration matter for views and interaction (likes/ comments)?
Does title length matter for views?
How many tags do good performing videos have? What are the common tags among these videos?
Across all the creators I take into consideration, how often do they upload new videos? On which days in the week?

# 1.3. Steps of the project

Obtain video meta data via Youtube API for a popular channel channels in the digital niche (this includes several small steps: create a developer key, request data and transform the responses into a usable data format)
Prepocess data and engineer additional features for analysis
Exploratory data analysis
Conclusions

# 1.4. Ethics of data source

According to Youtube API's guide, the usage of Youtube API is free of charge given that your application send requests within a quota limit. "The YouTube Data API uses a quota to ensure that developers use the service as intended and do not create applications that unfairly reduce service quality or limit access for others. " The default quota allocation for each application is 10,000 units per day, and you could request additional quota by completing a form to YouTube API Services if you reach the quota limit.

Since all data requested from Youtube API is public data (which everyone on the Internet can see on Youtube), there is no particular privacy issues as far as I am concerned. In addition, the data is obtained only for research purposes in this case and not for any commercial interests.



# Part I - Requesting Data Through YouTube API

In [1]:
#importing all necessary libraries
import requests
import googleapiclient.discovery
from googleapiclient.discovery import build
import numpy as np 
import pandas as pd
import seaborn as sns
import IPython.display
from IPython.display import JSON

In [2]:
# Creating a variable for the api key
api_key = 'AIzaSyAoXH-pggHWClrMsxLrgOOP0Fb0JHFcKNg'

In [3]:
# Defining the channel IDs you want to get information for
channel_ids = ['UCP7WmQ_U4GB3K51Od9QvM0w']

In [4]:
# The code creates a client object for interacting with the YouTube Data API using the Google API client library, which requires specifying the API service name, version, and developer key.

api_service_name = "youtube"
api_version = "v3"
    
youtube = googleapiclient.discovery.build(
    api_service_name, api_version, developerKey=api_key)

In [5]:
def get_channel_stats(youtube, channel_ids):
    
    """
    Getting channel stats
    
    Parametres:
    ------
    youtube: build object of Youtube API
    channel_ids: list of channel IDs
    
    Returns:
    ------
    dataframe with all channel stats for each channel ID
    
    """
    
    all_data = []
    
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=','.join(channel_ids)
    )
    response = request.execute()

    # loop through items
    for item in response['items']:
        data = {'channelName': item['snippet']['title'],
                'subscribers': item['statistics']['subscriberCount'],
                'views': item['statistics']['viewCount'],
                'totalVideos': item['statistics']['videoCount'],
                'playlistId': item['contentDetails']['relatedPlaylists']['uploads']
        }
        
        all_data.append(data)
        
    return pd.DataFrame(all_data)


In [6]:
# Creating channel_stats variable in otder to print the initial channel stats.
channel_stats = get_channel_stats(youtube, channel_ids)

In [7]:
print(channel_stats)

    channelName subscribers      views totalVideos                playlistId
0  David Bombal     1710000  106653492        1243  UUP7WmQ_U4GB3K51Od9QvM0w


In [8]:
#This code uses the YouTube Data API to retrieve the details of the playlist with the ID "UUP7WmQ_U4GB3K51Od9QvM0w", including the videos in the playlist and their metadata, and prints the response.

request = youtube.playlistItems().list(
    part="snippet,contentDetails",
    playlistId="UUP7WmQ_U4GB3K51Od9QvM0w"
    )
response = request.execute()

print(response)

{'kind': 'youtube#playlistItemListResponse', 'etag': 'ZAn0VuX5nL43lclv4_TZthy4Qfg', 'nextPageToken': 'EAAaBlBUOkNBVQ', 'items': [{'kind': 'youtube#playlistItem', 'etag': 'wUf-NhyBZOk5AENf0UQourkLQsE', 'id': 'VVVQN1dtUV9VNEdCM0s1MU9kOVF2TTB3LjZtbWdpNjVPakJN', 'snippet': {'publishedAt': '2023-03-19T15:00:03Z', 'channelId': 'UCP7WmQ_U4GB3K51Od9QvM0w', 'title': "What's the Best Operating System? #shorts", 'description': '#youtubeshorts #linux #windows', 'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/6mmgi65OjBM/default.jpg', 'width': 120, 'height': 90}, 'medium': {'url': 'https://i.ytimg.com/vi/6mmgi65OjBM/mqdefault.jpg', 'width': 320, 'height': 180}, 'high': {'url': 'https://i.ytimg.com/vi/6mmgi65OjBM/hqdefault.jpg', 'width': 480, 'height': 360}, 'standard': {'url': 'https://i.ytimg.com/vi/6mmgi65OjBM/sddefault.jpg', 'width': 640, 'height': 480}, 'maxres': {'url': 'https://i.ytimg.com/vi/6mmgi65OjBM/maxresdefault.jpg', 'width': 1280, 'height': 720}}, 'channelTitle': 'David Bomb

In [9]:
# Creating the playlist_id variable

playlist_id = "UUP7WmQ_U4GB3K51Od9QvM0w"

In [10]:
# Testing the function that takes in a YouTube API object and a playlist ID, and retrieves all the video IDs within the playlist, even if the playlist contains more than 50 videos (which is the default limit for one request). It does this by using the nextPageToken in the response to make multiple requests until all video IDs have been retrieved.

def get_video_ids(youtube, playlist_id):
    
    video_ids = []
    
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId=playlist_id,
        maxResults = 50
    )
    response = request.execute()
    
    for item in response['items']:
        video_ids.append(item['contentDetails']['videoId'])
        
    next_page_token = response.get('nextPageToken')
    while next_page_token is not None:
        request = youtube.playlistItems().list(
                    part='contentDetails',
                    playlistId = playlist_id,
                    maxResults = 50,
                    pageToken = next_page_token)
        response = request.execute()

        for item in response['items']:
            video_ids.append(item['contentDetails']['videoId'])

        next_page_token = response.get('nextPageToken')
        
    return video_ids

In [11]:
# The variable video_ids contains a list of video IDs extracted from a given YouTube playlist #

video_ids = get_video_ids(youtube, playlist_id)

In [12]:
# Requesting the video_ids

video_ids

['6mmgi65OjBM',
 '5vDucASmO_I',
 'unBOHR7BjwY',
 'Zfz3ZN2dTDM',
 'tm4evj8MZvY',
 'wjGFR38JFUE',
 'bwUXMQqMXaY',
 'z78TCiMrk2w',
 'XjA-GIS9U5U',
 'nHDixd-EdEQ',
 'KCTD61OmnzA',
 'evN89fdKOxU',
 '0qFLgjfZ6Oo',
 'yq9zo6IzP64',
 'jpgKUqEwFsg',
 '8MIIeIa25tE',
 '0hKUZC6L99g',
 'L-TJVyBdF2M',
 'Q9LZZ4ur-bU',
 'x7bSegjbfbU',
 'OVwJ5EMTSK0',
 'aQ_XTBmCXS8',
 '-OAa9k0zCDg',
 'NXpeXn0SKPU',
 'Fw5ybNwwSbg',
 'LWmy3t84AIo',
 'BSugciSUIek',
 'oz7NFc-qm7E',
 'Clu3-5TFdw0',
 'wGI5fcKzMWo',
 'VcV4T8cL3xw',
 '8HMqspTWWzc',
 'Nvv0drH4lmA',
 'tGZoCCoJtXA',
 'trPJaCGBbKU',
 '4ljq8JMFbJM',
 'yKTzek8EZ4E',
 'bK1lsI-ehL8',
 'I3Zd2LF1OAE',
 '20_7tit3IBQ',
 'WqbrB12Jvgc',
 'VF3xlAm_tdo',
 '7GGLi10sHDs',
 '3KYZ5yEqvY4',
 'L5Tn4jU9wz4',
 'QqrK294l_oI',
 '1Kg0Eh7n-RM',
 'Wx_aTsJr75c',
 'yHT4kq36PE4',
 'EMFIUDfQHCI',
 'UYt0r5Rw2gE',
 'F6l2Bmh7Dq4',
 'yDwxmF0Kn-Q',
 'f2BjFilLDqQ',
 'ObUgYDn1zZ0',
 'gii-IMlv6_Q',
 'Qb8Wvo9u5zE',
 'To5Nbs6DmIA',
 'wikyhVFPiDA',
 '5LvqU3-iINk',
 'nyhytT2tRN0',
 'TuIsHKUkI0g',
 'CkVvB5

In [13]:
def get_channel_stats(youtube, channel_ids):
    
    """
    Get channel stats
    
    Parametress:
    ------
    youtube: build object of Youtube API
    channel_ids: list of channel IDs
    
    Returns:
    ------
    dataframe with all channel stats for each channel ID
    
    """
    
    all_data = []
    
    request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=','.join(channel_ids)
    )
    response = request.execute()

    # loop through items
    for item in response['items']:
        data = {'channelName': item['snippet']['title'],
                'subscribers': item['statistics']['subscriberCount'],
                'views': item['statistics']['viewCount'],
                'totalVideos': item['statistics']['videoCount'],
                'playlistId': item['contentDetails']['relatedPlaylists']['uploads']
        }
        
        all_data.append(data)
        
    return pd.DataFrame(all_data)

def get_video_ids(youtube, playlist_id):
    
    video_ids = []
    
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId=playlist_id,
        maxResults = 50
    )
    response = request.execute()
    
    for item in response['items']:
        video_ids.append(item['contentDetails']['videoId'])
        
    next_page_token = response.get('nextPageToken')
    while next_page_token is not None:
        request = youtube.playlistItems().list(
                    part='contentDetails',
                    playlistId = playlist_id,
                    maxResults = 50,
                    pageToken = next_page_token)
        response = request.execute()

        for item in response['items']:
            video_ids.append(item['contentDetails']['videoId'])

        next_page_token = response.get('nextPageToken')
        
    return video_ids
    
    
def get_video_details(youtube, video_ids):

    all_video_info = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
        )
        response = request.execute() 

        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
                             'contentDetails': ['duration', 'definition', 'caption']
                            }
            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)
    
    return pd.DataFrame(all_video_info)

In [14]:
# This line of code calls the get_channel_stats() function with the youtube and channel_ids arguments, and assigns the returned channel statistics to the channel_stats variable.

channel_stats = get_channel_stats(youtube, channel_ids)

In [15]:
channel_stats 

Unnamed: 0,channelName,subscribers,views,totalVideos,playlistId
0,David Bombal,1710000,106653492,1243,UUP7WmQ_U4GB3K51Od9QvM0w


In [16]:
# This function retrieves video details (e.g. view count, title, tags, duration) from a list of video IDs using the YouTube Data API, and stores them in a Pandas dataframe.

def get_video_details(youtube, video_ids):
    all_video_info = []

    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
        )
        response = request.execute()

        for video in response['items']:
            stats_to_keep = {
                'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                'statistics': ['viewCount', 'likeCount', 'favoriteCount', 'commentCount'],
                'contentDetails': ['duration', 'definition', 'caption']
            }

            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)

    return pd.DataFrame(all_video_info)


In [17]:
video_df = get_video_details(youtube, video_ids)

In [18]:
video_df

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favoriteCount,commentCount,duration,definition,caption
0,6mmgi65OjBM,David Bombal,What's the Best Operating System? #shorts,#youtubeshorts #linux #windows,"[linux, ubuntu, windows, windows10, windows11,...",2023-03-19T15:00:03Z,142138,6202,0,442,PT43S,hd,false
1,5vDucASmO_I,David Bombal,Flipper Zero Capture & Replay Sub-Ghz Signals ...,#youtubeshorts #flipperzero #flipper,"[flipper zero, flipperzero, bluetooth, nfc, rf...",2023-03-14T15:01:00Z,98592,6146,0,73,PT53S,hd,false
2,unBOHR7BjwY,David Bombal,Do you know what a terminator is? #shorts,#youtubeshorts #internet #starlink,"[terminator, terminate, ethernet, 10base5, sta...",2023-03-09T15:00:40Z,72155,3418,0,100,PT57S,hd,false
3,Zfz3ZN2dTDM,David Bombal,The best Hacking Courses & Certs (not all thes...,This is your path to becoming a Pentester in 2...,"[oscp, pnpt, pentester, hacker, hack, hacking,...",2023-03-05T15:00:31Z,139734,6201,0,564,PT39M21S,hd,false
4,tm4evj8MZvY,David Bombal,Pass or Fail? Does RAID actually work?,Does a Synology NAS actually work? Should you ...,"[video, synology, sharing, nas, raid, video ph...",2023-02-24T15:00:48Z,39891,1559,0,133,PT14M42S,hd,false
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239,2iYr4YWq9sg,David Bombal,Cisco ASA 7X Easy VPN made easy - VPN client A...,See how to configure a Cisco ASA / PIX Easy VP...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-16T19:50:17Z,10662,11,0,1,PT2M42S,sd,false
1240,pFZEClD438I,David Bombal,Cisco Router Configurations made easy by Confi...,See how to configure a Cisco Router quickly an...,"[Cisco, CCNA, ICND, Router, Wireless, Pass, SS...",2008-09-15T21:37:22Z,34031,18,0,1,PT3M22S,sd,false
1241,S1yGGfyL1jU,David Bombal,Cisco ASA 7.X to ASA Site to Site VPN made easy,See how to configure a Cisco ASA to ASA Site t...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-14T19:29:04Z,12518,10,0,0,PT3M19S,sd,false
1242,8DJhwKQpSE8,David Bombal,Cisco ASA Easy VPN made easy - VPN client ASA VPN,See how to configure a Cisco ASA / PIX Easy VP...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-14T19:04:52Z,40168,15,0,2,PT2M,sd,false


In [19]:
playlist_ids = "UUP7WmQ_U4GB3K51Od9QvM0w"

In [20]:
# Get video IDs
video_ids = get_video_ids(youtube, playlist_id)

In [21]:
len(video_ids)

1244

In [22]:
import googleapiclient.errors 
from googleapiclient.errors import HttpError

In [23]:
# This code defines a function get_comments_in_videos which takes a list of video ids and a YouTube API instance, and returns a Pandas DataFrame containing the comments for each video. It iterates through each video and extracts the comments using the YouTube API, and adds them to a list which is then converted into a DataFrame. If comments are disabled for a video, it skips that video and prints a message.

def get_comments_in_videos(youtube, video_ids):
    all_comments = []
    for video_id in video_ids:
        try:
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id
            )
            response = request.execute()
            comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items']]
            comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}
            all_comments.append(comments_in_video_info)
        except HttpError as error:
            if error.resp.status == 403:
                print(f"Comments disabled for video {video_id}. Skipping...")
            else:
                raise
    return pd.DataFrame(all_comments)


In [None]:
comments_df = get_comments_in_videos(youtube, video_ids)

Comments disabled for video 20_7tit3IBQ. Skipping...


In [None]:
comments_df

In [None]:
len(comments_df)

# Part II - Analysis 

# Data pre-processing

In [144]:
# Checking for NULL values
video_df.isnull().any()

video_id         False
channelTitle     False
title            False
description      False
tags              True
publishedAt      False
viewCount        False
likeCount        False
favoriteCount    False
commentCount     False
duration         False
definition       False
caption          False
dtype: bool

In [145]:
# Check data types
video_df.dtypes

video_id         object
channelTitle     object
title            object
description      object
tags             object
publishedAt      object
viewCount        object
likeCount        object
favoriteCount    object
commentCount     object
duration         object
definition       object
caption          object
dtype: object

In [146]:
video_df

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favoriteCount,commentCount,duration,definition,caption
0,6mmgi65OjBM,David Bombal,What's the Best Operating System? #shorts,#youtubeshorts #linux #windows,"[linux, ubuntu, windows, windows10, windows11,...",2023-03-19T15:00:03Z,142091,6200,0,442,PT43S,hd,false
1,5vDucASmO_I,David Bombal,Flipper Zero Capture & Replay Sub-Ghz Signals ...,#youtubeshorts #flipperzero #flipper,"[flipper zero, flipperzero, bluetooth, nfc, rf...",2023-03-14T15:01:00Z,98543,6136,0,73,PT53S,hd,false
2,unBOHR7BjwY,David Bombal,Do you know what a terminator is? #shorts,#youtubeshorts #internet #starlink,"[terminator, terminate, ethernet, 10base5, sta...",2023-03-09T15:00:40Z,72142,3414,0,100,PT57S,hd,false
3,Zfz3ZN2dTDM,David Bombal,The best Hacking Courses & Certs (not all thes...,This is your path to becoming a Pentester in 2...,"[oscp, pnpt, pentester, hacker, hack, hacking,...",2023-03-05T15:00:31Z,139711,6201,0,564,PT39M21S,hd,false
4,tm4evj8MZvY,David Bombal,Pass or Fail? Does RAID actually work?,Does a Synology NAS actually work? Should you ...,"[video, synology, sharing, nas, raid, video ph...",2023-02-24T15:00:48Z,39891,1559,0,133,PT14M42S,hd,false
...,...,...,...,...,...,...,...,...,...,...,...,...,...
1239,2iYr4YWq9sg,David Bombal,Cisco ASA 7X Easy VPN made easy - VPN client A...,See how to configure a Cisco ASA / PIX Easy VP...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-16T19:50:17Z,10662,11,0,1,PT2M42S,sd,false
1240,pFZEClD438I,David Bombal,Cisco Router Configurations made easy by Confi...,See how to configure a Cisco Router quickly an...,"[Cisco, CCNA, ICND, Router, Wireless, Pass, SS...",2008-09-15T21:37:22Z,34031,18,0,1,PT3M22S,sd,false
1241,S1yGGfyL1jU,David Bombal,Cisco ASA 7.X to ASA Site to Site VPN made easy,See how to configure a Cisco ASA to ASA Site t...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-14T19:29:04Z,12518,10,0,0,PT3M19S,sd,false
1242,8DJhwKQpSE8,David Bombal,Cisco ASA Easy VPN made easy - VPN client ASA VPN,See how to configure a Cisco ASA / PIX Easy VP...,"[Cisco, ASA, PIX, VPN, IPSec, Site, to, EasyVP...",2008-09-14T19:04:52Z,40168,15,0,2,PT2M,sd,false


In [114]:
# Convert count columns to numeric
numeric_cols = ['viewCount', 'likeCount', 'favouriteCount', 'commentCount']
video_df[numeric_cols] = video_df[numeric_cols].apply(pd.to_numeric, errors = 'coerce', axis = 1)
print(video_df.columns)


KeyError: "['favouriteCount'] not in index"

In [None]:
api_key = "AIzaSyAoXH-pggHWClrMsxLrgOOP0Fb0JHFcKNg"

def get_comments_in_videos(youtube, video_ids):
    all_comments = []
    for video_id in video_ids:
        request = youtube.commentThreads().list(
            part="snippet,replies",
            videoId=video_id
        )
        response = request.execute()
        comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items']]
        comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}
        all_comments.append(comments_in_video_info)
    return pd.DataFrame(all_comments)

In [None]:
# Testing if requesting data from the Youtube channel works works

request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=','.join(channel_ids)
    )
response = request.execute()

JSON(response)

In [None]:
def get_video_details(youtube, video_ids):

    all_video_info = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
        )
        response = request.execute() 

        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
                             'contentDetails': ['duration', 'definition', 'caption']
                            }
            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)
    
    return pd.DataFrame(all_video_info)