## Exploratory Data Analysing Using Youtube Video Data from Most Popular Telemedicine Channels from Myanmar

### 1. Aims, objectives and background

### 1.1 Introduction
##### Founded in 2005, Youtube has grown to become the second largest search engine in the world (behind Google) that processes more than 3 billion searches per month. [1]. It is, however, generally a myth how the Youtube algorithm works, what makes a video get views and be recommended over another. In fact, YouTube has one of the largest scale and most sophisticated industrial recommendation systems in existence [2]. For new content creators, it is a challenge to understand why a video gets video and others do not. There are many "myths" around the success of a Youtube video [3], for example if the video has more likes or comments, or if the video is of a certain duration. It is also worth experimenting and looking for "trends" in the topics that Youtube channels are covering in a certain niche.

##### Having recently stepping into the content creation world with a new Youtube channel on data analytics and data science, I decided to gain some insights on this topic which might be useful for other new content creators. The scope of this small project is limited to data science channels and I will not consider other niches (that might have a different characteristics and audience base). Therefore, in this project will explore the statistics of around 10 most successful data science Youtube channel.

#### 1.2. Aims and objectives
#### Within this project, I would like to explore the following:

##### Getting to know Youtube API and how to obtain video data.
##### Analyzing video data and verify different common "myths" about what makes a video do well on Youtube, for example:
##### Does the number of likes and comments matter for a video to get more views?
##### Does the video duration matter for views and interaction (likes/ comments)?
##### Does title length matter for views?
##### How many tags do good performing videos have? What are the common tags among these videos?
##### Across all the creators I take into consideration, how often do they upload new videos? On which days in the week?
##### Explore the trending topics using NLP techniques
##### Which popular topics are being covered in the videos (e.g. using wordcloud for video titles)?
##### Which questions are being asked in the comment sections in the videos

In [1]:
import pandas as pd
import numpy as np
from dateutil import parser
import isodate

# Data visualization libraries
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
sns.set(style="darkgrid", color_codes=True)
import plotly.express as px

# Google API
from googleapiclient.discovery import build

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

In [2]:
api_key = 'AIzaSyARZvA4lT3s1w8s-E3HePFZ4sh_eivCPz0'
channel_id = 'UCwaed_IVBHjym8YVXhE9vLA'
channel_ids = ['UCwaed_IVBHjym8YVXhE9vLA', #myancare
               'UC3VMnv-y9D4PdEn29HE_Dtg', #mydoctor
               'UC_LoVzylC4pqCw8bJlquw1g', #zwaka
               'UC2IbWQXYC5SOQgWPKK7rvmg', #healtppy
               'UCHYkIkrhNMJQF2HzKbaTr5A', #ondoctor
]

youtube = build('youtube','v3', developerKey=api_key)

In [3]:
#function to get channel statistics 
def get_channel_stats(youtube, channel_ids):
    all_data = []
    request = youtube.channels().list(
        part='snippet,contentDetails,statistics',
        id=','.join(channel_ids))
    response = request.execute()

    for i in range(len(response['items'])):
        data = dict(channel_name = response['items'][i]['snippet']['title'],
                    subscribers = response['items'][i]['statistics']['subscriberCount'],
                    views = response['items'][i]['statistics']['viewCount'],
                    total_video = response['items'][i]['statistics']['videoCount'],
                    playlistId = response['items'][i]['contentDetails']['relatedPlaylists']['uploads'])
        all_data.append(data)
    return pd.DataFrame(all_data)

def get_video_ids(youtube, playlist_id):
    """
    Get list of video IDs of all videos in the given playlist
    Params:
    
    youtube: the build object from googleapiclient.discovery
    playlist_id: playlist ID of the channel
    
    Returns:
    List of video IDs of all videos in the playlist
    
    """
    
    request = youtube.playlistItems().list(
                part='contentDetails',
                playlistId = playlist_id,
                maxResults = 50)
    response = request.execute()
    
    video_ids = []
    
    for i in range(len(response['items'])):
        video_ids.append(response['items'][i]['contentDetails']['videoId'])
        
    next_page_token = response.get('nextPageToken')
    more_pages = True
    
    while more_pages:
        if next_page_token is None:
            more_pages = False
        else:
            request = youtube.playlistItems().list(
                        part='contentDetails',
                        playlistId = playlist_id,
                        maxResults = 50,
                        pageToken = next_page_token)
            response = request.execute()
    
            for i in range(len(response['items'])):
                video_ids.append(response['items'][i]['contentDetails']['videoId'])
            
            next_page_token = response.get('nextPageToken')
        
    return video_ids

def get_video_details(youtube, video_ids):
    """
    Get video statistics of all videos with given IDs
    Params:
    
    youtube: the build object from googleapiclient.discovery
    video_ids: list of video IDs
    
    Returns:
    Dataframe with statistics of videos, i.e.:
        'channelTitle', 'title', 'description', 'tags', 'publishedAt'
        'viewCount', 'likeCount', 'favoriteCount', 'commentCount'
        'duration', 'definition', 'caption'
    """
        
    all_video_info = []
    
    for i in range(0, len(video_ids), 50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
        )
        response = request.execute() 

        for video in response['items']:
            stats_to_keep = {'snippet': ['channelTitle', 'title', 'description', 'tags', 'publishedAt'],
                             'statistics': ['viewCount', 'likeCount', 'favouriteCount', 'commentCount'],
                             'contentDetails': ['duration', 'definition', 'caption']
                            }
            video_info = {}
            video_info['video_id'] = video['id']

            for k in stats_to_keep.keys():
                for v in stats_to_keep[k]:
                    try:
                        video_info[v] = video[k][v]
                    except:
                        video_info[v] = None

            all_video_info.append(video_info)
            
    return pd.DataFrame(all_video_info)

def get_comments_in_videos(youtube, video_ids):
    """
    Get top level comments as text from all videos with given IDs (only the first 10 comments due to quote limit of Youtube API)
    Params:
    
    youtube: the build object from googleapiclient.discovery
    video_ids: list of video IDs
    
    Returns:
    Dataframe with video IDs and associated top level comment in text.
    
    """
    all_comments = []
    
    for video_id in video_ids:
        try:   
            request = youtube.commentThreads().list(
                part="snippet,replies",
                videoId=video_id
            )
            response = request.execute()
        
            comments_in_video = [comment['snippet']['topLevelComment']['snippet']['textOriginal'] for comment in response['items'][0:10]]
            comments_in_video_info = {'video_id': video_id, 'comments': comments_in_video}

            all_comments.append(comments_in_video_info)
            
        except: 
            # When error occurs - most likely because comments are disabled on a video
            print('Could not get comments for video ' + video_id)
        
    return pd.DataFrame(all_comments)

In [4]:
channel_data = get_channel_stats(youtube, channel_ids)
channel_data = pd.DataFrame(channel_data)

In [5]:
channel_data

Unnamed: 0,channel_name,subscribers,views,total_video,playlistId
0,OnDoctor Healthcare App,62,13032,13,UUHYkIkrhNMJQF2HzKbaTr5A
1,Healtppy - ဟက်ပီး,720,22448,123,UU2IbWQXYC5SOQgWPKK7rvmg
2,z-waka,66,90,3,UU_LoVzylC4pqCw8bJlquw1g
3,MyanCare,8220,219197,47,UUwaed_IVBHjym8YVXhE9vLA
4,myDoctor,16100,1028737,129,UU3VMnv-y9D4PdEn29HE_Dtg


In [6]:
channel_data.dtypes

channel_name    object
subscribers     object
views           object
total_video     object
playlistId      object
dtype: object

In [7]:
# Convert count columns to numeric columns
numeric_cols = ['subscribers', 'views', 'total_video']
channel_data[numeric_cols] = channel_data[numeric_cols].apply(pd.to_numeric, errors='coerce')

In [8]:
channel_data.head()

Unnamed: 0,channel_name,subscribers,views,total_video,playlistId
0,OnDoctor Healthcare App,62,13032,13,UUHYkIkrhNMJQF2HzKbaTr5A
1,Healtppy - ဟက်ပီး,720,22448,123,UU2IbWQXYC5SOQgWPKK7rvmg
2,z-waka,66,90,3,UU_LoVzylC4pqCw8bJlquw1g
3,MyanCare,8220,219197,47,UUwaed_IVBHjym8YVXhE9vLA
4,myDoctor,16100,1028737,129,UU3VMnv-y9D4PdEn29HE_Dtg


In [9]:
tb = channel_data.groupby('channel_name')['subscribers'].sum().sort_values(ascending=False)
tb = pd.DataFrame(tb)

In [10]:
fig = px.bar(tb, title='Subscribers Distribution',
             labels={'channel_name':'Channel Titles',
                     'value':'Total Subscribers'})
fig.show()

#### Get video statistics for all the channels
##### In the next step, we will obtain the video statistics for all the channels. In total, we obtained 3,722 videos as seen in below.

In [11]:
channel_data['channel_name'].unique()

array(['OnDoctor Healthcare App', 'Healtppy - ဟက်ပီး', 'z-waka',
       'MyanCare', 'myDoctor'], dtype=object)

In [12]:
import pandas as pd

# Create empty lists to store DataFrames temporarily
video_dfs = []

for c in channel_data['channel_name'].unique():
    print("Getting video information from channel: " + c)
    playlist_id = channel_data.loc[channel_data['channel_name'] == c, 'playlistId'].iloc[0]
    video_ids = get_video_ids(youtube, playlist_id)

    # Get video data and comments data
    video_data = get_video_details(youtube, video_ids)
    

    # Append DataFrames to the lists
    video_dfs.append(video_data)
    

# Concatenate DataFrames after the loop
video_df = pd.concat(video_dfs, ignore_index=True)


Getting video information from channel: OnDoctor Healthcare App
Getting video information from channel: Healtppy - ဟက်ပီး
Getting video information from channel: z-waka
Getting video information from channel: MyanCare
Getting video information from channel: myDoctor


In [13]:
video_df

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favouriteCount,commentCount,duration,definition,caption
0,Biy9L6i-z3c,OnDoctor Healthcare App,အရက်နာကျတဲ့အခါ,အရက်နာကျတဲ့အခါ ငူငူငိုင်ငိုင်ဖြစ်မနေအောင် လိုက...,,2022-09-06T05:30:02Z,7,1,,0,PT47S,hd,false
1,8m9-71eOy2A,OnDoctor Healthcare App,ခါးနာခြင်းကို အမြန်ဆုံးသက်သာစေမယ့်နည်းလမ်းကောင...,ခါးနာခြင်းကို အမြန်ဆုံးသက်သာစေမယ့်နည်းလမ်းကောင...,,2022-09-06T03:24:43Z,504,2,,0,PT40S,hd,false
2,i7zsyVd6sbQ,OnDoctor Healthcare App,ရေခဲရေသောက်ခြင်းရဲ့ နောက်ဆက်တွဲဆိုးကျိုးတွေက ဘ...,သင်ဟာ ရေခဲရေကြိုက်တတ်သူတစ်ယောက်ဆိုရင်တော့ ဒီဗီ...,,2022-09-01T10:20:52Z,42,2,,0,PT46S,hd,false
3,_ccDTwztkyw,OnDoctor Healthcare App,အအိပ်လွန်ခြင်းရဲ့ နောက်ဆက်တွဲဘေးထွက်ဆိုးကျိုးများ,တစ်နေ့တာပင်ပန်းနွမ်းနယ်လာသမျှအတွက် အိပ်စက်ခြင်...,,2022-08-25T09:30:34Z,200,2,,0,PT41S,hd,false
4,8H7gu5t2UTM,OnDoctor Healthcare App,ရေမချိုးရင် ဘာတွေဖြစ်လာနိုင်လဲ?,ရေမချိုးဘဲနေမယ်ဆို ဘာတွေဖြစ်လာနိုင်လဲဆိုတဲ့ ဆိ...,,2022-08-24T03:36:24Z,190,2,,0,PT43S,hd,false
...,...,...,...,...,...,...,...,...,...,...,...,...,...
310,LLMx7-Ar8Vc,myDoctor,"Introduction Section Regarding "" EAS Foundation """,Uploaded by http://www.mydoctor.com.mm,,2017-03-03T04:41:26Z,9,0,,0,PT3M38S,sd,false
311,h2IeRcc3Gko,myDoctor,ဓာတုပစ္စည်း သောက်သုံးမိခြင်း ( ရှေးဦးသူနာပြုစု...,Uploaded by http://www.mydoctor.com.mm,,2017-03-03T04:37:44Z,87,2,,0,PT2M4S,sd,false
312,oGQUfbPzIHo,myDoctor,သိထားသင့်တဲ့ အရေးပေါ်သန္ဓေတားနည်းများ ( Pill 72 ),Uploaded by http://www.mydoctor.com.mm,[အေရးေပၚသေႏၶတားနည္းမ်ား],2017-02-27T05:13:31Z,3925,54,,1,PT2M14S,sd,false
313,4hduCVs_aqg,myDoctor,ကလေး နှာခေါင်းသွေးလျှံခြင်း ( ရှေးဦးသူနာပြုစုခ...,Uploaded by http://www.mydoctor.com.mm,[ေရွးဦးသူနာျပဳစုျခင္း],2017-02-27T05:10:01Z,956,26,,0,PT1M32S,sd,false


#### Preprocessing & Feature engineering

In [14]:
video_df.isnull().any()

video_id          False
channelTitle      False
title             False
description       False
tags               True
publishedAt       False
viewCount         False
likeCount         False
favouriteCount     True
commentCount       True
duration          False
definition        False
caption           False
dtype: bool

In [15]:
video_df.publishedAt.sort_values().value_counts()

publishedAt
2017-05-17T02:16:20Z    4
2017-05-08T02:14:14Z    2
2021-02-16T02:30:00Z    2
2017-02-27T05:00:13Z    1
2022-09-28T03:43:05Z    1
                       ..
2021-02-04T11:30:07Z    1
2021-02-02T02:30:06Z    1
2021-01-31T13:30:02Z    1
2021-01-29T13:30:01Z    1
2024-01-12T07:42:39Z    1
Name: count, Length: 310, dtype: int64

In [16]:
cols = ['viewCount', 'likeCount', 'favouriteCount', 'commentCount']
video_df[cols] = video_df[cols].apply(pd.to_numeric, errors='coerce', axis=1)

In [17]:
# Create publish day (in the week) column
video_df['publishedAt'] =  video_df['publishedAt'].apply(lambda x: parser.parse(x)) 
video_df['pushblishDayName'] = video_df['publishedAt'].apply(lambda x: x.strftime("%A")) 

In [18]:
# convert duration to seconds
video_df['durationSecs'] = video_df['duration'].apply(lambda x: isodate.parse_duration(x))
video_df['durationSecs'] = video_df['durationSecs'].astype('timedelta64[s]')

In [19]:
# Add number of tags
video_df['tagsCount'] = video_df['tags'].apply(lambda x: 0 if x is None else len(x))

In [20]:
# Comments and likes per 1000 view ratio
video_df['likeRatio'] = video_df['likeCount']/ video_df['viewCount'] * 1000
video_df['commentRatio'] = video_df['commentCount']/ video_df['viewCount'] * 1000

In [21]:
# Title character length
video_df['titleLength'] = video_df['title'].apply(lambda x: len(x))

In [22]:
video_df.head()

Unnamed: 0,video_id,channelTitle,title,description,tags,publishedAt,viewCount,likeCount,favouriteCount,commentCount,duration,definition,caption,pushblishDayName,durationSecs,tagsCount,likeRatio,commentRatio,titleLength
0,Biy9L6i-z3c,OnDoctor Healthcare App,အရက်နာကျတဲ့အခါ,အရက်နာကျတဲ့အခါ ငူငူငိုင်ငိုင်ဖြစ်မနေအောင် လိုက...,,2022-09-06 05:30:02+00:00,7.0,1.0,,0.0,PT47S,hd,False,Tuesday,0 days 00:00:47,0,142.857143,0.0,14
1,8m9-71eOy2A,OnDoctor Healthcare App,ခါးနာခြင်းကို အမြန်ဆုံးသက်သာစေမယ့်နည်းလမ်းကောင...,ခါးနာခြင်းကို အမြန်ဆုံးသက်သာစေမယ့်နည်းလမ်းကောင...,,2022-09-06 03:24:43+00:00,504.0,2.0,,0.0,PT40S,hd,False,Tuesday,0 days 00:00:40,0,3.968254,0.0,52
2,i7zsyVd6sbQ,OnDoctor Healthcare App,ရေခဲရေသောက်ခြင်းရဲ့ နောက်ဆက်တွဲဆိုးကျိုးတွေက ဘ...,သင်ဟာ ရေခဲရေကြိုက်တတ်သူတစ်ယောက်ဆိုရင်တော့ ဒီဗီ...,,2022-09-01 10:20:52+00:00,42.0,2.0,,0.0,PT46S,hd,False,Thursday,0 days 00:00:46,0,47.619048,0.0,52
3,_ccDTwztkyw,OnDoctor Healthcare App,အအိပ်လွန်ခြင်းရဲ့ နောက်ဆက်တွဲဘေးထွက်ဆိုးကျိုးများ,တစ်နေ့တာပင်ပန်းနွမ်းနယ်လာသမျှအတွက် အိပ်စက်ခြင်...,,2022-08-25 09:30:34+00:00,200.0,2.0,,0.0,PT41S,hd,False,Thursday,0 days 00:00:41,0,10.0,0.0,49
4,8H7gu5t2UTM,OnDoctor Healthcare App,ရေမချိုးရင် ဘာတွေဖြစ်လာနိုင်လဲ?,ရေမချိုးဘဲနေမယ်ဆို ဘာတွေဖြစ်လာနိုင်လဲဆိုတဲ့ ဆိ...,,2022-08-24 03:36:24+00:00,190.0,2.0,,0.0,PT43S,hd,False,Wednesday,0 days 00:00:43,0,10.526316,0.0,31


#### Exploratory analysis

In [23]:
fig = px.box(video_df, x = "channelTitle", y = "viewCount",  title='Distribution of Views per Channel',
             labels={'channelTitle':'Channel Titles',
                     'viewCount':'Total View'})
fig.show()

In [31]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

# Create subplots with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=["Comment Count vs View Count", "Like Count vs View Count"])

# Add scatter plots to the subplots
scatter1 = go.Scatter(x=video_df["commentCount"], y=video_df["viewCount"], mode="markers", name="Comment Count vs View Count")
scatter2 = go.Scatter(x=video_df["likeCount"], y=video_df["viewCount"], mode="markers", name="Like Count vs View Count")

# Add traces to the subplots
fig.add_trace(scatter1, row=1, col=1)
fig.add_trace(scatter2, row=1, col=2)

# Update layout
fig.update_layout(title_text="Subplots of Comment Count and Like Count vs View Count",
                  xaxis_title="Total Comment Count",
                  yaxis_title="Total View Count",
                  xaxis2_title="Total Like Count",
                  yaxis2_title="Total View Count",
                width=1500,  # Set the width to 1000 pixels
                height=500)   # Set the height to 800 pixels

# Show the plot
fig.show()


In [25]:
import plotly.express as px

# Filter based on duration in seconds
filtered_df = video_df[video_df['durationSecs'].dt.seconds < 10000]

# Convert timedelta to seconds
filtered_df['durationSecs_numeric'] = filtered_df['durationSecs'].dt.total_seconds()

# Create the histogram with Plotly Express
fig = px.histogram(filtered_df, x="durationSecs_numeric", nbins=100, title="Histogram of Video Durations")
fig.update_layout(title_text="Histogram of Video Durations")

# Show the plot
fig.show()


In [28]:
# Create subplots with 1 row and 2 columns
fig = make_subplots(rows=1, cols=2, subplot_titles=["Duration Count vs Comment Count", "Duration Count vs Like Count"])

# Add scatter plots to the subplots
video_df['duration_in_seconds'] = video_df['durationSecs'].dt.seconds
scatter1 = go.Scatter(x=video_df["duration_in_seconds"], y=video_df["commentCount"], mode="markers", 
                      name="Duration Count vs Comment Count")
scatter2 = go.Scatter(x=video_df["duration_in_seconds"], y=video_df["likeCount"], mode="markers", 
                      name="Duration Count vs Like Count")

# Add traces to the subplots
fig.add_trace(scatter1, row=1, col=1)
fig.add_trace(scatter2, row=1, col=2)

# Update layout
fig.update_layout(title_text="Subplots of Comment Count and Like Count vs Duration Count",
                  xaxis_title="Duration in Seconds",
                  yaxis_title="Total Count",
                  width=1500,  # Set the width to 1500 pixels
                  height=500)  # Set the height to 500 pixels

# Show the plot
fig.show()


#### Does title length matter for views?

##### There is no clear relationship between title length and views as seen the scatterplot below, but most-viewed videos tend to have average title length of 40-80 characters.

In [None]:
fig = px.scatter(video_df, x = "titleLength", y = "viewCount",
                 labels={'titleLength':'Video Title Length',
                     'viewCount':'Total View'})
fig.show()

#### Number of tags vs views

In [None]:
fig = px.scatter(video_df, x = "tagsCount", y = "viewCount",
                 labels={'tagsCount':'Video Tags Count',
                     'viewCount':'Total View'})
fig.show()

#### Which day in the week are most videos uploaded?

##### It's interesting to see that more videos are uploaded on Mondays to Fridays. Fewer videos are uploaded during the weekend. 

In [None]:
# Assuming 'pushblishDayName' is a categorical variable
weekdays = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
tb = video_df['pushblishDayName'].value_counts().reindex(weekdays, fill_value=0)

# Create the bar chart with Plotly Express
fig = px.bar(x=tb.index, y=tb.values, labels={'y': 'Count'}, title="Video Publication Days Count")
fig.update_layout(title_text="Video Publication Days Count", xaxis_title="Day of Week", yaxis_title="Count")

# Show the plot
fig.show()