# Youtube Channel Statistic Analysis

Here we will analyze all the data that we collected from the youtube data API.

## Necessary Imports

Here we will import necessary libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np

## Import Data from Data Files

Here we will import the data from the csv files in the data directory of our project and save it to dataframes.

In [None]:
channelStatistics = pd.read_pickle('../data/channelStatistics.pkl')

videoStatistics = pd.read_pickle('../data/videoStatistics.pkl')

In [None]:
channelStatistics.info()

In [None]:
videoStatistics.info()

As mentioned when cleaning the data, since we are still able to run operations such as sum() on a column that contains numpy NaN's, we will not be dropping these rows (videos) as they will not hinder our data analysis.

# Average Views Per Video

Here we will analyze the average number of views per videos which will give us a broad idea of how the channels video's perform.

In [None]:
channelStatistics['averageViewsPerVideo'] = channelStatistics['viewCount'] / channelStatistics['videoCount']

averageViewsPerVideoDf = channelStatistics[['channelName', 'viewCount', 'videoCount', 'averageViewsPerVideo']].sort_values(by='averageViewsPerVideo', ascending=False)

averageViewsPerVideoDf


Now we will plot this average on a graph where the x axis will be the video count and the y axis will be the view count.

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(averageViewsPerVideoDf['videoCount'], averageViewsPerVideoDf['viewCount'], alpha=0.7)
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.title('Total Channel View Count vs Video Count')
plt.xlabel('Video Count')
plt.ylabel('View Count in Millions')

locs, labels = plt.yticks()
new_tick_locations = locs / 1e6
new_tick_labels = ['{:.0f}M'.format(loc) for loc in new_tick_locations]
plt.yticks(ticks=locs, labels=new_tick_labels)

plt.show()

# Subscriber Engagement

Here we will be analyzing the subscriber engagement through comparing the relationship between subscriber count and view count for each channel.

Therefore we will create a new metric called subscriber engagement that will be given by the ratio: view count / subscriber count.

That way, if the channel has:

many subscribers and many views -->  the channel will have average subscriber engagement
few subscribers and many views -->  the channel will have high subscriber engagement
many subscribers and few views -->  the channel will have low subscriber engagement
few subscribers and few views -->  the channel will have average subscriber engagement

We will create a new dataframe that contains the name of the channel and its corresponding subscriber engagement. It will be ordered by decreasing subscriber engagement.

In [None]:
channelStatistics['subscriberEngagement'] = channelStatistics['viewCount'] / channelStatistics['subscriberCount']

subscriberEngagementDf = channelStatistics[['channelName', 'viewCount', 'subscriberCount', 'subscriberEngagement']].sort_values(by='subscriberEngagement', ascending=False)

subscriberEngagementDf

After that, we will plot the different subscriber engagments in a graph that contains the subscriber count on the x axis and the view count on the y axis.

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(subscriberEngagementDf['subscriberCount'], subscriberEngagementDf['viewCount'], alpha=0.7)
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.title('Total Channel View Count vs Subscriber Count')
plt.xlabel('Subscriber Count')
plt.ylabel('View Count in Millions')

ylocs, ylabels = plt.yticks()
newYTickLocations = ylocs / 1e6
newYTickLabels = ['{:.0f}M'.format(loc) for loc in newYTickLocations]
plt.yticks(ticks=ylocs, labels=newYTickLabels)

xlocs, xlabels = plt.xticks()
newXTickLocations = xlocs / 1e6
newXTickLabels = ['{:.0f}M'.format(loc) for loc in newXTickLocations]
plt.xticks(ticks=xlocs, labels=newXTickLabels)

plt.show()

# Metrics Over Time

Since plotting metrics according to date will be very common in our analysis, we will be creating a function that takes as parameters:
 
- The channel we want to analyze
- The metric (column name) we want to analyze and plots it as the y axis of a plot that has time in the x axis
- The scale we want to analyze the metric in the y axis, which can be either:
    - 1e6 (in millions)
    - 1e5 (in hundreds of thousands)
    - 1e4 (in tens of thousands)
    - 1e3 (in thousands)
    - 1e2 (in hundreds, but displayed as units)

and creates scatter plots to visualize the data.

We will also add a feature to this function to plot a linear regression line to see the trend in the plot.

OBS.: the dataframe that contains the metric and the datetime we will be analyzing will be hardcoded as the video statistics dataframe, since it is the only dataframe we have that contains datetime info.

In [None]:
def convertDateAndTimeColumnToLocalizedDateTime(channelName):
    
    channelMetricOverTimeDf = videoStatistics[videoStatistics['channelTitle'] == channelName].copy()

    channelMetricOverTimeDf['videoPublishDatetime'] = pd.to_datetime(channelMetricOverTimeDf['videoPublishDatetime'])

    channelMetricOverTimeDf['videoPublishDatetime'] = channelMetricOverTimeDf['videoPublishDatetime'].dt.tz_localize(None)

    return channelMetricOverTimeDf

def selectYaxisScale(yAxisStep):
    
    if yAxisStep == 1e6:
        tickStep = yAxisStep
        tickLabel = 'M'
        tickName = 'Millions'
    elif yAxisStep == 1e5:
        tickStep = 1e3
        tickLabel = 'K'
        tickName = 'Hundreds of Thousands'
    elif yAxisStep == 1e4:
        tickStep = 1e3
        tickLabel = 'K'
        tickName = 'Tens of Thousands'
    elif yAxisStep == 1e3:
        tickStep = 1e3
        tickLabel = 'K'
        tickName = 'Thousands'
    elif yAxisStep == 1e2:
        tickStep = 1
        tickLabel = ''
        tickName = 'Units'
    else:
        return 'Error while generating y axis. Please enter correct yAxisStep parameter value.'
    
    return tickStep, tickLabel, tickName



def scatterPlotChannelMetricOverTime(channelName, metricName, yAxisStep, isPlotAllChannels=False):
    
    plt.figure(figsize=(10,6))
    
    maxViewCount = 0
    nextStep = None
    yticks = None
    if isPlotAllChannels:
        allChannelsDf = videoStatistics.copy()
        allChannelsDf['videoPublishDatetime'] = pd.to_datetime(allChannelsDf['videoPublishDatetime'])
        allChannelsDf['videoPublishDatetime'] = allChannelsDf['videoPublishDatetime'].dt.tz_localize(None)

        numericDatesAll = mdates.date2num(allChannelsDf['videoPublishDatetime'])
        slopeAll, interceptAll = np.polyfit(numericDatesAll, allChannelsDf[metricName], 1)
        regressionLineAll = slopeAll * numericDatesAll + interceptAll

        plt.scatter(allChannelsDf['videoPublishDatetime'], allChannelsDf[metricName], color='green', alpha=0.3, label='All Channels')
        plt.plot(allChannelsDf['videoPublishDatetime'], regressionLineAll, color='purple', label=f'All Channels Linear Regression | Slope = {slopeAll:.3f}')

        maxViewCount = allChannelsDf[metricName].max()
        nextStep = np.ceil(maxViewCount / yAxisStep) * yAxisStep
        yticks = np.arange(0, nextStep+1, step=yAxisStep)
    
    channelMetricOverTimeDf = convertDateAndTimeColumnToLocalizedDateTime(channelName)

    numericDates = mdates.date2num(channelMetricOverTimeDf['videoPublishDatetime'])

    slope, intercept = np.polyfit(numericDates, channelMetricOverTimeDf[metricName], 1)

    regressionLine = slope * numericDates + intercept

    tickStep, tickLabel, tickName = selectYaxisScale(yAxisStep)
    
    plt.scatter(channelMetricOverTimeDf['videoPublishDatetime'], channelMetricOverTimeDf[metricName], alpha=0.3, label=channelName)

    plt.plot(channelMetricOverTimeDf['videoPublishDatetime'], regressionLine, color='red', label=f'{channelName} Linear Regression | Slope = {slope:.3f}')

    plt.gca().xaxis.set_major_locator(mdates.YearLocator())
    plt.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y')) 

    if not isPlotAllChannels:
        maxViewCount = channelMetricOverTimeDf[metricName].max()
        nextStep = np.ceil(maxViewCount / yAxisStep) * yAxisStep
        yticks = np.arange(0, nextStep+1, step=yAxisStep)

    plt.yticks(ticks=yticks, labels=[f"{int(tick/tickStep)}{tickLabel}" for tick in yticks])

    plt.title(f'{metricName} Over Time')
    plt.xlabel('Year')
    plt.ylabel(f'{metricName} in {tickName}')

    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, shadow=True, ncol=2)
    plt.show()

Now we will also add the same plot, but instead of making a scatter plot, we will transform it into a bar chart, where each bar corresponds to the average of the metric over a semester. By grouping the datetimes by semester and taking the average, we can have a broader view of the data.

In [None]:
def barPlotChannelMetricPerSemesterOverTime(channelName, metricName, yAxisStep):

    channelMetricOverTimeDf = convertDateAndTimeColumnToLocalizedDateTime(channelName)
    
    channelMetricOverTimeDf['semester'] = (channelMetricOverTimeDf['videoPublishDatetime'].dt.month - 1) // 6 + 1

    channelMetricOverTimeDf['yearSemester'] = channelMetricOverTimeDf['videoPublishDatetime'].dt.year.astype(str) + '-' + channelMetricOverTimeDf['semester'].astype(str)

    semesterAverage = channelMetricOverTimeDf.groupby('yearSemester')[metricName].mean().reset_index()

    minYear = channelMetricOverTimeDf['videoPublishDatetime'].dt.year.min()
    maxYear = channelMetricOverTimeDf['videoPublishDatetime'].dt.year.max()

    allSemesters = [f"{year}-{semester}" for year in range(minYear, maxYear + 1) for semester in range(1, 3)]

    i = 0
    while i < len(semesterAverage):
        if allSemesters[i] != semesterAverage['yearSemester'].loc[i]:

            newRow = pd.DataFrame({
                'yearSemester': [allSemesters[i]],
                metricName: [0]
            }, index=[i])
            semesterAverage = pd.concat([semesterAverage.loc[:i-1], newRow, semesterAverage.loc[i:]])
            semesterAverage = semesterAverage.reset_index(drop=True)
        
        i += 1

    semesterAverage['numericSemester'] = semesterAverage['yearSemester'].apply(
        lambda x: float(x.split('-')[0]) + (0.5 if x.split('-')[1] == '2' else 0.0)
    )

    slope, intercept = np.polyfit(semesterAverage['numericSemester'], semesterAverage[metricName], 1)

    regressionLine = slope * semesterAverage['numericSemester'] + intercept

    tickStep, tickLabel, tickName = selectYaxisScale(yAxisStep)

    plt.figure(figsize=(10, 6))
    plt.plot(semesterAverage['yearSemester'], semesterAverage[metricName], marker='o', linestyle='-')

    plt.plot(semesterAverage['yearSemester'], regressionLine, color='red', label=f'Linear Regression | Slope = {slope:.3f}')

    plt.title(f'Average {metricName} Per Semester')
    plt.xlabel('Semester')
    plt.ylabel(f'Average {metricName} in {tickName}')
    plt.xticks(rotation=45)

    maxViewCount = semesterAverage[metricName].max()
    nextStep = np.ceil(maxViewCount / yAxisStep) * yAxisStep
    yticks = np.arange(0, nextStep + 1, step=yAxisStep)

    plt.yticks(ticks=yticks, labels=[f"{tick/tickStep:.0f}{tickLabel}" for tick in yticks])

    plt.legend()

    plt.show()

## Video View Count Over Time

Here we will analyze the view count of a particular channel over time.

To do so, first, we will create a dataframe that contains all of the unique channels of our data set so we can choose which channel we will want to visualize the data for.

After that, we will plot the data using the plot functions we had defined before (above).

In [None]:
uniqueChannels = videoStatistics[['channelTitle', 'channelID']].drop_duplicates()

uniqueChannels

Now we will filter the videos for the our desired channel and plot the data.

In [None]:
channelName = 'GORGONOID'

scatterPlotChannelMetricOverTime(channelName, 'videoViewCount', 1e6, isPlotAllChannels=True)

In [None]:
barPlotChannelMetricPerSemesterOverTime(channelName, 'videoViewCount', 1e5)

## Like Count Over Time

Here we will analyze the like count of a particular channel over time.

To do so, first, we will select the channel we want to plot the data for.

After that, we will plot the data using the plot functions we had defined before.

In [None]:
channelName = 'GORGONOID'

scatterPlotChannelMetricOverTime(channelName, 'videoLikeCount', 1e5, isPlotAllChannels=True)

In [None]:
barPlotChannelMetricPerSemesterOverTime(channelName, 'videoLikeCount', 1e4)

## Comment Count Over Time

Here we will analyze the comment count of a particular channel over time.

To do so, first, we will select the channel we want to plot the data for.

After that, we will plot the data using the plot functions we had defined before.

In [None]:
channelName = 'GORGONOID'

scatterPlotChannelMetricOverTime(channelName, 'videoCommentCount', 1e4, isPlotAllChannels=True)

In [None]:
barPlotChannelMetricPerSemesterOverTime(channelName, 'videoCommentCount', 1e2)

## Video Duration Over Time

Here we will analyze the video duration of a particular channel over time.

To do so, first, we will create a specific plot function for the video duration metric.

OBS.: since some changes needed to be made to transform the video duration column into a minute duration, we were not able to use the plot funcitons used before.

In [None]:
def scatterPlotChannelVideoDurationOverTime(channelName, isPlotAllChannels=False):

    plt.figure(figsize=(10, 6))
    
    if isPlotAllChannels:
        allChannelsDf = videoStatistics.copy()
        allChannelsDf['videoPublishDatetime'] = pd.to_datetime(allChannelsDf['videoPublishDatetime'])
        allChannelsDf['videoPublishDatetime'] = allChannelsDf['videoPublishDatetime'].dt.tz_localize(None)

        allChannelsDf['videoDuration'] = allChannelsDf['videoDuration'].dt.total_seconds() / 60

        numericDatesAll = mdates.date2num(allChannelsDf['videoPublishDatetime'])
        slopeAll, interceptAll = np.polyfit(numericDatesAll, allChannelsDf['videoDuration'], 1)
        regressionLineAll = slopeAll * numericDatesAll + interceptAll

        plt.scatter(allChannelsDf['videoPublishDatetime'], allChannelsDf['videoDuration'], color='green', alpha=0.3, label='All Channels')
        plt.plot(allChannelsDf['videoPublishDatetime'], regressionLineAll, color='purple', label=f'All Channels Linear Regression | Slope = {slopeAll:.3f}')

    channelVideoDurationOverTime = convertDateAndTimeColumnToLocalizedDateTime(channelName)

    channelVideoDurationOverTime['videoDuration'] = channelVideoDurationOverTime['videoDuration'].dt.total_seconds() / 60

    numericDates = mdates.date2num(channelVideoDurationOverTime['videoPublishDatetime'])

    slope, intercept = np.polyfit(numericDates, channelVideoDurationOverTime['videoDuration'], 1)

    regressionLine = slope * numericDates + intercept

    plt.scatter(channelVideoDurationOverTime['videoPublishDatetime'], channelVideoDurationOverTime['videoDuration'], alpha=0.3, label=channelName)

    plt.plot(channelVideoDurationOverTime['videoPublishDatetime'], regressionLine, color='red', label=f'{channelName} Linear Regression | Slope = {slope:.3f}')

    plt.gca().xaxis.set_major_locator(plt.matplotlib.dates.YearLocator())
    plt.gca().xaxis.set_major_formatter(plt.matplotlib.dates.DateFormatter('%Y'))

    plt.xlabel('Year')
    plt.ylabel('Video Duration in Minutes)')
    plt.title('Video Duration Over Time')

    plt.legend(loc='upper center', bbox_to_anchor=(0.5, -0.1), fancybox=True, shadow=True, ncol=2)
    plt.show()


Now, we will select the channel we want to plot the data for and then we will plot the data.

In [None]:
channelName = 'GORGONOID'

scatterPlotChannelVideoDurationOverTime(channelName, isPlotAllChannels=True)

In [None]:
videoStatistics.info()

In [None]:
videoStatistics