# Youtube Channel Statistic Analysis

Here we will analyze all the data that we collected from the youtube data API.

## Necessary Imports

Here we will import necessary libraries.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

## Import Data from Data Files

Here we will import the data from the csv files in the data directory of our project and save it to dataframes.

In [None]:
channelStatistics = pd.read_pickle('../data/channelStatistics.pkl')

videoStatistics = pd.read_pickle('../data/videoStatistics.pkl')

In [None]:
channelStatistics.info()

In [None]:
videoStatistics.info()

As mentioned when cleaning the data, since we are still able to run operations such as sum() on a column that contains numpy NaN's, we will not be dropping these rows (videos) as they will not hinder our data analysis.

# Average Views Per Video

Here we will analyze the average number of views per videos which will give us a broad idea of how the channels video's perform.

In [None]:
channelStatistics['averageViewsPerVideo'] = channelStatistics['viewCount'] / channelStatistics['videoCount']

averageViewsPerVideoDf = channelStatistics[['channelName', 'viewCount', 'videoCount', 'averageViewsPerVideo']].sort_values(by='averageViewsPerVideo', ascending=False)

averageViewsPerVideoDf


Now we will plot this average on a graph where the x axis will be the video count and the y axis will be the view count.

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(averageViewsPerVideoDf['videoCount'], averageViewsPerVideoDf['viewCount'], alpha=0.7)
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.title('View Count vs Video Count')
plt.xlabel('Video Count')
plt.ylabel('View Count in Millions')

locs, labels = plt.yticks()
new_tick_locations = locs / 1e6
new_tick_labels = ['{:.0f}'.format(loc) for loc in new_tick_locations]
plt.yticks(ticks=locs, labels=new_tick_labels)

plt.show()

# Subscriber Engagement

Here we will be analyzing the subscriber engagement through comparing the relationship between subscriber count and view count for each channel.

Therefore we will create a new metric called subscriber engagement that will be given by the ratio: view count / subscriber count.

That way, if the channel has:

many subscribers and many views -->  the channel will have average subscriber engagement
few subscribers and many views -->  the channel will have high subscriber engagement
many subscribers and few views -->  the channel will have low subscriber engagement
few subscribers and few views -->  the channel will have average subscriber engagement

We will create a new dataframe that contains the name of the channel and its corresponding subscriber engagement. It will be ordered by decreasing subscriber engagement.

In [None]:
channelStatistics['subscriberEngagement'] = channelStatistics['viewCount'] / channelStatistics['subscriberCount']

subscriberEngagementDf = channelStatistics[['channelName', 'viewCount', 'subscriberCount', 'subscriberEngagement']].sort_values(by='subscriberEngagement', ascending=False)

subscriberEngagementDf

After that, we will plot the different subscriber engagments in a graph that contains the subscriber count on the x axis and the view count on the y axis.

In [None]:
plt.figure(figsize=(10,6))
plt.scatter(subscriberEngagementDf['subscriberCount'], subscriberEngagementDf['viewCount'], alpha=0.7)
plt.xlim(left=0)
plt.ylim(bottom=0)
plt.title('View Count vs Subscriber Count')
plt.xlabel('Subscriber Count')
plt.ylabel('View Count in Millions')

locs, labels = plt.yticks()
new_tick_locations = locs / 1e6
new_tick_labels = ['{:.0f}'.format(loc) for loc in new_tick_locations]
plt.yticks(ticks=locs, labels=new_tick_labels)

plt.show()