 # Trending YouTube Video Statistics Analysis

 ![Photo by Isaac Smith on Unsplash](https://images.unsplash.com/photo-1543286386-713bdd548da4?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=900&q=60 "Chart")

 Photo by Isaac Smith on Unsplash

 YouTube (the world-famous video sharing website) maintains a
 list of the top trending videos on the platform. According to
 Variety magazine, “To determine the year’s top-trending videos,
 YouTube uses a combination of factors including measuring users interactions
 (number of views, shares, comments and likes). Note that they’re not the
 most-viewed videos overall for the calendar year”. Top performers on the
 YouTube trending list are music videos (such as the famously virile “Gangam Style”),
 celebrity and/or reality TV performances, and the random dude-with-a-camera viral
 videos that YouTube is well-known for.

 This dataset was collected using the YouTube API.

 Columns include:
- video_id
- trending_date
- title
- channel_title
- category_id
- publish_time
- tags
- views (number of views)
- likes (number of likes)
- dislikes (number of dislikes)
- comment_count 
- thumbnail_link
- comments_disabled
- ratings_disabled
- video_error_or_removed
- description

 ### Standard imports.

In [1]:
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt
from plotly.offline import iplot, init_notebook_mode
import plotly.graph_objs as go
from datetime import datetime
import dateutil.parser
import seaborn as sns 
sns.set_style('whitegrid')
init_notebook_mode(connected=True)




In [2]:
def load_data(path):
    return pd.read_csv(path)

# Convert timestamp strings to datetime object.
def convert_date(timestamp, date_format='%Y-%m-%d'):
    '''
    Converts timestamp to datetime object with specified format.
    default format: '%Y-%m-%d'
    '''
    if 'Z' not in timestamp: # trending_date has format 'yy.dd.mm'
        parsed_date = datetime.strptime(timestamp, '%y.%d.%m')
        return parsed_date
    else:               # publish_time has format 'yyyy-mm-ddThh:mm:ss.000Z'
        d = dateutil.parser.parse(timestamp)
        parsed_date = d.strftime('%Y-%m-%d')
        return d.strptime(parsed_date, date_format)


# Map category labels, add category_label column.
def map_categories(df, map_dict, column='category_label'):
    '''
    Adds category label column to dataframe.
    Accepts df: Dataframe to perform map on.
            column: Column name in df to perform map on. (default='category_label')
            map_dict: Dictionary containing map items. (default=category_dict)
    '''
    df[column] = df.category_id.map(map_dict)
    return df

# Find top youTube video producers for specified year.
def top_video_producing_for_yr(df, year, top_range=5):
    '''
    Finds top video producers for specified year.
    Accepts df: Dataframe.
            year: Year to filter by.
            top_range: Number of top entries to print / return. (default=5)
    '''

    year_filter = [date.year == year for date in df['publish_time']]
    sliced_df = df[year_filter]

    channel_vid_groups = sliced_df.groupby(['channel_title'])['video_id'].count()
    sorted_groups = channel_vid_groups.sort_values(ascending=False)
    top_producers = sorted_groups[:top_range]

    print('#'*30)
    print(f'Top {top_range} video producers in {year} were:')
    print()
    i = 0
    for vid_count in top_producers:
        print('\t', f'{str(i+1)}) {top_producers.index[i]} : {vid_count} videos.', end='')
        print('\n')
        i += 1

    print('#'*30)

    return top_producers


# Plotly Bar graph 
def plotly_bar(x, y, name, title, x_title, y_title, filename, colors=None):
    trace1 = go.Bar(x=x, 
                    y=y,
                    name=name,
                    marker={'color':colors})
    layout = go.Layout(title=title, 
                        xaxis={'title':x_title},
                        yaxis={'title':y_title})

    data = [trace1]
    fig = go.Figure(data=data, layout=layout)
    iplot(fig, filename=filename)

# Plotly Bar graph for year
def plot_category_bar_for_yr(df, year, colors=None):
    year_filter = [date.year == year for date in df['publish_time']]
    sliced_df = df[year_filter]

    category_counts_for_yr = sliced_df.category_label.value_counts()
    print(category_counts_for_yr)

    x = category_counts_for_yr.index
    y = category_counts_for_yr
    name = f'Category Bar For {year}'
    title = f'Category Count ({year})'
    x_title = 'Category'
    y_title = 'Count'
    filename = f'category_bar_{year}'

    plotly_bar(x=x, y=y, name=name, title=title, x_title=x_title, y_title=y_title, filename=filename, colors=colors)



 ### Load Dataset

In [3]:
df = load_data('data/USvideos.csv')


In [4]:
date_columns = ['trending_date', 'publish_time']

for col in date_columns:
    df[col] = [convert_date(ts) for ts in df[col]]


In [5]:
# Take a peak at the first 5 rows (head).
df.head()


Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


In [6]:
# Take a peak at the dataframes info function.
df.info()

# 40,949 total entries.
# Every column looks complete with the exception of the video descriptions.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
video_id                  40949 non-null object
trending_date             40949 non-null datetime64[ns]
title                     40949 non-null object
channel_title             40949 non-null object
category_id               40949 non-null int64
publish_time              40949 non-null datetime64[ns]
tags                      40949 non-null object
views                     40949 non-null int64
likes                     40949 non-null int64
dislikes                  40949 non-null int64
comment_count             40949 non-null int64
thumbnail_link            40949 non-null object
comments_disabled         40949 non-null bool
ratings_disabled          40949 non-null bool
video_error_or_removed    40949 non-null bool
description               40379 non-null object
dtypes: bool(3), datetime64[ns](2), int64(5), object(6)
memory usage: 4.2+ MB


In [7]:
# Take a peak at the dataframes describe function.
df.describe()


Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,40949.0,40949.0,40949.0,40949.0,40949.0
mean,19.972429,2360785.0,74266.7,3711.401,8446.804
std,7.568327,7394114.0,228885.3,29029.71,37430.49
min,1.0,549.0,0.0,0.0,0.0
25%,17.0,242329.0,5424.0,202.0,614.0
50%,24.0,681861.0,18091.0,631.0,1856.0
75%,25.0,1823157.0,55417.0,1938.0,5755.0
max,43.0,225211900.0,5613827.0,1674420.0,1361580.0


 ## Exploring video category.
- Which category was assigned the most?
- Which category was assigned the least?
- Which creator uploaded the most videos?
>- Which were their most assigned video category?
- Which category has had the most views 2017 / 2018?

In [8]:
# Let add category labels to the dataframe.
category_dict = {
    
    1 :  'Film & Animation',
    2 : 'Autos & Vehicles',
    10 : 'Music',
    15 : 'Pets & Animals',
    17 : 'Sports',
    18 : 'Short Movies',
    19 : 'Travel & Events',
    20 : 'Gaming',
    21 : 'Videoblogging',
    22 : 'People & Blogs',
    23 : 'Comedy',
    24 : 'Entertainment',
    25 : 'News & Politics',
    26 : 'Howto & Style',
    27 : 'Education',
    28 : 'Science & Technology',
    29 : 'Nonprofits & Activism',
    30 : 'Movies',
    31 : 'Anime/Animation',
    32 : 'Action/Adventure',
    33 : 'Classics',
    34 : 'Comedy',
    35 : 'Documentary',
    36 : 'Drama',
    37 : 'Family',
    38 : 'Foreign',
    39 : 'Horror',
    40 : 'Sci-Fi/Fantasy',
    41 : 'Thriller',
    42 : 'Shorts',
    43 : 'Shows',
    44 : 'Trailers',
}



In [9]:
# Store new dataframe in a new variable.
mapped_df = map_categories(df, map_dict=category_dict)
year_list = [int(date.year) for date in mapped_df['publish_time']]
mapped_df['year'] = year_list

# Rearrange columns so that category_id 
# and category_label are next to one another.
mapped_df = mapped_df[['video_id', 'trending_date', 'title', 'channel_title', 'category_id',
                    'category_label', 'publish_time', 'tags', 'views', 'likes', 'dislikes', 
                    'comment_count', 'thumbnail_link', 'comments_disabled', 'ratings_disabled',
                    'video_error_or_removed', 'description', 'year']]



In [10]:
category_counts = mapped_df.category_label.value_counts()
category_counts

category_list = category_counts.index
cat_colors = ['#016e29', '#7f2171', '#b241d0', '#decad2', '#54d7a1', '#10e481', '#2ec580', '#4fa15c',
                '#94d85c', '#df5aad', '#6dd279', '#91a9ed', '#bb232b', '#b6d41b', '#359fe1', '#4985fd']

cat_colors_dict = {cat:color for cat, color in zip(category_list, cat_colors)}


In [11]:
i = 0
for count in category_counts:
    print(f'Of the {mapped_df.shape[0]} videos uploaded, {count} ({(count/mapped_df.shape[0]) * 100:0.2f}%) videos were of the {category_counts.index[i]} category.')
    print()
    i += 1


Of the 40949 videos uploaded, 9964 (24.33%) videos were of the Entertainment category.

Of the 40949 videos uploaded, 6472 (15.81%) videos were of the Music category.

Of the 40949 videos uploaded, 4146 (10.12%) videos were of the Howto & Style category.

Of the 40949 videos uploaded, 3457 (8.44%) videos were of the Comedy category.

Of the 40949 videos uploaded, 3210 (7.84%) videos were of the People & Blogs category.

Of the 40949 videos uploaded, 2487 (6.07%) videos were of the News & Politics category.

Of the 40949 videos uploaded, 2401 (5.86%) videos were of the Science & Technology category.

Of the 40949 videos uploaded, 2345 (5.73%) videos were of the Film & Animation category.

Of the 40949 videos uploaded, 2174 (5.31%) videos were of the Sports category.

Of the 40949 videos uploaded, 1656 (4.04%) videos were of the Education category.

Of the 40949 videos uploaded, 920 (2.25%) videos were of the Pets & Animals category.

Of the 40949 videos uploaded, 817 (2.00%) videos were

In [12]:
# PLot bar chart for all years.
x = category_counts.index
y = category_counts
name = 'Category Bar'
colors = cat_colors
title = 'Category Count (2006 - 2018)'
x_title = 'Category'
y_title = 'Count'
filename = 'category_bar'
plotly_bar(x, y, name, title, x_title, y_title, filename, colors=colors)



In [13]:
# PLot bar chart for 2018.
plot_category_bar_for_yr(mapped_df, 2018, colors=cat_colors)


Entertainment            7407
Music                    4785
Howto & Style            3163
Comedy                   2457
People & Blogs           2328
Science & Technology     1796
Film & Animation         1729
News & Politics          1685
Sports                   1647
Education                1255
Gaming                    728
Pets & Animals            683
Travel & Events           276
Autos & Vehicles          251
Shows                      46
Nonprofits & Activism      43
Name: category_label, dtype: int64


In [14]:
# PLot bar chart for 2018.
plot_category_bar_for_yr(mapped_df, 2017, colors=cat_colors)


Entertainment            2507
Music                    1669
Comedy                    992
Howto & Style             979
People & Blogs            862
News & Politics           787
Science & Technology      591
Film & Animation          568
Sports                    508
Education                 376
Pets & Animals            235
Travel & Events           126
Autos & Vehicles          120
Gaming                     83
Nonprofits & Activism      14
Shows                      11
Name: category_label, dtype: int64


In [15]:
# Table of year and category labels. Here we see the general number of videos
# (That are tracked) uploaded since 2006.
grouped_df = mapped_df.groupby(['category_label', 'year'])['video_id'].count()
unstacked_df = grouped_df.unstack().fillna(value=0)
unstacked_df



year,2006,2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018
category_label,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Autos & Vehicles,0.0,0.0,0.0,0.0,0.0,0.0,5.0,0.0,4.0,4.0,120.0,251.0
Comedy,0.0,0.0,0.0,0.0,0.0,3.0,3.0,2.0,0.0,0.0,992.0,2457.0
Education,0.0,0.0,1.0,6.0,5.0,5.0,5.0,0.0,3.0,0.0,376.0,1255.0
Entertainment,1.0,0.0,0.0,1.0,1.0,5.0,11.0,9.0,6.0,16.0,2507.0,7407.0
Film & Animation,0.0,4.0,2.0,7.0,6.0,0.0,2.0,8.0,12.0,7.0,568.0,1729.0
Gaming,0.0,0.0,0.0,0.0,2.0,4.0,0.0,0.0,0.0,0.0,83.0,728.0
Howto & Style,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,1.0,979.0,3163.0
Music,0.0,7.0,3.0,0.0,0.0,5.0,0.0,3.0,0.0,0.0,1669.0,4785.0
News & Politics,0.0,0.0,0.0,0.0,11.0,0.0,4.0,0.0,0.0,0.0,787.0,1685.0
Nonprofits & Activism,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,14.0,43.0


In [16]:
channel_groups = mapped_df.groupby(['channel_title'])['video_id'].count().sort_values(ascending=False)
channel_groups[:30]


channel_title
ESPN                                      203
The Tonight Show Starring Jimmy Fallon    197
Vox                                       193
TheEllenShow                              193
Netflix                                   193
The Late Show with Stephen Colbert        187
Jimmy Kimmel Live                         186
Late Night with Seth Meyers               183
Screen Junkies                            182
NBA                                       181
CNN                                       180
Saturday Night Live                       175
WIRED                                     171
BuzzFeedVideo                             169
INSIDER                                   167
The Late Late Show with James Corden      163
TED-Ed                                    162
Tom Scott                                 159
WWE                                       157
CollegeHumor                              156
HellthyJunkFood                           153
First We Feast      

In [17]:
end = 30
name = f'Top {end}'
title = f'Top  {end} Video Producing Channels'
x_title = 'Channels'
y_title = 'Count'
filename = f'top_ {end}_producing'

plotly_bar(x=channel_groups.index[:end], 
            y=channel_groups[:end], 
            name=name, 
            title=title, 
            x_title=x_title, 
            y_title=y_title, 
            filename=filename)

In [18]:
# Top category for ESPN channel.
top_channel_df = mapped_df[mapped_df['channel_title'] == 'ESPN']
top_channel_df.groupby(['category_label'])['video_id'].count().sort_values(ascending=False)


category_label
Sports    203
Name: video_id, dtype: int64

In [19]:
# Top category for The Tonight Show.
channel = 'The Tonight Show Starring Jimmy Fallon'
top_channel_df = mapped_df[mapped_df['channel_title'] == channel]
top_channel_df.groupby(['category_label'])['video_id'].count().sort_values(ascending=False)


category_label
Comedy    197
Name: video_id, dtype: int64

In [20]:
# Report out categories and the amount of view accrued from 2006 - 2018.
view_counts = mapped_df.groupby(['category_label'])['views'].sum().sort_values(ascending=False)

i = 0
for count in view_counts:
    print(f'{view_counts.index[i]} has accrued {count} views all-time (2006-2018).')
    print()
    i += 1

print('View Count Series')
print(view_counts)


Music has accrued 40132892190 views all-time (2006-2018).

Entertainment has accrued 20604388195 views all-time (2006-2018).

Film & Animation has accrued 7284156721 views all-time (2006-2018).

Comedy has accrued 5117426208 views all-time (2006-2018).

People & Blogs has accrued 4917191726 views all-time (2006-2018).

Sports has accrued 4404456673 views all-time (2006-2018).

Howto & Style has accrued 4078545064 views all-time (2006-2018).

Science & Technology has accrued 3487756816 views all-time (2006-2018).

Gaming has accrued 2141218625 views all-time (2006-2018).

News & Politics has accrued 1473765704 views all-time (2006-2018).

Education has accrued 1180629990 views all-time (2006-2018).

Pets & Animals has accrued 764651989 views all-time (2006-2018).

Autos & Vehicles has accrued 520690717 views all-time (2006-2018).

Travel & Events has accrued 343557084 views all-time (2006-2018).

Nonprofits & Activism has accrued 168941392 views all-time (2006-2018).

Shows has accrued 

In [21]:
# Report out categories and the amount of view accrued in 2017.
df_2017 = mapped_df[mapped_df['year'] == 2017]
view_counts_2017 = df_2017.groupby(['category_label'])['views'].sum().sort_values(ascending=False)

i = 0
for count in view_counts_2017:
    print(f'{view_counts_2017.index[i]} has accrued {count} views in 2017.')
    print()
    i += 1

print()
print('View Count Series 2017', '#'*10)
print(view_counts_2017)
print('#'*35)


Music has accrued 4504741345 views in 2017.

Entertainment has accrued 4049697550 views in 2017.

Comedy has accrued 1130781734 views in 2017.

Film & Animation has accrued 865461513 views in 2017.

Howto & Style has accrued 798971191 views in 2017.

People & Blogs has accrued 701889218 views in 2017.

Science & Technology has accrued 500594905 views in 2017.

Sports has accrued 381264871 views in 2017.

News & Politics has accrued 235729643 views in 2017.

Education has accrued 202535856 views in 2017.

Pets & Animals has accrued 144253255 views in 2017.

Autos & Vehicles has accrued 78956286 views in 2017.

Travel & Events has accrued 54291438 views in 2017.

Gaming has accrued 50354420 views in 2017.

Shows has accrued 1751446 views in 2017.

Nonprofits & Activism has accrued 154195 views in 2017.


View Count Series 2017 ##########
category_label
Music                    4504741345
Entertainment            4049697550
Comedy                   1130781734
Film & Animation          865

In [22]:
# Report out categories and the amount of view accrued in 2018.
df_2018 = mapped_df[mapped_df['year'] == 2018]
view_counts_2018 = df_2018.groupby(['category_label'])['views'].sum().sort_values(ascending=False)

i = 0
for count in view_counts_2018:
    print(f'{view_counts_2018.index[i]} has accrued {count} views in 2018.')
    print()
    i += 1

print()
print('View Count Series 2018', '#'*10)
print(view_counts_2018)
print('#'*35)




Music has accrued 35626828529 views in 2018.

Entertainment has accrued 16551753681 views in 2018.

Film & Animation has accrued 6414319592 views in 2018.

People & Blogs has accrued 4214020692 views in 2018.

Sports has accrued 4022850077 views in 2018.

Comedy has accrued 3985870620 views in 2018.

Howto & Style has accrued 3279557640 views in 2018.

Science & Technology has accrued 2986783711 views in 2018.

Gaming has accrued 2090593386 views in 2018.

News & Politics has accrued 1237593169 views in 2018.

Education has accrued 977569444 views in 2018.

Pets & Animals has accrued 620388475 views in 2018.

Autos & Vehicles has accrued 441658799 views in 2018.

Travel & Events has accrued 289265646 views in 2018.

Nonprofits & Activism has accrued 168787197 views in 2018.

Shows has accrued 49749612 views in 2018.


View Count Series 2018 ##########
category_label
Music                    35626828529
Entertainment            16551753681
Film & Animation          6414319592
People & B