In [1]:
import pandas as pd
from datetime import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pickle
import re
%matplotlib inline

In [2]:
df = pd.read_csv('USvideos.csv')

In [3]:
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


We see that our dataset contains many columns: some to identify the video and channel that published the video, some to keep track of viewer statistics, and some to describe the kind of video it is. The dataset also contains a 'thumbnail_link' column which contains a link to a page that contains the thumbnail for the video. Although it would be an interesting assignment to examine the picture and use it to understand how the choice of thumbnails can affect the likelihood of trending of a video, I think that might be too complicated an endeavor for me to currently embark on and so will remove it from the dataset.

In [4]:
cats_US = pd.read_json('UScategoryids.json')

In [5]:
cats_US.head()

Unnamed: 0,kind,etag,items
0,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
1,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
2,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
3,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."
4,youtube#videoCategoryListResponse,"""m2yskBQFythfE4irbTIeOgYYfBU/S730Ilt-Fi-emsQJv...","{'kind': 'youtube#videoCategory', 'etag': '""m2..."


We see that cats['items'] appears to be a dictionary with a lot of information.

In [6]:
cats_US['items'][0]

{'kind': 'youtube#videoCategory',
 'etag': '"m2yskBQFythfE4irbTIeOgYYfBU/Xy1mB4_yLrHy_BmKmPBggty2mZQ"',
 'id': '1',
 'snippet': {'channelId': 'UCBR8-60-B28hp2BmDPdntcQ',
  'title': 'Film & Animation',
  'assignable': True}}

Our assumption was correct! cats['items'] is a dictionary that contains various useful bits of information, including the category id and the category title, which, interestingly enough, is located in 'snippet' (which is a dictionary within a dictionary within a json).

In [7]:
categories_US = {int(category['id']): category['snippet']['title'] for category in cats_US['items']}

In [8]:
#Converting from objects to datetime64

df.trending_date = pd.to_datetime(df.trending_date, format='%y.%d.%m', errors='coerce')
df.publish_time = pd.to_datetime(df.publish_time, format='%Y-%m-%dT%H:%M:%S.%fZ', errors='coerce')

In [9]:
#Mapping values in a dictionary to its corresponding keys in the dataframe

df['category_name'] = df['category_id'].map(categories_US)

In [10]:
len(df.category_id.unique())

16

In [11]:
#Reordering columns, replacing 'category_id' with 'category_name', and removing 'thumbnail_link'

df = df[['video_id', 'trending_date', 'publish_time', 'channel_title', 'category_id', 'category_name', 'title', 'description', 'tags', 'views', 'likes', 'dislikes', 'comment_count']]

In [12]:
print("Our American data, on the other hand, covers videos that trended at any point from {} to {}, and the videos could have been published at any point between {} and {}.".format(min(df.trending_date.dt.date), max(df.trending_date.dt.date), min(df.publish_time.dt.date), max(df.publish_time.dt.date)))

Our American data, on the other hand, covers videos that trended at any point from 2017-11-14 to 2018-06-14, and the videos could have been published at any point between 2006-07-23 and 2018-06-14.


In [13]:
df.shape

(40949, 13)

We see that there are 40949 rows and 17 columns in df_US, and there are 40881 rows and 17 columns in df_CA.

In [14]:
df.describe()

Unnamed: 0,category_id,views,likes,dislikes,comment_count
count,40949.0,40949.0,40949.0,40949.0,40949.0
mean,19.972429,2360785.0,74266.7,3711.401,8446.804
std,7.568327,7394114.0,228885.3,29029.71,37430.49
min,1.0,549.0,0.0,0.0,0.0
25%,17.0,242329.0,5424.0,202.0,614.0
50%,24.0,681861.0,18091.0,631.0,1856.0
75%,25.0,1823157.0,55417.0,1938.0,5755.0
max,43.0,225211900.0,5613827.0,1674420.0,1361580.0


In [15]:
df.isnull().sum()

video_id           0
trending_date      0
publish_time       0
channel_title      0
category_id        0
category_name      0
title              0
description      570
tags               0
views              0
likes              0
dislikes           0
comment_count      0
dtype: int64

We see that the only column with missing values if **Description**. Instead of removing these rows, I am going to make this a description of its own (either by replacing null values with 'No Description' or '-').

In [16]:
null_descrips = df.description.isnull()
null_cats = pd.Series(df.category_name[null_descrips])
null_cats.describe()

count                570
unique                14
top       People & Blogs
freq                 149
Name: category_name, dtype: object

These are the various categories in df_US that have videos where the descriptions are null. We see that the category which has the maximum number of videos with missing descriptions is 'People & Blogs.'

In [17]:
len(df.video_id.unique())

6351

We see that there are only 6351 video_ids out of 40949 rows in df_US that are unique. This could be because of a number of reasons. The video could have trended at different instances of time, say once when it was published and another when something the creator published on Twitter caused a lot of people to watch an old video of theirs. Another possible reason could be that the dataset keeps outputting a new record for every day that the video remains trending.

In [18]:
df['days_since_last_trend'] = df.groupby('video_id')['trending_date'].diff()

In [19]:
df[df['days_since_last_trend'] == '4 days']

Unnamed: 0,video_id,trending_date,publish_time,channel_title,category_id,category_name,title,description,tags,views,likes,dislikes,comment_count,days_since_last_trend
1177,XrDlDj9DfZs,2017-11-19,2017-11-14 06:08:58,Dancing With The Stars,24,Entertainment,Lindsey and​ Mark’s - Iconic Dance - Dancing w...,Lindsey Stirling and Mark Ballas accompanied b...,"abc|""dancing""|""stars""|""dwts""|""Lindsey Stirling...",107919,1440,90,154,4 days
15116,DJFeaRWJdy8,2018-01-30,2018-01-25 16:00:03,First We Feast,26,Howto & Style,Sasha Banks Bosses Up While Eating Spicy Wings...,The Legit Boss herself—WWE superstar Sasha Ban...,"First we feast|""fwf""|""firstwefeast""|""food""|""fo...",1056913,35413,1084,6266,4 days
16577,qXU2qTRjBKU,2018-02-06,2018-01-31 17:32:26,theSkimm,22,People & Blogs,"Michael Wolff, “Fire and Fury | theSkimm Sip '...",Author Michael Wolff's “Fire and Fury is the b...,"michael wolff|""fire and fury""|""book""|""trump bo...",12989,2,16,15,4 days
18165,8h2rlhsN9DI,2018-02-14,2018-02-05 15:00:00,Inside Edition,25,News & Politics,Meet 13-Year-Old Who Took a Selfie With Justin...,More from Inside Edition: https://www.youtube....,"cat-entertainment|""trending""|""news""|""ie trendi...",316014,5464,590,772,4 days
21358,rJQYzX6Bgio,2018-03-02,2018-02-17 15:11:09,Fancy Vlogs By Gab,22,People & Blogs,from this to this real quick... flu 2018,hi im gabi demartino!\n\nwatch the flu hit me ...,"fancy|""fancy vlogs by gab""|""fancy vlogs""|""gabi...",517319,20881,333,2229,4 days
21359,XzvHSMEBFjE,2018-03-02,2018-02-17 06:23:55,Chris Smoove,17,Sports,NBA All-Star Celebrity Game 2018! Justin Biebe...,Chris Smoove T-Shirts! http://chrissmoove.com/...,"nba|""chris smoove""|""NBA All-Star Celebrity Gam...",999818,19645,535,1124,4 days
21360,rZQepOFnYi8,2018-03-02,2018-02-16 17:14:27,Brian Hull,24,Entertainment,BEST REACTION EVER! - Mickey and Minnie at Dri...,Had the best reaction at the Drive Thru doing ...,"Brian Hull|""Brian Hull Impressions""|""Brian Hul...",1288861,31886,1226,2005,4 days
21361,Mmn0zFalrD4,2018-03-02,2018-02-17 14:52:06,Hydraulic Press Channel,28,Science & Technology,Punching Huge Holes Through Everything with Hy...,"Frying pan, knife, tablet computer and lot mor...","Hydraulic press channel|""hydraulicpresschannel...",345371,6015,325,988,4 days
21362,h_cfN1t3flE,2018-03-02,2018-02-16 23:32:15,Jackie Aina,26,Howto & Style,Hmmm...Too Faced Life's a Festival Collection ...,Sign up for a free Audible 30 day trial! http:...,"too faced|""too life is a festival""|""life's a f...",839813,48283,1145,3746,4 days
21363,BJDGBNFxO7o,2018-03-02,2018-02-16 17:00:01,Will Smith,24,Entertainment,We Lost Him... | Will Smith Vlogs,"We began our journey in Miami, explored Cabo, ...","will smith|""will""|""smith""|""smiths""|""willsmith""...",847523,44077,650,3951,4 days


In [20]:
df[df['video_id'] == 'XrDlDj9DfZs']

Unnamed: 0,video_id,trending_date,publish_time,channel_title,category_id,category_name,title,description,tags,views,likes,dislikes,comment_count,days_since_last_trend
236,XrDlDj9DfZs,2017-11-15,2017-11-14 06:08:58,Dancing With The Stars,24,Entertainment,Lindsey and​ Mark’s - Iconic Dance - Dancing w...,Lindsey Stirling and Mark Ballas accompanied b...,"abc|""dancing""|""stars""|""dwts""|""Lindsey Stirling...",60564,975,66,106,NaT
1177,XrDlDj9DfZs,2017-11-19,2017-11-14 06:08:58,Dancing With The Stars,24,Entertainment,Lindsey and​ Mark’s - Iconic Dance - Dancing w...,Lindsey Stirling and Mark Ballas accompanied b...,"abc|""dancing""|""stars""|""dwts""|""Lindsey Stirling...",107919,1440,90,154,4 days


We see that a video can trend on non-consecutive days, and hence just taking the first or last trending day does not contain the full picture of trending videos.

In [21]:
df.dtypes

video_id                          object
trending_date             datetime64[ns]
publish_time              datetime64[ns]
channel_title                     object
category_id                        int64
category_name                     object
title                             object
description                       object
tags                              object
views                              int64
likes                              int64
dislikes                           int64
comment_count                      int64
days_since_last_trend    timedelta64[ns]
dtype: object

Let us see how many unique channel names there are in our dataset:

In [22]:
len(df.channel_title.unique())

2207

In [23]:
len(df.title.unique())

6455

It is interesting that the humber of unique video IDs is not equal to the number of unique video titles. I will examine this further while performing EDA.

Let us see if uppercase titles are most commonly found among trending videos than lowercase/regularly capitalized ones:

In [24]:
n = 0
for charac in df.title.unique():
    if charac == charac.upper():
        n += 1
    else:
        continue
print(n)

348


We see that out of 6455 unique video titles in our dataset, 348 are in uppercase, which shows that it is in the minority.

#### Filling NaN values

In [25]:
df['description'].fillna(value='No Description', inplace=True)

In [26]:
from datetime import timedelta

In [27]:
df['days_since_last_trend'].fillna(value=timedelta(seconds=0), inplace=True)

We see that df_US has a large number of videos with web links included in their description! Maybe this is key to trend in the 21st Century? 

When I perform EDA, I am curious to examine what are the most common websites that are linked. Is it other social media belonging to the creator? Is it other Youtube links?

In [28]:
df.head(20)

Unnamed: 0,video_id,trending_date,publish_time,channel_title,category_id,category_name,title,description,tags,views,likes,dislikes,comment_count,days_since_last_trend
0,2kyS6SvSYSE,2017-11-14,2017-11-13 17:13:01,CaseyNeistat,22,People & Blogs,WE WANT TO TALK ABOUT OUR MARRIAGE,SHANTELL'S CHANNEL - https://www.youtube.com/s...,SHANtell martin,748374,57527,2966,15954,0 days
1,1ZAPwfrtAFY,2017-11-14,2017-11-13 07:30:00,LastWeekTonight,24,Entertainment,The Trump Presidency: Last Week Tonight with J...,"One year after the presidential election, John...","last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,0 days
2,5qpjK5DgCt4,2017-11-14,2017-11-12 19:05:24,Rudy Mancuso,23,Comedy,"Racist Superman | Rudy Mancuso, King Bach & Le...",WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,0 days
3,puqaWrEC7tY,2017-11-14,2017-11-13 11:00:04,Good Mythical Morning,24,Entertainment,Nickelback Lyrics: Real or Fake?,Today we find out if Link is a Nickelback amat...,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,0 days
4,d380meD0W0M,2017-11-14,2017-11-12 18:01:41,nigahiga,24,Entertainment,I Dare You: GOING BALD!?,I know it's been a while since we did this sho...,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,0 days
5,gHZ1Qz0KiKM,2017-11-14,2017-11-13 19:07:23,iJustine,28,Science & Technology,2 Weeks with iPhone X,Using the iPhone for the past two weeks -- her...,"ijustine|""week with iPhone X""|""iphone x""|""appl...",119180,9763,511,1434,0 days
6,39idVpFF7NQ,2017-11-14,2017-11-12 05:37:17,Saturday Night Live,24,Entertainment,Roy Moore & Jeff Sessions Cold Open - SNL,Embattled Alabama Senate candidate Roy Moore (...,"SNL|""Saturday Night Live""|""SNL Season 43""|""Epi...",2103417,15993,2445,1970,0 days
7,nc99ccSXST0,2017-11-14,2017-11-12 21:50:37,CrazyRussianHacker,28,Science & Technology,5 Ice Cream Gadgets put to the Test,Ice Cream Pint Combination Lock - http://amzn....,"5 Ice Cream Gadgets|""Ice Cream""|""Cream Sandwic...",817732,23663,778,3432,0 days
8,jr9QtXwC9vc,2017-11-14,2017-11-13 14:00:23,20th Century Fox,1,Film & Animation,The Greatest Showman | Official Trailer 2 [HD]...,"Inspired by the imagination of P.T. Barnum, Th...","Trailer|""Hugh Jackman""|""Michelle Williams""|""Za...",826059,3543,119,340,0 days
9,TUmyygCMMGA,2017-11-14,2017-11-13 13:45:16,Vox,25,News & Politics,Why the rise of the robots won’t mean the end ...,"For now, at least, we have better things to wo...","vox.com|""vox""|""explain""|""shift change""|""future...",256426,12654,1363,2368,0 days


In [29]:
df.dtypes

video_id                          object
trending_date             datetime64[ns]
publish_time              datetime64[ns]
channel_title                     object
category_id                        int64
category_name                     object
title                             object
description                       object
tags                              object
views                              int64
likes                              int64
dislikes                           int64
comment_count                      int64
days_since_last_trend    timedelta64[ns]
dtype: object

Saving the dataframes:

In [30]:
file = open('US.pkl','wb')

pickle.dump(df, file)
file.close()