# Trending YouTube Content Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#assess">Assess the data</a></li>
    <ul>
        <li><a href="#assess_sum">Assessment summary</a></li>
    </ul>
<li><a href="#clean">Clean the data</a></li>
    <ul>
        <li><a href="clean_sum">Cleaning summary</a></li>

<a id='intro'></a>
## Introduction
The purpose of this analysis is to analyze the stats of YouTube trending videos to derive insights into what it takes for a video to trend. The dataset consists of 40,901 records representing a video's stats on a day it was considered trending. There are 6,351 unique videos in this dataset, which spans the trending date range from November 2017 to June 2018. 

In [1]:
import numpy as np
import pandas as pd
from tabulate import tabulate

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.io as pio
import plotly.express as px
import plotly.graph_objects as go

# pio.templates.default = "plotly_white"
%matplotlib inline

  import pandas.util.testing as tm


In [2]:
# Read csv file of trending YouTube video data
videos = pd.read_csv(r'/Users/baka_brooks/Documents/my-projects/youtube/datasets/USvideos.csv')

In [3]:
# Read json file of US video categories
import json
from pandas.io.json import json_normalize

data = pd.read_json(r"/Users/baka_brooks/Documents/my-projects/youtube/datasets/US_category_id.json")

category_id = []
category = []
for item in data['items']:
    category_id.append(item['id'])
    category.append(item['snippet']['title'])
    
categories = pd.DataFrame(list(zip(category_id, category)), columns=['category_id', 'category'])
categories['category_id'] = categories['category_id'].astype('int64')
categories = categories.sort_values('category_id')

<a id='assess'></a>
## Assess the data

In [4]:
videos.head(5)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


- Tags separated by "|" character
- Description also contains special characters
- Trending date and published time columns are in different formats, and both are not the standard datetime format

In [5]:
videos.shape

(40949, 16)

- There are 40,949 videos included in this dataset and 16 columns

In [6]:
videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   video_id                40949 non-null  object
 1   trending_date           40949 non-null  object
 2   title                   40949 non-null  object
 3   channel_title           40949 non-null  object
 4   category_id             40949 non-null  int64 
 5   publish_time            40949 non-null  object
 6   tags                    40949 non-null  object
 7   views                   40949 non-null  int64 
 8   likes                   40949 non-null  int64 
 9   dislikes                40949 non-null  int64 
 10  comment_count           40949 non-null  int64 
 11  thumbnail_link          40949 non-null  object
 12  comments_disabled       40949 non-null  bool  
 13  ratings_disabled        40949 non-null  bool  
 14  video_error_or_removed  40949 non-null  bool  
 15  de

- There are some null descriptions

In [7]:
videos.dtypes

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

- Dates and times are read as strings

In [8]:
videos.duplicated().sum()

48

- There are 48 duplicate rows in the dataset.

In [9]:
videos.video_id.nunique()

6351

- There are 40K+ rows in the dataset, but there are 6,351 unique videos featured.

In [10]:
videos.video_id.value_counts().head()

j4KvrAUjn6c    30
t4pRQ0jn23Q    29
NBSAQenU2Bk    29
QBL8IRJ5yHU    29
r-3iathMo7o    29
Name: video_id, dtype: int64

- The most common `video_id` is "j4KvrAUjn6c". Let's examine these rows to see the differences between these rows.

In [11]:
videos[videos['video_id'] == '2kyS6SvSYSE']

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
217,2kyS6SvSYSE,17.15.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2188590,88099,7150,24225,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
448,2kyS6SvSYSE,17.16.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2325233,91111,7543,21450,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
689,2kyS6SvSYSE,17.17.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2400741,92831,7687,21714,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
924,2kyS6SvSYSE,17.18.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2468267,94303,7802,21866,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1159,2kyS6SvSYSE,17.19.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2524854,95587,7892,22038,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1383,2kyS6SvSYSE,17.20.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,2564903,96321,7972,22149,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...


- The non-duplicate rows that are of the same video vary in trending date.

In [12]:
categories[categories['category'] == 'Comedy']

Unnamed: 0,category_id,category
10,23,Comedy
21,34,Comedy


<a id='assess_sum'></a>
**Assessment summary**
1. The `trending_date` and `publish_time` columns are strings and will need to be converted to datetime objects.
2. The `trending_date` column is in the wrong format to be read as a datetime object.
3. The `publish_time` column includes both date and time. In order to provide more segmented time-based analysis, date and time will be split into separate columns.
4. There are 48 duplicate rows in the dataset. These rows will be removed.
5. Merge the two DataFrames on `category_id`.
6. Rename the `comment_count` column to `comments`, for consistency and ease of analysis.
7. Tags appear as a single string in each record separated by the "|" character. Split each string and create an array in each cell.
8. Of the 40K+ videos featured in the dataset, there are 6,351 unique videos. The other rows represent different dates in which the video was trending. A separate DataFrame will be created after merging the video categories in order to analyze individual videos based on the first date they were trending.

<a id='clean'></a>
## Clean the data

In [13]:
videos_clean = videos.copy()
categories_clean = categories.copy()

1. The `trending_date` and `publish_time` columns are strings and will need to be converted to datetime objects.   
For the purposes of this analysis, I will only need to change the data types of the date and time columns.

2. The `trending_date` column is in the wrong format to be read as a datetime object.   
The `trending_date` column is currently in an unreadable date format, so I will need to get the data into a suitable format first.

In [14]:
videos_clean["publish_time"] = pd.to_datetime(videos_clean["publish_time"])
videos_clean.dtypes

video_id                               object
trending_date                          object
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
dtype: object

In [15]:
videos_clean.trending_date.value_counts()

18.21.05    200
17.28.12    200
17.26.11    200
18.23.05    200
18.19.04    200
           ... 
18.01.02    197
18.31.01    197
18.04.02    196
18.02.02    196
18.03.02    196
Name: trending_date, Length: 205, dtype: int64

- By observation, all of the records are in the yy.dd.mm format, so I will parse the date this way across the entire column. 

In [16]:
(videos_clean.publish_time.min(), videos_clean.publish_time.max())

(Timestamp('2006-07-23 08:24:11+0000', tz='UTC'),
 Timestamp('2018-06-14 01:31:53+0000', tz='UTC'))

- The dataset ranges from the years 2006 to 2018, so I will concatenate "20" to the year of the cleaned trending date.

In [17]:
videos_clean["trending_date"] = videos_clean.trending_date.str.split(".")

In [18]:
date = []
for x in videos_clean.trending_date:
    year = "20" + x[0]
    day = x[1]
    month = x[2]
    date.append(year + "-" + month + "-" + day)

videos_clean["trending_date"] = date

In [19]:
videos_clean["trending_date"] = pd.to_datetime(videos_clean["trending_date"])

In [20]:
videos_clean.dtypes

video_id                               object
trending_date                  datetime64[ns]
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
dtype: object

3. The `publish_time` column includes both date and time. In order to provide more segmented time-based analysis, date and time will be split into separate columns.   
I will split the `publish_time` column to separate date and time to perform separate time series analyses.

In [21]:
videos_clean['publish_date'] = videos_clean['publish_time'].dt.date
videos_clean['publish_time_new'] = videos_clean['publish_time'].dt.time
videos_clean.head(2)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time_new
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13 17:13:01+00:00,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13 07:30:00+00:00,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00


In [22]:
videos_clean.dtypes

video_id                               object
trending_date                  datetime64[ns]
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
publish_date                           object
publish_time_new                       object
dtype: object

In [23]:
videos_clean.publish_time_new.value_counts()

14:00:03    250
16:00:03    224
14:00:04    215
14:00:01    213
05:00:01    210
           ... 
14:02:05      1
12:55:39      1
20:33:38      1
18:46:37      1
03:56:23      1
Name: publish_time_new, Length: 4478, dtype: int64

- Drop the original `publish_time` column and rename the new time column

In [24]:
videos_clean = videos_clean.drop(columns='publish_time', axis=1)
videos_clean = videos_clean.rename(columns={'publish_time_new': 'publish_time'})
videos_clean.head(2)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00


4. There are 48 duplicate rows in the dataset. These rows will be removed.   
This will be done using the `.drop_duplicates()` method.

In [25]:
videos_clean = videos_clean.drop_duplicates()
videos_clean.shape

(40901, 17)

In [26]:
videos_clean.duplicated().sum()

0

5. Merge the two DataFrames on `category_id`.

In [27]:
# Merge DataFrames
df = videos_clean.merge(categories_clean, on='category_id', how='left')
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time,category
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01,People & Blogs
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00,Entertainment
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,2017-11-12,19:05:24,Comedy
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...,2017-11-13,11:00:04,Entertainment
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,2017-11-12,18:01:41,Entertainment


In [28]:
df.shape

(40901, 18)

In [29]:
df.isna().sum()

video_id                    0
trending_date               0
title                       0
channel_title               0
category_id                 0
tags                        0
views                       0
likes                       0
dislikes                    0
comment_count               0
thumbnail_link              0
comments_disabled           0
ratings_disabled            0
video_error_or_removed      0
description               569
publish_date                0
publish_time                0
category                    0
dtype: int64

- There were no NaN values created in the `category` column as a result of the left join, so all videos were classified appropriately. Only one 'Comedy' category was needed.

6. Rename the `comment_count` column to `columns`, for consistency and ease of analysis.
I am also going to rename the "comment_count" column to "comments", just for my own sanity.

In [30]:
df = df.rename(columns={"comment_count": "comments"})
df.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comments,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time,category
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01,People & Blogs
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00,Entertainment
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,2017-11-12,19:05:24,Comedy
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...,2017-11-13,11:00:04,Entertainment
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,2017-11-12,18:01:41,Entertainment


7. Tags appear as a single string in each record separated by the "|" character. Split each string and create an array in each cell.

Steps:
- Remove quotation marks from tags that have it.
- Strip beginning and trailing whitespace.
- Split each tag using the delimiter '|'.

In [31]:
df['tags'] = df['tags'].str.replace('"', '').str.strip().str.split('|')

In [32]:
df.tags

0                                        [SHANtell martin]
1        [last week tonight trump presidency, last week...
2        [racist superman, rudy, mancuso, king, bach, r...
3        [rhett and link, gmm, good mythical morning, r...
4        [ryan, higa, higatv, nigahiga, i dare you, idy...
                               ...                        
40896    [aarons animals, aarons, animals, cat, cats, k...
40897                                             [[none]]
40898    [I gave safiya nygaard a perfect hair makeover...
40899    [Black Panther, HISHE, Marvel, Infinity War, H...
40900         [call of duty, cod, activision, Black Ops 4]
Name: tags, Length: 40901, dtype: object

8. Of the 40K+ videos featured in the dataset, there are 6,351 unique videos. The other rows represent different dates in which the video was trending. A separate DataFrame will be created after merging the video categories in order to analyze individual videos based on the first date they were trending. <br>
I will need to find the `video_id` of all videos that trended on multiple days, find the day they were first trending, and create a DataFrame comprised of individual videos with the `trending_date` and associated stats of the day they first appeared in the trending section.

In [33]:
df.video_id.value_counts()

j4KvrAUjn6c    29
8h--kFui1JA    29
6S9c5nnDd_s    28
QBL8IRJ5yHU    28
MAjY8mCTXWk    28
               ..
pi0ePRY7TSc     1
qgTtyfgzGc0     1
s7JiyWfGIh8     1
0-_h-qFt_zs     1
BSHh0MmJM1U     1
Name: video_id, Length: 6351, dtype: int64

- The new DataFrame must filter out the duplicate videos, leaving only the row of the first trending date.

In [34]:
first = df.sort_values('trending_date').drop_duplicates('video_id', keep='first')
first.shape

(6351, 18)

In [35]:
first.sort_values('video_id')

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comments,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time,category
39121,-0CMnp02rNY,2018-06-06,Mindy Kaling's Daughter Had the Perfect Reacti...,TheEllenShow,24,"[ellen, ellen degeneres, the ellen show, ellen...",475965,6531,172,271,https://i.ytimg.com/vi/-0CMnp02rNY/default.jpg,False,False,False,Ocean's 8 star Mindy Kaling dished on bringing...,2018-06-04,13:00:00,Entertainment
15457,-0NYY8cqdiQ,2018-02-01,Megan Mullally Didn't Notice the Interesting P...,TheEllenShow,24,"[megan mullally, megan, mullally, will and gra...",563746,4429,54,94,https://i.ytimg.com/vi/-0NYY8cqdiQ/default.jpg,False,False,False,Ellen and Megan Mullally have known each other...,2018-01-29,14:00:39,Entertainment
31553,-1Hm41N0dUs,2018-04-29,Cast of Avengers: Infinity War Draws Their Cha...,Jimmy Kimmel Live,23,"[jimmy, jimmy kimmel, jimmy kimmel live, late ...",1566807,32752,393,1490,https://i.ytimg.com/vi/-1Hm41N0dUs/default.jpg,False,False,False,"Benedict Cumberbatch, Don Cheadle, Elizabeth O...",2018-04-27,07:30:02,Comedy
3019,-1yT-K3c6YI,2017-11-29,YOUTUBER QUIZ + TRUTH OR DARE W/ THE MERRELL T...,Molly Burke,22,"[youtube quiz, youtuber quiz, truth or dare, e...",129360,5214,108,516,https://i.ytimg.com/vi/-1yT-K3c6YI/default.jpg,False,False,False,Check out the video we did on the Merrell Twin...,2017-11-28,18:30:43,People & Blogs
90,-2RVw2_QyxQ,2017-11-14,2017 Champions Showdown: Day 3,Saint Louis Chess Club,27,"[Chess, Saint Louis, Club]",67429,438,23,23,https://i.ytimg.com/vi/-2RVw2_QyxQ/default.jpg,False,False,False,The Saint Louis Chess Club hosts a series of f...,2017-11-12,02:39:01,Education
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26002,zwEn-ambXLw,2018-03-26,This Is Me - Cover by Shoshana Bean Featuring ...,Shoshana Bean,10,"[travis wall, shoshana bean, greatest showman,...",114133,4979,66,198,https://i.ytimg.com/vi/zwEn-ambXLw/default.jpg,False,False,False,I was lucky enough to lay the original demo fo...,2018-03-22,08:30:07,Music
246,zxUwbflE1SY,2017-11-15,100 People Hold Their Breath for as Long as Th...,Cut,24,"[breath, hold, funny, holding breath, breathin...",190669,5162,131,1151,https://i.ytimg.com/vi/zxUwbflE1SY/default.jpg,False,False,False,Get Cut swag here: http://cut.com/shop\n\nDon’...,2017-11-13,13:00:10,Entertainment
32150,zxwfDlhJIpw,2018-05-02,kanye west / charlamagne interview,Kanye West,22,"[Kanye West, YEEZY, Kanye, Charlamagne, The Br...",3134765,88905,7526,26692,https://i.ytimg.com/vi/zxwfDlhJIpw/default.jpg,False,False,False,,2018-05-01,15:57:06,People & Blogs
144,zy0b9e40tK8,2017-11-14,Dark | Official Trailer [HD] | Netflix,Netflix,24,"[Netflix, Baran Bo Odar, Jantje Friese, DARK, ...",378750,5642,146,675,https://i.ytimg.com/vi/zy0b9e40tK8/default.jpg,False,False,False,The disappearance of two kids in the German sm...,2017-11-09,09:00:07,Entertainment


In [36]:
first.video_id.nunique()

6351

In [37]:
df.dtypes

video_id                          object
trending_date             datetime64[ns]
title                             object
channel_title                     object
category_id                        int64
tags                              object
views                              int64
likes                              int64
dislikes                           int64
comments                           int64
thumbnail_link                    object
comments_disabled                   bool
ratings_disabled                    bool
video_error_or_removed              bool
description                       object
publish_date                      object
publish_time                      object
category                          object
dtype: object

In [38]:
first.dtypes

video_id                          object
trending_date             datetime64[ns]
title                             object
channel_title                     object
category_id                        int64
tags                              object
views                              int64
likes                              int64
dislikes                           int64
comments                           int64
thumbnail_link                    object
comments_disabled                   bool
ratings_disabled                    bool
video_error_or_removed              bool
description                       object
publish_date                      object
publish_time                      object
category                          object
dtype: object

<a id='clean_sum'></a>
**Cleaning summary**
1. The date/time based columns were converted to datetime objects from strings using `pd.to_datetime()`.
2. The `trending_date` column was not in 'YYYY-MM-DD' format and needed to be split on '.' and concatenated in order to be converted to datetime.
3. The `publish_time` column includes both date and time and was split into two separate columns, `publish_date` and `publish_time` using pandas datetime methods.
4. 48 duplicate rows were dropped using `.drop_duplicates()`   
5. Only one `category_id` of the 'Comedy' category lived in the DataFrame, so there was no need to determine misclassification.
6. The `comment_count` column was renamed to `columns`, for consistent naming convention across other numeric columns.
7. The tags in the `tags` column were transformed into an array of tags for each video.
8. The 6,351 unique videos were isolated into a separate DataFrame using their first trending date for categorical analysis.

### Export Dataframes

In [39]:
df.to_pickle('src/full_data_cleaned.pkl')

In [40]:
first.to_pickle('src/individual_videos.pkl')