# Trending YouTube Content Exploratory Data Analysis

## Table of Contents
<ul>
<li><a href="#intro">Introduction</a></li>
<li><a href="#wrangling">Data Wrangling</a></li>
    <ul>
    <li><a href="#import">Importing libraries and data</a></li>
    <li><a href="#assess">Assess the data</a></li>
        <ul>
            <li><a href="#assess_sum">Assessment summary</a></li>
        </ul>
    <li><a href="#clean">Clean the data</a></li>
        <ul>
            <li><a href="clean_sum">Cleaning summary</a></li>
        </ul>
    </ul>
<li><a href="#eda">Exploratory Data Analysis</a></li>
<li><a href="#conclusions">Conclusions</a></li>
</ul>

<a id='intro'></a>
## Introduction
The purpose of this analysis is to analyze the YouTube trending content in the US. Here are the questions I am looking to answer:
- What types of videos appear in trending content most often? How has that changed over time?
- Are there particular days/times videos were posted that affect top video performance?
- How long does it take for a video to become trending typically?
- Is there a threshold for the amount of engagement/other statistic to become trending?
- What creators have had the most success in consistently publishing trending video? Why is that?
- Are there common themes in best practices (titles, descriptions, tags) amongst trending content?

<a id='wrangling'></a>
## Data Wrangling

<a id='import'></a>
### Importing libraries and data

In [44]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [45]:
# Read csv file of trending YouTube video data
videos = pd.read_csv(r"C:\Users\Main User\Documents\BAKA!\youtube-trending\datasets\USvideos.csv")

In [46]:
# Read json file of US video categories
import json
from pandas.io.json import json_normalize

data = pd.read_json(r"C:\Users\Main User\Documents\BAKA!\youtube-trending\datasets\US_category_id.json")

category_id = []
category = []
for item in data['items']:
    category_id.append(item['id'])
    category.append(item['snippet']['title'])
    
categories = pd.DataFrame(list(zip(category_id, category)), columns=['category_id', 'category'])
categories['category_id'] = categories['category_id'].astype('int64')
categories = categories.sort_values('category_id')

<a id='assess'></a>
### Assess the data

In [47]:
videos.head(5)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
0,2kyS6SvSYSE,17.14.11,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13T17:13:01.000Z,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...
1,1ZAPwfrtAFY,17.14.11,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13T07:30:00.000Z,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John..."
2,5qpjK5DgCt4,17.14.11,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,2017-11-12T19:05:24.000Z,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...
3,puqaWrEC7tY,17.14.11,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,2017-11-13T11:00:04.000Z,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...
4,d380meD0W0M,17.14.11,I Dare You: GOING BALD!?,nigahiga,24,2017-11-12T18:01:41.000Z,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...


- Tags separated by "|" character
- Description also contains special characters
- Trending date and published time columns are in different formats, and both are not the format needed for analysis

In [48]:
videos.shape

(40949, 16)

- There are 40,949 unique videos included in this dataset and 16 columns

In [49]:
videos.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40949 entries, 0 to 40948
Data columns (total 16 columns):
video_id                  40949 non-null object
trending_date             40949 non-null object
title                     40949 non-null object
channel_title             40949 non-null object
category_id               40949 non-null int64
publish_time              40949 non-null object
tags                      40949 non-null object
views                     40949 non-null int64
likes                     40949 non-null int64
dislikes                  40949 non-null int64
comment_count             40949 non-null int64
thumbnail_link            40949 non-null object
comments_disabled         40949 non-null bool
ratings_disabled          40949 non-null bool
video_error_or_removed    40949 non-null bool
description               40379 non-null object
dtypes: bool(3), int64(5), object(8)
memory usage: 4.2+ MB


- There are some null descriptions

In [50]:
videos.dtypes

video_id                  object
trending_date             object
title                     object
channel_title             object
category_id                int64
publish_time              object
tags                      object
views                      int64
likes                      int64
dislikes                   int64
comment_count              int64
thumbnail_link            object
comments_disabled           bool
ratings_disabled            bool
video_error_or_removed      bool
description               object
dtype: object

- Dates and times are read as strings

In [51]:
videos.duplicated().sum()

48

In [52]:
# Check the duplicate rows to determine if these will need to be dropped during the cleaning phase
duplicate_id = videos[videos.duplicated()]['video_id']
duplicate_id.values

array(['QBL8IRJ5yHU', 't4pRQ0jn23Q', 'j4KvrAUjn6c', 'MAjY8mCTXWk',
       'xhs8tf1v__w', 'E21NATEP9QI', 'jzLlsbdrwQk', '1RZYOeQeIXE',
       'WF82ABLw8s4', 'r-3iathMo7o', 'NBSAQenU2Bk', 'Xpv-sEKl1B4',
       'HrQNdClwMs4', '4oqvNR1o3Zo', '96oKlWv5wSo', 'oRexsyztGS0',
       'MT7RQ0gu8ak', '1U1u5aKU3AY', 'xTrwT0jSUg0', '3g5O-kT9m8k',
       'Dwc27Lsr1EY', '6ijnv-jNhUA', 'D2mxKEa2xmA', 'OUBx_raReDw',
       'BspHjvU11y4', 'nRc0kmOYgzQ', 'UfKmSfgFxi8', '_iGAptGAweo',
       'DGdSlnw4D_M', 'BfawmhUVXVo', 'LtpqdJkoKm8', 'mAfkkgw_-68',
       'rQEqKZ7CJlk', 'OXVm3fhYsEo', 'ksjWPxFPsos', 'UQkBcHLZOqU',
       'mdWcaWBxxcY', 'Am6NHDbj6XA', 'vjSohj-Iclc', 'CPjWgk0UXps',
       'uxbQATBAXf8', 'y_WoOYybCro', 'oSEeK9yDNQI', 'iILJvqrAQ_w',
       'zcEE8J2Bqa8', 'q1jzwV_s8_Y', 'mkz1zoo15zI', '2PH7dK6SLC8'],
      dtype=object)

In [53]:
videos[videos['video_id'].isin(duplicate_id.values)].sort_values('video_id')

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description
34906,1RZYOeQeIXE,18.15.05,Sarah Paulson Gets Scared During '5 Second Rule',TheEllenShow,24,2018-05-14T13:00:00.000Z,"ellen|""ellen degeneres""|""the ellen show""|""seas...",704786,19880,248,669,https://i.ytimg.com/vi/1RZYOeQeIXE/default.jpg,False,False,False,Sarah Paulson agreed to play a friendly game o...
34757,1RZYOeQeIXE,18.15.05,Sarah Paulson Gets Scared During '5 Second Rule',TheEllenShow,24,2018-05-14T13:00:00.000Z,"ellen|""ellen degeneres""|""the ellen show""|""seas...",704786,19880,248,669,https://i.ytimg.com/vi/1RZYOeQeIXE/default.jpg,False,False,False,Sarah Paulson agreed to play a friendly game o...
34968,1RZYOeQeIXE,18.16.05,Sarah Paulson Gets Scared During '5 Second Rule',TheEllenShow,24,2018-05-14T13:00:00.000Z,"ellen|""ellen degeneres""|""the ellen show""|""seas...",1195009,27935,446,965,https://i.ytimg.com/vi/1RZYOeQeIXE/default.jpg,False,False,False,Sarah Paulson agreed to play a friendly game o...
35194,1U1u5aKU3AY,18.17.05,New lava fissures fuel fears of eruption in Ha...,CNN,25,2018-05-13T19:30:53.000Z,"latest News|""Happening Now""|""CNN""|""lava""|""hawa...",365394,1814,335,1284,https://i.ytimg.com/vi/1U1u5aKU3AY/default.jpg,False,False,False,Three new fissures have opened on Hawaii's Big...
34981,1U1u5aKU3AY,18.16.05,New lava fissures fuel fears of eruption in Ha...,CNN,25,2018-05-13T19:30:53.000Z,"latest News|""Happening Now""|""CNN""|""lava""|""hawa...",293306,1707,315,1221,https://i.ytimg.com/vi/1U1u5aKU3AY/default.jpg,False,False,False,Three new fissures have opened on Hawaii's Big...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35632,zcEE8J2Bqa8,18.19.05,The Goblin - JACK AND DEAN,Jack and Dean,23,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",183436,21734,150,1479,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,That? That's a goblin living under the stairs....
36890,zcEE8J2Bqa8,18.25.05,The Goblin - JACK AND DEAN,Jack and Dean,23,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",202596,22741,161,1520,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,That? That's a goblin living under the stairs....
37101,zcEE8J2Bqa8,18.26.05,The Goblin - JACK AND DEAN,Jack and Dean,23,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",204539,22834,162,1521,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,That? That's a goblin living under the stairs....
34369,zcEE8J2Bqa8,18.13.05,The Goblin - JACK AND DEAN,Jack and Dean,23,2018-05-11T18:27:01.000Z,"Jack and Dean|""OMFGItsJackAndDean""|""Jack Howar...",123615,17665,89,1235,https://i.ytimg.com/vi/zcEE8J2Bqa8/default.jpg,False,False,False,That? That's a goblin living under the stairs....


- By visual inspection it looks like the 48 duplicate rows that were captured are true duplicates. The other 685 rows are duplicate videos, but have different trending dates. Observing the video performance stats for these videos, it seems that the stats for each video are the stats the video had on the day that it was trending. This subset of the data will be saved for a separate analysis, but only the first trending day will be used for the overall analysis to avoid duplicates.

In [54]:
categories.head(5)

Unnamed: 0,category_id,category
0,1,Film & Animation
1,2,Autos & Vehicles
2,10,Music
3,15,Pets & Animals
4,17,Sports


- Some categories appear to be missing in the dataset

In [55]:
categories.dtypes

category_id     int64
category       object
dtype: object

In [56]:
categories.category.value_counts()

Comedy                   2
Film & Animation         1
People & Blogs           1
Videoblogging            1
Thriller                 1
Foreign                  1
Shorts                   1
Short Movies             1
Documentary              1
Pets & Animals           1
Travel & Events          1
Sports                   1
Horror                   1
Action/Adventure         1
Gaming                   1
News & Politics          1
Entertainment            1
Classics                 1
Howto & Style            1
Family                   1
Drama                    1
Trailers                 1
Sci-Fi/Fantasy           1
Science & Technology     1
Nonprofits & Activism    1
Music                    1
Anime/Animation          1
Movies                   1
Autos & Vehicles         1
Education                1
Shows                    1
Name: category, dtype: int64

In [57]:
categories.query('category == "Comedy"')

Unnamed: 0,category_id,category
10,23,Comedy
21,34,Comedy


<a id='assess_sum'></a>
**Assessment summary**
1. The `trending_date` and `publish_time` columns are strings and will need to be converted to datetime objects.
2. The `trending_date` column is in the wrong format to be read as a datetime object.
3. The `publish_time` column includes both date and time. In order to provide more segmented time-based analysis, date and time will be split into separate columns.
4. There are 48 duplicate rows in the dataset. These rows will be removed.
5. There are 685 other rows that are duplicate videos, but because they trended on multiple days. These videos will be placed in a separate DataFrame for analysis and the day the video was first trending will be included in the main dataset.
6. There are gaps between the category id numbers. Monitor this when merging dataframes to determine if any additional data needs to be collected.
7. There are two "Comedy" categories. This will need to be assessed after merging the DataFrames to be sure that there is not mis-classification.
8. Merge the two DataFrames on `category_id`.

<a id='clean'></a>
### Clean the data

In [58]:
videos_clean = videos.copy()
categories_clean = categories.copy()

1. The `trending_date` and `publish_time` columns are strings and will need to be converted to datetime objects.   
For the purposes of this analysis, I will only need to change the data types of the date and time columns.

2. The `trending_date` column is in the wrong format to be read as a datetime object.   
The `trending_date` column is currently in an unreadable date format, so I will need to get the data into a suitable format first.

In [59]:
videos_clean["publish_time"] = pd.to_datetime(videos_clean["publish_time"])
videos_clean.dtypes

video_id                               object
trending_date                          object
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
dtype: object

In [60]:
videos_clean.trending_date.value_counts()

18.11.06    200
18.31.03    200
18.16.04    200
17.15.11    200
17.30.11    200
           ... 
18.01.02    197
18.31.01    197
18.03.02    196
18.04.02    196
18.02.02    196
Name: trending_date, Length: 205, dtype: int64

- By observation, all of the records are in the yy.dd.mm format, so I will parse the date this way across the entire column. 

In [61]:
(videos_clean.publish_time.min(), videos_clean.publish_time.max())

(Timestamp('2006-07-23 08:24:11+0000', tz='UTC'),
 Timestamp('2018-06-14 01:31:53+0000', tz='UTC'))

- The dataset ranges from the years 2006 to 2018, so I will concatenate "20" to the year of the cleaned trending date.

In [62]:
videos_clean["trending_date"] = videos_clean.trending_date.str.split(".")

In [63]:
date = []
for x in videos_clean.trending_date:
    year = "20" + x[0]
    day = x[1]
    month = x[2]
    date.append(year + "-" + month + "-" + day)

videos_clean["trending_date"] = date

In [64]:
videos_clean["trending_date"] = pd.to_datetime(videos_clean["trending_date"])

In [65]:
videos_clean.dtypes

video_id                               object
trending_date                  datetime64[ns]
title                                  object
channel_title                          object
category_id                             int64
publish_time              datetime64[ns, UTC]
tags                                   object
views                                   int64
likes                                   int64
dislikes                                int64
comment_count                           int64
thumbnail_link                         object
comments_disabled                        bool
ratings_disabled                         bool
video_error_or_removed                   bool
description                            object
dtype: object

3. The `publish_time` column includes both date and time. In order to provide more segmented time-based analysis, date and time will be split into separate columns.   
I will split the `publish_time` column to separate date and time to perform separate time series analyses.

In [66]:
videos_clean['publish_date'] = videos_clean['publish_time'].dt.date
videos_clean['publish_time_new'] = videos_clean['publish_time'].dt.time
videos_clean.head(2)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,publish_time,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time_new
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,2017-11-13 17:13:01+00:00,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,2017-11-13 07:30:00+00:00,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00


- Drop the original `publish_time` column and rename the new time column

In [67]:
videos_clean = videos_clean.drop(columns='publish_time', axis=1)
videos_clean = videos_clean.rename(columns={'publish_time_new': 'publish_time'})
videos_clean.head(2)

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00


4. There are 48 duplicate rows in the dataset. These rows will be removed.   
This will be done using the `.drop_duplicates()` method.

In [75]:
videos_clean = videos_clean.drop_duplicates()
videos_clean.shape

(40901, 17)

5. There are 685 other rows that are duplicate videos, but because they trended on multiple days. These videos will be placed in a separate DataFrame for analysis and the day the video was first trending will be included in the main dataset.   
First I will need to create a separate DataFrame of the 685 duplicate videos for future analysis. Then I will need to find the first day each video was trending and filter the others from the clean DataFrame.

In [78]:
videos_clean.video_id.value_counts()

j4KvrAUjn6c    29
8h--kFui1JA    29
t4pRQ0jn23Q    28
ulNswX3If6U    28
WIV3xNz8NoM    28
               ..
1M5r_B1_WZ8     1
89PKbJ2NAqc     1
MeXyRyxCjT4     1
1yf8ZSjtXiI     1
dgcr3mCsqBE     1
Name: video_id, Length: 6351, dtype: int64

In [28]:
multiple_trending = videos_clean[videos_clean['video_id'].isin(duplicate_id.values)].sort_values('video_id')
multiple_trending.shape

(685, 17)

- Now the `videos_clean` DataFrame must filter out the duplicate videos, leaving only the row of the first trending date.

In [80]:
videos_clean = videos_clean.sort_values('trending_date').drop_duplicates('video_id', keep='last')
videos_clean.shape

(6351, 17)

In [81]:
videos_clean.sort_values('video_id')

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comment_count,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time
40208,-0CMnp02rNY,2018-06-11,Mindy Kaling's Daughter Had the Perfect Reacti...,TheEllenShow,24,"ellen|""ellen degeneres""|""the ellen show""|""elle...",800359,9773,332,423,https://i.ytimg.com/vi/-0CMnp02rNY/default.jpg,False,False,False,Ocean's 8 star Mindy Kaling dished on bringing...,2018-06-04,13:00:00
15457,-0NYY8cqdiQ,2018-02-01,Megan Mullally Didn't Notice the Interesting P...,TheEllenShow,24,"megan mullally|""megan""|""mullally""|""will and gr...",563746,4429,54,94,https://i.ytimg.com/vi/-0NYY8cqdiQ/default.jpg,False,False,False,Ellen and Megan Mullally have known each other...,2018-01-29,14:00:39
31992,-1Hm41N0dUs,2018-05-01,Cast of Avengers: Infinity War Draws Their Cha...,Jimmy Kimmel Live,23,"jimmy|""jimmy kimmel""|""jimmy kimmel live""|""late...",2058516,41248,580,1484,https://i.ytimg.com/vi/-1Hm41N0dUs/default.jpg,False,False,False,"Benedict Cumberbatch, Don Cheadle, Elizabeth O...",2018-04-27,07:30:02
3711,-1yT-K3c6YI,2017-12-02,YOUTUBER QUIZ + TRUTH OR DARE W/ THE MERRELL T...,Molly Burke,22,"youtube quiz|""youtuber quiz""|""truth or dare""|""...",231341,7734,212,846,https://i.ytimg.com/vi/-1yT-K3c6YI/default.jpg,False,False,False,Check out the video we did on the Merrell Twin...,2017-11-28,18:30:43
584,-2RVw2_QyxQ,2017-11-16,2017 Champions Showdown: Day 3,Saint Louis Chess Club,27,"Chess|""Saint Louis""|""Club""",71089,460,27,20,https://i.ytimg.com/vi/-2RVw2_QyxQ/default.jpg,False,False,False,The Saint Louis Chess Club hosts a series of f...,2017-11-12,02:39:01
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28342,zwEn-ambXLw,2018-04-06,This Is Me - Cover by Shoshana Bean Featuring ...,Shoshana Bean,10,"travis wall|""shoshana bean""|""greatest showman""...",241668,8478,144,328,https://i.ytimg.com/vi/zwEn-ambXLw/default.jpg,False,False,False,I was lucky enough to lay the original demo fo...,2018-03-22,08:30:07
1183,zxUwbflE1SY,2017-11-19,100 People Hold Their Breath for as Long as Th...,Cut,24,"breath|""hold""|""funny""|""holding breath""|""breath...",225280,5770,150,1312,https://i.ytimg.com/vi/zxUwbflE1SY/default.jpg,False,False,False,Get Cut swag here: http://cut.com/shop\n\nDon’...,2017-11-13,13:00:10
36944,zxwfDlhJIpw,2018-05-25,kanye west / charlamagne interview,Kanye West,22,"Kanye West|""YEEZY""|""Kanye""|""Charlamagne""|""The ...",8442986,166520,19462,48467,https://i.ytimg.com/vi/zxwfDlhJIpw/default.jpg,False,False,False,,2018-05-01,15:57:06
144,zy0b9e40tK8,2017-11-14,Dark | Official Trailer [HD] | Netflix,Netflix,24,"Netflix|""Baran Bo Odar""|""Jantje Friese""|""DARK""...",378750,5642,146,675,https://i.ytimg.com/vi/zy0b9e40tK8/default.jpg,False,False,False,The disappearance of two kids in the German sm...,2017-11-09,09:00:07


In [32]:
videos_clean.duplicated('video_id').sum()

0

6. There are gaps between the category id numbers. Monitor this when merging dataframes to determine if any additional data needs to be collected.
7. There are two "Comedy" categories. This will need to be assessed after merging the DataFrames to be sure that there is not mis-classification.
8. Merge the two DataFrames on `category_id`.<br>
Merge the DataFrames first, then assess effect on categorization and clean further if necessary.

In [40]:
# Merge DataFrames
df_clean = videos_clean.merge(categories_clean, on='category_id', how='left')
df_clean.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comments,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time,category
0,htvR_dBs3eg,2017-11-14,Sam Smith - The Thrill of It All ALBUM REVIEW,theneedledrop,10,"album|""review""|""music""|""reviews""|""indie""|""unde...",98422,2926,106,798,https://i.ytimg.com/vi/htvR_dBs3eg/default.jpg,False,False,False,Listen: https://www.youtube.com/watch?v=J_ub7E...,2017-11-10,21:38:57,Music
1,5x1FAiIq_pQ,2017-11-14,Alicia Keys - When You Were Gone,Alicia Keys,10,[none],95944,1354,181,117,https://i.ytimg.com/vi/5x1FAiIq_pQ/default.jpg,False,False,False,Find out more in The Vault: http://bit.ly/AK_A...,2017-11-09,15:49:21,Music
2,vd4zwINEcLY,2017-11-14,Live in the now!,poofables,24,"cash|""Wayne's""|""World""|""wayne""|""waynes""|""fende...",95085,909,52,193,https://i.ytimg.com/vi/vd4zwINEcLY/default.jpg,False,False,False,"Stop torturing yourself man, you'll never affo...",2011-03-27,04:31:25,Entertainment
3,7fm7mll2qvg,2017-11-14,Sigrid - Strangers (Lyric Video),SigridVEVO,10,"Sigrid|""Strangers""|""Island""|""Records""|""Pop""",91776,4604,46,357,https://i.ytimg.com/vi/7fm7mll2qvg/default.jpg,False,False,False,Listen to Strangers here: https://Sigrid.lnk.t...,2017-11-10,00:00:00,Music
4,q-WipZ9p0wk,2017-11-14,Three meals that cost me $1.50 each,Brothers Green Eats,26,"brothers green eats|""budget cooking""|""cooking ...",77630,1991,83,208,https://i.ytimg.com/vi/q-WipZ9p0wk/default.jpg,False,False,False,Welcome to day three of cooking for the price ...,2017-11-09,14:00:08,Howto & Style


In [41]:
df_clean.shape

(6351, 18)

9. I am also going to rename the "comment_count" column to "comments", just for my own sanity.

In [26]:
videos_clean = videos_clean.rename(columns={"comment_count": "comments"})
videos_clean.head()

Unnamed: 0,video_id,trending_date,title,channel_title,category_id,tags,views,likes,dislikes,comments,thumbnail_link,comments_disabled,ratings_disabled,video_error_or_removed,description,publish_date,publish_time
0,2kyS6SvSYSE,2017-11-14,WE WANT TO TALK ABOUT OUR MARRIAGE,CaseyNeistat,22,SHANtell martin,748374,57527,2966,15954,https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg,False,False,False,SHANTELL'S CHANNEL - https://www.youtube.com/s...,2017-11-13,17:13:01
1,1ZAPwfrtAFY,2017-11-14,The Trump Presidency: Last Week Tonight with J...,LastWeekTonight,24,"last week tonight trump presidency|""last week ...",2418783,97185,6146,12703,https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg,False,False,False,"One year after the presidential election, John...",2017-11-13,07:30:00
2,5qpjK5DgCt4,2017-11-14,"Racist Superman | Rudy Mancuso, King Bach & Le...",Rudy Mancuso,23,"racist superman|""rudy""|""mancuso""|""king""|""bach""...",3191434,146033,5339,8181,https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg,False,False,False,WATCH MY PREVIOUS VIDEO ▶ \n\nSUBSCRIBE ► http...,2017-11-12,19:05:24
3,puqaWrEC7tY,2017-11-14,Nickelback Lyrics: Real or Fake?,Good Mythical Morning,24,"rhett and link|""gmm""|""good mythical morning""|""...",343168,10172,666,2146,https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg,False,False,False,Today we find out if Link is a Nickelback amat...,2017-11-13,11:00:04
4,d380meD0W0M,2017-11-14,I Dare You: GOING BALD!?,nigahiga,24,"ryan|""higa""|""higatv""|""nigahiga""|""i dare you""|""...",2095731,132235,1989,17518,https://i.ytimg.com/vi/d380meD0W0M/default.jpg,False,False,False,I know it's been a while since we did this sho...,2017-11-12,18:01:41


<a id='clean_sum'></a>
**Cleaning summary**
1. 
2. 
3. 
4. 

<a id='eda'></a>
## Exploratory Data Analysis
Data types have been corrected, so now I can visually explore the data.

In [27]:
# Converting the quantitative columns to logs to view distributions
log_views = np.log(df["views"])
log_likes = np.log(df["likes"])
log_dislikes = np.log(df["dislikes"])
log_comments = np.log(df["comments"])

plt.figure(figsize=(12,6))

plt.subplot(2,2,1)
ax1 = sns.distplot(log_views)

plt.subplot(2,2,2)
# ax2 = sns.distplot(log_likes)

plt.subplot(2,2,3)
# ax3 = sns.distplot(log_dislikes)

plt.subplot(2,2,4)
# ax4 = sns.distplot(log_comments)

plt.subplots_adjust(wspace=0.2, hspace=0.4, top=0.9)

NameError: name 'df' is not defined

<a id='conclusions'></a>
## Conclusions
1. 
2. 
3. 
4. 

### Limitations
...