<a href="https://colab.research.google.com/github/drusho/drusho.github.io/blob/master/_notebooks/2021-07-20-webscrapping-youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing the Top 10 Youtube Tech Channels
> Using Selenium to Scrape Data from Youtube

- toc: false
- badges: false
- comments: true
- categories: [Selenium, Web Scrapping, Pandas]
- image: "images/thumbnails/header_youtube_web.png"

<br>

__Notebook Created by David Rusho__


[Github Blog](https://drusho.github.io) | [Github](https://github.com/drusho/webscrape_youtube) | [Tableau](https://public.tableau.com/app/profile/drusho/) | [Linkedin](https://linkedin.com/in/davidrusho)


<br> 

_*This notebook contains hidden cells when viewed in blog posts in order to increase readability._  

_Visit the links below for more detailed code._  

* [Google Colab Notebook](https://colab.research.google.com/drive/1UxpBBsypGqUj7816zyvGNhJcPfaxBP_c?usp=sharing): **All code** related to data cleaning and data analaysis.

* [Github - Youtube Webscrapping](https://github.com/drusho/webscrape_youtube/tree/main/code):  **All code** related to web scrapping and data collection.


## About the Data

Web scraping was performed on the _Top 10 Tech Channels_ on Youtube using _[Selenium](https://selenium-python.readthedocs.io/)_ (an automated browser (driver) controlled using python, which is often used in web scraping and web testing).  Web scrapped Youtube channels were determined using a __[Top 10 Tech Youtubers](https://blog.bit.ai/top-tech-youtubers/)__ list from blog.bit.ai.  Scraping included:

* General data for each channel.
 * ex. join date, name, no. of subscribers

* Data from most popular videos per channel
 * ex. video titles, views

* Data specific to each video.
 * ex. post date, no. of upvotes, no. comments

<br>

The average number of videos per channel was around 200.  In total, the data from 2000 videos was scrapped.

## Introduction

#hide
## Data Cleaning

In [195]:
#hide
import pandas as pd

#hide
### Raw Dataframe Sample
Data from Youtube Channels' main pages (Video and About)

In [196]:
#hide
yt = pd.read_csv('yt_channel_scrap.csv',parse_dates=['channel_join_date'])
yt.head(2)

Unnamed: 0.1,Unnamed: 0,channel_name,subscribers,title,views,post_date,url,channel_join_date,channel_views,channel_description
0,0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,11 years ago,https://www.youtube.com/watch?v=iPgaTmsYTT8,NaT,,
1,1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,5 years ago,https://www.youtube.com/watch?v=y7xZ-kJDgvM,NaT,,


In [197]:
#hide
# create df of Channel details
channel_details = yt[yt.channel_join_date.notna()]
channel_details = channel_details.drop(columns=['Unnamed: 0','subscribers','title','views','post_date']).reset_index(drop=True)
channel_details.head(2)

Unnamed: 0,channel_name,url,channel_join_date,channel_views,channel_description
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl..."
1,Android Authority,,2011-04-03,"767,860,795 views","Your source for the best phones, streaming, ap..."


In [198]:
#hide
#create df Video details
video_details = yt[yt.channel_join_date.isna()]
video_details = video_details.drop(columns=['Unnamed: 0','channel_join_date','channel_views','channel_description','post_date']).reset_index(drop=True)
video_details.head(2)

Unnamed: 0,channel_name,subscribers,title,views,url
0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [199]:
#hide
# merge dfs 
merged = channel_details.merge(video_details, on='channel_name')
merged.head(2)

Unnamed: 0,channel_name,url_x,channel_join_date,channel_views,channel_description,subscribers,title,views,url_y
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [200]:
#hide
# drop 2nd url column and rename remaining url col
merged.drop(columns=('url_x'),inplace=True)
merged.rename(columns={'url_y':'url'},inplace=True)
merged.sample(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
1248,Unbox Therapy,2010-12-21,"4,091,676,835 views",Where products get naked.\n\nHere you will fin...,18M subscribers,Which Smartphone Do They ACTUALLY Use? --- MKB...,8.5M views,https://www.youtube.com/watch?v=Hi2tjMLVpdQ
1975,Marques Brownlee,2008-03-21,"2,597,028,774 views",MKBHD: Quality Tech Videos | YouTuber | Geek |...,14.3M subscribers,Samsung Galaxy Note 5 Review!,4.2M views,https://www.youtube.com/watch?v=V-nBAcr_huw


In [201]:
#hide
# dtypes to int for views and subscribers
merged.subscribers = merged.subscribers.str.replace('M subscribers','000000').str.replace('.','').astype('int')
merged.views = merged.views.str.replace('M views','000000').str.replace('K views','000').str.replace('.','').str.replace('1 year ago','0').astype('int')
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1999 entries, 0 to 1998
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   channel_name         1999 non-null   object        
 1   channel_join_date    1999 non-null   datetime64[ns]
 2   channel_views        1999 non-null   object        
 3   channel_description  1999 non-null   object        
 4   subscribers          1999 non-null   int64         
 5   title                1999 non-null   object        
 6   views                1999 non-null   int64         
 7   url                  1999 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 140.6+ KB


#### Cleaned Dataframe Sample

_*Data from Youtube Channels' main pages (Video and About)_ 

In [202]:
#hide
# channel views to dtype
merged.channel_views = merged.channel_views.str.replace(',','').str.replace('views','').astype('int')
merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
0,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Black Eyed Peas - I gotta Feeling (Parody),18000000,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000,https://www.youtube.com/watch?v=y7xZ-kJDgvM


#hide
## Import Videos Data

Specific data from 2000 youtube videos

In [203]:
#hide
# import videos 
df_videos = pd.read_csv('yt_videos_scrap_big_data.csv',parse_dates=['Publish Date','Upload_date'])
df_videos.drop(columns=['Unnamed: 0','Duration','Channel Name','Title'],inplace=True)
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
521,https://www.youtube.com/watch?v=7jiQ46OMcCA,Huawei has officially announced that they will...,2020-09-11,2020-09-11,Science & Technology,1280.0,720.0,152K,"15,040 Comments",3074271
1046,https://www.youtube.com/watch?v=q4vayNhU5Vc,Fortnite Battle Royale meets four gaming lapto...,2018-05-19,2018-05-19,Science & Technology,1280.0,720.0,56K,"7,036 Comments",3450756


In [204]:
#hide
# comments dytpe to int
df_videos['Comments'] = df_videos['Comments'].str.replace('Comments','').str.replace(',','').astype('int')
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
1657,https://www.youtube.com/watch?v=dRObA6vI6UU,Here are my top 5 favorite bluetooth headphone...,2015-09-01,2015-09-01,Science & Technology,1280.0,720.0,13K,1544,1246027
1944,https://www.youtube.com/watch?v=0T0rop9pE58,"Galaxy Note is a family now, and the 10+ is th...",2019-08-22,2019-08-22,Science & Technology,1280.0,720.0,116K,11728,4736147


In [205]:
#hide
# Likes dytpe to int
df_videos['Likes'] = df_videos['Likes'].str.replace('K','000').str.replace("M",'000000').str.replace('.','').astype('int')
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
1439,https://www.youtube.com/watch?v=75L8Hrb49A4,Get the Ring Doorbell Welcome Kit today at htt...,2019-07-25,2019-07-25,Science & Technology,1280.0,720.0,170000,15589,6886917
1239,https://www.youtube.com/watch?v=FZe7OllHUXw,( ͡° ͜ʖ ͡°) --- Today's Mystery Video - https:...,2016-08-06,2016-08-06,Science & Technology,1280.0,720.0,132000,6516,9272684


In [206]:
#hide
# Fix Width and Height, remove '.' and '0' from end of str
df_videos['Width'] = df_videos['Width'].astype('str').str.split(".", expand=True)[0]
df_videos['Height'] = df_videos['Height'].astype('str').str.split(".", expand=True)[0]
df_videos.head(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000,23437,18198670
1,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000,8421,12395700


## Cleaned Dataframe

Sample of fully cleaned and merged dataframe

Data from Youtubes Channels and all Videos pages were merged.

In [215]:
#collapse
# merge df2 
vc_merged = merged.merge(df_videos, on='url') 
vc_merged.rename(columns={
    'Partial Description':'video_desc',
    'Publish Date':'video_publish_date',
    'Upload_date':'video_upload_date',
    'Interaction Count':'video_interactations',
    'Genre':'video_genre',
    'Width':'video_width',
    'Height':'video_height',
    'Comments':'video_comments',
    'views':'video_views',
    'title':'video_title',
    'url':'video_url'
    },inplace=True)

vc_merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,video_title,video_views,video_url,video_desc,video_publish_date,video_upload_date,video_genre,video_width,video_height,Likes,video_comments,video_interactations
0,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Black Eyed Peas - I gotta Feeling (Parody),18000000,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000,23437,18198670
1,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000,8421,12395700


### Column Descriptions

|Column Name  | Description |
|:--|:--|
|channel_name|Name of Youtube Channel  |
|channel_join_date|Date Channel was created|
|channel_views|Total views the channel has received|
|channel_description|Description of Youtube Channel|
|subscribers|Number of channel subscribers|
|video_title|Video title|
|video_views|Total views for vidoe|
|video_url|Video url|
|video_desc|Description of video|
|video_publish_date|Date video was published|
|video_upload_date|Date video was uploaded|
|video_genre|Genre of video|
|video_width|Width of video|
|video_height|Height of video|
|video_likes|Total likes for video|
|video_comments|Total comments for video|
|video_interactions|Number of interactions video has received|

## Data Analysis

### List of Youtube Channels

In [216]:
#collapse
# List of Video Channels
yt_chan = vc_merged.groupby(['channel_join_date','channel_name','channel_views'])['subscribers'].max().to_frame().reset_index()

# rename columns to increase readability
yt_chan.rename(columns={
    'channel_name':'Channel',
    'channel_join_date':'Join Date',
    'subscribers':'Subscribers',
    'channel_views':'Views',
    },inplace=True)

# style dateframe to highlight highest values
yt_chan.style.format(formatter={'Subscribers': "{:,}",
                                 'Views': "{:,}",
                                 'Join Date': "{:%Y-%m-%d}"}).background_gradient(subset=['Views',
                                                                                          'Subscribers'], 
                                                                                  cmap='Wistia').hide_index()

Join Date,Channel,Views,Subscribers
2006-05-07,iJustine,1288987476,689000000
2007-06-07,Jon Rettinger,574947199,159000000
2007-08-04,Austin Evans,1118911675,507000000
2008-03-21,Marques Brownlee,2597028774,143000000
2008-11-24,Linus Tech Tips,4934741560,137000000
2010-03-24,Jonathan Morrison,430639061,264000000
2010-12-21,Unbox Therapy,4091676835,18000000
2011-04-03,Android Authority,767860795,336000000
2011-04-20,Mrwhosetheboss,1208148200,771000000
2012-01-01,UrAvgConsumer,430378637,311000000


### Top 10 Videos by Views

Discoveries so far:

* Majority of these vidoes are over a year old.  Meaning that as time goes by more video views will be acquired.

* Two videos list a dollar amount in the title.

* Marques Brownlee labels some videos (ex: "Dope Tech")

* Unbox Therapy dominates videos by views list, 8 of 10 videos belong to this channel alone.

In [224]:
#collapse
# Top 10 Videos by Views
top_chan = vc_merged.groupby(['video_title',
                              'channel_name',
                              'video_publish_date'])['video_views'].max().sort_values(ascending=False).head(10).reset_index()

# rename columns to increase readability
top_chan.rename(columns={
    'channel_name':'Channel',
    'video_publish_date':'Publish Date',
    'video_views':'Views',
    'video_title':'Title',
    },inplace=True)

top_chan.style.format(formatter={'Views': "{:,}",
                                 'Publish Date': "{:%Y-%m-%d}"}).background_gradient(subset=['Views',
                                                                                                   'Publish Date'], cmap='Wistia').hide_index()


Title,Channel,Publish Date,Views
2020 iPad Pro Review: It's... A Computer?!,Marques Brownlee,2020-03-24,99000000
A Keyboard Made Of Glass?,Unbox Therapy,2016-04-09,98000000
iPhone 12 - The iPhone is New Again,Unbox Therapy,2020-10-13,98000000
The Secret Android Button,Unbox Therapy,2016-04-13,98000000
The FASTEST gaming PC money can buy,Linus Tech Tips,2018-12-27,98000000
"Fortnite on an INSANE $20,000 Gaming PC",Unbox Therapy,2018-03-18,98000000
Dope Tech: Self-Lacing Nike Mag!,Marques Brownlee,2016-10-07,98000000
This is the iPhone SE 2,Unbox Therapy,2020-04-15,97000000
Human Headphones Just Changed The Game,Unbox Therapy,2019-08-30,97000000
$1000 Earphones! (Shure SE846 Unboxing & Test),Unbox Therapy,2014-06-23,97000000


### Total Views by Channel

In [226]:
#collapse
# Total Views by Channel

chan_views = vc_merged.groupby(['channel_name','subscribers'])['video_views'].sum().sort_values(ascending=False).reset_index()

# rename columns to increase readability
chan_views.rename(columns={
    'channel_name':'Channel',
    'subscribers':'Subscribers',
    'video_views':'Views',
    },inplace=True)

chan_views.style.format(formatter={'Views': "{:,}",
                                 'Subscribers': "{:,}"}).background_gradient(subset=['Views','Subscribers'], cmap='Wistia').hide_index()

Channel,Subscribers,Views
Unbox Therapy,18000000,10357000000
Linus Tech Tips,137000000,9895000000
Marques Brownlee,143000000,9864000000
Mrwhosetheboss,771000000,6870000000
Austin Evans,507000000,5041000000
iJustine,689000000,5017000000
Android Authority,336000000,1590710000
UrAvgConsumer,311000000,1500732000
Jonathan Morrison,264000000,1340207000
Jon Rettinger,159000000,1051493000


## Resources

- [Top 25 Selenium Functions That Will Make You Pro In Web Scraping](https://towardsdatascience.com/top-25-selenium-functions-that-will-make-you-pro-in-web-scraping-5c937e027244)

- [How to build a Web Scraper or Bot in Python using Selenium](https://medium.com/daily-programming-tips/how-to-build-a-web-scraper-or-bot-in-python-using-selenium-2815f20023f7)

- [Web Scraping: Introduction, Best Practices & Caveats](https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f)

- [Web Scraping Job Postings from Indeed.com using Selenium](https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf)


- [How I Use Selenium to Automate the Web With Python. Pt1 -  John Watson Rooney
](https://www.youtube.com/watch?v=pUUhvJvs-R4)