<a href="https://colab.research.google.com/github/drusho/webscrape_youtube/blob/main/notebooks/2021_07_20_webscrapping_youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Web Scrapping Youtube Tech Channels
> Using Senenium to Analyze Youtube Data

- toc: false
- badges: false
- comments: true
- categories: [Selenium, Web Scrapping, Pandas]
- image: "images/thumbnails/header_youtube_web.png"

<br>

__Notebook Created by David Rusho__

[Github Blog](https://drusho.github.io) | [Github](https://github.com/drusho/webscrape_youtube) | [Tableau](https://public.tableau.com/app/profile/drusho/) | [Linkedin](https://linkedin.com/in/davidrusho)


## About the Data

Web scraping was performed on the _Top 10 Tech Channels_ on Youtube using _[Selenium](https://selenium-python.readthedocs.io/)_ (an automated browser (driver) controlled using python, which is often used in web scraping and web testing).  Web scrapped Youtube channels were determined using a __[Top 10 Tech Youtubers](https://blog.bit.ai/top-tech-youtubers/)__ list from blog.bit.ai.  Scraping included:

* General data for each channel.
 * ex. join date, name, no. of subscribers

* Data from most popular videos per channel
 * ex. video titles, views

* Data specific to each video.
 * ex. post date, no. of upvotes, no. comments

<br>

The average number of videos per channel was around 200.  In total, the data from 2000 videos was scrapped.

## Introduction

_*Note: View this notebook in [Google Colab](https://colab.research.google.com/drive/1UxpBBsypGqUj7816zyvGNhJcPfaxBP_c?usp=sharing) to view more detailed code on data cleaning procedures or visit my [github](https://github.com/drusho/webscrape_youtube) to view the code related to webscrapping and data collection._

#hide
## Data Cleaning

In [None]:
#hide
import pandas as pd

#hide
### Raw Dataframe Sample
Data from Youtube Channels' main pages (Video and About)

In [None]:
#collapse
yt = pd.read_csv('yt_channel_scrap.csv',parse_dates=['channel_join_date'])
yt.head(2)

Unnamed: 0.1,Unnamed: 0,channel_name,subscribers,title,views,post_date,url,channel_join_date,channel_views,channel_description
0,0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,11 years ago,https://www.youtube.com/watch?v=iPgaTmsYTT8,NaT,,
1,1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,5 years ago,https://www.youtube.com/watch?v=y7xZ-kJDgvM,NaT,,


In [None]:
#hide
# create df of Channel details
channel_details = yt[yt.channel_join_date.notna()]
channel_details = channel_details.drop(columns=['Unnamed: 0','subscribers','title','views','post_date']).reset_index(drop=True)
channel_details.head(2)

Unnamed: 0,channel_name,url,channel_join_date,channel_views,channel_description
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl..."
1,Android Authority,,2011-04-03,"767,860,795 views","Your source for the best phones, streaming, ap..."


In [None]:
#hide
#create df Video details
video_details = yt[yt.channel_join_date.isna()]
video_details = video_details.drop(columns=['Unnamed: 0','channel_join_date','channel_views','channel_description','post_date']).reset_index(drop=True)
video_details.head(2)

Unnamed: 0,channel_name,subscribers,title,views,url
0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# merge dfs 
merged = channel_details.merge(video_details, on='channel_name')
merged.head(2)

Unnamed: 0,channel_name,url_x,channel_join_date,channel_views,channel_description,subscribers,title,views,url_y
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# drop 2nd url column and rename remaining url col
merged.drop(columns=('url_x'),inplace=True)
merged.rename(columns={'url_y':'url'},inplace=True)
merged.sample(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
1601,UrAvgConsumer,2012-01-01,"430,378,637 views",Just your average guy who loves tech and givin...,3.11M subscribers,The ULTIMATE Gamer's Paradise! (Room Tour),5.9M views,https://www.youtube.com/watch?v=3HTc9xJHbU0
666,Jon Rettinger,2007-06-07,"574,947,199 views","Welcome to the video home of Jon Rettinger, fo...",1.59M subscribers,Galaxy S8 Two Months Later: Was It a Mistake?,1M views,https://www.youtube.com/watch?v=75ZXRi2XyZ0


In [None]:
#hide
# dtypes to int for views and subscribers
merged.subscribers = merged.subscribers.str.replace('M subscribers','000000').str.replace('.','').astype('int')
merged.views = merged.views.str.replace('M views','000000').str.replace('K views','000').str.replace('.','').str.replace('1 year ago','0').astype('int')
merged.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1999 entries, 0 to 1998
Data columns (total 8 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   channel_name         1999 non-null   object        
 1   channel_join_date    1999 non-null   datetime64[ns]
 2   channel_views        1999 non-null   object        
 3   channel_description  1999 non-null   object        
 4   subscribers          1999 non-null   int64         
 5   title                1999 non-null   object        
 6   views                1999 non-null   int64         
 7   url                  1999 non-null   object        
dtypes: datetime64[ns](1), int64(2), object(5)
memory usage: 140.6+ KB


#### Cleaned Dataframe Sample

_*Data from Youtube Channels' main pages (Video and About)_ 

In [None]:
#collapse
# channel views to dtype
merged.channel_views = merged.channel_views.str.replace(',','').str.replace('views','').astype('int')
merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
0,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Black Eyed Peas - I gotta Feeling (Parody),18000000,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000,https://www.youtube.com/watch?v=y7xZ-kJDgvM


#hide
## Import Videos Data

Specific data from 2000 youtube videos

In [None]:
#hide
# import videos 
df_videos = pd.read_csv('yt_videos_scrap_big_data.csv',parse_dates=['Publish Date','Upload_date'])
df_videos.drop(columns=['Unnamed: 0','Duration','Channel Name','Title'],inplace=True)
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
401,https://www.youtube.com/watch?v=iXKvwPjCGnY,The cursed wallpaper that crashes Android phon...,2020-06-04,2020-06-04,Science & Technology,1280.0,720.0,831K,"54,129 Comments",15688251
735,https://www.youtube.com/watch?v=cER36crkkVw,Other iPhone 4 Videos:iPhone 4 Unboxing: http:...,2010-07-06,2010-07-06,Science & Technology,1280.0,720.0,2K,"1,086 Comments",568777


In [None]:
#hide
# comments dytpe to int
df_videos['Comments'] = df_videos['Comments'].str.replace('Comments','').str.replace(',','').astype('int')
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
1948,https://www.youtube.com/watch?v=ZrZISyPucMg,USB-C: The new industry standard for the next ...,2015-03-12,2015-03-12,Science & Technology,1280.0,720.0,82K,4940,4635276
1363,https://www.youtube.com/watch?v=bjZE1fwAyJ4,So how good is wood?FOLLOW ME IN THESE PLACES ...,2016-01-23,2016-01-23,Science & Technology,1280.0,720.0,98K,6876,5149958


In [None]:
#hide
# Likes dytpe to int
df_videos['Likes'] = df_videos['Likes'].str.replace('K','000').str.replace("M",'000000').str.replace('.','').astype('int')
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
234,https://www.youtube.com/watch?v=hHavcoQtYIY,Buy at Amazon: http://geni.us/S6Edge | Read mo...,2015-04-07,2015-04-07,Science & Technology,1280.0,720.0,11000,450,1832509
1645,https://www.youtube.com/watch?v=YjmWnwKxOls,My unboxing of the Xbox One.Get Hotspot Shield...,2014-01-17,2014-01-17,Science & Technology,1280.0,720.0,17000,2388,1446756


In [None]:
#hide
# Fix Width and Height, remove '.' and '0' from end of str
df_videos['Width'] = df_videos['Width'].astype('str').str.split(".", expand=True)[0]
df_videos['Height'] = df_videos['Height'].astype('str').str.split(".", expand=True)[0]
df_videos.head(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000,23437,18198670
1,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000,8421,12395700


## Cleaned Dataframe

Sample of fully cleaned and merged dataframe

Data from Youtubes Channels and all Videos pages were merged.

In [None]:
#collapse
# merge df2 
vc_merged = merged.merge(df_videos, on='url') 
vc_merged.rename(columns={
    'Partial Description':'video_desc',
    'Publish Date':'publish_date',
    'Upload_date':'upload_date',
    'Interaction Count':'interact_count'
    },inplace=True)

vc_merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url,video_desc,publish_date,upload_date,Genre,Width,Height,Likes,Comments,interact_count
0,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Black Eyed Peas - I gotta Feeling (Parody),18000000,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000,23437,18198670
1,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000,8421,12395700


## Data Analysis

### List of Youtube Channels in Dataframe

In [None]:
#collapse
# List of Video Channels
vc_merged.groupby(['channel_join_date','channel_name','channel_views','channel_description'])['subscribers'].max().to_frame().reset_index()

Unnamed: 0,channel_join_date,channel_name,channel_views,channel_description,subscribers
0,2006-05-07,iJustine,1288987476,"Tech, video games, failed cooking attempts, vl...",689000000
1,2007-06-07,Jon Rettinger,574947199,"Welcome to the video home of Jon Rettinger, fo...",159000000
2,2007-08-04,Austin Evans,1118911675,The best of technology from gaming PCs to smar...,507000000
3,2008-03-21,Marques Brownlee,2597028774,MKBHD: Quality Tech Videos | YouTuber | Geek |...,143000000
4,2008-11-24,Linus Tech Tips,4934741560,Tech can be complicated; we try to make it eas...,137000000
5,2010-03-24,Jonathan Morrison,430639061,"High quality videos blending tech + aesthetic,...",264000000
6,2010-12-21,Unbox Therapy,4091676835,Where products get naked.\n\nHere you will fin...,18000000
7,2011-04-03,Android Authority,767860795,"Your source for the best phones, streaming, ap...",336000000
8,2011-04-20,Mrwhosetheboss,1208148200,My name is Arun Maini. I'm a 25 year old Econo...,771000000
9,2012-01-01,UrAvgConsumer,430378637,Just your average guy who loves tech and givin...,311000000


### Top 10 Videos by Views

Discoveries so far:

* Majority of these vidoes are over a year old.  Meaning that as time goes by more video views will be acquired.

* Two videos list a dollar amount in the title.

* Marques Brownlee labels some videos (ex: "Dope Tech")

* Unbox Therapy dominates videos by views list, 8 of 10 videos belong to this channel alone.

In [None]:
#collapse
# Top 10 Videos by Views
vc_merged.groupby(['title','channel_name','publish_date'])['views'].max().sort_values(ascending=False).head(10).reset_index()

Unnamed: 0,title,channel_name,publish_date,views
0,2020 iPad Pro Review: It's... A Computer?!,Marques Brownlee,2020-03-24,99000000
1,A Keyboard Made Of Glass?,Unbox Therapy,2016-04-09,98000000
2,iPhone 12 - The iPhone is New Again,Unbox Therapy,2020-10-13,98000000
3,The Secret Android Button,Unbox Therapy,2016-04-13,98000000
4,The FASTEST gaming PC money can buy,Linus Tech Tips,2018-12-27,98000000
5,"Fortnite on an INSANE $20,000 Gaming PC",Unbox Therapy,2018-03-18,98000000
6,Dope Tech: Self-Lacing Nike Mag!,Marques Brownlee,2016-10-07,98000000
7,This is the iPhone SE 2,Unbox Therapy,2020-04-15,97000000
8,Human Headphones Just Changed The Game,Unbox Therapy,2019-08-30,97000000
9,$1000 Earphones! (Shure SE846 Unboxing & Test),Unbox Therapy,2014-06-23,97000000


### Total Views by Channel

In [None]:
#collapse
# Total Views by Channel

merged.groupby(['channel_name','subscribers'])['views'].sum().sort_values(ascending=False).reset_index()

Unnamed: 0,channel_name,subscribers,views
0,Unbox Therapy,18000000,10357000000
1,Linus Tech Tips,137000000,9891000000
2,Marques Brownlee,143000000,9863000000
3,Mrwhosetheboss,771000000,6816000000
4,Austin Evans,506000000,5040000000
5,iJustine,689000000,5016000000
6,Android Authority,336000000,1590710000
7,UrAvgConsumer,311000000,1500726000
8,Jonathan Morrison,264000000,1340206000
9,Jon Rettinger,159000000,1051486000


### Channels Grouped By Date They Joined Youtube

In [None]:
#collapse
merged.groupby('channel_name')['channel_join_date'].min().sort_values().to_frame().reset_index()

Unnamed: 0,channel_name,channel_join_date
0,iJustine,2006-05-07
1,Jon Rettinger,2007-06-07
2,Austin Evans,2007-08-04
3,Marques Brownlee,2008-03-21
4,Linus Tech Tips,2008-11-24
5,Jonathan Morrison,2010-03-24
6,Unbox Therapy,2010-12-21
7,Android Authority,2011-04-03
8,Mrwhosetheboss,2011-04-20
9,UrAvgConsumer,2012-01-01


## Resources

- [Top 25 Selenium Functions That Will Make You Pro In Web Scraping](https://towardsdatascience.com/top-25-selenium-functions-that-will-make-you-pro-in-web-scraping-5c937e027244)

- [How to build a Web Scraper or Bot in Python using Selenium](https://medium.com/daily-programming-tips/how-to-build-a-web-scraper-or-bot-in-python-using-selenium-2815f20023f7)

- [Web Scraping: Introduction, Best Practices & Caveats](https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f)

- [Web Scraping Job Postings from Indeed.com using Selenium](https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf)


- [How I Use Selenium to Automate the Web With Python. Pt1 -  John Watson Rooney
](https://www.youtube.com/watch?v=pUUhvJvs-R4)