<a href="https://colab.research.google.com/github/drusho/drusho.github.io/blob/master/_notebooks/2021-07-20-webscrapping-youtube.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Analyzing Youtube Tech Channels
> Using Data Collected from Selenium

- toc: false
- badges: false
- comments: true
- categories: [Selenium, Web Scrapping, Pandas]
- image: "images/thumbnails/header_youtube_web.png"

<br>

> Note: __Notebook Created by David Rusho__
>
> [Github Blog](https://drusho.github.io) | [Github](https://github.com/drusho/webscrape_youtube) | [Tableau](https://public.tableau.com/app/profile/drusho/) | [Linkedin](https://linkedin.com/in/davidrusho)


<br> 

> Important: This notebook contains hidden cells when viewed as a blog post.
>Visit the links below for more detailed code:  
> * [Google Colab Notebook](https://colab.research.google.com/drive/1UxpBBsypGqUj7816zyvGNhJcPfaxBP_c?usp=sharing): **All code** related to data cleaning and data analaysis.
> * [Github - Youtube Webscrapping](https://github.com/drusho/webscrape_youtube/tree/main/code):  **All code** related to web scrapping and data collection.


## About the Data

Web scraping was performed on the _Top 10 Tech Channels_ on Youtube using _[Selenium](https://selenium-python.readthedocs.io/)_ (an automated browser (driver) controlled using python, which is often used in web scraping and web testing).  These channels were selected using a __[Top 10 Tech Youtubers](https://blog.bit.ai/top-tech-youtubers/)__ list from blog.bit.ai.  

Data from 2,000 videos was scrapped, which equals about 200 of most popular videos per channel.

## Introduction

### Web Scrapping Youtube Channels
> Using Selenium

In [None]:
#collapse

### Web Scrapping Youtube Videos
> Using Selenium

In [None]:
#collapse

## Data Cleaning

In [None]:
#hide
import pandas as pd

#hide
### Raw Dataframe Sample
Data from Youtube Channels' main pages (Video and About)

In [None]:
#hide
yt = pd.read_csv('yt_channel_scrap.csv',parse_dates=['channel_join_date'])
yt.head(2)

Unnamed: 0.1,Unnamed: 0,channel_name,subscribers,title,views,post_date,url,channel_join_date,channel_views,channel_description
0,0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,11 years ago,https://www.youtube.com/watch?v=iPgaTmsYTT8,NaT,,
1,1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,5 years ago,https://www.youtube.com/watch?v=y7xZ-kJDgvM,NaT,,


In [None]:
#hide
# create df of Channel details
channel_details = yt[yt.channel_join_date.notna()]
channel_details = channel_details.drop(columns=['Unnamed: 0','subscribers','title','views','post_date']).reset_index(drop=True)
channel_details.head(2)

Unnamed: 0,channel_name,url,channel_join_date,channel_views,channel_description
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl..."
1,Android Authority,,2011-04-03,"767,860,795 views","Your source for the best phones, streaming, ap..."


In [None]:
#hide
#create df Video details
video_details = yt[yt.channel_join_date.isna()]
video_details = video_details.drop(columns=['Unnamed: 0','channel_join_date','channel_views','channel_description','post_date']).reset_index(drop=True)
video_details.head(2)

Unnamed: 0,channel_name,subscribers,title,views,url
0,iJustine,6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# merge dfs 
merged = channel_details.merge(video_details, on='channel_name')
merged.head(2)

Unnamed: 0,channel_name,url_x,channel_join_date,channel_views,channel_description,subscribers,title,views,url_y
0,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# drop 2nd url column and rename remaining url col
merged.drop(columns=('url_x'),inplace=True)
merged.rename(columns={'url_y':'url'},inplace=True)
merged.head()

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
0,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM
2,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,The Voice of Siri!,11M views,https://www.youtube.com/watch?v=W2bc72HClEE
3,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Ugliest iPhone Cases Ever?,9.4M views,https://www.youtube.com/watch?v=x06yBIHu26o
4,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6.89M subscribers,Making a mini cake with Ro!,9.1M views,https://www.youtube.com/watch?v=MdmGtxyzwHA


In [None]:
#hide
# dtypes to float for views and subscribers
merged.subscribers = merged.subscribers.str.replace('M subscribers','').astype('float')*1000000
merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
0,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6890000.0,Black Eyed Peas - I gotta Feeling (Parody),18M views,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6890000.0,Cake Decorating Challenge with Ro | Nerdy Numm...,12M views,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# modify views col dtype to float
def fix_views(col):
  if 'M' in col:
    return float(col.replace('M views',''))*1000000
  elif 'K' in col:
    return float(col.replace('K views',''))*1000
  elif '1 year ago' in col:
    return 0

merged['views'] = merged['views'].apply(fix_views)

merged.head(2)

Unnamed: 0,channel_name,channel_join_date,channel_views,channel_description,subscribers,title,views,url
0,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6890000.0,Black Eyed Peas - I gotta Feeling (Parody),18000000.0,https://www.youtube.com/watch?v=iPgaTmsYTT8
1,iJustine,2006-05-07,"1,288,987,476 views","Tech, video games, failed cooking attempts, vl...",6890000.0,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000.0,https://www.youtube.com/watch?v=y7xZ-kJDgvM


In [None]:
#hide
# Correct channel view column to display num only
merged['channel_views'] = merged['channel_views'].str.replace(',','').str.replace(' views','').astype('int')

#hide
## Import Videos Data

Specific data from 2000 youtube videos

In [None]:
#hide
# import videos 
df_videos = pd.read_csv('yt_videos_scrap_big_data.csv',parse_dates=['Publish Date','Upload_date'])
df_videos.drop(columns=['Unnamed: 0','Duration','Channel Name','Title'],inplace=True)
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
1186,https://www.youtube.com/watch?v=P0r9wR-Z2dc,For $200 how does a new vs used Windows 10 lap...,2018-03-11,2018-03-11,Science & Technology,1280.0,720.0,31K,"3,469 Comments",1634850
1209,https://www.youtube.com/watch?v=j6T1Mygucak,Use sharp scissors like these - http://amzn.to...,2012-12-08,2012-12-08,Science & Technology,1280.0,720.0,46K,"3,257 Comments",14169813


In [None]:
#hide
# comments dytpe to int
df_videos['Comments'] = df_videos['Comments'].str.replace('Comments','').str.replace(',','').astype('int')
df_videos.sample(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
1087,https://www.youtube.com/watch?v=TOyazdH2b-U,That's it. Ken has officially gone too far in ...,2019-04-01,2019-04-01,Science & Technology,1280.0,720.0,55K,2890,2348400
1744,https://www.youtube.com/watch?v=K43mTKyaed8,It’s that time again! We’ve got another massiv...,2017-10-14,2017-10-14,Science & Technology,1280.0,720.0,18K,1397,769452


In [None]:
#hide
# modify likes col dtype to float
def fix_likes(col):
  if 'M' in col:
    return float(col.replace('M',''))*1000000
  elif 'K' in col:
    return float(col.replace('K',''))*1000
  else:
    return float(col)

df_videos['Likes'] = df_videos['Likes'].apply(fix_likes)

df_videos.head(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280.0,720.0,102000.0,23437,18198670
1,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280.0,720.0,99000.0,8421,12395700


In [None]:
#hide
# Fix Width and Height, remove '.' and '0' from end of str
df_videos['Width'] = df_videos['Width'].astype('str').str.split(".", expand=True)[0]
df_videos['Height'] = df_videos['Height'].astype('str').str.split(".", expand=True)[0]
df_videos.head(2)

Unnamed: 0,url,Partial Description,Publish Date,Upload_date,Genre,Width,Height,Likes,Comments,Interaction Count
0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000.0,23437,18198670
1,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000.0,8421,12395700


### The Cleaned Dataframe

Sample of fully cleaned and merged dataframe

Data from Youtubes Channels and all Videos pages merged.

In [None]:
#collapse

vc_merged = merged.merge(df_videos, on='url') 

# rename columns to increase readability in analysis plots and tables
vc_merged.rename(columns={
    'channel_name':'Channel Name',
    'channel_join_date':'Channel Join Date',
    'channel_views':'Channel Views (M)',
    'subscribers':'Subscribers (M)',
    'Interaction Count':'Interactations (M)',
    'views':'Video Views (M)',
    'Partial Description':'Video Desc',
    'Publish Date':'Publish Date',
    'Upload_date':'Upload Date',
    'Genre':'Video Genre',
    'Width':'Width',
    'Height':'Height',
    'Comments':'Video Comments',
    'title':'Video Title',
    'url':'Video URL'
    },inplace=True)

vc_merged.head(2)

Unnamed: 0,Channel Name,Channel Join Date,Channel Views (M),channel_description,Subscribers (M),Video Title,Video Views (M),Video URL,Video Desc,Publish Date,Upload Date,Video Genre,Width,Height,Likes,Video Comments,Interactations (M)
0,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",6890000.0,Black Eyed Peas - I gotta Feeling (Parody),18000000.0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000.0,23437,18198670
1,iJustine,2006-05-07,1288987476,"Tech, video games, failed cooking attempts, vl...",6890000.0,Cake Decorating Challenge with Ro | Nerdy Numm...,12000000.0,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000.0,8421,12395700


In [None]:
#hide
# shorten column numbers length by millions 

vc_merged['Channel Views (M)'] = round(vc_merged['Channel Views (M)']/1000000,2)
vc_merged['Video Views (M)'] = vc_merged['Video Views (M)']/1000000
vc_merged['Subscribers (M)'] = vc_merged['Subscribers (M)']/1000000
vc_merged['Interactations (M)'] = round(vc_merged['Interactations (M)']/1000000,2)

vc_merged.head(2)

Unnamed: 0,Channel Name,Channel Join Date,Channel Views (M),channel_description,Subscribers (M),Video Title,Video Views (M),Video URL,Video Desc,Publish Date,Upload Date,Video Genre,Width,Height,Likes,Video Comments,Interactations (M)
0,iJustine,2006-05-07,1288.99,"Tech, video games, failed cooking attempts, vl...",6.89,Black Eyed Peas - I gotta Feeling (Parody),18.0,https://www.youtube.com/watch?v=iPgaTmsYTT8,Thanks for watching! Don't forget to subscribe...,2009-07-30,2009-07-30,Comedy,1280,720,102000.0,23437,18.2
1,iJustine,2006-05-07,1288.99,"Tech, video games, failed cooking attempts, vl...",6.89,Cake Decorating Challenge with Ro | Nerdy Numm...,12.0,https://www.youtube.com/watch?v=y7xZ-kJDgvM,Thanks for watching! Don't forget to subscribe...,2016-02-18,2016-02-18,Howto & Style,1280,720,99000.0,8421,12.4


#hide
#### Column Descriptions

|Column Name  | Description |
|:--|:--|
|Channel Name|Name of Youtube Channel  |
|Channel Join Date|Date Channel was created|
|Channel Views (M)|Total views the channel has received (in millions)|
|Channel Description|Description of Youtube Channel|
|Subscribers (M)|Number of channel subscribers (in millions)|
|Video Title|Video title|
|Video Views (M)|Total views for video (in millions)|
|Video URL|Video url|
|Video Desc|Description of video|
|Publish Date|Date video was published|
|Upload Date|Date video was uploaded|
|Video Genre|Genre of video|
|Width|Width of video|
|Height|Height of video|
|Likes|Total likes for video|
|Video Comments|Total comments for video|
|Interactions (M)|Number of interactions video has received (in millions)|


#hide
### Data Analysis

#### Youtube Channels Ordered by Join Date

In [None]:
#collapse
# List of Video Channels
yt_chan = vc_merged.groupby(['Channel Join Date','Channel Name','Channel Views (M)'])['Subscribers (M)'].max().to_frame().reset_index()

# rename columns to increase readability
yt_chan.rename(columns={
    'Channel Name':'Channel',
    'Channel Join Date':'Join Date',
    'Subscribers (M)':'Subscribers',
    'Channel Views (M)':'Channel Views'
    },inplace=True)

# style dateframe to highlight highest values
yt_chan.style.format(formatter={'Subscribers': "{:,} M",
                                 'Channel Views': "{:,} M",
                                 'Join Date': "{:%Y-%m-%d}"}).background_gradient(subset=['Channel Views',
                                                                                          'Subscribers'], 
                                                                                  cmap='Wistia').hide_index()

Join Date,Channel,Channel Views,Subscribers
2006-05-07,iJustine,"1,288.99 M",6.89 M
2007-06-07,Jon Rettinger,574.95 M,1.59 M
2007-08-04,Austin Evans,"1,118.91 M",5.07 M
2008-03-21,Marques Brownlee,"2,597.03 M",14.3 M
2008-11-24,Linus Tech Tips,"4,934.74 M",13.7 M
2010-03-24,Jonathan Morrison,430.64 M,2.64 M
2010-12-21,Unbox Therapy,"4,091.68 M",18.0 M
2011-04-03,Android Authority,767.86 M,3.36 M
2011-04-20,Mrwhosetheboss,"1,208.15 M",7.71 M
2012-01-01,UrAvgConsumer,430.38 M,3.11 M


#### Top 10 Most Viewed Videos

In [None]:
#collapse
# Top 10 Videos by Views
top_chan = vc_merged.groupby(['Video Title',
                              'Channel Name',
                              'Publish Date'])['Video Views (M)'].max().sort_values(ascending=False).head(10).reset_index()

# rename columns to increase readability
top_chan.rename(columns={
    'Channel Name':'Channel',
    'Video Views (M)':'Video Views'
    },inplace=True)

top_chan.style.format(formatter={'Video Views': "{:,} M",
                                 'Publish Date': "{:%Y-%m-%d}"}).background_gradient(subset=['Video Views',
                                                                                                   'Publish Date'], cmap='Wistia').hide_index()


Video Title,Channel,Publish Date,Video Views
iPhone 6 Plus Bend Test,Unbox Therapy,2014-09-23,73.0 M
Retro Tech: Game Boy,Marques Brownlee,2019-04-19,28.0 M
BROKE vs PRO Gaming,Austin Evans,2019-08-03,22.0 M
Samsung Galaxy Fold Unboxing: Magnets!,Marques Brownlee,2019-04-16,22.0 M
Turn your Smartphone into a 3D Hologram | 4K,Mrwhosetheboss,2015-08-01,22.0 M
OnePlus 6 Review: Right On the Money!,Marques Brownlee,2018-05-25,21.0 M
This Smartphone Changes Everything...,Unbox Therapy,2018-06-19,21.0 M
The 4 Dollar Android Smartphone,Unbox Therapy,2016-03-11,20.0 M
This Cup Is Unspillable - What Magic Is This?,Unbox Therapy,2016-07-03,20.0 M
"Unboxing The $20,000 Smartphone",Unbox Therapy,2016-12-25,19.0 M


#### Channels Grouped by Total Video Views

Sum of all videos for each channel.


In [None]:
#collapse
# Total Views by Channel

chan_views = vc_merged.groupby(['Channel Name','Subscribers (M)'])['Video Views (M)'].sum().sort_values(ascending=False).reset_index()

# rename columns to increase readability
chan_views.rename(columns={
    'Channel Name':'Channel',
    'Video Views (M)':'Video Views',
    'Subscribers (M)':'Subscribers'
    },inplace=True)

chan_views.style.format(formatter={'Video Views': "{:,}",
                                   'Video Views':'{0:,.0f} M',
                                 'Subscribers': "{:,} M"}).background_gradient(subset=['Video Views','Subscribers'], cmap='Wistia').hide_index()

Channel,Subscribers,Video Views
Unbox Therapy,18.0 M,"1,522 M"
Marques Brownlee,14.3 M,"1,286 M"
Linus Tech Tips,13.7 M,"1,158 M"
Mrwhosetheboss,7.71 M,816 M
Austin Evans,5.07 M,600 M
iJustine,6.89 M,597 M
Android Authority,3.36 M,288 M
Jonathan Morrison,2.64 M,249 M
UrAvgConsumer,3.11 M,249 M
Jon Rettinger,1.59 M,193 M


#### Correlations

In [None]:
#collapse
vc_merged.corr().style.background_gradient(subset=['Channel Views (M)',
                                                   'Subscribers (M)',
                                                   'Video Views (M)',
                                                   'Likes',
                                                   'Video Comments',
                                                   'Interactations (M)'],
                                           cmap='Wistia')

Unnamed: 0,Channel Views (M),Subscribers (M),Video Views (M),Likes,Video Comments,Interactations (M)
Channel Views (M),1.0,0.907635,0.586217,0.570409,0.138889,0.583878
Subscribers (M),0.907635,1.0,0.65992,0.652701,0.163038,0.659026
Video Views (M),0.586217,0.65992,1.0,0.708341,0.155869,0.996397
Likes,0.570409,0.652701,0.708341,1.0,0.23568,0.715335
Video Comments,0.138889,0.163038,0.155869,0.23568,1.0,0.156037
Interactations (M),0.583878,0.659026,0.996397,0.715335,0.156037,1.0


### Conclusion

* Video Comment numbers have very little correlation to any data that was obtained in this project.

## Resources

- [Top 25 Selenium Functions That Will Make You Pro In Web Scraping](https://towardsdatascience.com/top-25-selenium-functions-that-will-make-you-pro-in-web-scraping-5c937e027244)

- [How to build a Web Scraper or Bot in Python using Selenium](https://medium.com/daily-programming-tips/how-to-build-a-web-scraper-or-bot-in-python-using-selenium-2815f20023f7)

- [Web Scraping: Introduction, Best Practices & Caveats](https://medium.com/velotio-perspectives/web-scraping-introduction-best-practices-caveats-9cbf4acc8d0f)

- [Web Scraping Job Postings from Indeed.com using Selenium](https://towardsdatascience.com/web-scraping-job-postings-from-indeed-com-using-selenium-5ae58d155daf)


- [How I Use Selenium to Automate the Web With Python. Pt1 -  John Watson Rooney
](https://www.youtube.com/watch?v=pUUhvJvs-R4)