# Data Miner
## Get raw data from YouTube.com - scrapying

**Import packages**
 - **youtube-dl** is an open-source download manager for video and audio from YouTube and over 1000 other video hosting websites.It is released under the Unlicense software license.

**Source**: <a href="https://yt-dl.org/">youtube_dl</a>

In [22]:
import numpy as np
import pandas as pd

import youtube_dl

## Add a query list
List used to search videos. URL will be stored in a json file.

youtube_dl will get videos using the query list.

In [2]:
# search list
queries = ["machine+learning", "data+science", "kaggle"]

In [3]:
# put a defensive: to avoid videos not available --> {"ignoreerrors": True}
ydl = youtube_dl.YoutubeDL({"ignoreerrors": True})      # defined object

Loop through the queries items, extract YouTube page using ydl.extract_info function. All parsed data is stored on json format under **entries**. 

Add query items under **entries**. 

In [24]:
result = []        # store all the data parsed in a list

# get the pages
for query in queries:
    
    # use extract_info from youtube_dl package to parse the data
    # pass the number of videos to fetch --> ytsearchdate10 (meaning: search by date the first 10 videos)
    # r: result object variable
    
    # double defensive
    try: 
    # avoid error sample: ERROR: This live event will begin in 40 hours.    
        r = ydl.extract_info("ytsearchdate1000:{}".format(query), download=False)

        # add query under the json parsed by youtube_dl. Original parsed data doesn't contain the query parameters
        for entry in r['entries']:
            if entry is not None:                 # if the list contains an empty video
                entry['query'] = query            # add query items into the json data
        result += r['entries']                    # parsed data is stored under a list as 'entries' - json file!
    except:
        pass

[download] Downloading playlist: machine+learning
[youtube:search:date] query "machine+learning": Downloading page 1
[youtube:search:date] query "machine+learning": Downloading page 2
[youtube:search:date] query "machine+learning": Downloading page 3
[youtube:search:date] query "machine+learning": Downloading page 4
[youtube:search:date] query "machine+learning": Downloading page 5
[youtube:search:date] query "machine+learning": Downloading page 6
[youtube:search:date] query "machine+learning": Downloading page 7
[youtube:search:date] query "machine+learning": Downloading page 8
[youtube:search:date] query "machine+learning": Downloading page 9
[youtube:search:date] query "machine+learning": Downloading page 10
[youtube:search:date] query "machine+learning": Downloading page 11
[youtube:search:date] query "machine+learning": Downloading page 12
[youtube:search:date] query "machine+learning": Downloading page 13
[youtube:search:date] query "machine+learning": Downloading page 14
[youtub

ERROR: This live event will begin in 4 days.


[download] Downloading video 41 of 552
[youtube] qYhTdcYhhk8: Downloading webpage
[youtube] qYhTdcYhhk8: Downloading MPD manifest
[download] Downloading video 42 of 552
[youtube] Sxvn9UxpZiw: Downloading webpage
[youtube] Sxvn9UxpZiw: Downloading MPD manifest
[download] Downloading video 43 of 552
[youtube] UVYZoqxfCcQ: Downloading webpage
[download] Downloading video 44 of 552
[youtube] MGSqIuhu6Dg: Downloading webpage
[download] Downloading video 45 of 552
[youtube] rIiGBasgisE: Downloading webpage
[youtube] rIiGBasgisE: Downloading MPD manifest
[download] Downloading video 46 of 552
[youtube] bk0PYjKrcX0: Downloading webpage
[youtube] bk0PYjKrcX0: Downloading MPD manifest
[download] Downloading video 47 of 552
[youtube] iw7kyLuz75k: Downloading webpage
[youtube] iw7kyLuz75k: Downloading MPD manifest
[download] Downloading video 48 of 552
[youtube] ZrUnjr0WR0U: Downloading webpage
[youtube] ZrUnjr0WR0U: Downloading MPD manifest
[download] Downloading video 49 of 552
[youtube] HDcctfM

ERROR: Premieres in 13 hours


[download] Downloading video 3 of 465
[youtube] 2Gf2pxB24Ic: Downloading webpage
[youtube] 2Gf2pxB24Ic: Downloading MPD manifest
[download] Downloading video 4 of 465
[youtube] Zhk-SqyMc8Y: Downloading webpage
[youtube] Zhk-SqyMc8Y: Downloading MPD manifest
[download] Downloading video 5 of 465
[youtube] pptMjkVU4Wc: Downloading webpage
[youtube] pptMjkVU4Wc: Downloading MPD manifest
[download] Downloading video 6 of 465
[youtube] s9EB7d0gQKI: Downloading webpage
[youtube] s9EB7d0gQKI: Downloading MPD manifest
[download] Downloading video 7 of 465
[youtube] -AQHiML410I: Downloading webpage
[youtube] -AQHiML410I: Downloading MPD manifest
[download] Downloading video 8 of 465
[youtube] qCCABwOz-Xk: Downloading webpage
[youtube] qCCABwOz-Xk: Downloading MPD manifest
[download] Downloading video 9 of 465
[youtube] LfrzboD-O7I: Downloading webpage
[download] Downloading video 10 of 465
[youtube] nrc-n98pF2w: Downloading webpage
[youtube] nrc-n98pF2w: Downloading MPD manifest
[download] Down

As a first attempt - raw data - duplicated videos are being parsed. 

Cleanup will be done down on data pipeline...not a big deal right now!


In [25]:
# put a defensive: just in case any Not available video is on the list
result = [e for e in result if e is not None]

print('Num of videas parsed: ', len(result))

Num of videas parsed:  1566


In [39]:
# convert the data parsed from you tube into a Pandas dataframe
# cool libray that provides the data so organized...great!
# youtube_dl avoid to manully scrap the youbube webpage
df = pd.DataFrame(result)            # result: long string with all data parsed

pd.set_option('display.max_columns', 70)

df.head(2)


Unnamed: 0,id,title,formats,thumbnails,description,upload_date,uploader,uploader_id,uploader_url,channel_id,channel_url,duration,view_count,average_rating,age_limit,webpage_url,categories,tags,is_live,like_count,dislike_count,channel,extractor,webpage_url_basename,extractor_key,n_entries,playlist,playlist_id,playlist_title,playlist_uploader,playlist_uploader_id,playlist_index,thumbnail,display_id,requested_subtitles,requested_formats,format,format_id,width,height,resolution,fps,vcodec,vbr,stretched_ratio,acodec,abr,ext,entries,automatic_captions,subtitles,location,chapters,track,artist,creator,alt_title,url,manifest_url,tbr,protocol,preference,http_headers,license,album,series,season_number,episode_number,season,episode
0,ZojgQYeLS6c,ML Lecture [Improving Machine Learning Models ...,"[{'format_id': '139', 'manifest_url': 'https:/...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",Raw lecture recording,20210505,P.S.R Patnaik,weekendtutor,http://www.youtube.com/user/weekendtutor,UCyg6fASREeRNCnxBKaxmrXg,https://www.youtube.com/channel/UCyg6fASREeRNC...,2303.0,0,0.0,0,https://www.youtube.com/watch?v=ZojgQYeLS6c,[Education],[],,0.0,0.0,P.S.R Patnaik,youtube,ZojgQYeLS6c,Youtube,552,machine+learning,machine+learning,,,,1,https://i.ytimg.com/vi_webp/ZojgQYeLS6c/maxres...,ZojgQYeLS6c,,"({'format_id': '247', 'manifest_url': 'https:/...",247 - 1280x720 (DASH video)+140 - audio only (...,247+140,1280,720,,24.0,vp9,,,mp4a.40.2,129.473,webm,machine+learning,,,,,,,,,,,,,,,,,,,,,
1,Xr-akaCxSCY,"#2 Learn spatial data science, remote sensing,...","[{'format_id': '139', 'manifest_url': 'https:/...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",Learn more: https://spatialelearning.com,20210505,Spatial eLearning,UCzWimsVHHZFG1uMUYgKizqg,http://www.youtube.com/channel/UCzWimsVHHZFG1u...,UCzWimsVHHZFG1uMUYgKizqg,https://www.youtube.com/channel/UCzWimsVHHZFG1...,31.0,1,0.0,0,https://www.youtube.com/watch?v=Xr-akaCxSCY,[Entertainment],[],,0.0,0.0,Spatial eLearning,youtube,Xr-akaCxSCY,Youtube,552,machine+learning,machine+learning,,,,2,https://i.ytimg.com/vi_webp/Xr-akaCxSCY/maxres...,Xr-akaCxSCY,,"({'format_id': '248', 'manifest_url': 'https:/...",248 - 1920x1032 (DASH video)+251 - audio only ...,248+251,1920,1032,,30.0,vp9,,,opus,136.235,webm,machine+learning,,,,,,,,,,,,,,,,,,,,,


In [31]:
# feature engineering - a little bit!

# convert upload_date column to a date format
df['upload_date'] = pd.to_datetime(df['upload_date'])

# will done this one later!
# add new feaure - delta time: time since video publication
#df['time_since_pub'] = (pd.to_datetime("2021-05-31") - df['upload_date']) / np.timedelta64(1, 'D')


#pd.set_option("display.max_columns", 60)
pd.set_option('display.max_columns', 70)


df.head(2)

Unnamed: 0,id,title,formats,thumbnails,description,upload_date,uploader,uploader_id,uploader_url,channel_id,channel_url,duration,view_count,average_rating,age_limit,webpage_url,categories,tags,is_live,like_count,dislike_count,channel,extractor,webpage_url_basename,extractor_key,n_entries,playlist,playlist_id,playlist_title,playlist_uploader,playlist_uploader_id,playlist_index,thumbnail,display_id,requested_subtitles,requested_formats,format,format_id,width,height,resolution,fps,vcodec,vbr,stretched_ratio,acodec,abr,ext,entries,automatic_captions,subtitles,location,chapters,track,artist,creator,alt_title,url,manifest_url,tbr,protocol,preference,http_headers,license,album,series,season_number,episode_number,season,episode
0,ZojgQYeLS6c,ML Lecture [Improving Machine Learning Models ...,"[{'format_id': '139', 'manifest_url': 'https:/...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",Raw lecture recording,2021-05-05,P.S.R Patnaik,weekendtutor,http://www.youtube.com/user/weekendtutor,UCyg6fASREeRNCnxBKaxmrXg,https://www.youtube.com/channel/UCyg6fASREeRNC...,2303.0,0,0.0,0,https://www.youtube.com/watch?v=ZojgQYeLS6c,[Education],[],,0.0,0.0,P.S.R Patnaik,youtube,ZojgQYeLS6c,Youtube,552,machine+learning,machine+learning,,,,1,https://i.ytimg.com/vi_webp/ZojgQYeLS6c/maxres...,ZojgQYeLS6c,,"({'format_id': '247', 'manifest_url': 'https:/...",247 - 1280x720 (DASH video)+140 - audio only (...,247+140,1280,720,,24.0,vp9,,,mp4a.40.2,129.473,webm,machine+learning,,,,,,,,,,,,,,,,,,,,,
1,Xr-akaCxSCY,"#2 Learn spatial data science, remote sensing,...","[{'format_id': '139', 'manifest_url': 'https:/...","[{'height': 94, 'url': 'https://i.ytimg.com/vi...",Learn more: https://spatialelearning.com,2021-05-05,Spatial eLearning,UCzWimsVHHZFG1uMUYgKizqg,http://www.youtube.com/channel/UCzWimsVHHZFG1u...,UCzWimsVHHZFG1uMUYgKizqg,https://www.youtube.com/channel/UCzWimsVHHZFG1...,31.0,1,0.0,0,https://www.youtube.com/watch?v=Xr-akaCxSCY,[Entertainment],[],,0.0,0.0,Spatial eLearning,youtube,Xr-akaCxSCY,Youtube,552,machine+learning,machine+learning,,,,2,https://i.ytimg.com/vi_webp/Xr-akaCxSCY/maxres...,Xr-akaCxSCY,,"({'format_id': '248', 'manifest_url': 'https:/...",248 - 1920x1032 (DASH video)+251 - audio only ...,248+251,1920,1032,,30.0,vp9,,,opus,136.235,webm,machine+learning,,,,,,,,,,,,,,,,,,,,,


## Save dataframe using feather
Feather is a fast, lightweight, and easy-to-use binary file format for storing data frames.


In [40]:
# gets only the desire features
#columns = ['title', 'upload_date', 'view_count', 'time_since_pub']
columns = ['title', 'upload_date', 'view_count']                      # features to train the model

# nan values may break the feather data format...can't convert
df[columns].to_feather('./raw_data/raw_data.feather')


# save also as a csv to be use on the labelling step
df[columns].to_csv('./raw_data/raw_data_without_labels.csv')


In [30]:
print('Data miner - parsed raw data completed!')

Data miner - parsed raw data completed!


# Quick comparison  between feather format and csv

In [35]:
# %time df[columns].to_feather('./raw_data/test_feather_format.feather')

# %time df[columns].to_csv('./raw_data/test_csv_format.feather')

Wall time: 38 ms
Wall time: 126 ms


Wall time: 38 ms
Wall time: 126 ms

In [36]:
# df_= pd.read_feather('./raw_data/raw_data.feather')

In [37]:
# just checking the raw data
# df_

Unnamed: 0,title,upload_date,view_count
0,ML Lecture [Improving Machine Learning Models ...,2021-05-05,0
1,"#2 Learn spatial data science, remote sensing,...",2021-05-05,1
2,"#6 Learn spatial data science, remote sensing,...",2021-05-05,1
3,"#5 Learn spatial data science, remote sensing,...",2021-05-05,0
4,Machine Learning Modeling in Breast Cancer Pro...,2021-05-05,0
...,...,...,...
1561,Kaggle Winner Gradient Boosting Explained in 2...,2019-01-27,5405
1562,How to use Kaggle Kernel for Kaggle Contests?,2019-01-22,3568
1563,Python Toolkit that Used for Two Kaggle Top 10...,2019-01-14,8164
1564,Kaggle Reading Group: Bidirectional Encoder Re...,2019-01-09,8945
