# Initial Data Scrape Workbook

***

### We ideally want to scrape a dataset consisting of
- video thumbnails
- titles
- views
- parent channel

### YouTube API
https://developers.google.com/youtube/v3/getting-started#quota

Get credentials by going to https://console.developers.google.com/apis/dashboard?project=red-means-go
- Look for "YouTube Data API v3" in the library tab and make sure it's enabled.
- Select Credentials and get an api key

Daily limit of 10,000 "units" worth of requests.
- Different operations have different cost values, need to be careful what data we request.

We can more efficiently get data by using the offered compressed gzip request format.

In [1]:
# Run once
!pip install --upgrade google-api-python-client
!pip install --upgrade google-auth-oauthlib google-auth-httplib2
!pip install --upgrade google-api-core

Requirement already up-to-date: google-api-python-client in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (1.8.0)
Requirement already up-to-date: google-auth-oauthlib in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (0.4.1)
Requirement already up-to-date: google-auth-httplib2 in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (0.0.3)
Requirement already up-to-date: google-api-core in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (1.16.0)


### Desired scraping code
- config files to identify what categories of videos to scrape
- what level of popularity to lower bound our videos to
    - what measurement works for this? subscription to yearly average view count in relation to videos uploaded count?
- possible inversion config option to instead opt for getting the least popular videos(?)
- output to data/out/
    - /thumbs -- a folder full of thumbnails with identifying labels (possibly gzip compressed?)
    - videos.csv -- a .csv containing metadata on the videos that correspond to the thumbnails in the above folder.

### Possible search parameters
- Safesearch
    - none
    - moderate
    - strict

***

# Code

In [1]:
import os
import json
import pandas as pd
import time

import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors

In [2]:
def youtube_request():
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)
    
    # Search parameters
    request = youtube.search().list(
        part="snippet",
        q="gaming",
        maxResults="10",
        type="channel"
    )
    response = request.execute()

    return response

def get_channel_videos(channel_id):
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)
    
    request = youtube.search().list(
        part="snippet",
        maxResults="10",
        type="video",
        channelId=channel_id,
        order="date"
    )
    response = request.execute()
    return response

def request_video_details(video_id):
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)
    # note that this uses youtube.videos instead of youtube.search
    request = youtube.videos().list(
        part="snippet,contentDetails,statistics",
        id=video_id
    )
    response = request.execute()
    return response

def get_vid_stats(vid):
    channel_id = vid['snippet']['channelId']
    channel_title = vid['snippet']['channelTitle']
    try:
        thumbnail_link = vid['snippet']['thumbnails']['maxres']['url']
    except:
        thumbnail_link = vid['snippet']['thumbnails']['high']['url']
    title = vid['snippet']['title']
    date = vid['snippet']['publishedAt']
    views = vid['statistics']['viewCount']
    likes = vid['statistics']['likeCount']
    dislikes = vid['statistics']['dislikeCount']
    comments = vid['statistics']['commentCount']
    stats = [channel_id, channel_title, thumbnail_link, title, date, views, likes, dislikes, comments]
    return stats

# full pipeline for scraping and generating dataframe
def main_pipeline():
    metadata = []
    # get initial search results (usually going to be a list of channels)
    out = youtube_request()
    data = out['items']
    # get channel ids from search results
    channel_ids = []
    for channel in data:
        cur_channel_id = channel['snippet']['channelId']
        channel_ids.append(cur_channel_id)
        # get channel videos from the current channel id (we can also choose to handpick channels / videos)
        videos = get_channel_videos(cur_channel_id)
        video_ids = []
        for video in videos['items']:
            cur_id = video['id']['videoId']
            video_ids.append(cur_id)
            # use youtube videos api to get metadata about a single video, by video id
            cur_vid = request_video_details(cur_id)['items'][0]
            row = get_vid_stats(cur_vid)
            metadata.append(row)
        time.sleep(1)

    # create a dataframe from the gathered metadata
    columns = ['channelId','channelTitle','thumbnailLink',
               'videoTitle','Date','Views',
               'Likes','Dislikes','Comments']
    df = pd.DataFrame(metadata,columns=columns)
    return df

# Data Setup
will probably move this to a config file at some point

In [3]:
scopes = ["https://www.googleapis.com/auth/youtube.force-ssl"]
with open('../api_key.json') as json_file:
    cred = json.load(json_file)
api_key = cred['api_key']
# Disable OAuthlib's HTTPS verification when running locally.
# *DO NOT* leave this option enabled in production.
os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"
api_service_name = "youtube"
api_version = "v3"

# Full run

In [4]:
# run this if you want to scrape data and generate a dataset BE CAREFUL THIS USES MANY API CALLS
meta_df = main_pipeline()

# Individual Steps
this is the full run broken down into individual steps for dev and clarity purposes

In [None]:
# get initial search data (usually a list of channels)
out = youtube_request()
data = out['items']

In [25]:
# get channel ids from search results
channel_ids = []
for channel in data:
    cur_channel_id = channel['snippet']['channelId']
    channel_ids.append(cur_channel_id)

In [None]:
# get channel videos from a specific channel id
cur_channel_id = channel_ids[1] # arbitrary channel id for demo purposes
videos = get_channel_videos(cur_channel_id)
video_ids = []
for video in videos['items']:
    video_ids.append(video['id']['videoId'])

In [92]:
metadata = []
# use youtube videos api to get metadata about a single video, by video id
for cur_id in video_ids:
    cur_vid = request_video_details(cur_id)['items'][0]
    row = get_vid_stats(cur_vid)
    metadata.append(row)    

In [95]:
# create a dataframe from the gathered metadata
columns = ['channelId','channelTitle','thumbnailLink','videoTitle','Date','Views','Likes','Dislikes','Comments']
df = pd.DataFrame(metadata,columns=columns)

In [113]:
df.head()

Unnamed: 0,channelId,channelTitle,thumbnailLink,videoTitle,Date,Views,Likes,Dislikes,Comments
0,UCrkfdiZ4pF3f5waQaJtjXew,GamingWithKev,https://i.ytimg.com/vi/f19g-muiJ2M/maxresdefau...,ROBLOX WIPEOUT...,2020-04-01T02:01:33.000Z,203876,21828,285,2139
1,UCrkfdiZ4pF3f5waQaJtjXew,GamingWithKev,https://i.ytimg.com/vi/K0hRF4u1M1M/maxresdefau...,Playing as MY CHARACTER in Roblox Bakon...,2020-03-30T23:26:12.000Z,858107,64494,916,3985
2,UCrkfdiZ4pF3f5waQaJtjXew,GamingWithKev,https://i.ytimg.com/vi/ctqqGZOcXFA/maxresdefau...,ROBLOX PIGGY INFECTION...,2020-03-30T00:03:40.000Z,556425,32997,649,1958
3,UCrkfdiZ4pF3f5waQaJtjXew,GamingWithKev,https://i.ytimg.com/vi/6-ePZzPsTws/maxresdefau...,My Character is in the GAME!! (Roblox Bakon Ch...,2020-03-28T16:44:28.000Z,1360181,52226,1129,2647
4,UCrkfdiZ4pF3f5waQaJtjXew,GamingWithKev,https://i.ytimg.com/vi/OTT-8PrarMM/maxresdefau...,ROBLOX HOUSE PARTY...,2020-03-27T22:02:31.000Z,251918,9982,250,1868
