# Initial Data Scrape Workbook

***

### We ideally want to scrape a dataset consisting of
- video thumbnails
- titles
- views
- parent channel

### YouTube API
https://developers.google.com/youtube/v3/getting-started#quota

Get credentials by going to https://console.developers.google.com/apis/dashboard?project=red-means-go
- Look for "YouTube Data API v3" in the library tab and make sure it's enabled.
- Select Credentials and get an api key

Daily limit of 10,000 "units" worth of requests.
- Different operations have different cost values, need to be careful what data we request.

We can more efficiently get data by using the offered compressed gzip request format.

In [1]:
# Run once
!pip install --upgrade google-api-python-client
!pip install --upgrade google-auth-oauthlib google-auth-httplib2
!pip install --upgrade google-api-core

Requirement already up-to-date: google-api-python-client in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (1.8.0)
Requirement already up-to-date: google-auth-oauthlib in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (0.4.1)
Requirement already up-to-date: google-auth-httplib2 in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (0.0.3)
Requirement already up-to-date: google-api-core in c:\users\kutor\appdata\local\programs\python\python36\lib\site-packages (1.16.0)


### Desired scraping code
- config files to identify what categories of videos to scrape
- what level of popularity to lower bound our videos to
    - what measurement works for this? subscription to yearly average view count in relation to videos uploaded count?
- possible inversion config option to instead opt for getting the least popular videos(?)
- output to data/out/
    - /thumbs -- a folder full of thumbnails with identifying labels (possibly gzip compressed?)
    - videos.csv -- a .csv containing metadata on the videos that correspond to the thumbnails in the above folder.

### Possible search parameters
- Safesearch
    - none
    - moderate
    - strict

***

# Code

In [2]:
import os
import json
import pandas as pd

import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors

In [4]:
scopes = ["https://www.googleapis.com/auth/youtube.force-ssl"]

with open('../api_key.json') as json_file:
    cred = json.load(json_file)
api_key = cred['api_key']

def youtube_request():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)
    
    # Search parameters
    request = youtube.search().list(
        part="snippet",
        q="gaming",
        maxResults="5"
    )
    response = request.execute()

    return response

out = youtube_request()

In [5]:
out["items"][2]

{'kind': 'youtube#searchResult',
 'etag': '"ksCrgYQhtFrXgbHAhi9Fo5t0C2I/UFDkLRMWWMx-GJUXeScVGDcdtMY"',
 'id': {'kind': 'youtube#video', 'videoId': 'eFUAGZUPJbY'},
 'snippet': {'publishedAt': '2020-04-01T19:55:50.000Z',
  'channelId': 'UCWVuy4NPohItH9-Gr7e8wqw',
  'title': 'Apple iPad Pro 2020 Unboxing - Best Tablet for Gaming? (Fortnite, PUBG, Call of Duty Mobile)',
  'description': 'Unboxing review of new Apple iPad Pro 2020 tablets (12.9 inch 4th generation & 10 inch), Pencil and keyboard. iPhone 11 phone style dual camera. 120 FPS ...',
  'thumbnails': {'default': {'url': 'https://i.ytimg.com/vi/eFUAGZUPJbY/default.jpg',
    'width': 120,
    'height': 90},
   'medium': {'url': 'https://i.ytimg.com/vi/eFUAGZUPJbY/mqdefault.jpg',
    'width': 320,
    'height': 180},
   'high': {'url': 'https://i.ytimg.com/vi/eFUAGZUPJbY/hqdefault.jpg',
    'width': 480,
    'height': 360}},
  'channelTitle': 'TheRelaxingEnd',
  'liveBroadcastContent': 'none'}}