# Initial Data Scrape Workbook

***

### We ideally want to scrape a dataset consisting of
- video thumbnails
- titles
- views
- parent channel

### YouTube API
https://developers.google.com/youtube/v3/getting-started#quota

Get credentials by going to https://console.developers.google.com/apis/dashboard?project=red-means-go
- Look for "YouTube Data API v3" in the library tab and make sure it's enabled.
- Select Credentials and get an api key

Daily limit of 10,000 "units" worth of requests.
- Different operations have different cost values, need to be careful what data we request.

We can more efficiently get data by using the offered compressed gzip request format.

In [2]:
# Run once
!pip install --upgrade google-api-python-client
!pip install --upgrade google-auth-oauthlib google-auth-httplib2

Collecting google-api-python-client
[?25l  Downloading https://files.pythonhosted.org/packages/9a/b4/a955f393b838bc47cbb6ae4643b9d0f90333d3b4db4dc1e819f36aad18cc/google_api_python_client-1.8.0-py3-none-any.whl (57kB)
[K     |████████████████████████████████| 61kB 1.4MB/s eta 0:00:011
Collecting google-api-core<2dev,>=1.13.0 (from google-api-python-client)
[?25l  Downloading https://files.pythonhosted.org/packages/63/7e/a523169b0cc9ce62d56e07571db927286a94b1a5f51ac220bd97db825c77/google_api_core-1.16.0-py2.py3-none-any.whl (70kB)
[K     |████████████████████████████████| 71kB 4.1MB/s eta 0:00:011
[?25hCollecting httplib2<1dev,>=0.9.2 (from google-api-python-client)
[?25l  Downloading https://files.pythonhosted.org/packages/8e/4b/025a7338bb2d4a2c61f0e530b79aafc29d112ed8e61333a6dd9ba48f3bab/httplib2-0.17.0-py3-none-any.whl (95kB)
[K     |████████████████████████████████| 102kB 10.4MB/s ta 0:00:01
[?25hCollecting google-auth-httplib2>=0.0.3 (from google-api-python-client)
  Downloa

In [4]:
!pip install --upgrade google-api-core

Requirement already up-to-date: google-api-core in /opt/anaconda3/lib/python3.7/site-packages (1.16.0)


### Desired scraping code
- config files to identify what categories of videos to scrape
- what level of popularity to lower bound our videos to
    - what measurement works for this? subscription to yearly average view count in relation to videos uploaded count?
- possible inversion config option to instead opt for getting the least popular videos(?)
- output to data/out/
    - /thumbs -- a folder full of thumbnails with identifying labels (possibly gzip compressed?)
    - videos.csv -- a .csv containing metadata on the videos that correspond to the thumbnails in the above folder.

### Possible search parameters
- Safesearch
    - none
    - moderate
    - strict

***

# Code

In [11]:
import os
import json


import google_auth_oauthlib.flow
import googleapiclient.discovery
import googleapiclient.errors

In [12]:
scopes = ["https://www.googleapis.com/auth/youtube.force-ssl"]

with open('../api_key.txt') as json_file:
    cred = json.load(json_file)
api_key = cred['api_key']

def main():
    # Disable OAuthlib's HTTPS verification when running locally.
    # *DO NOT* leave this option enabled in production.
    os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

    api_service_name = "youtube"
    api_version = "v3"
    
    youtube = googleapiclient.discovery.build(
        api_service_name, api_version, developerKey=api_key)
    
    # here is where we edit search parameters
    request = youtube.search().list(
        part="snippet",
        q="gaming"
    )
    response = request.execute()

    print(response)

if __name__ == "__main__":
    main()

{'kind': 'youtube#searchListResponse', 'etag': '"ksCrgYQhtFrXgbHAhi9Fo5t0C2I/GhGrG0yRnu6JD3Hd1QW_qBbM1wc"', 'nextPageToken': 'CAUQAA', 'regionCode': 'US', 'pageInfo': {'totalResults': 1000000, 'resultsPerPage': 5}, 'items': [{'kind': 'youtube#searchResult', 'etag': '"ksCrgYQhtFrXgbHAhi9Fo5t0C2I/e62kFdcHSPkM3JEvN3WoYe_KG2Q"', 'id': {'kind': 'youtube#channel', 'channelId': 'UCrkfdiZ4pF3f5waQaJtjXew'}, 'snippet': {'publishedAt': '2015-05-08T04:25:58.000Z', 'channelId': 'UCrkfdiZ4pF3f5waQaJtjXew', 'title': 'GamingWithKev', 'description': 'Comedy gaming commentary, pranks, vlogs ...I do it ALL baby!', 'thumbnails': {'default': {'url': 'https://yt3.ggpht.com/-oPzTfN-wwIQ/AAAAAAAAAAI/AAAAAAAAAAA/JaP9MeJeVrA/s88-c-k-no-mo-rj-c0xffffff/photo.jpg'}, 'medium': {'url': 'https://yt3.ggpht.com/-oPzTfN-wwIQ/AAAAAAAAAAI/AAAAAAAAAAA/JaP9MeJeVrA/s240-c-k-no-mo-rj-c0xffffff/photo.jpg'}, 'high': {'url': 'https://yt3.ggpht.com/-oPzTfN-wwIQ/AAAAAAAAAAI/AAAAAAAAAAA/JaP9MeJeVrA/s800-c-k-no-mo-rj-c0xffffff/pho

In [1]:
# import pandas as pd
# import numpy as np

# import os
# import google_auth_oauthlib.flow
# import googleapiclient.discovery
# import googleapiclient.errors

In [6]:
# # Input credentials:
# from getpass import getpass
# api_key = getpass()

········


No idea how the below code is supposed to work. Reference this link to work it out

https://developers.google.com/youtube/v3/docs/search/list?apix_params=%7B%22part%22%3A%22gaming%22%2C%22maxResults%22%3A10%7D

In [None]:
# scopes = ["https://www.googleapis.com/auth/youtube.readonly"]

# def execute_request(in_request):
#     # Disable OAuthlib's HTTPS verification when running locally.
#     # *DO NOT* leave this option enabled in production.
#     os.environ["OAUTHLIB_INSECURE_TRANSPORT"] = "1"

#     api_service_name = "youtube"
#     api_version = "v3"
#     client_secrets_file = "YOUR_CLIENT_SECRET_FILE.json"

#     # Get credentials and create an API client
#     flow = google_auth_oauthlib.flow.InstalledAppFlow.from_client_secrets_file(
#         client_secrets_file, scopes)
#     credentials = flow.run_console()
#     youtube = googleapiclient.discovery.build(
#         api_service_name, api_version, credentials=credentials)

#     request = eval(in_request)
#     response = request.execute()

#     return response