# BigQuery

### About Google BigQuery

<b><font color='red'>TODO, add links to live bigquery queries</font></b>

### Setting Up Google BigQuery

* https://bigquery.cloud.google.com
* sign in with google/ gmail account
* Create Project
* (this seems to hang a bit)
    * go back to original URL and select "enable on existing project" and choose your project.
* "Go to project settings"
* "Service Accounts"
* "Create Service Account"
* "Create"
* "Role": "BigQuery User"
* "Create key", "JSON", Download file
* Store .json file in home folder or similar
* update `JSON_KEY` variable below to reflect its location, and make sure it's readable by the notebook
* BigQuery comes with 1TB of free queries per month, after that you have to pay ;)
* <b><font color='red'>TODO: Video tutorial</font></b>


### Sample Projects of Interest

* [reddit_posts](https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_posts)
* [reddit_comments](https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments)
* [stackoverflow](https://bigquery.cloud.google.com/dataset/fh-bigquery:stackoverflow_archive)
* [githubarchive](https://bigquery.cloud.google.com/table/githubarchive:month.201809?tab=preview)
* [gdelt](https://bigquery.cloud.google.com/table/gdelt-bq:full.events)

### Async Queries

The below code performs synchronous queries with a fixed timeout.

For bigger dataset or queries that take longer, the proper way is to store results to bigquery storage asynchronously and to fetch the data from there once the query is completed.

In [2]:
import os

from dotenv import load_dotenv
load_dotenv()

JSON_KEY = os.environ['JSON_KEY']

In [3]:
import json
import os

from bigquery import get_client  # BigQuery-Python
from bigquery.errors import BigQueryTimeoutException
import pandas as pd

pd.options.display.max_colwidth = 280

class TimeoutException(Exception):
    pass
 
bqp_client = get_client(json_key_file=JSON_KEY, readonly=False)


def perform_query(query, timeout=500):
    """ run a query on google bigquery and return result set as dataframe """
    try:
        job_id, _results = bqp_client.query(query, timeout=timeout)
        return pd.DataFrame(bqp_client.get_query_rows(job_id))
    except BigQueryTimeoutException:
        raise TimeoutException()

## Reddit Comments

<b><font color='red'>TODO</font></b>

In [4]:
reddit_query = """
SELECT body, link_id
FROM [fh-bigquery:reddit_comments.2018_07]
WHERE subreddit_id = 't5_2qh1e'
AND (body CONTAINS 'youtu.be' or body CONTAINS 'youtube.com')
ORDER BY link_id;
"""
# this query ignores limit requests, so restrain yourself ;)

reddit_results = perform_query(reddit_query)
reddit_results.head(10)

Unnamed: 0,body,link_id
0,"And 7 years later in 1969, Neil Armstrong became the first man on the moon. This is 49 years ago a came across this fitting video tribute to this historical occasion at https://youtu.be/Vc-_xBC5sYk",t3_7wptw2
1,I dunno... [Mr. Marvin Gaye...](https://www.youtube.com/watch?v=QRvVzaQ6i8A) killed it,t3_7ymeia
2,"How does it compare to this one? https://www.youtube.com/watch?v=4cRlqE3D3RQ\nHe guesses kinda similar to the east, but doesn't focus on a particular language.",t3_8551k1
3,"Compared to this guy, he does.\nhttps://youtu.be/tIdIqbv7SPo?t=43s",t3_86ak1o
4,dangerous for democracy \n\n[Review](http://www.youtube.com/c/BestProductReviewCom),t3_88ll08
5,Hey hey! Just wanted to let you know Episode 3 is now posted! https://www.youtube.com/watch?v=7T8--yhYyjw&amp;t=2s,t3_89vpne
6,"Hey, thanks so much! Weird to hear about the audio... seems good on our end but we will totally check it out. Also, we just released Ep 3 so feel free to watch it here! https://www.youtube.com/watch?v=7T8--yhYyjw&amp;t=2s",t3_89vpne
7,"&gt; https://youtu.be/YF8I-PotoK8\n\nShit, I forgot about this video. Dirty, dirty, DIRTY. Yet the heads turning make it so good. :P\n",t3_8lnm0o
8,[Oh these killers](https://www.youtube.com/watch?v=VBkV-2YgF8I&amp;feature=youtu.be),t3_8mox83
9,Is this a good method? https://www.youtube.com/watch?v=LT_dFRnmdGs,t3_8o1n0l


## Github Archive

<b><font color='red'>TODO</font></b>

In [5]:
github_query = """
SELECT
  actor.login,
  actor.id,
  repo.name,
  repo.id
FROM [githubarchive:day.20160901]
WHERE
  type = "WatchEvent"
LIMIT 50;
"""

github_results = perform_query(github_query)
github_results.head(10)

Unnamed: 0,actor_id,actor_login,repo_id,repo_name
0,3979617,AtrusHB,9606400,Vibex/qt
1,21375706,hxsir,67090888,hxsir/hy
2,4192763,powerLambda,10379607,buger/gor
3,2041398,jiangplus,62916792,di/divspl
4,756988,sunglim,140656,jonas/tig
5,6362625,srsudar,10214538,koush/ion
6,128857,marshallswain,22266795,ljharb/qs
7,1136652,jperl,54331831,tj/d3-dot
8,1269815,oblitum,66104684,Ogeon/flow
9,19735849,canering,11766338,bendoh/env


## GDELT

<b><font color='red'>TODO</font></b>

In [6]:
gdelt_query = """
SELECT SQLDATE, Actor1Code, Actor1Name, Actor2Code, Actor2Name, AvgTone, SOURCEURL
FROM [gdelt-bq:gdeltv2.events]
WHERE MonthYear == 201809 AND Actor1Code == 'IRLGOV'
LIMIT 1000
"""

gdelt_results = perform_query(gdelt_query)
gdelt_results.head(20)

Unnamed: 0,Actor1Code,Actor1Name,Actor2Code,Actor2Name,AvgTone,SOURCEURL,SQLDATE
0,IRLGOV,IRELAND,IRL,IRELAND,-1.724138,https://www.independent.ie/irish-news/politics/i-will-support-my-government-colleague-catherine-byrne-backs-housing-minister-as-government-defeats-motion-of-no-confidence-37354456.html,20180925
1,IRLGOV,IRELAND,IRL,IRELAND,-1.724138,https://www.independent.ie/irish-news/politics/i-will-support-my-government-colleague-catherine-byrne-backs-housing-minister-as-government-defeats-motion-of-no-confidence-37354456.html,20180925
2,IRLGOV,IRELAND,,,-1.576577,http://www.itv.com/news/2018-09-25/labour-to-support-motion-of-no-confidence-in-irelands-housing-minister/,20180925
3,IRLGOV,IRISH,GBR,UNITED KINGDOM,0.234192,https://www.irishtimes.com/news/ireland/irish-news/who-are-the-six-candidates-vying-for-the-presidency-1.3640934,20180925
4,IRLGOV,IRELAND,GBR,UNITED KINGDOM,0.234192,https://www.irishtimes.com/news/ireland/irish-news/who-are-the-six-candidates-vying-for-the-presidency-1.3640934,20180925
5,IRLGOV,IRELAND,IGOEUREEC,EUROPEAN UNION,-0.606061,https://www.reuters.com/article/uk-britain-eu-ireland/brexit-deal-ratification-not-a-given-irish-pm-idUSKCN1M5224,20180925
6,IRLGOV,IRELAND,IRL,IRELAND,-0.928722,http://www.borehamwoodtimes.co.uk/news/national/16901639.labour-to-support-motion-of-no-confidence-in-irelands-housing-minister/,20180925
7,IRLGOV,IRELAND,IGOEUREEC,EUROPEAN UNION,-0.606061,https://www.reuters.com/article/uk-britain-eu-ireland/brexit-deal-ratification-not-a-given-irish-pm-idUSKCN1M5224,20180925
8,IRLGOV,IRELAND,IRL,IRISH,0.234192,https://www.irishtimes.com/news/ireland/irish-news/who-are-the-six-candidates-vying-for-the-presidency-1.3640934,20180925
9,IRLGOV,IRELAND,GBR,UNITED KINGDOM,0.234192,https://www.irishtimes.com/news/ireland/irish-news/who-are-the-six-candidates-vying-for-the-presidency-1.3640934,20180925
