# BigQuery Public Data - Hacker News Dataset
- Hacker news is a social news website focusing on computer science and entrepreneurship. The dataset contains all stories and comments from Hacker News from its launch in 2006. Each story contains a story id, the author that made the post, when it was written, and the number of points the story received.

### Authenticating with Google BigQuery service account key file

Before connecting to the Jupyter server, do the following in terminal window:

1. export GOOGLE_APPLICATION_CREDENTIALS="/your/file/path/[FILE_NAME].json"
2. pip install --upgrade google-cloud-bigquery

### Set up BigQuery client & load Google Cloud BigQuery extension

In [6]:
from google.cloud import bigquery
client = bigquery.Client()
%load_ext google.cloud.bigquery

The google.cloud.bigquery extension is already loaded. To reload it, use:
  %reload_ext google.cloud.bigquery


### Preliminary Stuff

#### What tables do the hacker_news dataset have?

In [4]:
data_ref = client.dataset("hacker_news", project="bigquery-public-data")
data = client.get_dataset(data_ref)

tables = list(client.list_tables(data))
for table in tables:
    print(table.table_id)

comments
full
full_201510
stories


#### Sneak peak of the comments table

In [6]:
table_ref = data_ref.table("comments")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,author,time,time_ts,text,parent,deleted,dead,ranking
0,2701393,5l,5l,1309184881,2011-06-27 14:28:01+00:00,And the glazier who fixed all the broken windo...,2701243,,,0
1,5811403,99,99,1370234048,2013-06-03 04:34:08+00:00,Does canada have the equivalent of H1B/Green c...,5804452,,,0
2,21623,AF,AF,1178992400,2007-05-12 17:53:20+00:00,"Speaking of Rails, there are other options in ...",21611,,,0
3,10159727,EA,EA,1441206574,2015-09-02 15:09:34+00:00,Humans and large livestock (and maybe even pet...,10159396,,,0
4,2988424,Iv,Iv,1315853580,2011-09-12 18:53:00+00:00,I must say I reacted in the same way when I re...,2988179,,,0


#### Sneak peak of the stories table

In [8]:
table_ref = data_ref.table("stories")
table = client.get_table(table_ref)
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,by,score,time,time_ts,title,url,text,deleted,dead,descendants,author
0,6940813,sarath237,0,1387536270,2013-12-20 10:44:30+00:00,Sheryl Brindo Hot Pics,http://www.youtube.com/watch?v=ym1cyxneB0Y,Sheryl Brindo Hot Pics,,True,,sarath237
1,6991401,123123321321,0,1388508751,2013-12-31 16:52:31+00:00,Are you people also put off by the culture of ...,,They&#x27;re pretty explicitly &#x27;startup f...,,True,,123123321321
2,1531556,ssn,0,1279617234,2010-07-20 09:13:54+00:00,New UI for Google Image Search,http://googlesystem.blogspot.com/2010/07/googl...,Again following on Bing's lead.,,,0.0,ssn
3,5012398,hoju,0,1357387877,2013-01-05 12:11:17+00:00,Historic website screenshots,http://webscraping.com/blog/Generate-website-s...,Python script to generate historic screenshots...,,,0.0,hoju
4,7214182,kogir,0,1401561740,2014-05-31 18:42:20+00:00,Placeholder,,Mind the gap.,,,0.0,kogir


### BigQuery SQL

#### Review table schema

In [16]:
# comments table schema
client.get_table(data_ref.table("comments")).schema

[SchemaField('id', 'INTEGER', 'NULLABLE', 'Unique comment ID', ()),
 SchemaField('by', 'STRING', 'NULLABLE', 'Username of commenter', ()),
 SchemaField('author', 'STRING', 'NULLABLE', 'Username of author', ()),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', ()),
 SchemaField('time_ts', 'TIMESTAMP', 'NULLABLE', 'Human readable time in UTC (format: YYYY-MM-DD hh:mm:ss)', ()),
 SchemaField('text', 'STRING', 'NULLABLE', 'Comment text', ()),
 SchemaField('parent', 'INTEGER', 'NULLABLE', 'Parent comment ID', ()),
 SchemaField('deleted', 'BOOLEAN', 'NULLABLE', 'Is deleted?', ()),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', ()),
 SchemaField('ranking', 'INTEGER', 'NULLABLE', 'Comment ranking', ())]

In [17]:
# stories table schema
client.get_table(data_ref.table("stories")).schema

[SchemaField('id', 'INTEGER', 'NULLABLE', 'Unique story ID', ()),
 SchemaField('by', 'STRING', 'NULLABLE', 'Username of submitter', ()),
 SchemaField('score', 'INTEGER', 'NULLABLE', 'Story score', ()),
 SchemaField('time', 'INTEGER', 'NULLABLE', 'Unix time', ()),
 SchemaField('time_ts', 'TIMESTAMP', 'NULLABLE', 'Human readable time in UTC (format: YYYY-MM-DD hh:mm:ss)', ()),
 SchemaField('title', 'STRING', 'NULLABLE', 'Story title', ()),
 SchemaField('url', 'STRING', 'NULLABLE', 'Story url', ()),
 SchemaField('text', 'STRING', 'NULLABLE', 'Story text', ()),
 SchemaField('deleted', 'BOOLEAN', 'NULLABLE', 'Is deleted?', ()),
 SchemaField('dead', 'BOOLEAN', 'NULLABLE', 'Is dead?', ()),
 SchemaField('descendants', 'INTEGER', 'NULLABLE', 'Number of story descendants', ()),
 SchemaField('author', 'STRING', 'NULLABLE', 'Username of author', ())]

#### Pull information from the stories and comments tables to create a table showing all stories posted on 1/1/2012, with the corresponding number of comments. 

In [19]:
query = """
        WITH comments_count AS
        (
        SELECT parent, COUNT(*) AS num_comments
        FROM `bigquery-public-data.hacker_news.comments`
        GROUP BY parent
        )
        SELECT s.id AS StoryID, s.title AS Title, c.num_comments 
        FROM `bigquery-public-data.hacker_news.stories` s
        LEFT JOIN comments_count c
            ON s.id = c.parent
        WHERE EXTRACT(DATE FROM s.time_ts) = '2012-01-01'
        ORDER BY c.num_comments DESC
        """

result = client.query(query).result().to_dataframe()
result.head()

Unnamed: 0,StoryID,Title,num_comments
0,3412900,Ask HN: Who is Hiring? (January 2012),154.0
1,3412901,Ask HN: Freelancer? Seeking freelancer? (Janua...,97.0
2,3412643,Avoid Apress,30.0
3,3412891,"There's no shame in code that is simply ""good ...",27.0
4,3414012,Impress.js - a Prezi like implementation using...,27.0


#### Obtain all usernames corresponding to users who wrote stories or comments on 1/24/2014

In [22]:
query = """
        SELECT c.by 
        FROM `bigquery-public-data.hacker_news.comments` c
        WHERE EXTRACT(DATE FROM time_ts) = '2014-01-24'
        UNION DISTINCT
        SELECT s.by
        FROM `bigquery-public-data.hacker_news.stories` s
        WHERE EXTRACT(DATE FROM time_ts) = '2014-01-24'
        """

result = client.query(query).result().to_dataframe()
result.head()

Unnamed: 0,by
0,anko
1,lhnz
2,ljoshua
3,spikels
4,31reasons


## End of Session