Data pipeline example showing data extraction from Google BigTable using SQL: 

This demo shows
* Accessing a data base in the cloud
* Exploring - tables present, schema, head/ first 5 rows
* Cost limit for the query - Kaggle users can scan 5TB every 30 days for free, from Google BigTable. Hence a 10GB limit is configured for the query.
* Deriving meaningful information - in this example, we explore the stackoverflow dataset to find how long it has taken each user to either post a question or answer one, from the time they join. This information is extracted from 3 tables.

In [1]:
from learntools.core import binder
binder.bind(globals())
from learntools.sql_advanced.ex1 import *
print("Setup Complete")

Using Kaggle's public dataset BigQuery integration.
Setup Complete


In [2]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "stackoverflow" dataset
dataset_ref = client.dataset("stackoverflow", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

Using Kaggle's public dataset BigQuery integration.


In [3]:
# List all the tables in the dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset
for table in tables:  
    print(table.table_id)

badges
comments
post_history
post_links
posts_answers
posts_moderator_nomination
posts_orphaned_tag_wiki
posts_privilege_wiki
posts_questions
posts_tag_wiki
posts_tag_wiki_excerpt
posts_wiki_placeholder
stackoverflow_posts
tags
users
votes


In [4]:
# Construct a reference to the "users" table
table_ref = dataset_ref.table("users")

# API request - fetch the table
table = client.get_table(table_ref)


In [5]:
print(table.schema)

[SchemaField('id', 'INTEGER', 'NULLABLE', None, ()), SchemaField('display_name', 'STRING', 'NULLABLE', None, ()), SchemaField('about_me', 'STRING', 'NULLABLE', None, ()), SchemaField('age', 'STRING', 'NULLABLE', None, ()), SchemaField('creation_date', 'TIMESTAMP', 'NULLABLE', None, ()), SchemaField('last_access_date', 'TIMESTAMP', 'NULLABLE', None, ()), SchemaField('location', 'STRING', 'NULLABLE', None, ()), SchemaField('reputation', 'INTEGER', 'NULLABLE', None, ()), SchemaField('up_votes', 'INTEGER', 'NULLABLE', None, ()), SchemaField('down_votes', 'INTEGER', 'NULLABLE', None, ()), SchemaField('views', 'INTEGER', 'NULLABLE', None, ()), SchemaField('profile_image_url', 'STRING', 'NULLABLE', None, ()), SchemaField('website_url', 'STRING', 'NULLABLE', None, ())]


In [6]:
# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,display_name,about_me,age,creation_date,last_access_date,location,reputation,up_votes,down_votes,views,profile_image_url,website_url
0,212,Mike Polen,"<p>Christian, dynamic systems engineer, develo...",,2008-08-03 14:59:56.407000+00:00,2019-08-23 16:10:12.707000+00:00,"Richmond, VA",3421,465,4,1181,,http://about.me/mwpolen
1,278,Lea Cohen,"<p>Web developer, both server-side and client-...",,2008-08-04 11:28:54.023000+00:00,2019-09-01 04:48:44.660000+00:00,Israel,4261,3470,24,997,,http://leketshibolim.ort.org.il
2,380,Vaibhav,,,2008-08-05 10:39:18.677000+00:00,2019-07-29 11:59:18.740000+00:00,"Gurgaon, India",8825,430,46,1218,,http://blog.gadodia.net
3,527,ggasp,<p>I'll upvote every answer to my questions!</...,,2008-08-06 14:44:09.103000+00:00,2019-08-21 08:38:01.943000+00:00,Chile,917,135,4,168,,http://gasparolo.com/gabriel
4,889,Eldila,"<p>I'm a programmer, scientist, and mathematic...",,2008-08-10 08:04:03.333000+00:00,2017-02-12 01:45:10.563000+00:00,"Vancouver, Canada",7208,130,5,461,,http://jkwiens.com


In [7]:
# Construct a reference to the "posts_questions" table
table_ref = dataset_ref.table("posts_questions")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,15432504,Div wont go into wrapper?,"<p>I'm having a bit of trouble with a div, my ...",,4,0,,2013-03-15 12:29:18.540000+00:00,,2013-04-08 21:00:53.857000+00:00,NaT,,,,2060014,,1,0,csshtmlwrapper,256
1,15437295,I want each HTML line to alternate in color in...,<p>I have scrolling HTML text in Actionscript ...,,1,0,,2013-03-15 16:19:14.063000+00:00,,2013-03-15 16:21:50.683000+00:00,NaT,,,,2174752,,1,0,htmlcolorstextboxalternate,256
2,15459643,What kind of design pattern is this?,"<p>On my site, I have features, which I want t...",,2,4,,2013-03-17 10:17:44.483000+00:00,,2013-03-17 11:16:17.483000+00:00,NaT,,,,362378,,1,0,ruby-on-railsrubydesign-patterns,256
3,15475567,Sending post request no framework,<pre><code>public class ClientTestHttp {\n\n ...,,1,0,,2013-03-18 11:20:10.303000+00:00,,2013-03-19 08:50:05.527000+00:00,2013-03-18 11:24:52.963000+00:00,,852866.0,,2140215,,1,0,netty,256
4,15475774,custom b2b app also in regular store,<p>I was wondering if it would be possible to ...,18689167.0,1,0,,2013-03-18 11:31:40.753000+00:00,,2013-09-08 22:14:14.940000+00:00,NaT,,,,2118320,,1,0,iositunesappstore-approvalb2b,256


In [8]:
# Construct a reference to the "posts_answers" table
table_ref = dataset_ref.table("posts_answers")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the table
client.list_rows(table, max_results=5).to_dataframe()

Unnamed: 0,id,title,body,accepted_answer_id,answer_count,comment_count,community_owned_date,creation_date,favorite_count,last_activity_date,last_edit_date,last_editor_display_name,last_editor_user_id,owner_display_name,owner_user_id,parent_id,post_type_id,score,tags,view_count
0,50075045,,<p>To return a json try to replace</p>\n\n<pre...,,,0,,2018-04-28 09:36:26.320000+00:00,,2018-04-28 09:36:26.320000+00:00,NaT,,,,5855039,50058348,2,0,,
1,50075058,,<p>Very good practice is to import feature mod...,,,0,,2018-04-28 09:37:48.770000+00:00,,2018-04-28 09:37:48.770000+00:00,NaT,,,,6370870,42300636,2,0,,
2,50075091,,<p>You can adjust your current query to get th...,,,0,,2018-04-28 09:41:45.040000+00:00,,2018-04-28 09:41:45.040000+00:00,NaT,,,,8024897,50031556,2,0,,
3,50075112,,<p>Provided stacktrace related with the incomp...,,,0,,2018-04-28 09:44:39.143000+00:00,,2018-04-28 09:44:39.143000+00:00,NaT,,,,2674303,50004504,2,0,,
4,50075119,,<p>So it seems it just took quite a long time ...,,,0,,2018-04-28 09:45:09.930000+00:00,,2018-04-28 15:19:39.537000+00:00,2018-04-28 15:19:39.537000+00:00,,1843511.0,,1843511,50061108,2,0,,


In [9]:
# Your code here
query = """
                SELECT u.id AS user_id, u.creation_date AS user_joining_date,
                    MIN(q.creation_date) AS q_creation_date,
                    MIN(a.creation_date) AS a_creation_date
                FROM `bigquery-public-data.stackoverflow.users` AS u
                    LEFT JOIN `bigquery-public-data.stackoverflow.posts_questions` AS q
                        ON u.id = q.owner_user_id
                    FULL JOIN `bigquery-public-data.stackoverflow.posts_answers` AS a
                        ON q.owner_user_id = a.owner_user_id  
                WHERE u.creation_date >= '2019-01-01' AND u.creation_date < '2019-06-01' 
                GROUP BY user_id, user_joining_date
                """
# Setting up the query (cancel the query if it exceeds the limit set to 10 GB)
config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=config)

# API request - run the query, and convert the results to a pandas DataFrame
count = query_job.to_dataframe()
count.dropna().head()

Unnamed: 0,user_id,user_joining_date,q_creation_date,a_creation_date
271,10890552,2019-01-09 15:09:52.233000+00:00,2019-01-09 15:42:33.570000+00:00,2019-05-09 09:26:03.717000+00:00
273,10943898,2019-01-21 09:16:50.513000+00:00,2019-01-21 09:24:43.930000+00:00,2019-08-15 15:56:36.933000+00:00
276,11345506,2019-04-11 10:16:57.180000+00:00,2019-04-30 15:50:56.970000+00:00,2019-05-01 06:38:36.517000+00:00
282,10953925,2019-01-23 06:02:49.743000+00:00,2019-04-29 10:18:19.137000+00:00,2019-03-06 10:55:52.590000+00:00
288,11345763,2019-04-11 11:06:49.417000+00:00,2019-04-11 11:42:00.827000+00:00,2019-05-29 09:24:30.830000+00:00


This post is inspired by Rachael Tatman's posts on Kaggle