In [0]:
from google.cloud import bigquery

In [0]:
# Create a "Client" object
client = bigquery.Client()

In [0]:
# Construct a reference to the "hacker_news" dataset, in BigQuery, each dataset is contained in a corresponding project
dataset_ref = client.dataset("hacker_news", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

In [0]:
# List all the tables in the "hacker_news" dataset
tables = list(client.list_tables(dataset))

# Print names of all tables in the dataset (there are four!)
for table in tables:  
    print(table.table_id)

Similar to how we fetched a dataset, we can fetch a table. In the code cell below, we fetch the full table in the hacker_news dataset.

In [0]:
# Construct a reference to the "full" table
table_ref = dataset_ref.table("full")

# API request - fetch the table
table = client.get_table(table_ref)


![](https://drive.google.com/uc?id=1UY7c-6Nz2ss2V93rLwlPcod_JIL9zmv4)

#Table schema

The structure of a table is called its schema. We need to understand a table's schema to effectively pull out the data we want. In this example, we'll investigate the full table that we fetched above.


In [0]:
# Print information on all the columns in the "full" table in the "hacker_news" dataset
table.schema

Each SchemaField tells us about a specific column (which we also refer to as a field). In order, the information is:

    The name of the column
    The field type (or datatype) in the column
    The mode of the column ('NULLABLE' means that a column allows NULL values, and is the default)
    A description of the data in that column

The first field has the SchemaField:

SchemaField('by', 'string', 'NULLABLE', "The username of the item's author.",())

This tells us:

    the field (or column) is called by,
    the data in this field is strings,
    NULL values are allowed, and
    it contains the usernames corresponding to each item's author.

We can use the list_rows() method to check just the first five lines of of the full table to make sure this is right. (Sometimes databases have outdated descriptions, so it's good to check.) This returns a BigQuery RowIterator object that can quickly be converted to a pandas DataFrame with the to_dataframe() method.


In [0]:
# Preview the first five lines of the "full" table
client.list_rows(table, max_results=5).to_dataframe()

In [0]:
# Preview the first five entries in the "by" column of the "full" table
client.list_rows(table, selected_fields=table.schema[:1], max_results=5).to_dataframe()

# SELECT ... FROM, Where


    specify the column you want after the word SELECT, and then
    specify the table after the word FROM
    
    If you want multiple columns, you can select them with a comma between the names (for all the columns use *)


In [0]:
query = '''
SELECT name
FROM 'project.database.table'
Where Animal = 'cat'
'''

# Submitting the query to the dataset

In [0]:
# Create a "Client" object
client = bigquery.Client()

# Set up the query
query_job = client.query(query)

# API request : run the query, and return a pandas DataFrame
us_cities = query_job.to_dataframe()

Now we've got a pandas DataFrame called us_cities, which we can use like any other DataFrame. 

In [0]:
# What five cities have the most measurements?
us_cities.city.value_counts().head()

# How much data a query will scan?

In [0]:
# Query to get the score column from every row where the type column has value "job"
query = """
        SELECT score, title
        FROM `bigquery-public-data.hacker_news.full`
        WHERE type = "job" 
        """

# Create a QueryJobConfig object to estimate size of query without running it
dry_run_config = bigquery.QueryJobConfig(dry_run=True)

# API request - dry run query to estimate costs
dry_run_query_job = client.query(query, job_config=dry_run_config)

print("This query will process {} bytes.".format(dry_run_query_job.total_bytes_processed))

# Specify a parameter when running the query to limit how much data you are willing to scan

In [0]:
# Only run the query if it's less than 1 MB
ONE_MB = 1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_MB)

# Set up the query (will only run if it's less than 1 MB)
safe_query_job = client.query(query, job_config=safe_config)

# API request - try to run the query, and return a pandas DataFrame
safe_query_job.to_dataframe()

In this case, the query was cancelled, because the limit of 1 MB was exceeded. However, we can increase the limit to run the query successfully!

In [0]:
# Only run the query if it's less than 1 GB
ONE_GB = 1000*1000*1000
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=ONE_GB)

# Set up the query (will only run if it's less than 1 GB)
safe_query_job = client.query(query, job_config=safe_config)

# API request - try to run the query, and return a pandas DataFrame
job_post_scores = safe_query_job.to_dataframe()

# Print average score for job posts
job_post_scores.score.mean()

# COUNT()

COUNT(), as you may have guessed from the name, returns a count of things. If you pass it the name of a column, it will return the number of entries in that column.

For instance, if we SELECT the COUNT() of the ID column in the pets table, it will return 4, because there are 4 ID's in the table.

COUNT() is an example of an aggregate function, which takes many values and returns one. (Other examples of aggregate functions include SUM(), AVG(), MIN(), and MAX().) As you'll notice in the picture above, aggregate functions introduce strange column names (like f0__). Later in this tutorial, you'll learn how to change the name to something more descriptive.

In [0]:
query = """
SELECT COUNT(id)
FROM project.database.table
"""

# aggregate functions : COUNT(), SUM(), AVG(), MIN(), MAX()

# GROUP BY

GROUP BY takes the name of one or more columns, and treats all rows with the same value in that column as a single group when you apply aggregate functions like COUNT().

For example, say we want to know how many of each type of animal we have in the pets table. We can use GROUP BY to group together rows that have the same value in the Animal column, while using COUNT() to find out how many ID's we have in each group.

It returns a table with three rows (one for each distinct animal). We can see that the pets table contains 1 rabbit, 1 dog, and 2 cats.

In [0]:
query = """
SELECT animal, COUNT(id)
FROM project.database.table
GROUP BY animal
"""

# GROUP BY HAVING

HAVING is used in combination with GROUP BY to ignore groups that don't meet certain criteria.

So this query, for example, will only include groups that have more than one ID in them.

Since only one group meets the specified criterion, the query will return a table with only one row.

In [0]:
query = """
SELECT animal, COUNT(id)
FROM project.database.table
GROUP BY animal
HAVING COUNT(id)>1 
"""

# Aliasing and other improvements

A couple hints to make your queries even better:

    The column resulting from COUNT(id) was called f0__. That's not a very descriptive name. You can change the name by adding AS NumPosts after you specify the aggregation. This is called aliasing, and it will be covered in more detail in an upcoming lesson.
    If you are ever unsure what to put inside the COUNT() function, you can do COUNT(1) to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

Using these tricks, we can rewrite our query:

In [0]:
# Improved version of earlier query, now with aliasing & improved readability
query_improved = """
                 SELECT parent, COUNT(1) AS NumPosts
                 FROM `bigquery-public-data.hacker_news.comments`
                 GROUP BY parent
                 HAVING COUNT(1) > 10
                 """

safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_improved, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
improved_df = query_job.to_dataframe()

# Print the first five rows of the DataFrame
improved_df.head()

# Note on using GROUP BY

Note that because it tells SQL how to apply aggregate functions (like COUNT()), it doesn't make sense to use GROUP BY without an aggregate function. Similarly, if you have any GROUP BY clause, then all variables must be passed to either a

    GROUP BY command, or
    an aggregation function.

Consider the query below:

In [0]:
query_good = """
             SELECT parent, COUNT(id)
             FROM `bigquery-public-data.hacker_news.comments`
             GROUP BY parent
             """

Note that there are two variables: parent and id.

    parent was passed to a GROUP BY command (in GROUP BY parent), and
    id was passed to an aggregate function (in COUNT(id)).

And this query won't work, because the author column isn't passed to an aggregate function or a GROUP BY clause:


In [0]:
query_bad = """
            SELECT author, parent, COUNT(id)
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            """

If make this error, you'll get the error message SELECT list expression references column (column's name) which is neither grouped nor aggregated at.

# Order By

ORDER BY is usually the last clause in your query, and it sorts the results returned by the rest of your query.

Notice that the rows are not ordered by the ID column. We can quickly remedy this with the query below.

In [0]:
query = '''
SELECT ID, Name, Animal
FROM `bigquery-public-data.pet_records.pets`
ORDER BY ID 
'''

You can reverse the order using the DESC argument (short for 'descending'). The next query sorts the table by the Animal column, where the values that are last in alphabetic order are returned first.

In [0]:
query = '''
SELECT ID, Name, Animal
FROM `bigquery-public-data.pet_records.pets`
ORDER BY Animal DESC
'''

# DATE and DATETIME



There are two ways that dates can be stored in BigQuery: as a DATE or as a DATETIME.

The DATE format has the year first, then the month, and then the day. It looks like this:  YYYY-[M]M-[D]D

YYYY: Four-digit year
[M]M: One or two digit month
[D]D: One or two digit day

So 2019-01-10 is interpreted as January 10, 2019.

The DATETIME format is like the date format ... but with time added at the end.

# Extract

To look at part of a date, like the year or the day : use EXTRACT

In [0]:
query = '''
SELECT Name, EXTRACT(DAY from Date) AS Day
FROM `bigquery-public-data.pet_records.pets_with_date`'''

SQL is very smart about dates, and we can ask for information beyond just extracting part of the cell. For example, this query returns one column with just the week in the year (between 1 and 53) for each date in the Date column:

In [0]:
query = '''
SELECT Name, EXTRACT(WEEK from Date) AS Week
FROM `bigquery-public-data.pet_records.pets_with_date`'''

You can find all the functions you can use with dates in BigQuery in this documentation under "Date and time functions" : 

https://cloud.google.com/bigquery/docs/reference/legacy-sql#datetimefunctions

## Example : determine how the number of accidents varies with the day of the week?

The consecutive_number column contains a unique ID for each accident, and
The timestamp_of_crash column contains the date of the accident in DATETIME format

we can:
EXTRACT the day of the week (as day_of_week in the query below) from the timestamp_of_crash column, and
GROUP BY the day of the week, before we COUNT the consecutive_number column to determine the number of accidents for each day of the week.
Then we sort the table with an ORDER BY clause, so the days with the most accidents are returned first.


In [0]:
# Query to find out the number of accidents for each day of the week
query = """
        SELECT COUNT(consecutive_number) AS num_accidents, 
               EXTRACT(DAYOFWEEK FROM timestamp_of_crash) AS day_of_week
        FROM `bigquery-public-data.nhtsa_traffic_fatalities.accident_2015`
        GROUP BY day_of_week
        ORDER BY num_accidents DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 1 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**9)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
accidents_by_day = query_job.to_dataframe()

# Print the DataFrame
accidents_by_day

Notice that the data is sorted by the num_accidents column, where the days with more traffic accidents appear first.

To map the numbers returned for the day_of_week column to the actual day, you might consult the BigQuery documentation on the DAYOFWEEK function. It says that it returns "an integer between 1 (Sunday) and 7 (Saturday), inclusively". So, in 2015, most fatal motor accidents in the US occured on Sunday and Saturday, while the fewest happened on Tuesday.

# Difference between WHERE and HAVING

WHERE is used in any SELECT query, while HAVING clause is only used in SELECT queries containing aggregate function or group by clause. 

-> Both are used to specify filtering condition, but condition specified in WHERE clause is used while fetching data (rows) from table, and data which doesn't pass the condition will not be fetched into result set, on the other hand HAVING clause is later used to filter summarized data or grouped data. 

= In short if both WHERE and HAVING clause is used in a SELECT query with aggregate function or GROUP BY clause, it will execute before HAVING clause

# WITH .... AS

With all that you've learned, your SQL queries are getting pretty long, which can make them hard understand (and debug).

You are about to learn how to use AS and WITH to tidy up your queries and make them easier to read.

On its own, AS is a convenient way to clean up the data returned by your query. It's even more powerful when combined with WITH in what's called a "common table expression".

A common table expression (or CTE) is a temporary table that you return within your query. CTEs are helpful for splitting your queries into readable chunks, and you can write queries against them.

For instance, you might want to use the pets table to ask questions about older animals in particular. So you can start by creating a CTE which only contains information about animals more than five years old like this:

In [0]:
query = """
WITH Seniors AS 
(
SELECT ID, Name
FROM 'bigquery-public-data.pet_records.pets'
WHERE Years_old > 5
)
SELECT ID
FROM Seniors
"""

Also, it's important to note that CTEs only exist inside the query where you create them, and you can't reference them in later queries. So, any query that uses a CTE is always broken into two parts: (1) first, we create the CTE, and then (2) we write a query that uses the CTE.

## Example: How many Bitcoin transactions are made per month?

We're going to use a CTE to find out how many Bitcoin transactions were made each day for the entire timespan of a bitcoin transaction dataset.

We'll investigate the transactions table. 

In [0]:
from google.cloud import bigquery

# Create a "Client" object
client = bigquery.Client()

# Construct a reference to the "crypto_bitcoin" dataset
dataset_ref = client.dataset("crypto_bitcoin", project="bigquery-public-data")

# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

# Construct a reference to the "transactions" table
table_ref = dataset_ref.table("transactions")

# API request - fetch the table
table = client.get_table(table_ref)

# Preview the first five lines of the "transactions" table
client.list_rows(table, max_results=5).to_dataframe()

# Query to select the number of transactions per date, sorted by date
#block_timestamp column contains the date of each transaction in DATETIME format, we'll convert these into DATE
query_with_CTE = """ 
                 WITH time AS 
                 (
                     SELECT DATE(block_timestamp) AS trans_date
                     FROM `bigquery-public-data.crypto_bitcoin.transactions`
                 )
                 SELECT COUNT(1) AS transactions,
                        trans_date
                 FROM time
                 GROUP BY trans_date
                 ORDER BY trans_date
                 """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query_with_CTE, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
transactions_by_date = query_job.to_dataframe()

# Print the first five rows
transactions_by_date.head()

In [0]:
transactions_by_date.set_index('trans_date').plot() # plot the number of Bitcoin transactions per day over the whole timespan of the dataset

# Join

To get information that applies to a certain pet, we match the ID column in the pets table to the Pet_ID column in the owners table

In [0]:
query = """
SELECT p.Name AS Pet_Name, o.Name AS Owner_Name
FROM `bigquery-public-data.pet_records.pets` AS p
INNER JOIN `bigquery-public-data.pet_records.owner AS o
        ON p.ID = o.PET_ID """

The type of JOIN we're using today is called an INNER JOIN. That means that a row will only be put in the final output table if the value in the columns you're using to combine them shows up in both the tables you're joining. 

## Example: How many files are covered by each type of software license?

We'll work with two tables in the database. The first table is the licenses table, which provides the name of each GitHub repo (in the repo_name column) and its corresponding license. 

The second table is the sample_files table, which provides, among other information, the GitHub repo that each file belongs to (in the repo_name column). 

Query that uses information in both tables to determine how many files are released in each license:

In [0]:
# Query to determine the number of files per license, sorted by number of files
query = """
        SELECT L.license, COUNT(1) AS number_of_files
        FROM `bigquery-public-data.github_repos.sample_files` AS sf
        INNER JOIN `bigquery-public-data.github_repos.licenses` AS L 
            ON sf.repo_name = L.repo_name
        GROUP BY L.license
        ORDER BY number_of_files DESC
        """

# Set up the query (cancel the query if it would use too much of 
# your quota, with the limit set to 10 GB)
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed=10**10)
query_job = client.query(query, job_config=safe_config)

# API request - run the query, and convert the results to a pandas DataFrame
file_count_by_license = query_job.to_dataframe()

# Print the DataFrame
file_count_by_license