## Introduction
Now that you can select raw data, you're ready to learn how to group your data and count things within those groups. This can help you answer questions like:

- How many of each kind of fruit has our store sold?
- How many species of animal has the vet office treated?

To do this, you'll learn about three new techniques: GROUP BY, HAVING and COUNT(). Once again, we'll use this made-up table of information on pets.

In [1]:
import pandas as pd
from pandas.io import gbq

# Cara sendiri untuk import dati bigquery, tetapi query tetap sama
df_pet = gbq.read_gbq('select * from pet_records.pet_records', 
                       project_id='numeric-dialect-275303', index_col="ID")
df_pet

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 4/4 [00:01<00:00,  2.25rows/s]


Unnamed: 0_level_0,Name,Animal
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,Dr. Harris Bonkers,Rabbit
2,Moon,Dog
3,Ripley,Cat
4,Tom,Cat


## COUNT()
COUNT(), as you may have guessed from the name, returns a count of things. If you pass it the name of a column, it will return the number of entries in that column.

For instance, if we SELECT the COUNT() of the ID column in the pets table, it will return 4, because there are 4 ID's in the table.

In [2]:
import pandas as pd
from pandas.io import gbq

# Cara sendiri untuk import dati bigquery, tetapi query tetap sama
df_pet = gbq.read_gbq('select COUNT(ID) from pet_records.pet_records', 
                       project_id='numeric-dialect-275303')
df_pet

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.62s/rows]


Unnamed: 0,f0_
0,4


COUNT() is an example of an aggregate function, which takes many values and returns one. (Other examples of aggregate functions include SUM(), AVG(), MIN(), and MAX().) As you'll notice in the picture above, aggregate functions introduce strange column names (like f0__). Later in this tutorial, you'll learn how to change the name to something more descriptive.

## GROUP BY
GROUP BY takes the name of one or more columns, and treats all rows with the same value in that column as a single group when you apply aggregate functions like COUNT().

For example, say we want to know how many of each type of animal we have in the pets table. We can use GROUP BY to group together rows that have the same value in the Animal column, while using COUNT() to find out how many ID's we have in each group.

In [3]:
# Cara sendiri untuk import dati bigquery, tetapi query tetap sama
df_pet = gbq.read_gbq('SELECT Animal, COUNT(ID) FROM pet_records.pet_records GROUP BY Animal', 
                       project_id='numeric-dialect-275303')
df_pet

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 3/3 [00:02<00:00,  1.30rows/s]


Unnamed: 0,Animal,f0_
0,Rabbit,1
1,Dog,1
2,Cat,2


It returns a table with three rows (one for each distinct animal). We can see that the pets table contains 1 rabbit, 1 dog, and 2 cats.

## GROUP BY ... HAVING
HAVING is used in combination with GROUP BY to ignore groups that don't meet certain criteria.

So this query, for example, will only include groups that have more than one ID in them.

In [4]:
# Cara sendiri untuk import dati bigquery, tetapi query tetap sama
df_pet = gbq.read_gbq('select Animal, count(ID) from pet_records.pet_records group by Animal having count(ID) =1 ', 
                       project_id='numeric-dialect-275303')
df_pet

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.06rows/s]


Unnamed: 0,Animal,f0_
0,Rabbit,1
1,Dog,1


Since only one group meets the specified criterion, the query will return a table with only one row.

## Example: Which Hacker News comments generated the most discussion?
Ready to see an example on a real dataset? The Hacker News dataset contains information on stories and comments from the Hacker News social networking site.

We'll work with the comments table and begin by printing the first few rows. (We have hidden the corresponding code. To take a peek, click on the "Code" button below.)

In [18]:
# import dataset dari google bigquery dengan pandas
hacker_news = gbq.read_gbq('select* from hacker_news.hacker_news_dataset limit 100', 
                       project_id='numeric-dialect-275303')
hacker_news

Downloading: 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 54.41rows/s]


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12121216,Valid.ly Never send another OOPS message,https://www.valid.ly,1,1,validly,2016-07-19 12:05:00+00:00
1,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,2016-05-02 10:14:00+00:00
2,11590768,"Show HN: Shanhu.io, a programming playground p...",https://shanhu.io,1,1,h8liu,2016-04-28 18:05:00+00:00
3,10581844,"Analysis of 114 propaganda sources from ISIS, ...",http://37.252.122.95/sites/default/files/Insid...,1,1,crosre,2015-11-17 15:53:00+00:00
4,10402073,Predicting the Future and Exponential Growth,http://uday.io/2015/10/15/predicting-the-futur...,1,1,urs2102,2015-10-16 21:19:00+00:00
...,...,...,...,...,...,...,...
95,11301165,Can Dogs Detect Seizures? [2007?],http://www.epilepsy.com/information/profession...,1,1,YeGoblynQueenne,2016-03-16 22:28:00+00:00
96,11116381,"In Munich, a fightening preview of the rise of...",https://www.washingtonpost.com/opinions/in-mun...,1,1,citizensixteen,2016-02-17 09:03:00+00:00
97,12379186,Ask HN: Real time one way replication on Linux?,,1,1,Manozco,2016-08-29 00:07:00+00:00
98,10886516,Show HN: Cindr,http://cindr.com,1,1,mikeem,2016-01-12 10:19:00+00:00


In [6]:
hacker_news.head()

Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12121216,Valid.ly Never send another OOPS message,https://www.valid.ly,1,1,validly,2016-07-19 12:05:00+00:00
1,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,2016-05-02 10:14:00+00:00
2,11590768,"Show HN: Shanhu.io, a programming playground p...",https://shanhu.io,1,1,h8liu,2016-04-28 18:05:00+00:00
3,10581844,"Analysis of 114 propaganda sources from ISIS, ...",http://37.252.122.95/sites/default/files/Insid...,1,1,crosre,2015-11-17 15:53:00+00:00
4,10402073,Predicting the Future and Exponential Growth,http://uday.io/2015/10/15/predicting-the-futur...,1,1,urs2102,2015-10-16 21:19:00+00:00


Let's use the table to see which comments generated the most replies. Since:

- the parent column indicates the comment that was replied to, and
- the id column has the unique ID used to identify each comment,

we can GROUP BY the parent column and COUNT() the id column in order to figure out the number of comments that were made as responses to a specific comment. (This might not make sense immediately -- take your time here to ensure that everything is clear!)

Furthermore, since we're only interested in popular comments, we'll look at comments with more than ten replies. So, we'll only return groups HAVING more than ten ID's.

In [7]:
# import dataset dari google bigquery dengan pandas (dataset hacker_news.comments belum kita download)
# kita coba tampilkan num_comments
hacker_news = gbq.read_gbq('SELECT num_comments, COUNT(id) FROM hacker_news.hacker_news_dataset GROUP BY num_comments HAVING COUNT(id) > 10', 
                            project_id='numeric-dialect-275303')
hacker_news.head()

Downloading: 100%|█████████████████████████████████████████████████████████████████| 135/135 [00:01<00:00, 67.54rows/s]


Unnamed: 0,num_comments,f0_
0,1,6884
1,2,2467
2,3,1265
3,4,772
4,5,554


Now that our query is ready, let's run it and store the results in a pandas DataFrame:

Each row in the popular_comments DataFrame corresponds to a comment that received more than ten replies. For instance, the comment with ID 801208 received 56 replies.

## Aliasing and other improvements
A couple hints to make your queries even better:

- The column resulting from COUNT(id) was called f0__. That's not a very descriptive name. You can change the name by adding AS NumPosts after you specify the aggregation. This is called aliasing, and it will be covered in more detail in an upcoming lesson.
- If you are ever unsure what to put inside the COUNT() function, you can do COUNT(1) to count the rows in each group. Most people find it especially readable, because we know it's not focusing on other columns. It also scans less data than if supplied column names (making it faster and using less of your data access quota).

Using these tricks, we can rewrite our query:

In [8]:
# import dataset dari google bigquery dengan pandas (dataset hacker_news.comments belum kita download)
# kita coba tampilkan num_comments
hacker_news = gbq.read_gbq(
    'SELECT num_comments, COUNT(id) as NumPosts FROM hacker_news.hacker_news_dataset GROUP BY num_comments HAVING COUNT(1) > 10', 
     project_id='numeric-dialect-275303')
hacker_news.head()

Downloading: 100%|█████████████████████████████████████████████████████████████████| 135/135 [00:01<00:00, 82.73rows/s]


Unnamed: 0,num_comments,NumPosts
0,1,6884
1,2,2467
2,3,1265
3,4,772
4,5,554


Now you have the data you want, and it has descriptive names. That's good style.

## Note on using GROUP BY
Note that because it tells SQL how to apply aggregate functions (like COUNT()), it doesn't make sense to use GROUP BY without an aggregate function. Similarly, if you have any GROUP BY clause, then all variables must be passed to either a

1. GROUP BY command, or
2. an aggregation function.

Consider the query below:

In [9]:
# import dataset dari google bigquery dengan pandas (dataset hacker_news.comments belum kita download)
# kita coba tampilkan num_comments
hacker_news = gbq.read_gbq('SELECT num_comments, COUNT(id) FROM hacker_news.hacker_news_dataset GROUP BY num_comments', 
                            project_id='numeric-dialect-275303')
hacker_news.head()

Downloading: 100%|████████████████████████████████████████████████████████████████| 390/390 [00:01<00:00, 240.33rows/s]


Unnamed: 0,num_comments,f0_
0,1,6884
1,513,1
2,2,2467
3,514,2
4,3,1265


Note that there are two variables: parent and id.

- parent was passed to a GROUP BY command (in GROUP BY parent), and
- id was passed to an aggregate function (in COUNT(id)).

And this query won't work, because the author column isn't passed to an aggregate function or a GROUP BY clause:

In [10]:
# import dataset dari google bigquery dengan pandas (dataset hacker_news.comments belum kita download)
# kita coba tampilkan num_comments
# hacker_news = gbq.read_gbq('SELECT author, num_comments, COUNT(id) FROM hacker_news.hacker_news_dataset GROUP BY num_comments', 
#                             project_id='numeric-dialect-275303')
# hacker_news.head()

# yes, hasilnya memang error

If make this error, you'll get the error message SELECT list expression references column (column's name) which is neither grouped nor aggregated at.

## Your turn
These aggregations let you write much more interesting queries. Try it yourself with these coding exercises.

# Exercise: Group By, Having & Count

## Introduction
Queries with GROUP BY can be powerful. There are many small things that can trip you up (like the order of the clauses), but it will start to feel natural once you've done it a few times. Here, you'll write queries using GROUP BY to answer questions from the Hacker News dataset.

Before you get started, run the following cell to set everything up:

In [11]:
hacker_news = gbq.read_gbq('select * from hacker_news.hacker_news_dataset limit 100',
                           project_id = 'numeric-dialect-275303')
hacker_news.head()

Downloading: 100%|█████████████████████████████████████████████████████████████████| 100/100 [00:01<00:00, 58.32rows/s]


Unnamed: 0,id,title,url,num_points,num_comments,author,created_at
0,12121216,Valid.ly Never send another OOPS message,https://www.valid.ly,1,1,validly,2016-07-19 12:05:00+00:00
1,11610310,Ask HN: Aby recent changes to CSS that broke m...,,1,1,polskibus,2016-05-02 10:14:00+00:00
2,11590768,"Show HN: Shanhu.io, a programming playground p...",https://shanhu.io,1,1,h8liu,2016-04-28 18:05:00+00:00
3,10581844,"Analysis of 114 propaganda sources from ISIS, ...",http://37.252.122.95/sites/default/files/Insid...,1,1,crosre,2015-11-17 15:53:00+00:00
4,10402073,Predicting the Future and Exponential Growth,http://uday.io/2015/10/15/predicting-the-futur...,1,1,urs2102,2015-10-16 21:19:00+00:00


## Exercises

#### 1) Prolific commenters

Hacker News would like to send awards to everyone who has written more than 10,000 posts. Write a query that returns all authors with more than 10,000 posts as well as their post counts. Call the column with post counts `NumPosts`.

In case sample query is helpful, here is a query you saw in the tutorial to answer a similar question:
    ```
    query = """
            SELECT parent, COUNT(1) AS NumPosts
            FROM `bigquery-public-data.hacker_news.comments`
            GROUP BY parent
            HAVING COUNT(1) > 10
            """
    ```
##### Jawaban Di Kaggle

##### Coba ke dataset hacker_news

In [12]:
hacker_news = gbq.read_gbq('select author, count(1) as Numpost from hacker_news.hacker_news_dataset group by author having count(1) > 100',
                           project_id = 'numeric-dialect-275303')
hacker_news.head()


Downloading: 100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.10rows/s]


Unnamed: 0,author,Numpost
0,prostoalex,123
1,ingve,198


#### 2) Deleted comments
How many comments have been deleted? (If a comment was deleted, the deleted column in the comments table will have the value True.)

##### Jawaban Di Kaggle

##### Coba ke dataset hacker_news

In [13]:
hacker_news = gbq.read_gbq('SELECT COUNT(1) AS NumUrlNone from hacker_news.hacker_news_dataset WHERE url = "https://www.valid.ly"',
                           project_id = 'numeric-dialect-275303')
hacker_news.head()

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.77s/rows]


Unnamed: 0,NumUrlNone
0,1
