## Introduction
Now that you know how to access and examine a dataset, you're ready to write your first SQL query! As you'll soon see, SQL queries will help you sort through a massive dataset, to retrieve only the information that you need.

We'll begin by using the keywords SELECT, FROM, and WHERE to get data from specific columns based on conditions you specify.

For clarity, we'll work with a small imaginary dataset pet_records which contains just one table, called pets.

## SELECT ... FROM
The most basic SQL query selects a single column from a single table. To do this,

specify the column you want after the word SELECT, and then
specify the table after the word FROM.
For instance, to select the Name column (from the pets table in the pet_records database in the bigquery-public-data project), our query would appear as follows:

In [65]:
import pandas as pd
from pandas.io import gbq

In [66]:
df = gbq.read_gbq('SELECT * FROM pet_records.pet_records LIMIT 100', project_id='numeric-dialect-275303')

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00,  9.76rows/s]


In [67]:
df

Unnamed: 0,ID,Name,Animal
0,1,Dr. Harris Bonkers,Rabbit
1,2,Moon,Dog
2,3,Ripley,Cat
3,4,Tom,Cat


WHERE ...
BigQuery datasets are large, so you'll usually want to return only the rows meeting specific conditions. You can do this using the WHERE clause.

The query below returns the entries from the Name column that are in rows where the Animal column has the text 'Cat'.

##### Cara import dari Kaggle Micro-Coureses

The dataset contains only one table, called global_air_quality. We'll fetch the table and take a peek at the first few rows to see what sort of data it contains. (Again, we have hidden the code. To take a peek, click on the "Code" button below.)

##### Cara Sendiri

In [68]:
df = gbq.read_gbq('SELECT * FROM pet_records.pet_records WHERE Animal="Cat"', project_id='numeric-dialect-275303')

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 2/2 [00:01<00:00,  1.17rows/s]


In [69]:
df

Unnamed: 0,ID,Name,Animal
0,3,Ripley,Cat
1,4,Tom,Cat


## Example: What are all the U.S. cities in the OpenAQ dataset?
Now that you've got the basics down, let's work through an example with a real dataset. We'll use an OpenAQ dataset about air quality.

First, we'll set up everything we need to run queries and take a quick peek at what tables are in our database. (Since you learned how to do this in the previous tutorial, we have hidden the code. But if you'd like to take a peek, you need only click on the "Code" button below.)

The dataset contains only one table, called global_air_quality. We'll fetch the table and take a peek at the first few rows to see what sort of data it contains. (Again, we have hidden the code. To take a peek, click on the "Code" button below.)

In [71]:
aq_indonesia = gbq.read_gbq('select * from openaq.openaq_indonesia limit 1000', project_id='numeric-dialect-275303')

Downloading: 100%|██████████████████████████████████████████████████████████████| 1000/1000 [00:03<00:00, 330.58rows/s]


In [72]:
aq_indonesia

Unnamed: 0,location,city,country,utc,local,parameter,value,unit,latitude,longitude,attribution
0,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 20:00:00+00:00,2020-04-23 20:00:00+00:00,pm25,80,µg/m³,-6.236704,106.793240,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
1,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 19:00:00+00:00,2020-04-23 19:00:00+00:00,pm25,69,µg/m³,-6.236704,106.793240,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
2,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 18:00:00+00:00,2020-04-23 18:00:00+00:00,pm25,72,µg/m³,-6.236704,106.793240,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
3,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 17:00:00+00:00,2020-04-23 17:00:00+00:00,pm25,71,µg/m³,-6.236704,106.793240,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
4,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 01:00:00+00:00,2020-04-23 01:00:00+00:00,pm25,78,µg/m³,-6.236704,106.793240,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
...,...,...,...,...,...,...,...,...,...,...,...
995,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-02-24 18:00:00+00:00,2020-02-24 18:00:00+00:00,pm25,8,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
996,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-02-23 17:00:00+00:00,2020-02-23 17:00:00+00:00,pm25,8,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
997,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-02-23 05:00:00+00:00,2020-02-23 05:00:00+00:00,pm25,8,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
998,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-02-19 18:00:00+00:00,2020-02-19 18:00:00+00:00,pm25,8,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."


Everything looks good! So, let's put together a query. Say we want to select all the values from the city column that are in rows where the country column is 'US' (for "United States").

In [77]:
# Dataset kita berbeda dengan dataset di khursus 
aq_indonesia = gbq.read_gbq('select * from openaq.openaq_indonesia where location = "US Diplomatic Post: Jakarta South" limit 50', project_id='numeric-dialect-275303')

Downloading: 100%|███████████████████████████████████████████████████████████████████| 50/50 [00:01<00:00, 26.95rows/s]


In [79]:
aq_indonesia.head()

Unnamed: 0,location,city,country,utc,local,parameter,value,unit,latitude,longitude,attribution
0,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 20:00:00+00:00,2020-04-23 20:00:00+00:00,pm25,80,µg/m³,-6.236704,106.79324,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
1,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 19:00:00+00:00,2020-04-23 19:00:00+00:00,pm25,69,µg/m³,-6.236704,106.79324,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
2,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 18:00:00+00:00,2020-04-23 18:00:00+00:00,pm25,72,µg/m³,-6.236704,106.79324,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
3,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 17:00:00+00:00,2020-04-23 17:00:00+00:00,pm25,71,µg/m³,-6.236704,106.79324,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
4,US Diplomatic Post: Jakarta South,Jakarta,ID,2020-04-23 01:00:00+00:00,2020-04-23 01:00:00+00:00,pm25,78,µg/m³,-6.236704,106.79324,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."


Take the time now to ensure that this query lines up with what you learned above.

## Submitting the query to the dataset
We're ready to use this query to get information from the OpenAQ dataset. As in the previous tutorial, the first step is to create a Client object.

We begin by setting up the query with the query() method. We run the method with the default parameters, but this method also allows us to specify more complicated settings that you can read about in the documentation. We'll revisit this later.

Next, we run the query and convert the results to a pandas DataFrame.

Now we've got a pandas DataFrame called us_cities, which we can use like any other DataFrame.

## More queries
If you want multiple columns, you can select them with a comma between the names:

You can select all columns with a * like this:

## Q&A: Notes on formatting
The formatting of the SQL query might feel unfamiliar. If you have any questions, you can ask in the comments section at the bottom of this page. Here are answers to two common questions:

##### Question: What's up with the triple quotation marks (""")?
Answer: These tell Python that everything inside them is a single string, even though we have line breaks in it. The line breaks aren't necessary, but they make it easier to read your query.

##### Question: Do you need to capitalize SELECT and FROM?
Answer: No, SQL doesn't care about capitalization. However, it's customary to capitalize your SQL commands, and it makes your queries a bit easier to read.

## Working with big datasets
BigQuery datasets can be huge. We allow you to do a lot of computation for free, but everyone has some limit.

Each Kaggle user can scan 5TB every 30 days for free. Once you hit that limit, you'll have to wait for it to reset.

The biggest dataset currently on Kaggle is 3TB, so you can go through your 30-day limit in a couple queries if you aren't careful.

Don't worry though: we'll teach you how to avoid scanning too much data at once, so that you don't run over your limit.

To begin,you can estimate the size of any query before running it. Here is an example using the (very large!) Hacker News dataset. To see how much data a query will scan, we create a QueryJobConfig object and set the dry_run parameter to True.

You can also specify a parameter when running the query to limit how much data you are willing to scan. Here's an example with a low limit.

In this case, the query was cancelled, because the limit of 1 MB was exceeded. However, we can increase the limit to run the query successfully!

## Exercise: Select, From & Where

### 1) Units of measurement
Which countries have reported pollution levels in units of "ppm"? In the code cell below, set first_query to an SQL query that pulls the appropriate entries from the country column.

In case it's useful to see an example query, here's some code from the tutorial:

    query = """ SELECT city FROM `bigquery-public-data.openaq.global_air_quality` WHERE country = 'US' """

In [88]:
# Dataset kita berbeda dengan dataset di khursus 
aq_indonesia = gbq.read_gbq('select * from openaq.openaq_indonesia where unit = "ppm" limit 50', project_id='numeric-dialect-275303')

Downloading: 0rows [00:01, ?rows/s]


In [89]:
#tidak ada data unit = "ppm" di dataset indonesia
aq_indonesia

Unnamed: 0,location,city,country,utc,local,parameter,value,unit,latitude,longitude,attribution


### 2) High air quality
Which pollution levels were reported to be exactly 0?

Set zero_pollution_query to select all columns of the rows where the value column is 0.
Set zero_pollution_results to a pandas DataFrame containing the query results.

In [90]:
# Dataset kita berbeda dengan dataset di khursus 
aq_indonesia = gbq.read_gbq('select * from openaq.openaq_indonesia where value = 0 limit 50', project_id='numeric-dialect-275303')

Downloading: 100%|█████████████████████████████████████████████████████████████████████| 6/6 [00:00<00:00, 13.28rows/s]


In [91]:
aq_indonesia

Unnamed: 0,location,city,country,utc,local,parameter,value,unit,latitude,longitude,attribution
0,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-01-29 22:00:00+00:00,2020-01-29 22:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
1,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-01-27 09:00:00+00:00,2020-01-27 09:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
2,US Diplomatic Post: Jakarta Central,Jakarta,ID,2020-01-27 08:00:00+00:00,2020-01-27 08:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834236,"[{""name"":""EPA AirNow DOS"",""url"":""http://airnow..."
3,Jakarta Central,,ID,2020-01-29 21:00:00+00:00,2020-01-29 21:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834235,"[{""name"":""US EPA AirNow"",""url"":""http://www.air..."
4,Jakarta Central,,ID,2020-01-27 08:00:00+00:00,2020-01-27 08:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834235,"[{""name"":""US EPA AirNow"",""url"":""http://www.air..."
5,Jakarta Central,,ID,2020-01-27 07:00:00+00:00,2020-01-27 07:00:00+00:00,pm25,0,µg/m³,-6.182536,106.834235,"[{""name"":""US EPA AirNow"",""url"":""http://www.air..."


That query wasn't too complicated, and it got the data you want. But these SELECT queries don't organizing data in a way that answers the most interesting questions. For that, we'll need the GROUP BY command.

If you know how to use groupby() in pandas, this is similar. But BigQuery works quickly with far larger datasets.

Fortunately, that's next.