<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Objectives" data-toc-modified-id="Objectives-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Objectives</a></span></li><li><span><a href="#Grouping-in-SQL" data-toc-modified-id="Grouping-in-SQL-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Grouping in SQL</a></span><ul class="toc-item"><li><span><a href="#Grouping-statements" data-toc-modified-id="Grouping-statements-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Grouping statements</a></span><ul class="toc-item"><li><span><a href="#Task" data-toc-modified-id="Task-2.1.1"><span class="toc-item-num">2.1.1&nbsp;&nbsp;</span>Task</a></span><ul class="toc-item"><li><span><a href="#Possible-Solution" data-toc-modified-id="Possible-Solution-2.1.1.1"><span class="toc-item-num">2.1.1.1&nbsp;&nbsp;</span>Possible Solution</a></span></li></ul></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-2.1.2"><span class="toc-item-num">2.1.2&nbsp;&nbsp;</span>Exercise</a></span></li><li><span><a href="#Exercise" data-toc-modified-id="Exercise-2.1.3"><span class="toc-item-num">2.1.3&nbsp;&nbsp;</span>Exercise</a></span></li></ul></li></ul></li><li><span><a href="#Joins" data-toc-modified-id="Joins-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Joins</a></span></li><li><span><a href="#Level-Up:-Subqueries" data-toc-modified-id="Level-Up:-Subqueries-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Level Up: Subqueries</a></span></li><li><span><a href="#Level-Up:-Using-SQL-Within-pandas" data-toc-modified-id="Level-Up:-Using-SQL-Within-pandas-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Level Up: Using SQL Within <code>pandas</code></a></span><ul class="toc-item"><li><span><a href="#Index-Filtering" data-toc-modified-id="Index-Filtering-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Index Filtering</a></span></li><li><span><a href="#.query()" data-toc-modified-id=".query()-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span><code>.query()</code></a></span></li><li><span><a href="#pandasql" data-toc-modified-id="pandasql-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span><code>pandasql</code></a></span></li></ul></li><li><span><a href="#Level-Up:-Other-Dialects-of-SQL" data-toc-modified-id="Level-Up:-Other-Dialects-of-SQL-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Level Up: Other Dialects of SQL</a></span><ul class="toc-item"><li><span><a href="#More-Resources" data-toc-modified-id="More-Resources-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>More Resources</a></span></li></ul></li><li><span><a href="#Level-Up:-BigQuery" data-toc-modified-id="Level-Up:-BigQuery-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Level Up: BigQuery</a></span></li></ul></div>

![sql](sql-logo.jpg)

In [None]:
import pandas as pd
import sqlite3
import pandasql

conn = sqlite3.connect("flights.db")
cur = conn.cursor()

# Objectives

- Use SQL aggregation functions with GROUP BY
- Use HAVING for group filtering
- Use SQL JOIN to combine tables using keys

# Grouping in SQL

## Grouping statements

Combine `SELECT` and `GROUP BY` when you want *aggregates* by values

`SELECT` `COUNT`, `MIN(x)`, `MAX(x)`, `SUM(x)`, etc.

`GROUP BY x`

### Task

- Which countries have the highest numbers of active airlines? Return the top 25.

#### Possible Solution

In [None]:
pd.read_sql('''
    SELECT 
        COUNT() AS num,
        country
    FROM 
        airlines
    WHERE 
        active='Y'
    GROUP BY 
        country
    ORDER BY 
        num DESC
    LIMIT 25
''', conn)

### Exercise

- Which countries have the highest numbers of inactive airlines? Return all the countries that have more than 10.

In [None]:
# Your code here


### Exercise

- Run a query that will return the number of airports by time zone. Each row should have a number of airports and a time zone.

In [None]:
# Your code here


# Joins

# Level Up: Subqueries

For more complex queries it can be helpful to break them down into multiple parts. Subqueries are a natural way to do this.

Suppose I wanted to know, after collecting together the highest airport in each country, which one's name comes alphabetically first.

I might break this down by first collecting the highest airports and then _wrapping_ that query in a higher query that selects the name and country I want _from_ the result of that first query:

In [None]:
pd.read_sql('''
    SELECT 
        MIN(name), 
        country, 
        altitude 
    FROM (
        SELECT 
            name, 
            code, 
            country, 
            MAX(CAST(altitude AS INT)) AS altitude
        FROM 
            airports
        GROUP BY 
            country
    )
''', conn)

In [None]:
# What will this query return?

pd.read_sql('''
SELECT 
    name, 
    city,
    CASE 
        WHEN latitude > 0 
            THEN 'northern hemisphere'
            ELSE 'southern hemisphere'
    END AS hemisphere
FROM 
    airports
''', conn)

# Level Up: Using SQL Within `pandas`

You can use SQL syntax to query a DataFrame with the `pandasql` package. The example below shows how a SQL query can be used to filter DataFrames, just like other `pandas` filtering techniques 

In [1]:
iris_df = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris_df.shape

NameError: name 'pd' is not defined

## Index Filtering

In [2]:
iris_setosa_df = iris_df[iris_df['species'] == 'setosa']
iris_setosa_df.shape

NameError: name 'iris_df' is not defined

## `.query()`

[query documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html)

In [None]:
iris_setosa_df = iris_df.query("species == 'setosa'")
iris_setosa_df.shape

## `pandasql`

In [None]:
import pandasql

iris_setosa_df = pandasql.sqldf("""

SELECT * 
FROM iris_df 
WHERE species == 'setosa'

""", env=globals())

iris_setosa_df.shape

# Level Up: Other Dialects of SQL

There are many versions of SQL out there! Here are a few of the major players in the workplace:

- SQLite (we've already seen this!)
- T-SQL (Transact-SQL, used by Microsoft)
- PostgreSQL (free and open-source!)
- Oracle SQL
- MySQL (half open-souce, half Oracle)

## More Resources

- ["What SQL Dialect to Learn" blog](https://learnsql.com/blog/what-sql-dialect-to-learn/)
- There's a whole [wikibook](https://en.wikibooks.org/wiki/SQL_Dialects_Reference) on this!

# Level Up: BigQuery

There are some data warehouse tools that house big data.

One cloud tool is Google Cloud's [BigQuery](https://cloud.google.com/bigquery).

![](https://storage.googleapis.com/gweb-uniblog-publish-prod/original_images/Google_Cloud_Covered.png)

> **NOTE:** 
> This will technically mean you need a Google Cloud account but you should be able to do this all for free

We're going to play with some [Google Cloud Public Datasets](https://cloud.google.com/public-datasets). Specifically, we'll explore the [USA Names](https://console.cloud.google.com/marketplace/product/social-security-administration/us-names) dataset of names from all Social Security card applications for births over the past century.

<!--

SELECT year, gender, SUM(number) , STRING_AGG(DISTINCT state)
FROM `bigquery-public-data.usa_names.usa_1910_current` as usa_names_current
where name = 'Corey' and gender = 'F'
group by gender, year
ORDER BY year desc
LIMIT 100

-->