# Subqueries

* Which rows are above the average for the ShareWomen column?

Using the SQL we've learned so far, there's no way to write a query that answers these questions. Aggregate functions, like `AVG()` can only be used in the `SELECT` clause. The following query:

```sql
SELECT * FROM recent_grads
WHERE ShareWomen > AVG(ShareWomen)
```

will return an error:
```sql
(sqlite3.OperationalError) misuse of aggregate function AVG() [SQL: 'SELECT * FROM recent_grads WHERE ShareWomen > AVG(ShareWomen)']
```

We need to instead learn **how to break up a question we want to answer into a series of queries that can be combined**.

In [1]:
import pandas as pd
import pymysql

host_name = "choigww-freedb"
username = ""
password = ""
database_name = "dataquest"

db = pymysql.connect(
    host=host_name,  # DATABASE_HOST
    port=3306,
    user=username,  # DATABASE_USERNAME
    passwd=password,  # DATABASE_PASSWORD
    db=database_name,  # DATABASE_NAME
    charset='utf8'
)
cursor = db.cursor()
cursor.execute("set names utf8")
db.commit()

To determine which majors are above the average for the `ShareWomen` column, we need to:

* first determine the average value for the `ShareWomen` column
* then select and filter the rows that are greater than the average value

If we had to do this using Python and pandas, we would compute and store the average value in `ShareWomen` as a variable and then use the variable in a table filter. While variables dominate how we express logic object oriented programming languages like Python and Java, SQL doesn't have support for variables. The designers of SQL, a [declarative programming language](https://en.wikipedia.org/wiki/Declarative_programming), want it's users to focus on expressing computations over explicitly defining, setting, and juggling variables.

* What would the query look like if we already knew the average value for the `ShareWomen` column?

```sql
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > 0.5225502029537575
```

Now, how do we make the computed average value, 0.5225502029537575, dynamic?

* Let's introduce the SQL way to solve this problem -- **subqueries**. A subquery is a query nested within another query. Here's a template for a SQL statement where the subquery resides in the `WHERE` clause:

```sql
SELECT Major, ShareWomen FROM recent_grads
WHERE ShareWomen > (subquery that returns the average value for ShareWomen)
```

The subquery is run first and returns the average value for the `ShareWomen` column (which happens to be `0.5225502029537575`). Based on the result of the subquery, SQL will replace the subquery with this value dynamically. Note that SQL will ignore the column name (`AVG(ShareWomen)`) and is smart enough to just use the actual row value. Here's a diagram that makes the flow clearer:

![](https://s3.amazonaws.com/dq-content/255/subquery_one.svg)

The query that replaces the placeholder `subquery` needs to be a **full query (contain `SELECT` and `FROM` clauses, etc)**, that works even if it's run separately. In addition, the inner query should only return a table with a single row and column because of where it fits in the outer query (`... WHERE > ?`). If you instead try to return a table with multiple columns, for example, the following error will be returned:

```sql
(sqlite3.OperationalError) only a single result allowed for a SELECT that is part of an expression [SQL: 'SELECT Major, ShareWomen FROM recent_grads WHERE ShareWomen > (SELECT Major, AVG(ShareWomen) FROM recent_grads)']
```

Lastly, **a subquery must always be contained within parentheses ()**, or the following error will be returned.

Write a query that returns the majors that are below the average for Unemployment_rate. The results should:
* only contain the Major and Unemployment_rate columns
* be sorted in ascending order by Unemployment_rate

In [5]:
sql = '''
        select Major, Unemployment_rate
        from recent_grads
        where Unemployment_rate <
                    (select AVG(Unemployment_rate)
                    from recent_grads
                    )
        order by Unemployment_rate
        limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,Unemployment_rate
0,MILITARY TECHNOLOGIES,0.0
1,SOIL SCIENCE,0.0
2,BOTANY,0.0
3,EDUCATIONAL ADMINISTRATION AND SUPERVISION,0.0
4,MATHEMATICS AND COMPUTER SCIENCE,0.0
5,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
6,COURT REPORTING,0.01169
7,MATHEMATICS TEACHER EDUCATION,0.016203
8,PETROLEUM ENGINEERING,0.018381
9,GENERAL AGRICULTURE,0.019642


In the last screen, we wrote SQL statements that used a subquery to express dynamic filter criteria in the WHERE clause. Specifically, we were interested in rows that were above or below the average value in a specific column. What if we wanted to understand the proportion of majors are above the average for a given column? We'd need to divide the number of rows that met the filter criteria with the total number of rows in the table.

* Let's focus on the query from the last screen:

In [3]:
sql = '''
        SELECT Major, ShareWomen FROM recent_grads
        WHERE ShareWomen > (SELECT AVG(ShareWomen) FROM recent_grads)
        limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,ShareWomen
0,ASTRONOMY AND ASTROPHYSICS,0.535714
1,PUBLIC POLICY,0.558548
2,NURSING,0.896019
3,"NUCLEAR, INDUSTRIAL RADIOLOGY, AND BIOLOGICAL ...",0.750473
4,ACCOUNTING,0.524153
5,MEDICAL TECHNOLOGIES TECHNICIANS,0.753927
6,STATISTICS AND DECISION SCIENCE,0.526476
7,PHARMACOLOGY,0.707719
8,OCEANOGRAPHY,0.688999
9,MEDICAL ASSISTING SERVICES,0.927807


Using the COUNT() aggregate function, we can return the number of rows the results set contains:

In [4]:
sql = '''
        SELECT COUNT(*) FROM recent_grads
        WHERE ShareWomen > (SELECT AVG(ShareWomen)
        FROM recent_grads)
'''
pd.read_sql(sql, db)

Unnamed: 0,COUNT(*)
0,90


To return the proportion, we need to divide this value with the total number of rows in recent_grads. The challenge, however, is that the we don't know the total number of rows (or want to be reliant on an out of date calculation anyway that we could potentially hard code).

* To dynamically calculate the number of total rows in recent_grads and be able to use it in another SQL statement, we can use a subquery in the SELECT clause:

In [5]:
sql = '''
        SELECT COUNT(*), (SELECT COUNT(*) 
                          FROM recent_grads) 
        FROM recent_grads
        WHERE ShareWomen > (SELECT AVG(ShareWomen) 
                            FROM recent_grads)
'''
pd.read_sql(sql, db)

Unnamed: 0,COUNT(*),(SELECT COUNT(*) FROM recent_grads)
0,90,172


#### instructions
* Write a SQL statement that computes the proportion (as a float value) of rows that contain above average values for the `ShareWomen`.
* The results should only return the proportion, aliased as `proportion_abv_avg`, like so (with a different value):

In [9]:
sql = '''
        SELECT COUNT(*) / (SELECT COUNT(*) 
                           FROM recent_grads) 
        FROM recent_grads
        WHERE ShareWomen > (SELECT AVG(ShareWomen) 
                            FROM recent_grads)
'''
pd.read_sql(sql, db)

Unnamed: 0,COUNT(*) / (SELECT COUNT(*) FROM recent_grads)
0,0.5233


In [None]:
sql = '''
        SELECT CAST(COUNT(*) as float) 
                / 
                cast((SELECT COUNT(*) 
                      FROM recent_grads) 
                      as float) 
                proportion_abv_avg
        
        FROM recent_grads
        
        WHERE ShareWomen > (SELECT AVG(ShareWomen) 
                            FROM recent_grads)

'''

pd.read_sql(sql, db)

In [11]:
# pymysql applies cast to float by default.

sql = '''
        SELECT COUNT(*) 
                / 
                (SELECT COUNT(*) 
                 FROM recent_grads) 
                proportion_abv_avg
        
        FROM recent_grads
        
        WHERE ShareWomen > (SELECT AVG(ShareWomen) 
                            FROM recent_grads)

'''

pd.read_sql(sql, db)

Unnamed: 0,proportion_abv_avg
0,0.5233


So far, the subqueries we've used have computed an aggregate value of some kind and returned that value to the outer query to use for filtering. This is because we only worked with the `<` and `>` operators, which, by definition, expect a single value to compare against in a filter. As we learned earlier in this course [from the documentation](https://sqlite.org/lang_expr.html), SQLite contains all of the following operators:

![](https://s3.amazonaws.com/dq-content/255/sqlite_operators.png)

Using the `IN` operator, we can specify a list of values that we want to match against in the `WHERE` clause. All rows that match exactly will be returned. The following query returns the rows where `Major_category` equals either `Business` or `Engineering`:


In [12]:
sql = '''
        select Major, Major_category
        from recent_grads
        where Major_category in
            ('Business', 'Engineering')
        limit 7
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,Major_category
0,PETROLEUM ENGINEERING,Engineering
1,MINING AND MINERAL ENGINEERING,Engineering
2,METALLURGICAL ENGINEERING,Engineering
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering
4,CHEMICAL ENGINEERING,Engineering
5,NUCLEAR ENGINEERING,Engineering
6,ACTUARIAL SCIENCE,Business


Opportunities like this, where we've hard coded values, are usually good candidates for converting to a subquery. Instead of returning the rows where `Major_category` equals one of 2 specific values, we can write a subquery that returns the `Major_category` with the 5 highest group level sums for the `Total` column:

In [15]:
sql = '''
        select Major_category, sum(Total)
        from recent_grads
        group by Major_category
        order by sum(Total) desc
        limit 5
'''
pd.read_sql(sql, db)

Unnamed: 0,Major_category,sum(Total)
0,Business,1302376.0
1,Humanities & Liberal Arts,713468.0
2,Education,559129.0
3,Engineering,537583.0
4,Social Science,529966.0


#### instructions
Write a query that returns the Major and Major_category columns for the rows where:
* Major_category is one of the 5 highest group level sums for the Total column

In [None]:
sql = '''
       select Major, Major_category
       from recent_grads
       where Major_category
       in (select Major_category
           from recent_grads
           group by Major_category
           order by sum(Total) desc
           limit 5)
'''
pd.read_sql(sql, db) 
# "This version of MySQL doesn't yet support 
# 'LIMIT & IN/ALL/ANY/SOME subquery'"

When you have a SQL statement you want to write that will end up using many subqueries, it can be overwhelming at first to know how to start. In general, you want to start with the inner queries first and work your way outwards. Let's say we're interested in understanding the ratio of the `Sample_size` column to the `Total` column. You can read the [dataset documentation](https://github.com/fivethirtyeight/data/tree/master/college-majors) if you need a reminder for what these columns represent.

Specifically, let's say we're interested in:

* computing this ratio for every major
* understanding which majors are above the average for this ratio
* understanding how many majors are above the average for this ratio

We'll start by writing a query that computes the ratio for every major and then the average of all of these ratios.

#### instructions
Write a query that returns the average ratio (Sample_size/Total)) for all of the majors.
* You'll need to cast both columns to the float type.
Use the alias avg_ratio for the average ratio.

In [37]:
sql = '''
        select Total
                    /
                    (select sum(Total) 
                    from recent_grads)
                    ratios
                    
        from recent_grads
        group by Major
        limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,ratios
0,0.029333
1,0.000558
2,0.007851
3,0.002224
4,0.00036
5,0.002103
6,0.003186
7,0.005736
8,0.000729
9,0.000417


In [44]:
sql = '''
        select AVG(
                Sample_size
                /
                Total)
                avg_ratio 
        from recent_grads
'''
pd.read_sql(sql, db)

Unnamed: 0,avg_ratio
0,0.009091


Write a query that:
* selects the Major, Major_category, and the computed ratio columns
* filters to just the rows where `ratio` is greater than `avg_ratio`:
  * recall that this value is the result of the subquery from the last screen: `select AVG(cast(Sample_size as float)/cast(Total as float)) avg_ratio from recent_grads`


In [None]:
sql = '''
        select 
            Major, 
            Major_category, 
            Sample_size/Total as ratio
            
        
        where ratio > (select avg(Sample_size/Total)
                              avg_ratio
                        from recent_grads) 
                        
        from recent_grads
'''
pd.read_sql(sql, db)
# seems like right query (working in dataquest website)
# 

In [None]:
# dataquest answer
sql = '''
        select Major, Major_category, cast(Sample_size as float)/cast(Total as float) ratio from recent_grads
where ratio > (select AVG(cast(Sample_size as float)/cast(Total as float)) avgratio from recent_grads)
'''