## 1: Introduction

In the last mission, we computed summary statistics across columns with SQL. In many cases, though, we want to drill down even more and compute summary statistics per group. In this mission, we'll explore how to calculate more granular summary statistics. We'll switch back to writing SQL queries directly instead of using Python so we can focus on the SQL syntax.

We'll be working with a data set on jobs we stored in the recent_grads table of jobs.db. Each row represents a single college major, and contains information about post-graduation employment of students who studied the major. You can find out more about the data set in FiveThirtyEight's GitHub repository for the project. Here are some descriptions for just a few of the 21 total columns:

- Rank - The major's rank by median earnings
- Major_code - The major's code or ID
- Major - The name of the major
- Major_category - The broader category the major belongs to
- Total - The total number of people who studied the major
- Men - The number of male graduates
- Women - The number of female graduates
- ShareWomen - Women as a proportion of the total number of graduates (a number ranging from 0 to 1)
- Employed - The number of employed graduates

Here are the first few rows and columns in the data set:

| Rank | Major_code | Major                                     | Major_category | Total | Sample_size | Men   | Women | ShareWomen | Employed |
|------|------------|-------------------------------------------|----------------|-------|-------------|-------|-------|------------|----------|
| 1    | 2419       | PETROLEUM ENGINEERING                     | Engineering    | 2339  | 36          | 2057  | 282   | 0.120564   | 1976     |
| 2    | 2416       | MINING AND MINERAL ENGINEERING            | Engineering    | 756   | 7           | 679   | 77    | 0.101852   | 640      |
| 3    | 2415       | METALLURGICAL ENGINEERING                 | Engineering    | 856   | 3           | 725   | 131   | 0.153037   | 648      |
| 4    | 2417       | NAVAL ARCHITECTURE AND MARINE ENGINEERING | Engineering    | 1258  | 16          | 1123  | 135   | 0.107313   | 758      |
| 5    | 2405       | CHEMICAL ENGINEERING                      | Engineering    | 32260 | 289         | 21239 | 11021 | 0.341631   | 25694    |

As we progress through this mission, we'll drill down and compute summary statistics by group to answer questions like:

- What's the share of women in each major category?
- Which major categories have the greatest numbers of employed graduates?
- What percentage of people in each major category end up in low-wage jobs?

First, let's explore the data.

## 2: Introduction

#### Instructions:
Write a SQL query that displays all of the columns and the first five rows of the recent_grads table.

In [1]:
query = "select * from recent_grads limit 5;"

## 3: Calculating Group-Level Summary Statistics

The GROUP BY SQL statement allows us to compute summary statistics by "group," or unique value. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the DISTINCT statement), and then does the calculations for them. To illustrate, we can find the total number of people employed in each major category with the following query:


``SELECT SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;``

This will give us the total number of employed graduates for each major category. Here's a truncated view of the output:

| SUM(Employed) |
|---------------|
| 66943         |
| 288114        |
| 302797        |

The output shows aggregate counts of the Employed column for each Major_category. Unfortunately, it doesn't indicate which major category each row refers to. We can fix this by including the Major_category column in our query:


``SELECT Major_category, SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;``

This makes the output much easier to understand:

| Major_category                  | SUM(Employed) |
|---------------------------------|---------------|
| Agriculture & Natural Resources | 66943         |
| Arts                            | 288114        |
| Biology & Life Science          | 302797        |

Here's how the query works. The GROUP BY statement splits the Major_category column into groups (with one group for each unique major category), then calculates the sum for each group. The following diagram shows how GROUP BY splits the data. (The diagram uses a small sample from the recent_grads table.):

<img src="./pics/p1.png">

For each group, the GROUP BY statement queries each column, and runs all of the aggregation functions we include in the query after the SELECT statement:

<img src="./pics/p2.png">

If a column is selected, the SQL engine will use the **last value** for that column in the group. If an aggregation function is selected, the SQL engine will compute the value for that aggregation function across the group.

The query in the diagram will give us the following result:

| Employed | Major_category | SUM(Employed) |
|----------|----------------|---------------|
| 1290     | Agriculture    | 4439          |
| 36165    | Arts           | 39075         |

## 4: Calculating Group-Level Summary Statistics

#### Instructions:
- Use the SELECT statement to select the following columns and aggregates in a query:
    - Major_category
    - AVG(ShareWomen)
- Use the GROUP BY statement to group the query by the Major_category column.

In [6]:
import sqlite3

conn = sqlite3.connect("data/jobs.db")
query = "select Major_category, avg(ShareWomen) from recent_grads group by major_category;"
conn.execute(query).fetchall()

[('Agriculture & Natural Resources', 0.6179384232),
 ('Arts', 0.56185119575),
 ('Biology & Life Science', 0.584518475857143),
 ('Business', 0.4050631853076923),
 ('Communications & Journalism', 0.64383484025),
 ('Computers & Mathematics', 0.5127519954545455),
 ('Education', 0.6749855163125),
 ('Engineering', 0.2571578951034483),
 ('Health', 0.6168565694166667),
 ('Humanities & Liberal Arts', 0.6761934042),
 ('Industrial Arts & Consumer Services', 0.4493512688571429),
 ('Interdisciplinary', 0.495397153),
 ('Law & Public Policy', 0.3359896912),
 ('Physical Sciences', 0.5087494197),
 ('Psychology & Social Work', 0.7777631628888888),
 ('Social Science', 0.5390672957777778)]

## 5: Renaming Columns With The AS Statement

You may have noticed that on the last screen, specifying AVG(ShareWomen) caused the column to have that name in the results. This can lead to confusion, and make it difficult to work with the results of SQL queries. To avoid this issue, we can select and rename columns at the same time with the AS statement. Here's an example:


``SELECT AVG(ShareWomen) AS average_female_share 
FROM recent_grads;``

This query will result in the following output:

| average_female_share |
|----------------------|
| 0.5225502029537575   |

#### Instructions:
- Write a query that selects the following items, in order, and renames them with AS:
    - SUM(Men) as total_men.
    - SUM(Women) as total_women.

In [10]:
query = "select sum(men) as total_men , sum(women) as total_women from recent_grads;"

import pandas as pd

db = pd.read_sql(query, conn)
db

Unnamed: 0,total_men,total_women
0,2878263,3897752


## 6: Practice: Using GROUP BY

Now that we have a better understanding of the GROUP BY statement, let's practice using it by computing summary statistics by group for the recent_grads table.

#### Instructions: 

- For each major category, find the percentage of graduates who are employed.
    - Use the SELECT statement to select the following columns and aggregates in your query:
        - Major_category
        - AVG(Employed) / AVG(Total) as share_employed
    - Use the GROUP BY statement to group the query by the Major_category column.

In [12]:
query = '''select Major_category, avg(Employed) / avg(Total) as share_employed from 
recent_grads group by Major_category'''

db = pd.read_sql(query, conn)
db

Unnamed: 0,Major_category,share_employed
0,Agriculture & Natural Resources,0.836986
1,Arts,0.806748
2,Biology & Life Science,0.667157
3,Business,0.835966
4,Communications & Journalism,0.842229
5,Computers & Mathematics,0.795611
6,Education,0.85819
7,Engineering,0.781967
8,Health,0.803374
9,Humanities & Liberal Arts,0.762638


## 7: Querying Virtual Columns With The HAVING Statement

Sometimes we want to select a subset of rows after performing a GROUP BY query. On the last screen, for instance, we may have wanted to select only those rows where share_employed is greater than .8. We can't use the WHERE clause to do this because share_employed isn't a column in recent_grads; it's actually a virtual column generated by the GROUP BY statement.

When we want to filter on a column generated by a query, we can use the HAVING statement. Here's an example:


``SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;``

Note that we used the same column name in the HAVING statement that we originally specified with the AS statement. SQL allows us to use custom column names in subsequent statements, including HAVING and WHERE. The statement above will result in the following output:

| Major_category                  | share_employed     |
|---------------------------------|--------------------|
| Agriculture & Natural Resources | 0.8369862842425075 |
| Arts                            | 0.8067482429367457 |
| Business                        | 0.8359659576036412 |
| Communications & Journalism     | 0.8422291333949735 |

Note that the results only include categories where share_employed is greater than .8. That's because the HAVING statement filters out the other rows.

#### Instructions: 
- Find all of the major categories where the share of graduates with low-wage jobs is greater than .1.
    - Use the SELECT statement to select the following columns and aggregates in a query:
        - Major_category
        - AVG(Low_wage_jobs) / AVG(Total) as share_low_wage
    - Use the GROUP BY statement to group the query by the Major_category column.
    - Use the HAVING statement to restrict the selection to rows where share_low_wage is greater than .1.

In [14]:
query = '''select Major_category, avg(Low_wage_jobs) / avg(Total) as share_low_wage
from recent_grads group by Major_category having share_low_wage > 0.1;'''

db = pd.read_sql(query, conn)
db

Unnamed: 0,Major_category,share_low_wage
0,Arts,0.168331
1,Communications & Journalism,0.126324
2,Humanities & Liberal Arts,0.132087
3,Industrial Arts & Consumer Services,0.115713
4,Law & Public Policy,0.115685
5,Psychology & Social Work,0.116934
6,Social Science,0.102233


## 8: Rounding Results With The ROUND Function

On the last screen, the percentages in our results were very long and hard to read (e.g., 0.16833085991095678). We can use the SQL ROUND function in our query to round them. Here's an example of what this looks like:


``SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women 
FROM recent_grads;``

The query will round the ShareWomen column to two decimal places. Here's a truncated view of the results:

| Major_category | rounded_share_women |
|----------------|---------------------|
| Engineering    | 0.12                |
| Engineering    | 0.1                 |

By passing different values in to the ROUND function, such as ROUND(ShareWomen, 3), we can round to different decimal places.

#### Instructions:
- Write a SQL query that returns the following columns of recent_grads (in the same order):
    - ShareWomen rounded to 4 decimal places
    - Major_category
- Limit the results to 10 rows.

In [15]:
query = '''select round(ShareWomen,4), Major_category from recent_grads limit 10'''

db = pd.read_sql(query, conn)
db

Unnamed: 0,"round(ShareWomen,4)",Major_category
0,0.1206,Engineering
1,0.1019,Engineering
2,0.153,Engineering
3,0.1073,Engineering
4,0.3416,Engineering
5,0.145,Engineering
6,0.5357,Business
7,0.4414,Physical Sciences
8,0.1398,Engineering
9,0.4378,Engineering


## 9: Nesting Functions

On a previous screen, we ran the following query:


``SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;``

This query returned very long fractional values for share_employed. We can update our query with the ROUND function to round the results to three decimal places:


``SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;``

This will return the following result:

| Major_category                  | share_employed |
|---------------------------------|----------------|
| Agriculture & Natural Resources | 0.837          |
| Arts                            | 0.807          |

#### Instructions:
- Use the SELECT statement to select the following columns and aggregates in a query:
    - Major_category
    - AVG(College_jobs) / AVG(Total) as share_degree_jobs
        - Use the ROUND function to round share_degree_jobs to 3 decimal places.
- Group the query by the Major_category column.
- Only select rows where share_degree_jobs is less than .3.

In [16]:
query = '''select Major_category, round(avg(College_jobs) / Avg(total), 3) as 
share_degree_jobs from recent_grads group by Major_category having share_degree_jobs < 0.3;'''

db = pd.read_sql(query, conn)
db

Unnamed: 0,Major_category,share_degree_jobs
0,Agriculture & Natural Resources,0.248
1,Arts,0.265
2,Business,0.114
3,Communications & Journalism,0.22
4,Humanities & Liberal Arts,0.27
5,Industrial Arts & Consumer Services,0.249
6,Law & Public Policy,0.163
7,Social Science,0.215


## 10: Next Steps

In this mission, we covered the GROUP BY and HAVING statements. We can use these statements to quickly calculate powerful summary statistics in SQL. In the next few missions, we'll learn more about working with SQL tables, including how to insert and modify data.