# Group Summary Statistics

The `GROUP BY` SQL statement allows us to compute summary statistics by "group," or unique value. When we use this statement, SQL creates a group for each unique value in a column or set of columns (the same values we get when we use the `DISTINCT` statement), and then does the calculations for them. To illustrate, we can find the total number of people employed in each major category with the following query:

```sql
SELECT SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;
```

The output shows aggregate counts of the Employed column for each Major_category. Unfortunately, it doesn't indicate which major category each row refers to. We can fix this by including the Major_category column in our query:

```sql
SELECT Major_category, SUM(Employed) 
FROM recent_grads 
GROUP BY Major_category;
```

In [2]:
import pandas as pd
import pymysql

In [5]:
host_name = ""
username = ""
password = ""
database_name = "dataquest"

In [6]:
db = pymysql.connect(
    host=host_name,  # DATABASE_HOST
    port=3306,
    user=username,  # DATABASE_USERNAME
    passwd=password,  # DATABASE_PASSWORD
    db=database_name,  # DATABASE_NAME
    charset='utf8'
)
cursor = db.cursor()
cursor.execute("set names utf8")
db.commit()

Use the SELECT statement to select the following columns and aggregates in a query:
* Major_category
* AVG(ShareWomen)

Use the GROUP BY statement to group the query by the Major_category column.

In [9]:
sql = '''
        select Major_category, AVG(ShareWomen)
        from recent_grads
        group by Major_category
'''
pd.read_sql(sql, db)

Unnamed: 0,Major_category,AVG(ShareWomen)
0,Agriculture & Natural Resources,0.405268
1,Arts,0.603658
2,Biology & Life Science,0.587193
3,Business,0.483198
4,Communications & Journalism,0.658384
5,Computers & Mathematics,0.311772
6,Education,0.748508
7,Engineering,0.238889
8,Health,0.795152
9,Humanities & Liberal Arts,0.63179


For each major category, find the percentage of graduates who are employed.
* Use the SELECT statement to select the following columns and aggregates in your query:
  * Major_category
  * AVG(Employed) / AVG(Total) as share_employed

Use the GROUP BY statement to group the query by the Major_category column.

In [10]:
sql = '''
        select Major_category, AVG(Employed)/AVG(Total) share_employed
        from recent_grads
        group by Major_category
'''
pd.read_sql(sql, db)

Unnamed: 0,Major_category,share_employed
0,Agriculture & Natural Resources,0.843613
1,Arts,0.806748
2,Biology & Life Science,0.667157
3,Business,0.835966
4,Communications & Journalism,0.842229
5,Computers & Mathematics,0.795611
6,Education,0.85819
7,Engineering,0.781967
8,Health,0.803374
9,Humanities & Liberal Arts,0.762638


Sometimes we want to select a subset of rows after performing a `GROUP BY` query. On the last screen, for instance, we may have wanted to select only those rows where `share_employed` is greater than `.8`. **We can't use the `WHERE` clause to do this because share_employed isn't a column in recent_grads;** it's actually a virtual column generated by the `GROUP BY` statement.

* When we want to filter on a column generated by a query, we can use the `HAVING` statement. Here's an example:

```sql
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

Note that we used the same column name in the `HAVING` statement that we originally specified with the `AS` statement. SQL allows us to use custom column names in subsequent statements, including `HAVING` and `WHERE`. 

#### instructions
Find all of the major categories where the share of graduates with low-wage jobs is greater than .1.
 * Use the SELECT statement to select the following columns and aggregates in a query:
  * Major_category
  * AVG(Low_wage_jobs) / AVG(Total) as share_low_wage

* Use the GROUP BY statement to group the query by the Major_category column.
* Use the HAVING statement to restrict the selection to rows where share_low_wage is greater than .1.

In [11]:
sql = '''
        select Major_category, AVG(Low_wage_jobs)/AVG(Total) share_low_wage
        from recent_grads
        group by Major_category
        having share_low_wage > .1
'''
pd.read_sql(sql, db)

Unnamed: 0,Major_category,share_low_wage
0,Arts,0.168331
1,Communications & Journalism,0.126324
2,Humanities & Liberal Arts,0.132087
3,Industrial Arts & Consumer Services,0.115713
4,Law & Public Policy,0.115685
5,Psychology & Social Work,0.116934
6,Social Science,0.102233


On the last screen, the percentages in our results were very long and hard to read (e.g., `0.16833085991095678`). We can use the SQL `ROUND` function in our query to round them. Here's an example of what this looks like:

```sql
SELECT Major_category, ROUND(ShareWomen, 2) AS rounded_share_women 
FROM recent_grads;
```

By passing different values in to the `ROUND` function, such as ROUND(ShareWomen, 3), we can round to different decimal places.

Write a SQL query that returns the following columns of recent_grads (in the same order):
* ShareWomen rounded to 4 decimal places
* Major_category

Limit the results to 10 rows.

In [13]:
sql = '''
        select ROUND(ShareWomen, 4), Major_category
        from recent_grads
        limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,"ROUND(ShareWomen, 4)",Major_category
0,0.1206,Engineering
1,0.1019,Engineering
2,0.153,Engineering
3,0.1073,Engineering
4,0.3416,Engineering
5,0.145,Engineering
6,0.4414,Business
7,0.5357,Physical Sciences
8,0.1196,Engineering
9,0.1964,Engineering


```sql
SELECT Major_category, AVG(Employed) / AVG(Total) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

This query returned very long fractional values for `share_employed`. We can update our query with the `ROUND` function to round the results to three decimal places:

```sql
SELECT Major_category, ROUND(AVG(Employed) / AVG(Total), 3) AS share_employed 
FROM recent_grads 
GROUP BY Major_category 
HAVING share_employed > .8;
```

#### instructions
* Use the SELECT statement to select the following columns and aggregates in a query:
  * Major_category
  * AVG(College_jobs) / AVG(Total) as share_degree_jobs
    * Use the ROUND function to round share_degree_jobs to 3 decimal places.
* Group the query by the Major_category column.
* Only select rows where share_degree_jobs is less than .3.

In [14]:
sql = '''
        select Major_category, ROUND(AVG(College_jobs)/AVG(Total),3) share_degree_jobs
        from recent_grads
        group by Major_category
        HAVING share_degree_jobs < .3
'''
pd.read_sql(sql, db)

Unnamed: 0,Major_category,share_degree_jobs
0,Agriculture & Natural Resources,0.247
1,Arts,0.265
2,Business,0.114
3,Communications & Journalism,0.22
4,Humanities & Liberal Arts,0.27
5,Industrial Arts & Consumer Services,0.249
6,Law & Public Policy,0.163
7,Social Science,0.215


In the last few screens, we used SQL arithmetic to divide float columns. This resulted in float values that we could round using the `ROUND()` function. If we

```sql
PRAGMA TABLE_INFO(recent_grads)
```

This query returns:

In [34]:
### pymysql doesn't support PRAGMA TABLE INFO() command,
### instead ,,,
pd.read_sql('show fields from recent_grads',db)

Unnamed: 0,Field,Type,Null,Key,Default,Extra
0,Rank,int(11),NO,PRI,,
1,Major_code,int(11),NO,,,
2,Major,varchar(70),YES,,,
3,Total,float,NO,,,
4,Men,float,NO,,,
5,Women,float,NO,,,
6,Major_category,varchar(70),YES,,,
7,ShareWomen,float,NO,,,
8,Sample_size,int(11),NO,,,
9,Employed,int(11),NO,,,


If we try to divide 2 integer columns (`Women` and `Total`), SQL will round down and **return integer values**:

```sql
SELECT Women/Total SW from recent_grads limit 10
```

In [35]:
# It is supposed to cast int type output, but not working.
pd.read_sql('select women/total SW from recent_grads limit 10',db)

Unnamed: 0,SW
0,0.120564
1,0.101852
2,0.153037
3,0.107313
4,0.341631
5,0.144967
6,0.441356
7,0.535714
8,0.119559
9,0.19645


We need to instead use the `CAST()` function to the Float type:

```sql
SELECT CAST(Women as Float) / CAST(Total as Float) FROM recent_grads limit 5
```

In [None]:
### pymysql seems to not support 'CAST()'
pd.read_sql('SELECT CAST(Women as Float) / CAST(Total as Float) FROM recent_grads limit 5', db)

**instructions**
* Write a query that divides the sum of the Women column by the sum of the Total column, aliased as SW.
* Group the results by Major_category and order by SW.
* The results should only contain the Major_category and SW columns, in that order.

In [None]:
sql = '''
        SELECT Major_category, 
            Cast(SUM(Women) as Float)/Cast(SUM(Total) as Float) SW 
        FROM recent_grads GROUP 
        BY Major_category 
        ORDER BY SW
'''
pd.read_sql(sql, db)

In [36]:
db.close()