# Summary Statistics
Learn how to calculate summary statistics in SQL.

* A key idea in SQL is that everything is a table. One advantage of this simplification is that it's a common, visual representation that makes SQL approachable for a much wider audience. The disadvantage is that datasets and calculations that aren't well suited for this representation must be converted to be used in a SQL database environment.

In [1]:
import pandas as pd
import pymysql

In [None]:
host_name = ""
username = ""
password = ""
database_name = "dataquest"

In [2]:
db = pymysql.connect(
    host=host_name,  # DATABASE_HOST
    port=3306,
    user=username,  # DATABASE_USERNAME
    passwd=password,  # DATABASE_PASSWORD
    db=database_name,  # DATABASE_NAME
    charset='utf8'
)
cursor = db.cursor()
cursor.execute("set names utf8")
db.commit()

Write a query that returns the number of majors with mostly male students.
* Use all caps in the SELECT clause so our answer checking will match - COUNT(Major).

In [3]:
sql = '''
    select count(major) from recent_grads where sharewomen < 0.5 
'''
pd.read_sql(sql, db)

Unnamed: 0,count(major)
0,76


Functions like `COUNT()` are known as [`aggregate functions`](https://sqlite.org/lang_aggfunc.html). Aggregate functions are applied over columns of values and return a single value. [MIN()](https://sqlite.org/lang_corefunc.html#minoreunc) and [MAX()](https://sqlite.org/lang_corefunc.html#maxoreunc), for example, calculate and return the minimum and maximum values in a column.

```sql
SELECT MIN(ShareWomen) 
FROM recent_grads;
```

It's interesting that there's a major with 0 women in the dataset. What if we wanted to which major that was or access other columns for that row? We just need to add the additional columns we want returned in the `SELECT` clause:

```sql
SELECT Major, MIN(ShareWomen) FROM recent_grads
```

#### instructions
Write a query that returns the Engineering major with the lowest median salary.
* We only want the Major, Major_category, and MIN(Median) columns in the result.

In [5]:
sql = '''
    select major, major_category, min(median)
    from recent_grads
    where major_category = "Engineering"
'''
pd.read_sql(sql, db)

Unnamed: 0,major,major_category,min(median)
0,PETROLEUM ENGINEERING,Engineering,40000


The final two aggregation functions we'll look at are `SUM()` and `AVG()`. Applying the SUM() function will add all of the values in a column while AVG() will compute the average. Lastly, the TOTAL() function also returns the sum as a floating point value (even if the column contains integers). **The `TOTAL()` function should be used when working with a column containing floating point values**. [You can read more here](https://sqlite.org/lang_aggfunc.html).

* This time around, we're going to skip showing sample code since these functions are used the same way as `COUNT()`, `MIN()`, and `MAX()`. This is good practice working with new functions, as SQL contains many functions that you'll end up using down the road that you haven't been taught explicitly.

Write a query that computes the sum of the Total column. - Return only the total number of students integer value.

In [4]:
sql = '''
        select SUM(Total) from recent_grads
'''
pd.read_sql(sql, db)

Unnamed: 0,SUM(Total)
0,6771654.0


Instead of writing an individual query for each specific question we want to answer, we can actually write queries that answer multiple questions at once. Let's take the following questions:

* What's the lowest median salary?
* What's the highest median salary?
* What's the total number of students?

Recall that we can select multiple columns by including their names with commas, like so:

```
SELECT Major, Major_category FROM recent_grads
```
We can apply the sample principle to combine multiple aggregation functions into a single query:

```
SELECT MIN(Median), MAX(Median), SUM(Total)
FROM recent_grads
```

#### instructions
Write a query that computes the average of the Total column, the minimum of the Men column, and the maximum of the Women column, in that specific order.
* Make sure that all of the aggregate functions are capitalized (SUM() not sum(), etc), so our results match yours.

In [5]:
sql = '''
        select AVG(Total), MIN(Men), MAX(Women)
        from recent_grads
'''
pd.read_sql(sql, db)

Unnamed: 0,AVG(Total),MIN(Men),MAX(Women)
0,39370.081395,119.0,307087.0


## AS operator

All of the queries we've written so far have had somewhat unpleasant column names in the results, like `AVG(SUM)` and `MIN(Men)`. Many companies use SQL environments and tools that can run your query, turn the results into a plot of your choosing, and then create a PDF report containing multiple plots (and some additional explanation from the user). Given that others may interpret and understand the results of your SQL queries, it's helpful to be able to specify custom names for the columns in our results.

We can do just that using the `AS` operator:
```sql
SELECT COUNT(*) as num_students FROM recent_grads
```

This is known as an alias and the [`alias`](https://www.tutorialspoint.com/sqlite/sqlite_alias_syntax.htm) is restricted to just our results table (the table in the database won't be renamed). We can specify an arbitrary phrase as a string using quotation marks:

```sql
SELECT COUNT(*) as "Total Students" FROM recent_grads
```

Even better, we can drop `AS` entirely and just add the name next to the original column:
```sql
SElECT COUNT(*) "Total Students" FROM recent_grads
```
Lastly, we can reference renamed columns when writing longer queries to make our code more compact:
```sql
SELECT Major m, Major_category mc, Unemployment_rate ur
FROM recent_grads
WHERE (mc = 'Engineering') AND (ur > 0.04 and ur < 0.08)
ORDER BY ur DESC
```

Write a query that returns, in the following order:
* the number of rows as Number of Students
* the maximum value of Unemployment_rate as Highest Unemployment Rate

In [6]:
sql = '''
        select * from recent_grads limit 1
'''
pd.read_sql(sql, db)

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193


In [10]:
sql = '''
        select COUNT(*) "Number of Students", 
                         MAX(Unemployment_rate) "Highest Unemployment Rate"
        from recent_grads
'''
pd.read_sql(sql, db)

Unnamed: 0,Number of Students,Highest Unemployment Rate
0,172,0.177226


We've been working with the Major_category column a decent amount in our queries and it's a column with only few unique values. What if we want to return just the unique values in this column? Or the number of unique values in this column?

* We can return all of the unique values in a column using the DISTINCT statement.

```sql
SELECT DISTINCT Major_category FROM recent_grads
```

As with the other SQL clauses we've learned, we can use the `DISTINCT` statement with multiple columns to return unique pairings of those columns:

```sql
SELECT DISTINCT Major, Major_category FROM recent_grads limit 5
```

In this case, the Major_category column is much more unique (only 16 unique values fopr Major_category compared to 173 for Major), so the corresponding value is repeated for every unique value in Major.<br>

Lastly, we can count the number of unique values in a column by nesting the COUNT() function with the DISTINCT() function (note the nesting of parentheses as well):

```sql
SELECT COUNT(DISTINCT(Major_category)) unique_major_categories FROM recent_grads
```

Write a query that returns the number of unique values in the Major, Major_category, and Major_code columns. Use the following aliases in the following order:
* For the unique value count of the Major column, use the alias unique_majors.
* For the unique value count of the Major_category column, use the alias unique_major_categories.
* For the unique value count of the Major_code column, use the alias unique_major_codes.

In [15]:
sql = '''
        select COUNT(DISTINCT(Major)) unique_majors, 
               COUNT(DISTINCT(Major_category)) as unique_major_categories, 
               COUNT(DISTINCT(Major_code)) as unique_major_codes
        from recent_grads
'''
pd.read_sql(sql, db)

Unnamed: 0,unique_majors,unique_major_categories,unique_major_codes
0,172,16,172


Let's revisit one of the questions from the beginning of the mission:

* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?

To answer this question, we need to be able to perform arithmetic on the columns in a table to compute the difference. SQL supports the standard arithmetic operators: `*`, `+`, `-`, and `\`, and we can use them like any other operator:

```sql
SELECT P75th - P25th quartile_spread FROM recent_grads LIMIT 10
```
You can also add, subtract, multiple, or divide columns by individual values:

```sql
SELECT ShareWomen * 100 percent_female FROM recent_grads LIMIT 10
```
One thing to note is that multiplying or dividing columns with a floating point value (or a column with floating point values) will result in floating point values:

* Two floats - Returns a float.
  * SELECT 100.0 / 100.0 returns 1.0.
* A float and an integer - Returns a float
  * SELECT 100 / 1.0 returns 1.0.
* Two integers - Returns an integer
  * SELECT 100 / 10 returns 10
  
Write a query that computes the difference between the 25th and 75th percentile of salaries for all majors.
* Return the `Major` column first, using the default column name.
* Return the `Major_category` column second, using the default column name.
* Return the compute difference between the 25th and 75th percentile third, using the alias `quartile_spread`.
* Order the results from lowest to highest and only return the first 20 results.

In [17]:
sql = '''
        select Major, Major_category, P75th-P25th quartile_spread
        from recent_grads
        order by quartile_spread
        limit 20
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,Major_category,quartile_spread
0,MILITARY TECHNOLOGIES,Industrial Arts & Consumer Services,0
1,LIBRARY SCIENCE,Education,2000
2,SCHOOL STUDENT COUNSELING,Education,2000
3,COURT REPORTING,Law & Public Policy,4000
4,PHARMACOLOGY,Biology & Life Science,5000
5,EDUCATIONAL ADMINISTRATION AND SUPERVISION,Education,6000
6,COUNSELING PSYCHOLOGY,Psychology & Social Work,6800
7,SPECIAL NEEDS EDUCATION,Education,10000
8,MATHEMATICS TEACHER EDUCATION,Education,10000
9,EDUCATIONAL PSYCHOLOGY,Psychology & Social Work,10000


In [None]:
db.close()