# Introduction to SQL

In previous missions, we primarily worked with data represented in a **CSV** file. The workflow looked something like this:
![](https://s3.amazonaws.com/dq-content/252/pandas_workflow.svg)

The pandas workflow works well when:

* the data fits in memory (**a few gigabytes but not terabytes**)
* the data is **relatively static** (doesn't need to be loaded into memory every minute because the data has changed)
* **only a single person is accessing** the data (shared access to memory is difficult)
* **security isn't important** (security is critical for company scale production situations)

When the data
* changes frequently, 
* requires shared access, 
* doesn't fit in memory, 
* and security is critical, 
### a database is a much better solution. 

A database is a data representation that lives on disk that can be queried, accessed, and updated without using much memory. We primarily interact with a database using a [database management system or DBMS](https://en.wikipedia.org/wiki/Database) for short.

In the pandas workflow, we spend most of our time thinking about what functions and methods to use, where to store intermediate results in variables, and juggling all of these. To work with data stored in a database, we instead use a language called SQL (or structured query language). In SQL, we express each unique request (whether it be fetching a subset of or editing values in the data) as a single query and then ask the DBMS to run the query and display any results.

For example, to fetch a specific subset of the data from a database, we would:

* write the SQL query: SELECT * FROM salaries
* ask the DBMS to run the query and display the results to us

Here's what the database workflow looks like:
![](https://s3.amazonaws.com/dq-content/252/database_workflow.svg)

Because the data lives on disk, **we can work with datasets that consume multiple terabytes of disk space**. Many data science teams in industry have servers and setups in cloud environments like Microsoft Azure or Amazon Web Services that let team members work with this scale of data. Robust and popular DBMS tools like [Postgres](https://www.postgresql.org/) and [MySQL](https://www.mysql.com/) include powerful features for managing user credentials, security, and high data throughput (quickly changing data). In this course and the next, we'll learn the fundamentals of SQL using a small, portable DBMS called [SQLite](https://www.sqlite.org/). SQLite is the most popular database in the world and is lightweight enough that the SQLite DBMS is included as a [module in Python](https://docs.python.org/3.6/library/sqlite3.html). In later courses, we'll dive into production systems like Postgres.<br>

In this course, we'll explore data from the American Community Survey on job outcome statistics based on college majors. While the original CSV version can be found on [FiveThirtyEight's Github](https://github.com/fivethirtyeight/data/tree/master/college-majors), we'll be using a slightly modified version of the data that's stored as a database. We'll be working with a of the data that contains the 2010-2012 data for recent college grads only. In this mission, we'll learn how to write SQL queries **to explore and start to understand the dataset.**

## Previewing A Table Using SELECT

We'll be working with the database file `jobs.db`, which contains a single table named recent_grads. In later courses, we'll learn how to work with a database containing multiple tables.

![](https://s3.amazonaws.com/dq-content/252/sql_table.svg)

To display the first 5 rows from the recent_grads table, we need to:

* write SQL code that expresses this request
* ask the SQLite RDBMS software to run the code and display the results.

```sql
SELECT * FROM recent_grads LIMIT 5
```

In this query, we specified:

* the columns we wanted using SELECT *
* the table we wanted to query using FROM recent_grads
* the number of rows we wanted using LIMIT 5

Here's a visual breakdown of the different components of the query:

![](https://s3.amazonaws.com/dq-content/252/select_breakdown_2.svg)

In [None]:
# Write a SQL query that returns the first 10 rows from recent_grads.

query = 'SELECT * FROM recent_grads LIMIT 10'

Head to the [dataset page](https://github.com/fivethirtyeight/data/tree/master/college-majors) and spend some time getting familiar with what each column represents.

Header|Description
---|---
Rank|Rank by median earnings
Major_code|Major code, FO1DP in ACS PUMS
Major|Major description
Major_category|Category of major from Carnevale et al
Total|Total number of people with major
Sample_size|Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men	Male|graduates
Women|Female graduates
ShareWomen|Women as share of total
Employed|Number employed (ESR == 1 or 2)
Full_time|Employed 35 hours or more
Part_time|Employed less than 35 hours
Full_time_year_round|Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed|Number unemployed (ESR == 3)
Unemployment_rate|Unemployed / (Unemployed + Employed)
Median|Median earnings of full-time, year-round workers
P25th|25th percentile of earnigns
P75th|75th percentile of earnings
College_jobs|Number with job requiring a college degree
Non_college_jobs|Number with job not requiring a college degree
Low_wage_jobs|Number in low-wage service jobs

Based on this dataset preview and an understanding of what each column represents, here are some questions we may have:

* Which majors had mostly female students? Which ones had mostly male students?
* Which majors had the largest spread (difference) between the 25th and 75th percentile starting salaries?
* Which engineering majors had the highest full time employment rates?

Let's start by focusing on the first question. The SQL workflow revolves around translating the question we want to answer to the subset of data we want from the database. To determine which majors had mostly female students, we want the following subset:

* only the `Major` column
* only the rows where `ShareWomen` is greater than `0.5` (corresponding to 50%)

To return only the `Major` column, we need to add the specific column name in the `SELECT` statement part of the query (instead of using the `*` operator to return all columns):

```
SELECT Major FROM recent_grads
```

We can specify multiple columns this way as well and the results table will preserve the order of the columns:

```
SELECT Major, Major_category FROM recent_grads
```

To return only the values where ShareWomen is greater than or equal to 0.5, we need to add a `WHERE` clause:

```
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
```

Finally, we can limit the number of rows returned using `LIMIT`:

```
SELECT Major FROM recent_grads
WHERE ShareWomen >= 0.5
LIMIT 5
```

![](https://s3.amazonaws.com/dq-content/252/where_breakdown_1.svg)

In [1]:
import pandas as pd
import pymysql

In [2]:
recent_grads = pd.read_csv('data/recent-grads.csv')

In [3]:
recent_grads.head(1)

Unnamed: 0,Rank,Major_code,Major,Total,Men,Women,Major_category,ShareWomen,Sample_size,Employed,...,Part_time,Full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th,College_jobs,Non_college_jobs,Low_wage_jobs
0,1,2419,PETROLEUM ENGINEERING,2339.0,2057.0,282.0,Engineering,0.120564,36,1976,...,270,1207,37,0.018381,110000,95000,125000,1534,364,193


In [4]:
for col in recent_grads.columns:
    if recent_grads[col].dtype != 'object':
        print(col, ':', recent_grads[col].dtype, max(recent_grads[col]), recent_grads[col].isnull().sum())
    else:
        print(col, ':', recent_grads[col].dtype, recent_grads[col].isnull().sum())

Rank : int64 173 0
Major_code : int64 6403 0
Major : object 0
Total : float64 393735.0 1
Men : float64 173809.0 1
Women : float64 307087.0 1
Major_category : object 0
ShareWomen : float64 0.968953683 1
Sample_size : int64 4212 0
Employed : int64 307933 0
Full_time : int64 251540 0
Part_time : int64 115172 0
Full_time_year_round : int64 199897 0
Unemployed : int64 28169 0
Unemployment_rate : float64 0.177226407 0
Median : int64 110000 0
P25th : int64 95000 0
P75th : int64 125000 0
College_jobs : int64 151643 0
Non_college_jobs : int64 148395 0
Low_wage_jobs : int64 48207 0


In [5]:
recent_grads = recent_grads.dropna(axis=0, how='any')

In [None]:
host_name = ""
username = ""
password = ""
database_name = "dataquest"

In [7]:
db = pymysql.connect(
    host=host_name,  # DATABASE_HOST
    port=3306,
    user=username,  # DATABASE_USERNAME
    passwd=password,  # DATABASE_PASSWORD
    db=database_name,  # DATABASE_NAME
    charset='utf8'
)
cursor = db.cursor()
cursor.execute("set names utf8")
db.commit()

In [13]:
sql = '''
        DROP TABLE IF EXISTS recent_grads;
        CREATE TABLE recent_grads (
            Rank INT NOT NULL,
            Major_code INT NOT NULL,
            Major INT NOT NULL,
            Total FLOAT NOT NULL,
            Men FLOAT NOT NULL,
            Women FLOAT NOT NULL,
            Major_category INT NOT NULL,
            ShareWomen FLOAT NOT NULL,
            Sample_size INT NOT NULL,
            Employed INT NOT NULL,
            Full_time INT NOT NULL,
            Part_time INT NOT NULL,
            Full_time_year_round INT NOT NULL,
            Unemployed INT NOT NULL,
            Unemployment_rate FLOAT NOT NULL,
            Median INT NOT NULL,
            P25th INT NOT NULL,
            P75th INT NOT NULL,
            College_jobs INT NOT NULL,
            Non_college_jobs INT NOT NULL,
            Low_wage_jobs INT NOT NULL,
            PRIMARY KEY (Rank)
      ) DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
'''
cursor.execute(sql)
db.commit()

In [14]:
sql = '''
        alter table recent_grads modify Major varchar(70);
        alter table recent_grads modify Major_category varchar(70);
'''
cursor.execute(sql)
db.commit()

In [15]:
for idx_row in range(recent_grads.shape[0]):
    
    row = recent_grads.iloc[idx_row]
    
    sql =   '''
            INSERT INTO recent_grads 
            
            (Rank, Major_code, Major, Total, Men, Women,
            Major_category, ShareWomen, Sample_size, Employed, Full_time,
            Part_time, Full_time_year_round, Unemployed, Unemployment_rate,
            Median, P25th, P75th, College_jobs, Non_college_jobs,
            Low_wage_jobs) 

            VALUES  (%d, %d, "%s", %f, %f, %f, "%s", %f, %d, %d, %d, %d, %d, %d,
                     %f, %d, %d, %d, %d, %d, %d) 
            '''% (row.Rank, row.Major_code, row.Major, row.Total, row.Men, row.Women,\
               row.Major_category, row.ShareWomen, row.Sample_size, row.Employed,\
               row.Full_time, row.Part_time, row.Full_time_year_round, row.Unemployed,\
               row.Unemployment_rate, row.Median, row.P25th, row.P75th, row.College_jobs,\
               row.Non_college_jobs, row.Low_wage_jobs)
    
    #print(sql)
    cursor.execute(sql)
    db.commit()

Write a SQL query that returns the majors where females were a minority.
Only return the Major and ShareWomen columns (in that order) and don't limit the number of rows returned

In [16]:
sql = '''
     SELECT Major, Sharewomen FROM recent_grads WHERE Sharewomen < 0.5;
'''
pd.read_sql(sql, db).iloc[:10]

Unnamed: 0,Major,Sharewomen
0,PETROLEUM ENGINEERING,0.120564
1,MINING AND MINERAL ENGINEERING,0.101852
2,METALLURGICAL ENGINEERING,0.153037
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,0.107313
4,CHEMICAL ENGINEERING,0.341631
5,NUCLEAR ENGINEERING,0.144967
6,ACTUARIAL SCIENCE,0.441356
7,MECHANICAL ENGINEERING,0.119559
8,ELECTRICAL ENGINEERING,0.19645
9,COMPUTER ENGINEERING,0.199413


We can use the AND operator to combine multiple filter criteria. For example, to determine which engineering majors had majority female, we'd need to specify 2 filtering criteria:

```sql
SELECT Major FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
```

If we wanted to "zoom" back out to look at all of the columns for both of these majors to see if they shared some other common attributes, we can modify the SELECT statement and use the symbol * to represent all columns:

```sql
SELECT * FROM recent_grads
WHERE Major_category = 'Engineering' AND ShareWomen > 0.5
```

#### instructions
Write a SQL query that returns:
* all majors with majority female and
* all majors had a median salary greater than 50000.

Only include the following columns in the results and in this order:
* Major
* Major_category
* Median
* ShareWomen

In [17]:
sql = '''
        SELECT Major, Major_category, Median, ShareWomen
        FROM recent_grads
        WHERE ShareWomen >= 0.5 AND Median > 50000
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,Major_category,Median,ShareWomen
0,ASTRONOMY AND ASTROPHYSICS,Physical Sciences,62000,0.535714


We used the `AND` operator to specify that our filter needs to pass two Boolean conditions. Both of the conditions had to evaluate to True for the record to appear in the result set. If we wanted to specify a filter that meets either of the conditions instead, we would use the `OR` operator.

```sql
SELECT [column1, column2,...] FROM [table1]
WHERE [condition1] OR [condition2]
```

#### instructions
Write a SQL query that returns the first 20 majors that either:
* have a Median salary greater than or equal to 10,000, or
* have less than or equal to 1,000 Unemployed people

Only include the following columns in the results and in this order:
* Major
* Median
* Unemployed

In [18]:
sql = '''
        SELECT Major, Median, Unemployed FROM recent_grads
        WHERE Median >= 10000 OR Unemployed <= 1000
        LIMIT 20
'''
pd.read_sql(sql, db)

Unnamed: 0,Major,Median,Unemployed
0,PETROLEUM ENGINEERING,110000,37
1,MINING AND MINERAL ENGINEERING,75000,85
2,METALLURGICAL ENGINEERING,73000,16
3,NAVAL ARCHITECTURE AND MARINE ENGINEERING,70000,40
4,CHEMICAL ENGINEERING,65000,1672
5,NUCLEAR ENGINEERING,65000,400
6,ACTUARIAL SCIENCE,62000,308
7,ASTRONOMY AND ASTROPHYSICS,62000,33
8,MECHANICAL ENGINEERING,60000,4650
9,ELECTRICAL ENGINEERING,60000,3895


There's a certain class of questions that we can't answer using only the techniques we've learned so far. For example, if we wanted to write a query that returned all `Engineering` majors that **either** had mostly female graduates **or** an unemployment rate below 5.1%, we would need to use parentheses to express this more complex logic. <br>

The three raw conditions we'll need are:

```sql
Major_category = 'Engineering'
ShareWomen >= 0.5
Unemployment_rate < 0.051
```
What the SQL query looks like using parantheses:

```sql
SELECT Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051);
```

* The first thing you may notice is that we didn't capitalize any of the operators or statements in the query. **SQL's built-in keywords are case-insensitive, which means we don't have to capitalize operators like AND or statements like SELECT.** 
  * **This also goes for the column names** (you can use either major_category or Major_category).
  
* The second thing you may notice is **how we enclosed the logic we wanted to be evaluated together in parentheses**. This is very similar to how we group mathematical calculations together in a particular order. The parentheses makes it explictly clear to the database that we want all of the rows where both of the expressions in the statements evaluate to True:

```sql
(Major_category = 'Engineering' AND ShareWomen > 0.5) -> True or False
(ShareWomen > 0.5 OR Unemployment_rate < 0.051) -> True or False
```

If we had written the where statement without any parentheses, the database would guess what our intentions are, and actually execute the following query instead:

```sql
WHERE (Major_category = 'Engineering' AND ShareWomen > 0.5) OR (Unemployment_rate < 0.051)
```

#### instructions
Run the query we explored above, which returns all Engineering majors that:
* either had mostly women graduates
* or had an unemployment rate below 5.1%, which was the rate in August 2015

Only include the following columns in the results and in this order:
Major
* Major_category
* ShareWomen
* Unemployment_rate

In [19]:
sql = '''
        select major, major_category, sharewomen, unemployment_rate
        from recent_grads
        where (major_category = "Engineering") and 
              (sharewomen > 0.5 or unemployment_rate < 0.051)
'''
pd.read_sql(sql, db)

Unnamed: 0,major,major_category,sharewomen,unemployment_rate
0,PETROLEUM ENGINEERING,Engineering,0.120564,0.018381
1,METALLURGICAL ENGINEERING,Engineering,0.153037,0.024096
2,NAVAL ARCHITECTURE AND MARINE ENGINEERING,Engineering,0.107313,0.050125
3,MATERIALS SCIENCE,Engineering,0.31082,0.023043
4,ENGINEERING MECHANICS PHYSICS AND SCIENCE,Engineering,0.183985,0.006334
5,INDUSTRIAL AND MANUFACTURING ENGINEERING,Engineering,0.343473,0.042876
6,MATERIALS ENGINEERING AND MATERIALS SCIENCE,Engineering,0.325092,0.027789
7,INDUSTRIAL PRODUCTION TECHNOLOGIES,Engineering,0.24919,0.028308
8,ENGINEERING AND INDUSTRIAL MANAGEMENT,Engineering,0.174123,0.033652


As the questions we want to answer get more complex, we want more control over how the results are ordered. We can specify the order using the [ORDER BY](https://sqlite.org/lang_select.html#orderby) clause. For example, we may want to understand which majors that met the criteria in the `WHERE` statement had the lowest unemployment rate:

```sql
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate
```

If we instead want the results ordered by the same column but in descending order, we can add the DESC keyword:

```sql
SELECT Rank, Major, Major_category, ShareWomen, Unemployment_rate
FROM recent_grads
WHERE (Major_category = 'Engineering') AND (ShareWomen > 0.5 OR Unemployment_rate < 0.051)
ORDER BY Unemployment_rate DESC
```

#### instructions
Write a query that returns all majors where:
* ShareWomen is greater than 0.3
* and Unemployment_rate is less than .1

Only include the following columns in the results and in this order:
* Major,
* ShareWomen,
* Unemployment_rate

Order the results in descending order by the ShareWomen column.

In [20]:
sql = '''
        select major, sharewomen, unemployment_rate
        from recent_grads
        where sharewomen > 0.3 and unemployment_rate < 0.1
        order by sharewomen desc limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,major,sharewomen,unemployment_rate
0,EARLY CHILDHOOD EDUCATION,0.968954,0.040105
1,COMMUNICATION DISORDERS SCIENCES AND SERVICES,0.967998,0.047584
2,MEDICAL ASSISTING SERVICES,0.927807,0.042507
3,ELEMENTARY EDUCATION,0.923745,0.046586
4,FAMILY AND CONSUMER SCIENCES,0.910933,0.067128
5,SPECIAL NEEDS EDUCATION,0.906677,0.041508
6,HUMAN SERVICES AND COMMUNITY ORGANIZATION,0.90559,0.037819
7,SOCIAL WORK,0.904075,0.068828
8,NURSING,0.896019,0.044863
9,MISCELLANEOUS HEALTH MEDICAL PROFESSIONS,0.881294,0.081411


In this step, you'll practice going from question to answer using the SQL workflow. You'll focus on one of the questions we posed early in this mission:

* Which engineering majors had the highest full time employment rates?


Write a query that returns the Engineering or Physical Sciences majors in asecending order of unemployment rates.
* The results should only contain the Major_category, Major, and Unemployment_rate columns.

In [21]:
sql = '''
        select major_category, major, unemployment_rate
        from recent_grads
        where major_category = "Engineering" or major_category = "Physical Sciences"
        order by unemployment_rate limit 10
'''
pd.read_sql(sql, db)

Unnamed: 0,major_category,major,unemployment_rate
0,Engineering,ENGINEERING MECHANICS PHYSICS AND SCIENCE,0.006334
1,Engineering,PETROLEUM ENGINEERING,0.018381
2,Physical Sciences,ASTRONOMY AND ASTROPHYSICS,0.021167
3,Physical Sciences,ATMOSPHERIC SCIENCES AND METEOROLOGY,0.022229
4,Engineering,MATERIALS SCIENCE,0.023043
5,Engineering,METALLURGICAL ENGINEERING,0.024096
6,Physical Sciences,GEOSCIENCES,0.024374
7,Engineering,MATERIALS ENGINEERING AND MATERIALS SCIENCE,0.027789
8,Engineering,INDUSTRIAL PRODUCTION TECHNOLOGIES,0.028308
9,Engineering,ENGINEERING AND INDUSTRIAL MANAGEMENT,0.033652


In this mission, we became familiar with a dataset stored in a SQLite table by learning how to craft basic SQL queries. Here are a few things to note:

* We rarely linked to SQLite documentation, because it's a bit challenging to understand while you're just starting out. Sites like https://www.w3resource.com/sqlite/aggregate-functions-and-grouping-sum.php and https://www.w3resource.com/sqlite/aggregate-functions-and-grouping-sum.php are more friendly for looking up SQL commands.
* We learned about clauses, statements, keywords, and operators in SQL. Here's a diagram describing the difference between each term:

![](https://s3.amazonaws.com/dq-content/252/sql_components.svg)
In the next mission, we'll learn how to compute summary statistics and perform reductions on the same data in SQL.

In [22]:
db.close()