# Subqueries Lab

### Loading our Data

In [3]:
ls school_prices

diversity_school.csv    salary_potential.csv    tuition_income.csv
historical_tuition.csv  tuition_cost.csv


In [72]:
import pandas as pd
salary_potential_df = pd.read_csv('./school_prices/salary_potential.csv', index_col = 0)
tuition_cost_df = pd.read_csv('./school_prices/tuition_cost.csv', index_col = 0)
diversity_df = pd.read_csv('./school_prices/diversity_school.csv', index_col = 0)
tuition_income_df = pd.read_csv('./school_prices/tuition_income.csv', index_col = 0)

In [17]:
tuitions_df[:2]

Unnamed: 0,name,state,total_price,year,campus,net_cost,income_lvl
0,Piedmont International University,NC,20174,2016,On Campus,11475.0,"0 to 30,000"
1,Piedmont International University,NC,20174,2016,On Campus,11451.0,"30,001 to 48,000"


In [18]:
import sqlite3
conn = sqlite3.connect('schools.db')

In [19]:
tuition_cost_df.to_sql('tuitions', conn, if_exists = 'replace')
salary_potential_df.to_sql('salaries', conn, if_exists = 'replace')
diversity_df.to_sql('diversity_categories', conn, if_exists = 'replace')

In [73]:
tuition_income_df.to_sql('tuition_incomes', conn, if_exists = 'replace')

### Exploring our data

We have a couple of new tables, so let's start by exploring them.  Use sql to select the first three rows from the `diversity_categories` table.

In [20]:
sql = """
SELECT * FROM diversity_categories LIMIT 3;
"""

pd.read_sql(sql, conn)

Unnamed: 0,name,total_enrollment,state,category,enrollment
0,University of Phoenix-Arizona,195059,Arizona,Women,134722
1,University of Phoenix-Arizona,195059,Arizona,American Indian / Alaska Native,876
2,University of Phoenix-Arizona,195059,Arizona,Asian,1959


So we can see that each school has multiple entries, each for a different diversity category.  Let's write query (not a subquery) that selects just the entries where the category is `Women`.

In [22]:
sql = """
SELECT * FROM diversity_categories WHERE category = 'Women';
"""

women_category_df = pd.read_sql(sql, conn)

In [23]:
women_category_df[:2]

Unnamed: 0,name,total_enrollment,state,category,enrollment
0,University of Phoenix-Arizona,195059,Arizona,Women,134722
1,Ivy Tech Community College-Central Indiana,91179,Indiana,Women,53476


Then write another select statement that only returns entries where the category is `Women`, but this time return a column for the name of the university -- aliased as college -- the state, and a `percentage_women` column.

> To avoid getting zero, may need to cast each column as a float, [see reference](https://stackoverflow.com/questions/1666407/sql-server-division-returns-zero).

In [39]:
sql = """
SELECT name as college, state, CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women';
"""

In [40]:
percentage_women_df = pd.read_sql(sql, conn)

In [41]:
percentage_women_df[:2]

Unnamed: 0,college,state,percentage_women
0,University of Phoenix-Arizona,Arizona,0.690673
1,Ivy Tech Community College-Central Indiana,Indiana,0.586495


Ok, now turn the query above into a subquery, and simply select the `college` and `percentage_women` columns from the derived table (ie. subquery).  Alias the results of the subquery as `gender_splits`.

In [45]:
sql = """
SELECT college, percentage_women FROM 
(SELECT name as college, state, CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits;
"""

In [46]:
percentage_women_subquery_df = pd.read_sql(sql, conn)

percentage_women_subquery_df[:2]

Unnamed: 0,college,percentage_women
0,University of Phoenix-Arizona,0.690673
1,Ivy Tech Community College-Central Indiana,0.586495


### Joining a Table

Now let's say that we want to join the results above with income information in the salaries table.

In [48]:
pd.read_sql("SELECT * FROM salaries LIMIT 2;", conn)

Unnamed: 0,rank,name,state_name,early_career_pay,mid_career_pay,make_world_better_percent,stem_percent
0,1,Auburn University,Alabama,54400,104500,51.0,31
1,2,University of Alabama in Huntsville,Alabama,57500,103900,59.0,45


Ok, so we can can begin by placing our entire previous query into a subquery. 

In [50]:
sql = """
SELECT * FROM 
(SELECT name as college,  CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits;
"""

subquery_df = pd.read_sql(sql, conn)
subquery_df[:2]

Unnamed: 0,college,percentage_women
0,University of Phoenix-Arizona,0.690673
1,Ivy Tech Community College-Central Indiana,0.586495


And then because we can treat the subquery as a table `gender_splits` with columns of `college` and `percentage_women`, we can simply join the `salaries` table just like we would any other table.

In [54]:
sql = """
SELECT college, percentage_women, early_career_pay, mid_career_pay FROM 
(SELECT name as college,  CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits 
INNER JOIN salaries ON salaries.name = gender_splits.college;
"""

joined_subquery_df = pd.read_sql(sql, conn)
joined_subquery_df[:2]

Unnamed: 0,college,percentage_women,early_career_pay,mid_career_pay
0,Auburn University,0.493902,54400,104500
1,Tuskegee University,0.597809,54500,93500


And from here, let's say find those schools where `mid_career_pay` is greater than 100,000 and sort by `percentage_women`.

In [58]:
sql = """
SELECT college, percentage_women, early_career_pay, mid_career_pay FROM 
(SELECT name as college,  CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits 
INNER JOIN salaries ON salaries.name = gender_splits.college WHERE mid_career_pay > 100000 ORDER BY percentage_women DESC;
"""

joined_subquery_df = pd.read_sql(sql, conn)
joined_subquery_df[:5]

Unnamed: 0,college,percentage_women,early_career_pay,mid_career_pay
0,Barnard College,0.998057,59200,109800
1,Wellesley College,0.973741,58900,106200
2,Samuel Merritt University,0.741139,91200,154100
3,Rush University,0.73138,63500,107600
4,Texas Tech University Health Sciences Center,0.684648,61300,104900


### Your Turn

Ok, now it's your turn to use a subquery in a join.  We'll start you off with the subquery from before.

In [62]:
sql = """
SELECT * FROM 
(SELECT name as college,  CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits ;
"""

subquery_df = pd.read_sql(sql, conn)
subquery_df[:2]

Unnamed: 0,college,percentage_women
0,University of Phoenix-Arizona,0.690673
1,Ivy Tech Community College-Central Indiana,0.586495


In [65]:
tuitions_df = pd.read_sql("SELECT * FROM tuitions LIMIT 5;", conn)
tuitions_df[:2]

Unnamed: 0,name,state,state_code,type,degree_length,room_and_board,in_state_tuition,in_state_total,out_of_state_tuition,out_of_state_total
0,Aaniiih Nakoda College,Montana,MT,Public,2 Year,,2380,2380,2380,2380
1,Abilene Christian University,Texas,TX,Private,4 Year,10350.0,34850,45200,34850,45200


This time use JOIN to return the out_of_state_tuition cost, aliased as tution along with the colleg and `percentage_women` columns and sort by the name of the college. 

In [71]:
sql = """
SELECT college, percentage_women, out_of_state_tuition as tuition FROM 
(SELECT name as college, CAST(enrollment as float) / CAST(total_enrollment as float)
as percentage_women 
FROM diversity_categories WHERE category = 'Women') as gender_splits 
JOIN tuitions ON tuitions.name = gender_splits.college ORDER BY college ASC;
"""

subquery_df = pd.read_sql(sql, conn)
subquery_df[:2]

Unnamed: 0,college,percentage_women,tuition
0,Aaniiih Nakoda College,0.611684,2380
1,Abilene Christian University,0.578721,34850


Finally, let's load up some data from the `tuition_incomes` table.

In [74]:
sql = """
SELECT * FROM tuition_incomes;
"""
tuition_incomes_df = pd.read_sql(sql, conn)
tuition_incomes_df[:2]

Unnamed: 0,name,state,total_price,year,campus,net_cost,income_lvl
0,Piedmont International University,NC,20174,2016,On Campus,11475.0,"0 to 30,000"
1,Piedmont International University,NC,20174,2016,On Campus,11451.0,"30,001 to 48,000"
