### How often does pay for lower degree levels exceed higher degree levels?

Generally speaking, higher levels of education lead to higher salaries. Workers with Masters degrees earn more, on average, than workers with Bachelors degrees. Workers with Doctoral degrees earn, on average, more than workers with Masters degrees. 

That said, it's fairly common knowledge that the institution and academic field play a substantial, often greater role in determining compensation. A bachelors degree in electrical engineering will likely result in higher salaries than a masters degree in journalism. A degree from an elite law school is likely to result in a much higher salary than a mid-tier program. 

The US Deparatment of education recently released data [1] that allows a much finer-grained analysis of trends in degree level, program, and academic subject. 

This data was used by The Center on Education and Workforce at Georgetown to analyze the financial value of various academic degrees and programs relative to average student indebtedness [2]. 

[1] https://collegescorecard.ed.gov/data/

[2] http://cew.georgetown.edu/wp-content/uploads/CEW-Buyer-Beware.pdf 
Note - The study appears to have used the 1516-1617 The notebook here will work for most recent cohorts as well, though the data around salary and cohort size will change. However, if you want to recreate the figures provided in the study, use the 1516-1617 set. 

The article is that in the general workforce, higher degree levels don't necessarily imply higher pay:

 27 percent of workers with an associate’s degree earn more than the median for workers with a bachelor’s degree

 35 percent of workers with a bachelor’s degree earn more than the median for workers with a master’s degree

 31 percent of workers with a master’s degree earn more than the median for workers with a doctoral degree

 22 percent of workers with a master’s degree earn more than the median for workers with a professional degree

Note that this data is based on Georgetown University Center on Education and the Workforce analysis of data from the Current Population Survey, 2019, which measures pay "among full-time, full-year workers,
25 to 64 years old", not from the College Scorecard dataset, which measures 1st and 2nd year pay after completion of a program. 

In this notebook, I tried recreating these numbers for the 1st year earnings, to investigate the possible difference between recent graduates and the workforce in general.

My conculsion: 

Based on the data 1516-1617 dataset (the one used for the study), the advantages of higher degree are considerably less notable in first year earnings than for the workforce at large. 

For 1st year graduates:

* 18 percent of workers with an associate’s degree earn more than the median for workers with a bachelor’s degree

* 17 percent of workers with a bachelor’s degree earn more than the median for workers with a master’s degree

* 15 percent of workers with a master’s degree earn more than the median for workers with a doctoral degree

* 28 percent of workers with a master’s degree earn more than the median for workers with a professional degree

There are a number of possibile interpretations. Tthe value of a higher degree may be decreasing over time, or the benefits of a higher degree may accrue later in a career, or immediate and long term benefits may be much more pronounced in certain specialized fields of study (and not reflected in data aggregated for all fields). 

The workbook below shows my queries and calculations for generating the stats above. 

<b>NOTE: The query and calculation pipeline here is intricate, and I haven't had anyone else look at this yet. So I'm hesitant to draw any conculsions at all until I'm more confident in the accuracy of this notebook. I would welcome a code review, or another independent effort to confirm or correct the numbers above.</b>

In [2]:
import pandas as pd
pd.set_option('display.max_colwidth', None)
from pandasql import sqldf
pysqldf = lambda q: sqldf(q, globals())

### Getting the data:

To download the dataset used here, go to:
    
https://collegescorecard.ed.gov/data/
    
And follow the link for: 

https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources

Download "All College Scorecard Data Files"

Several datasets are available. I used the same dataset referenced in the study, FieldOfStudyData1516_1617_PP.csv. However, you can run this notebook using the more recent dataset.

In [3]:
df = pd.read_csv('data/FieldOfStudyData1516_1617_PP.csv')

In [4]:
len(df)

218901

#### Remove missing data

Prior to analysis, I'm removing all rows without 1st year earnings data (PrivacySuppressed). 

For privacy reasons and to avoid reporting personally identifiable data, programs with small cohorts do not report data. Unfortunately, there's a good chance this may introduce bias into the dataset, as some degree programs have much larger graduating cohorts than others. Harvard law school, a large cohort, will report numbers, whereas the Doctoral Programs in Operations Research at UC Berkeley, typically a small cohort, does not. As a result, conculsions may be skewed by the uneven availability of data across program, discipline, and degree types. 



In [6]:
df_1yr = pysqldf("""
SELECT 
    INSTNM, 
    CIPCODE,
    CIPDESC, 
    CREDLEV, 
    CREDDESC, 
    EARN_COUNT_WNE_HI_1YR * 1 AS EARN_COUNT_WNE_HI_1YR, 
    EARN_MDN_HI_1YR * 1.0 AS EARN_MDN_HI_1YR
FROM 
    df
WHERE
    EARN_MDN_HI_1YR <> 'PrivacySuppressed'
ORDER BY EARN_MDN_HI_1YR
""")

In [7]:
3946/7623

0.5176439721894267

For simplicity, I'll use the the credlev code for each degree level. As a visual reference, here's a look up table.

In [8]:
pysqldf("SELECT DISTINCT CREDLEV, CREDDESC FROM df_1yr ORDER BY CREDLEV")

Unnamed: 0,CREDLEV,CREDDESC
0,1,Undergraduate Certificate or Diploma
1,2,Associate's Degree
2,3,Bachelors Degree
3,4,Post-baccalaureate Certificate
4,5,Master's Degree
5,6,Doctoral Degree
6,7,First Professional Degree
7,8,Graduate/Professional Certificate


I couldn't see where the Georgetown article reference which dataset it was using, so I spot checked a few numbers and concluded it was the 1516-1617 dataset. 

Here are a few spot checks:

<i>Graduates who earn an associate’s degree from
Pierpont Community and Technical College in West Virginia to become an electrical and
power transmission installer can expect to make \\$6,700 per month (\\$80,400 per year) in
their first year after graduation.</i>

Ever so slightly off, I get \\$80,100, but close enough.

In [9]:
pysqldf("""
SELECT 
    * 
FROM 
    df_1yr 
WHERE 
    INSTNM = 'Pierpont Community and Technical College'
AND 
    CIPDESC = 'Electrical and Power Transmission Installers.'
""")

Unnamed: 0,INSTNM,CIPCODE,CIPDESC,CREDLEV,CREDDESC,EARN_COUNT_WNE_HI_1YR,EARN_MDN_HI_1YR
0,Pierpont Community and Technical College,4603,Electrical and Power Transmission Installers.,2,Associate's Degree,0,80100.0


One more spot check...

<i>the first-year earnings of a person who is awarded
an associate’s degree in nursing from City University of New York (CUNY) LaGuardia Community College are \\$5,017 per month, which is about \\$800 higher than the median monthly earnings of graduates from master’s degree programs at all institutions.</i>

A query aagainst the 1516-1617 data set gives the same answer, confirmed below. 

In [11]:
pysqldf("""
SELECT 
    * 
FROM 
    df_1yr 
WHERE 
    INSTNM = 'CUNY LaGuardia Community College'
AND
    CIPDESC = 'Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing.'
AND
    CREDLEV = 2
""")

Unnamed: 0,INSTNM,CIPCODE,CIPDESC,CREDLEV,CREDDESC,EARN_COUNT_WNE_HI_1YR,EARN_MDN_HI_1YR
0,CUNY LaGuardia Community College,5138,"Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing.",2,Associate's Degree,24,60200.0


In [15]:
60200.0 / 12

5016.666666666667

Moving on to the calculation pipeline. 

My goal is to to calculate the percent of each degree level that exceeds the median for the degree level above it.

For simplicity, I'll create a separate dataframe for each degree level. 

In [12]:
df_asoc = pysqldf("SELECT * FROM df_1yr WHERE CREDLEV = 2")
df_bach = pysqldf("SELECT * FROM df_1yr WHERE CREDLEV = 3")
df_ms = pysqldf("SELECT * FROM df_1yr WHERE CREDLEV = 5")
df_prof = pysqldf("SELECT * FROM df_1yr WHERE CREDLEV = 7")
df_doc = pysqldf("SELECT * FROM df_1yr WHERE CREDLEV = 6")

To get the median for all graduates, I need to consider the number of graduates in each cohort (I can't take the median for all programs as the median for all students, because there are varying numbers of graduates in each cohort).

To do this, I'll take the cumulative sum of the number of graduates in each program ordered by earnings (2nd year out).

In [13]:
df_asoc['num_below'] = df_asoc['EARN_COUNT_WNE_HI_1YR'].cumsum()
df_bach['num_below'] = df_bach['EARN_COUNT_WNE_HI_1YR'].cumsum()
df_ms['num_below'] = df_ms['EARN_COUNT_WNE_HI_1YR'].cumsum()
df_prof['num_below'] = df_prof['EARN_COUNT_WNE_HI_1YR'].cumsum()
df_doc['num_below'] = df_doc['EARN_COUNT_WNE_HI_1YR'].cumsum()

... and do a quick sanity check through visual inspection to make sure ordering is correct

In [15]:
df_ms.head(10)

Unnamed: 0,INSTNM,CIPCODE,CIPDESC,CREDLEV,CREDDESC,EARN_COUNT_WNE_HI_1YR,EARN_MDN_HI_1YR,num_below
0,Stanbridge University,5123,Rehabilitation and Therapeutic Professions.,5,Master's Degree,0,5500.0,0
1,Touro College,3001,Biological and Physical Sciences.,5,Master's Degree,28,9500.0,28
2,Dongguk University-Los Angeles,5112,Medicine.,5,Master's Degree,0,11300.0,28
3,Acupuncture and Integrative Medicine College-Berkeley,5133,Alternative and Complementary Medicine and Medical Systems.,5,Master's Degree,27,15200.0,55
4,Academy of Art University,5007,Fine and Studio Arts.,5,Master's Degree,48,15300.0,103
5,University of Illinois at Urbana-Champaign,5009,Music.,5,Master's Degree,30,15500.0,133
6,Emperor's College of Traditional Oriental Medicine,5133,Alternative and Complementary Medicine and Medical Systems.,5,Master's Degree,32,15900.0,165
7,New York Academy of Art,5007,Fine and Studio Arts.,5,Master's Degree,41,16000.0,206
8,University of Puerto Rico-Rio Piedras,402,Architecture.,5,Master's Degree,34,16100.0,240
9,University of Southern California,5009,Music.,5,Master's Degree,46,16200.0,286


In [16]:
df_ms.tail(10)

Unnamed: 0,INSTNM,CIPCODE,CIPDESC,CREDLEV,CREDDESC,EARN_COUNT_WNE_HI_1YR,EARN_MDN_HI_1YR,num_below
7613,University of California-San Francisco,5114,Medical Clinical Sciences/Graduate Medical Studies.,5,Master's Degree,29,184500.0,1057633
7614,University of New England,5138,"Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing.",5,Master's Degree,21,184700.0,1057654
7615,Saint Mary's University of Minnesota,5138,"Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing.",5,Master's Degree,81,186000.0,1057735
7616,Central Connecticut State University,5138,"Registered Nursing, Nursing Administration, Nursing Research and Clinical Nursing.",5,Master's Degree,62,188500.0,1057797
7617,Duke University,5211,International Business.,5,Master's Degree,21,189400.0,1057818
7618,University of Washington-Seattle Campus,5105,Advanced/Graduate Dentistry and Oral Sciences.,5,Master's Degree,26,197000.0,1057844
7619,University of Pennsylvania,5199,"Health Professions and Related Clinical Sciences, Other.",5,Master's Degree,23,197100.0,1057867
7620,University of Michigan-Ann Arbor,5105,Advanced/Graduate Dentistry and Oral Sciences.,5,Master's Degree,29,204000.0,1057896
7621,Virginia Commonwealth University,5105,Advanced/Graduate Dentistry and Oral Sciences.,5,Master's Degree,25,208600.0,1057921
7622,Ohio State University-Main Campus,5104,Dentistry.,5,Master's Degree,0,231200.0,1057921


Data is prepped.

Next, I'll calculate the percentage of graduates at a lower degree level earning at or above the median for a higher degree level. 

First up, I'll calculate the percentage of Bachelors degree graduates who earn more than the median for Masters degree holders (for 1st year earnings). 

To do this, I'll need to calculate the median salary for all MS degree recipients, then calculate the percentage of bachelors degree recipients exceed it.

As mentioned above, this isn't as easy as finding the salary for the median *program*, since each program has a differing cohort size. I need to find the salary for the median *graduate*, not *program*. To do this, I first find the salary for the program and cohort closest to the median masters student (using the cumulative count of all students, called "num_below"). 

In [17]:
pysqldf("""
SELECT 
        EARN_MDN_HI_1YR 
    FROM 
        (SELECT 
            *, 
            ABS(num_below - (SELECT MAX(num_below)/2 FROM df_ms)) AS D 
        FROM 
            df_ms
        ORDER BY D 
        LIMIT 1)
""")

Unnamed: 0,EARN_MDN_HI_1YR
0,54400.0


Now that I have this number, I can count the number of graduates at a different degree level have a salary above that number, and then divide it by the total number of grads at that degree level to get the percentage.

In [52]:
pysqldf("""
SELECT 
    SUM(EARN_COUNT_WNE_HI_1YR) * 1.0 / (SELECT MAX(num_below) FROM df_bach) * 1.0 AS PCT_ABOVE
FROM 
    df_bach 
WHERE 
    EARN_MDN_HI_1YR >= 54400.0
""")

Unnamed: 0,PCT_ABOVE
0,0.166612


Now that I've been through the calc pipeline once, I'm going to modularize it so I can do the calculations for any two degree levels.

#### A little more explanation for this calculation query/pipeline.

I want to find the percent of graduates at one degree level who have a 1st year salary at or above the median for a different degree level. 

The data is not reported for each graduate. Instead, the data includes a row for each cohort for each degree program/instution. This row includes the number of graduates and the median 1yr pay for this group. 

To find the medians, I ordered the data by 1yr pay in ascending order, then created a new column that holds the cumulative number of graduates at or below each row (called "num_below", as it provides the number of graduates at or below the median salary for a particular cohort within each degree level). I use this column to find the 1yr salary of the median graduate for each degree level (rather than the median program). 

Now that I have the median 1yr salary for a degree level, I can find the number of graduates at or above this median for graduates from a different degree level (using the same num_row column column). 


The queries in the cell below carry out these calculations - first the median for a degree level ("med"), then the percentage at or above it from a different degree level "above"). 

In [18]:
def pct_at_or_above_median(med, above):

    return pysqldf("""
    WITH med AS 
    (
    SELECT 
        EARN_MDN_HI_1YR 
    FROM 
        (SELECT 
            *, 
            ABS(num_below - (SELECT MAX(num_below)/2 FROM {0})) AS D 
        FROM 
            {0} 
        ORDER BY D 
        LIMIT 1)
    )

    SELECT 
        SUM(EARN_COUNT_WNE_HI_1YR) * 1.0 / (SELECT MAX(num_below) FROM {1}) * 1.0 AS PCT_ABOVE
    FROM 
        {1} 
    WHERE 
        EARN_MDN_HI_1YR >= (SELECT EARN_MDN_HI_1YR FROM med)

    """.format(med, above))

For reference, here again are the numbers for the *general working population 25-64*

- 27 percent of workers with an associate’s degree earn more than the median for workers with a bachelor’s degree
- 35 percent of workers with a bachelor’s degree earn more than the median for workers with a master’s degree
- 31 percent of workers with a master’s degree earn more than the median for workers with a doctoral degree
- 22 percent of workers with a master’s degree earn more than the median for workers with a professional degree.

### Results from the 1516-1617 dataset

Here are the calculations (using the pipeline above).

#### Percent of associates above median for bachelors

In [26]:
pct_at_or_above_median('df_bach', 'df_asoc')

Unnamed: 0,PCT_ABOVE
0,0.177478


#### Percent of bachelors above median for masters

In [20]:
pct_at_or_above_median('df_ms', 'df_bach')

Unnamed: 0,PCT_ABOVE
0,0.166612


#### Percent of masters above median for doctoral

In [19]:
pct_at_or_above_median('df_doc', 'df_ms')

Unnamed: 0,PCT_ABOVE
0,0.154378


In [20]:
pct_at_or_above_median('df_prof', 'df_ms')

Unnamed: 0,PCT_ABOVE
0,0.283655


### Calculations based on median program, rather than median graduate

It might be useful to look at the data by program level rather than individual.

I think that the individual calculation provides a better comparison with the general population measures, so this is included as an addendum. 


In [21]:
m_asoc = df_asoc.describe()['EARN_MDN_HI_1YR']['50%']
m_bach = df_bach.describe()['EARN_MDN_HI_1YR']['50%']
m_ms = df_ms.describe()['EARN_MDN_HI_1YR']['50%']
m_doc = df_doc.describe()['EARN_MDN_HI_1YR']['50%']
m_prof = df_prof.describe()['EARN_MDN_HI_1YR']['50%']

In [22]:
print(m_asoc)
print(m_bach)
print(m_ms)
print(m_doc)
print(m_prof)

29400.0
34500.0
50600.0
69150.0
61700.0


In [23]:
def pct_at_or_above_median_program(median, degree):
    return pysqldf("""
    SELECT (COUNT(*) * 1.0) / ((SELECT COUNT(*) FROM {0})*1.0) as pct_above
    FROM {0} WHERE EARN_MDN_HI_1YR >= {1}""".format(degree, median))

In [24]:
pct_at_or_above_median_program(m_bach, 'df_asoc')

Unnamed: 0,pct_above
0,0.35324


In [25]:
pct_at_or_above_median_program(m_ms, 'df_bach')

Unnamed: 0,pct_above
0,0.188873


In [26]:
pct_at_or_above_median_program(m_doc, 'df_ms')

Unnamed: 0,pct_above
0,0.195199


In [27]:
pct_at_or_above_median_program(m_prof, 'df_ms')

Unnamed: 0,pct_above
0,0.279418
