# College Major Salary Analysis

This is an analysis of potential earnings for college majors. Included is an assessment of the most and least risky majors. While it is often claimed by philosophy departments that philosophy majors go on to find high paying jobs, our analysis suggests that while that is only true in the 90th percentile. Philosophy remains among the riskiest majors.

Data Source: 
<a href='https://www.payscale.com/college-salary-report/majors-that-pay-you-back/bachelors'>Highest Paying Bachelor's Degrees </a>

In [10]:
import pandas as pd
import plotly as pl
import plotly.express as px
df = pd.read_csv('salaries_by_college_major.csv')


# EDA

In [3]:
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


<h4> Data set includes 50 majors </h4>

In [8]:
df.shape

(51, 6)

In [5]:
df.columns

Index(['Undergraduate Major', 'Starting Median Salary',
       'Mid-Career Median Salary', 'Mid-Career 10th Percentile Salary',
       'Mid-Career 90th Percentile Salary', 'Group'],
      dtype='object')

In [8]:
df.isnull()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False
3,False,False,False,False,False,False
4,False,False,False,False,False,False
5,False,False,False,False,False,False
6,False,False,False,False,False,False
7,False,False,False,False,False,False
8,False,False,False,False,False,False
9,False,False,False,False,False,False


In [6]:
df.tail(5)

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS
50,Source: PayScale Inc.,,,,,


In [11]:
clean_df = df.dropna()
clean_df.tail()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
45,Political Science,40800.0,78200.0,41200.0,168000.0,HASS
46,Psychology,35900.0,60400.0,31600.0,127000.0,HASS
47,Religion,34100.0,52000.0,29700.0,96400.0,HASS
48,Sociology,36500.0,58200.0,30700.0,118000.0,HASS
49,Spanish,34000.0,53100.0,31000.0,96400.0,HASS


**Major with the highest median starting salary: physician assistant**

In [10]:
clean_df['Starting Median Salary'].max()

74300.0

In [11]:
clean_df['Starting Median Salary'].idxmax()

43

In [12]:
clean_df['Undergraduate Major'][43]

'Physician Assistant'

In [13]:
clean_df.loc[43]

Undergraduate Major                  Physician Assistant
Starting Median Salary                           74300.0
Mid-Career Median Salary                         91700.0
Mid-Career 10th Percentile Salary                66400.0
Mid-Career 90th Percentile Salary               124000.0
Group                                               STEM
Name: 43, dtype: object

**Highest mid-career earning major is economics**

In [12]:
clean_df['Mid-Career 90th Percentile Salary'].idxmax()

17

In [84]:
clean_df.loc[17]

Undergraduate Major                  Economics
Spread                              159,400.00
Starting Median Salary               50,100.00
Mid-Career Median Salary             98,600.00
Mid-Career 10th Percentile Salary    50,600.00
Mid-Career 90th Percentile Salary   210,000.00
Group                                 Business
Name: 17, dtype: object

**Major with the lowest starting salary is spanish**

In [15]:
clean_df['Starting Median Salary'].min()

34000.0

In [16]:
clean_df["Starting Median Salary"].idxmin()
clean_df.loc[49]

Undergraduate Major                  Spanish
Starting Median Salary               34000.0
Mid-Career Median Salary             53100.0
Mid-Career 10th Percentile Salary    31000.0
Mid-Career 90th Percentile Salary    96400.0
Group                                   HASS
Name: 49, dtype: object

**Major with the lowest mid-career salary**

17    210000.0
22    195000.0
8     194000.0
37    183000.0
44    178000.0
36    175000.0
30    173000.0
14    171000.0
45    168000.0
42    168000.0
19    168000.0
38    163000.0
12    162000.0
1     161000.0
33    157000.0
25    156000.0
13    154000.0
16    153000.0
0     152000.0
2     150000.0
28    149000.0
10    148000.0
9     148000.0
7     147000.0
35    146000.0
34    145000.0
11    143000.0
3     138000.0
21    136000.0
4     136000.0
6     135000.0
39    134000.0
20    133000.0
24    132000.0
31    129000.0
46    127000.0
5     125000.0
29    124000.0
43    124000.0
48    118000.0
26    112000.0
23    111000.0
32    107000.0
15    107000.0
18    102000.0
27    101000.0
41     99200.0
40     98300.0
47     96400.0
49     96400.0
Name: Mid-Career 90th Percentile Salary, dtype: float64

In [14]:
clean_df["Mid-Career 90th Percentile Salary"].idxmin()
clean_df.loc[47]

Undergraduate Major                  Religion
Starting Median Salary                34100.0
Mid-Career Median Salary              52000.0
Mid-Career 10th Percentile Salary     29700.0
Mid-Career 90th Percentile Salary     96400.0
Group                                    HASS
Name: 47, dtype: object

**Note: Spanish and Religion have the same mid-career 90th percentile salary**

In [24]:
clean_df['Mid-Career 90th Percentile Salary'].sort_values(ascending=True)

49     96400.0
47     96400.0
40     98300.0
41     99200.0
27    101000.0
18    102000.0
32    107000.0
15    107000.0
23    111000.0
26    112000.0
48    118000.0
43    124000.0
29    124000.0
5     125000.0
46    127000.0
31    129000.0
24    132000.0
20    133000.0
39    134000.0
6     135000.0
21    136000.0
4     136000.0
3     138000.0
11    143000.0
34    145000.0
35    146000.0
7     147000.0
10    148000.0
9     148000.0
28    149000.0
2     150000.0
0     152000.0
16    153000.0
13    154000.0
25    156000.0
33    157000.0
1     161000.0
12    162000.0
38    163000.0
42    168000.0
19    168000.0
45    168000.0
14    171000.0
30    173000.0
36    175000.0
44    178000.0
37    183000.0
8     194000.0
22    195000.0
17    210000.0
Name: Mid-Career 90th Percentile Salary, dtype: float64

**Philosophy numbers**

In [85]:
clean_df.loc[42]

Undergraduate Major                  Philosophy
Spread                               132,500.00
Starting Median Salary                39,900.00
Mid-Career Median Salary              81,200.00
Mid-Career 10th Percentile Salary     35,500.00
Mid-Career 90th Percentile Salary    168,000.00
Group                                      HASS
Name: 42, dtype: object

# Major Earning Risk

**Calculate spread of major earnings**

In [22]:
clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']

0     109800.0
1      96700.0
2     113700.0
3     104200.0
4      85400.0
5      96200.0
6      98100.0
7     108200.0
8     122100.0
9     102700.0
10     84600.0
11    105500.0
12     95900.0
13     98000.0
14    114700.0
15     74800.0
16    116300.0
17    159400.0
18     72700.0
19     98700.0
20     99600.0
21    102100.0
22    147800.0
23     70000.0
24     92000.0
25    111000.0
26     76000.0
27     66400.0
28    112000.0
29     88500.0
30    115900.0
31     84500.0
32     71300.0
33    118800.0
34    106600.0
35    100700.0
36    132900.0
37    137800.0
38     99300.0
39    107300.0
40     50700.0
41     65300.0
42    132500.0
43     57600.0
44    122000.0
45    126800.0
46     95400.0
47     66700.0
48     87300.0
49     65400.0
dtype: float64

In [27]:
spread_col = clean_df['Mid-Career 90th Percentile Salary'] - clean_df['Mid-Career 10th Percentile Salary']
clean_df.insert(1, "Spread", spread_col)
clean_df.head()

Unnamed: 0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,109800.0,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,96700.0,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,113700.0,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,104200.0,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,85400.0,41600.0,76800.0,50600.0,136000.0,Business


**The lowest risk majors are determined by a minimum spread. This shows that the 90th and 10th percentile earnings are closer than other majors. Note: a low risk major does not mean that the major has high earnings, only that the descrepancy between earnings is low.**

In [34]:
low_risk = clean_df.sort_values("Spread")
low_risk[['Undergraduate Major', 'Spread']].head()

Unnamed: 0,Undergraduate Major,Spread
40,Nursing,50700.0
43,Physician Assistant,57600.0
41,Nutrition,65300.0
49,Spanish,65400.0
27,Health Care Administration,66400.0


**The majors with the most potential earnings are those in which the 90th percentile mid-career salary is highest. These majors may nevertheless be risky even if their potential earnings are very high, e.g. Economics.**

In [22]:
potential = clean_df.sort_values("Mid-Career 90th Percentile Salary", ascending=False)
potential[["Undergraduate Major","Mid-Career 90th Percentile Salary" ]].head()

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0


**The most risky majors are those in which the spread is highest, meaning that there is a big discrepancy in earnings across the career spectrum.**

**Philosophy remains among the top 5 riskiest majors, though it is not the riskiest. Surprisingly finance, marketing and math are among the riskiest majors, which is worth noting for undergraduate students deciding on a major**

In [23]:
high_risk = clean_df.sort_values("Spread", ascending=False)
high_risk[["Undergraduate Major", "Spread"]].head()

Unnamed: 0,Undergraduate Major,Spread
17,Economics,159400.0
22,Finance,147800.0
37,Math,137800.0
36,Marketing,132900.0
42,Philosophy,132500.0


**Majors that are relatively safe, though not risk free, are those whose spread occupies the median values.**

In [24]:
median_risk = clean_df.sort_values("Mid-Career Median Salary", ascending=False)
median_risk[["Undergraduate Major", "Mid-Career Median Salary"]].head()

Unnamed: 0,Undergraduate Major,Mid-Career Median Salary
8,Chemical Engineering,107000.0
12,Computer Engineering,105000.0
19,Electrical Engineering,103000.0
1,Aerospace Engineering,101000.0
17,Economics,98600.0


# Majors Analysis Visualization

In [28]:
fig = px.scatter(clean_df, x='Undergraduate Major', y='Starting Median Salary')
fig.update_xaxes(title='Major')
fig.show()

fig = px.scatter(clean_df, x='Undergraduate Major', y='Mid-Career Median Salary')
fig.update_xaxes(title='Major')
fig.show()

fig = px.scatter(clean_df, x='Undergraduate Major', y='Spread')
fig.update_xaxes(title='Major')
fig.show()

# Majors By Category

In [63]:
clean_df.groupby("Group").count()

Unnamed: 0_level_0,Undergraduate Major,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Business,12,12,12,12,12,12
HASS,22,22,22,22,22,22
STEM,16,16,16,16,16,16


In [80]:
pd.options.display.float_format = "{:,.2f}".format
clean_df.groupby("Group").mean()

Unnamed: 0_level_0,Spread,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Business,103958.33,44633.33,75083.33,43566.67,147525.0
HASS,95218.18,37186.36,62968.18,34145.45,129363.64
STEM,101600.0,53862.5,90812.5,56025.0,157625.0


# Final Analysis

When viewed as a group, STEM majors have the highest average starting and mid-career salaries, with a spread that isn't much greater than the lowest spread category, Humanities and Social Sciences (HASS). From a purely salary based perspective, STEM majors are the best choice.

Looking at individual majors, the starting salary, mid-career salary and spread must all be taken into account. Economics, for instance, has a very high 90th percentile mid-career salary, but the 10th percentile mid-career salary is not much more than the median starting salary. Economics has the highest spread, and hence is a very risky major. Alternatively, computer engineering has a relatively high median starting salary, a very high mid-career median salary and has average spread. So, computer engineering is a safe major that will likely lead to high earnings.

For my purposes, philosophy has a relatively low median starting salary, and has a slightly above average mid-career median salary. However, the spread for philosophy salaries is quite large. For instance the mid-career 10th percentile salary is actually less than the median starting salary. Therefore, philosophy is a risky major, and if salary is an important factor in major choice, philosophy is not a good choice. 

As a final note: this analysis has been of undergraduate majors. An analysis of postgraduate majors will likely be quite different in terms of risk and earnings. So, one can't draw any conclusions as to the earning potential of say a Ph.D. in philsophy from this analysis. 