# Data Exploration: College Major vs Salary

In [49]:
import pandas as pd

## Read data

In [50]:
df = pd.read_csv("data/salaries_by_college_major.csv")
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,Business


## Preliminary Data Exploration and Data Cleaning

In [51]:
rows, columns = df.shape
print(f"Number of rows: {rows}")
print(f"Number of columns: {columns}")

Number of rows: 51
Number of columns: 6


Check if there are null values in dataframe

In [52]:
df.isnull().any().any()

True

Check how many null values are there

In [53]:
df.isnull().sum().sum()

5

Cleaning dataframe from null values

In [54]:
df = df.dropna()
df.isnull().any().any()

False

## Data Exploration

### Finding a major with highest starting median salary

In [55]:
id__with_max_starting_salary = df["Starting Median Salary"].idxmax()
id__with_max_starting_salary

43

In [56]:
df[["Undergraduate Major", "Starting Median Salary"]].loc[id__with_max_starting_salary]

Undergraduate Major       Physician Assistant
Starting Median Salary                74300.0
Name: 43, dtype: object

### Finding a major with highest mid-career median salary

Mid-career is defined as having 10+ years of experience.

In [57]:
df[["Undergraduate Major", "Mid-Career Median Salary"]].loc[
    df["Mid-Career Median Salary"].idxmax()
]

Undergraduate Major         Chemical Engineering
Mid-Career Median Salary                107000.0
Name: 8, dtype: object

### Which college major has the lowest starting median salary?

In [58]:
df[["Undergraduate Major", "Starting Median Salary"]].loc[
    df["Starting Median Salary"].idxmin()
]

Undergraduate Major       Spanish
Starting Median Salary    34000.0
Name: 49, dtype: object

### Which college major has the lowest mid-career median salary?

In [65]:
df[["Undergraduate Major", "Mid-Career Median Salary"]].loc[
    df["Starting Median Salary"].idxmin()
]

Undergraduate Major         Spanish
Mid-Career Median Salary    53100.0
Name: 49, dtype: object

So Spanish major has both lowest starting and mid-career median salary.

### Finding a low risk major

A low-risk major is a degree where there is a small difference between the lowest and highest salaries.

In [60]:
spread_col = (
    df["Mid-Career 90th Percentile Salary"] - df["Mid-Career 10th Percentile Salary"]
)
df.insert(5, "Spread", spread_col)
df.head()

Unnamed: 0,Undergraduate Major,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Spread,Group
0,Accounting,46000.0,77100.0,42200.0,152000.0,109800.0,Business
1,Aerospace Engineering,57700.0,101000.0,64300.0,161000.0,96700.0,STEM
2,Agriculture,42600.0,71900.0,36300.0,150000.0,113700.0,Business
3,Anthropology,36800.0,61500.0,33800.0,138000.0,104200.0,HASS
4,Architecture,41600.0,76800.0,50600.0,136000.0,85400.0,Business


In [66]:
low_risk = df.sort_values("Spread")
low_risk[["Undergraduate Major", "Spread"]].head()

Unnamed: 0,Undergraduate Major,Spread
40,Nursing,50700.0
43,Physician Assistant,57600.0
41,Nutrition,65300.0
49,Spanish,65400.0
27,Health Care Administration,66400.0


### Finding the degrees with the greatest spread in salaries

In [70]:
great_spread = df.sort_values("Spread", ascending=False)
great_spread[["Undergraduate Major", "Spread"]].head()

Unnamed: 0,Undergraduate Major,Spread
17,Economics,159400.0
22,Finance,147800.0
37,Math,137800.0
36,Marketing,132900.0
42,Philosophy,132500.0


### Finding degrees with the highest values in the 90th percentile

In [69]:
high_potential = df.sort_values("Mid-Career 90th Percentile Salary", ascending=False)
high_potential[["Undergraduate Major", "Mid-Career 90th Percentile Salary"]].head()

Unnamed: 0,Undergraduate Major,Mid-Career 90th Percentile Salary
17,Economics,210000.0
22,Finance,195000.0
8,Chemical Engineering,194000.0
37,Math,183000.0
44,Physics,178000.0


### Which category of degrees has the highest average salary?

In [74]:
pd.options.display.float_format = "{:,.2f}".format
df.groupby('Group').mean()

Unnamed: 0_level_0,Starting Median Salary,Mid-Career Median Salary,Mid-Career 10th Percentile Salary,Mid-Career 90th Percentile Salary,Spread
Group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Business,44633.33,75083.33,43566.67,147525.0,103958.33
HASS,37186.36,62968.18,34145.45,129363.64,95218.18
STEM,53862.5,90812.5,56025.0,157625.0,101600.0
