# Data Science Salaries

### Objective
To find the best location to work as a Data Scientist.

With this information, one can guide their career by selecting companies in these locations whose values align with theirs.

#### Bonus Objectives
- To understand how FAANG + Microsoft differs from the pack
- To understand disparities and what is the most one can make of, given equal opportunity


### Table of Contents:

- Section 1: Data Science Salaries
- Section 2: Cost of Living Index
- Section 3: Combining Data Science Salaries with Cost of Living

In [1]:
# Import libraries
import numpy as np
import pandas as pd

# Section 1: Data Science Salaries

In [2]:
# Read STEM Salaries
df = pd.read_csv('/kaggle/input/data-science-and-stem-salaries/Levels_Fyi_Salary_Data.csv')

In [3]:
# Check the data
df.head(5)

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Doctorate_Degree,Highschool,Some_College,Race_Asian,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000.0,...,0,0,0,0,0,0,0,0,,
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000.0,...,0,0,0,0,0,0,0,0,,
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000.0,...,0,0,0,0,0,0,0,0,,
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0.0,...,0,0,0,0,0,0,0,0,,


In [4]:
# Create a city ID so that we can lookup later
df[['city','state','country']] = df.location.str.split(", ",expand=True,n=2)
df.country.fillna('United States',inplace=True)
df['city_id'] = df.agg(lambda df: f"{df['city']}_{df['country']}",axis = 1)

df.head(5)

Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education,city,state,country,city_id
0,6/7/2017 11:33:27,Oracle,L3,Product Manager,127000,"Redwood City, CA",1.5,1.5,,107000.0,...,0,0,0,0,,,Redwood City,CA,United States,Redwood City_United States
1,6/10/2017 17:11:29,eBay,SE 2,Software Engineer,100000,"San Francisco, CA",5.0,3.0,,0.0,...,0,0,0,0,,,San Francisco,CA,United States,San Francisco_United States
2,6/11/2017 14:53:57,Amazon,L7,Product Manager,310000,"Seattle, WA",8.0,0.0,,155000.0,...,0,0,0,0,,,Seattle,WA,United States,Seattle_United States
3,6/17/2017 0:23:14,Apple,M1,Software Engineering Manager,372000,"Sunnyvale, CA",7.0,5.0,,157000.0,...,0,0,0,0,,,Sunnyvale,CA,United States,Sunnyvale_United States
4,6/20/2017 10:58:51,Microsoft,60,Software Engineer,157000,"Mountain View, CA",5.0,3.0,,0.0,...,0,0,0,0,,,Mountain View,CA,United States,Mountain View_United States


In [5]:
# Take FAANG and Microsoft out of the equation, as they represent a different tier of jobs
faangless_df = df.copy()

faang_list = ['Facebook','Apple','Amazon','Netflix','Google','Microsoft']
faang_list = '|'.join(faang_list)

faangless_s = faangless_df.company.str.contains(faang_list,case=False)

fdf = faangless_df = faangless_df[faangless_s == False]

print(faangless_df.shape)
print(df.shape)

(39455, 33)
(62642, 33)


In [6]:
# Filter for all Data-related job titles only

fds = fdf[fdf.title.str.contains('data',case=False)]

print(fds.shape)
fds.head()

(1711, 33)


Unnamed: 0,timestamp,company,level,title,totalyearlycompensation,location,yearsofexperience,yearsatcompany,tag,basesalary,...,Race_White,Race_Two_Or_More,Race_Black,Race_Hispanic,Race,Education,city,state,country,city_id
419,6/5/2018 14:06:30,LinkedIn,Senior,Data Scientist,233000,"San Francisco, CA",4.0,0.0,Data Analysis,162000.0,...,0,0,0,0,,,San Francisco,CA,United States,San Francisco_United States
444,6/8/2018 17:55:09,ebay,26,Data Scientist,180000,"San Jose, CA",10.0,5.0,,0.0,...,0,0,0,0,,,San Jose,CA,United States,San Jose_United States
454,6/10/2018 19:39:35,Twitter,Staff,Data Scientist,500000,"San Francisco, CA",4.0,4.0,ML / AI,200000.0,...,0,0,0,0,,,San Francisco,CA,United States,San Francisco_United States
523,6/25/2018 8:45:29,Tesla,Senior Engineer,Data Scientist,168000,"Palo Alto, CA",8.0,3.0,Mechanical Engineering,118000.0,...,0,0,0,0,,,Palo Alto,CA,United States,Palo Alto_United States
535,6/26/2018 21:37:46,GrubHub,II,Data Scientist,187000,"New York, NY",4.0,1.0,ML / AI,150000.0,...,0,0,0,0,,,New York,NY,United States,New York_United States


# Section 2: Cost of Living Index

# Section 3: Combining Data Science Salaries with Cost of Living