# ESE 415 Final Project - Analyzing Data Scientist Salaries

## By: Abigail Alpert and Kevin Yan

# Data exploration and preprocesing

### Dataset Columns
1. **work_year** : The year the salary was paid.

2. **experience_level** : The experience level in the job during the year

3. **employment_type** : The type of employment for the role

4. **job_title** : The role worked in during the year.

5. **salary** : The total gross salary amount paid.

6. **salary_currency** : The currency of the salary paid as an ISO 4217 currency code.

7. **salaryinusd** : The salary in USD

8. **employee_residence** : Employee's primary country of residence in during the work year as an ISO 3166 country code.

9. **remote_ratio** : The overall amount of work done remotely

10. **company_location** : The country of the employer's main office or contracting branch

11. **company_size** : The median number of people that worked for the company during the year

In [5]:
import pandas as pd
import numpy as np

import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [6]:
df = pd.read_csv('ds_salaries.csv', index_col = 0)
df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2020,MI,FT,Data Scientist,70000,EUR,79833,DE,0,DE,L
1,2020,SE,FT,Machine Learning Scientist,260000,USD,260000,JP,0,JP,S
2,2020,SE,FT,Big Data Engineer,85000,GBP,109024,GB,50,GB,M
3,2020,MI,FT,Product Data Analyst,20000,USD,20000,HN,0,HN,S
4,2020,SE,FT,Machine Learning Engineer,150000,USD,150000,US,50,US,L
...,...,...,...,...,...,...,...,...,...,...,...
602,2022,SE,FT,Data Engineer,154000,USD,154000,US,100,US,M
603,2022,SE,FT,Data Engineer,126000,USD,126000,US,100,US,M
604,2022,SE,FT,Data Analyst,129000,USD,129000,US,0,US,M
605,2022,SE,FT,Data Analyst,150000,USD,150000,US,100,US,M


First off, basic data cleaning (dropping duplicates and NA's).

In [7]:
df = df.drop_duplicates()
df = df.dropna(subset=['salary_in_usd']) 

## <span style="color:red"> Check if standardizing makes sense here. My concern is that it'll make our predictions hard to interpret </span>

Let's standardize the numerical features to help the model converge faster. The different numerical features have very different magnitudes which could cause issues during gradient descent.

In [8]:
num_cols = ['work_year', 'salary', 'salary_in_usd', 'remote_ratio']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols)
    ],
    remainder='passthrough'
)

X_standardized = preprocessor.fit_transform(df)

all_feature_names = list(num_cols) + [
    col for col in df.columns 
    if col not in num_cols
]

standardized_df = pd.DataFrame(X_standardized, columns=all_feature_names)

standardized_df

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio,experience_level,employment_type,job_title,salary_currency,employee_residence,company_location,company_size
0,-1.956361,-0.167734,-0.42618,-1.710815,MI,FT,Data Scientist,EUR,DE,DE,L
1,-1.956361,-0.048869,2.06863,-1.710815,SE,FT,Machine Learning Scientist,USD,JP,JP,S
2,-1.956361,-0.15835,-0.021966,-0.487257,SE,FT,Big Data Engineer,GBP,GB,GB,M
3,-1.956361,-0.199014,-1.254701,-1.710815,MI,FT,Product Data Analyst,USD,HN,HN,S
4,-1.956361,-0.117686,0.545437,-0.487257,SE,FT,Machine Learning Engineer,USD,US,US,L
...,...,...,...,...,...,...,...,...,...,...,...
560,0.910939,-0.115183,0.600826,0.7363,SE,FT,Data Engineer,USD,US,US,M
561,0.910939,-0.1327,0.213104,0.7363,SE,FT,Data Engineer,USD,US,US,M
562,0.910939,-0.130823,0.254645,-1.710815,SE,FT,Data Analyst,USD,US,US,M
563,0.910939,-0.117686,0.545437,0.7363,SE,FT,Data Analyst,USD,US,US,M


Now let's create a couple flags/binary variables that could be useful for predictions

In [9]:
df['is_manager'] = df['job_title'].str.contains('Manager|Lead|Director|Head', case=False).astype(int)
df['is_remote'] = (df['remote_ratio'] == 100).astype(int)
df['is_hybrid'] = ((df['remote_ratio'] > 0) & (df['remote_ratio'] < 100)).astype(int)
df['same_country'] = (df['employee_residence'] == df['company_location']).astype(int)

In [10]:
print("Unique Job Titles:",len(df['job_title'].unique()))
print("Unique Employee Residences:",len(df['employee_residence'].unique()))
print("Unique Company Locations:",len(df['company_location'].unique()))

Unique Job Titles: 50
Unique Employee Residences: 57
Unique Company Locations: 50


Unfortunately, it seems like these categorical variables have many possibilities, so performing one-hot encoding on these columns may create too many columns to be feasible. We will have to find other ways to deal with these columns, as we believe that the job title will play a strong role in predicting salary.

With that being said, let's handle the lower-cardinal categorical features with one-hot encoding. Note that we ignore salary_currency as we do not think it will be predictive given we have the salary_in_usd column.

In [11]:
low_card_cat_features = [
    'experience_level',  # EN, MI, SE, EX
    'employment_type',   # FT, PT, CT, FL
    'company_size',     # S, M, L
]
preprocessor = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(handle_unknown='ignore'), low_card_cat_features)
    ],
    remainder='passthrough'
)

X_processed = preprocessor.fit_transform(standardized_df)


cat_encoder = preprocessor.named_transformers_['cat']
feature_names = cat_encoder.get_feature_names_out(low_card_cat_features)

all_feature_names = list(feature_names) + [
    col for col in standardized_df.columns 
    if col not in low_card_cat_features
]

processed_df = pd.DataFrame(X_processed, columns=all_feature_names)
processed_df

Unnamed: 0,experience_level_EN,experience_level_EX,experience_level_MI,experience_level_SE,employment_type_CT,employment_type_FL,employment_type_FT,employment_type_PT,company_size_L,company_size_M,company_size_S,work_year,salary,salary_in_usd,remote_ratio,job_title,salary_currency,employee_residence,company_location
0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-1.956361,-0.167734,-0.42618,-1.710815,Data Scientist,EUR,DE,DE
1,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-1.956361,-0.048869,2.06863,-1.710815,Machine Learning Scientist,USD,JP,JP
2,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,-1.956361,-0.15835,-0.021966,-0.487257,Big Data Engineer,GBP,GB,GB
3,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,-1.956361,-0.199014,-1.254701,-1.710815,Product Data Analyst,USD,HN,HN
4,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,-1.956361,-0.117686,0.545437,-0.487257,Machine Learning Engineer,USD,US,US
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
560,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.910939,-0.115183,0.600826,0.7363,Data Engineer,USD,US,US
561,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.910939,-0.1327,0.213104,0.7363,Data Engineer,USD,US,US
562,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.910939,-0.130823,0.254645,-1.710815,Data Analyst,USD,US,US
563,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.910939,-0.117686,0.545437,0.7363,Data Analyst,USD,US,US


## <span style="color:red"> we need to figure out how we'll handle the job_title and employee_residence columns </span>

**job_title:** this feaure seems like it will be important, as we suspect samples with the same/similar job title to have more similar salaries.

**employee_residence:** this feature seems important, because salary often relates to cost of living (ie: the same position at the same company will pay different amounts depending on which US city it is located in).

## <span style="color:limegreen"> *for now, let's see what happens in we use label encoding* </span>

In [14]:
label_encoder = LabelEncoder()

label_df = processed_df.copy()
label_df['job_title']= label_encoder.fit_transform(label_df['job_title'])
# label_df['employee_residence']=label_encoder.fit_transform(label_df['employee_residence'])

The last this to do before model building is to split the features from the target variable. As mentioned earlier, our target will be "salary_in_usd".

In [15]:
target = 'salary_in_usd'
features = [col for col in label_df.columns if col not in target]

X = label_df[features]
y = label_df['salary_in_usd']

# Model Building

We are going to fit our model using gradient descent, so let's start by implmenting a multivariate linear GD algorithm with MSE as the loss function

### Vanilla GD

In [16]:
# starting with a helper function that implements a single gradient step
def gd_step_mv(X, y, m, b, alpha):
    ''' This function calculates one step of gradient descent for a multivariate linear function.

        Args:
            X: a numpy array of the input data
            y: a numpy array of the true output
            m: a numpy array of the current m values (slope)
            b: a float of the current b value
            alpha: a float the learning rate for the GD model

        Output:
            m_new: an array of the updated m values (slope)
            b_new: a float of the updated bias term
    '''

    n, k = X.shape # n = num samples, k = num features

    y_pred = np.dot(X, m) + b

    error = y - y_pred
    
    # loss = MSE
    m_grad = -(2/n)*np.dot(X.T, error) #compute the gradient of m
    b_grad = -(2/n)*np.sum(error) #computer the gradient of b

    m_new = m - alpha*m_grad
    b_new = b - alpha*b_grad

    return m_new, b_new

In [17]:
# Now, moving on to an actual GD implementation
def grad_descent_mv(X, y, m, b, alpha, N, tol=1e-05):
    '''This function performs N-many iterations of gradient descent for a multivariate linear model.
    
    Args:
        X: a numpy array of the input data
        y: a numpy array of the true output
        m: a numpy array of the current m values (slope)
        b: a float of the current b value
        alpha: a float the initial learning rate for the GD model
        N: the maximum number of iterations
        (*) tol: the convergence tolerance

    Output:
        m: an array of the final m values (slope)
        b: a float of the final bias value
    '''

    for n in range(N):
        m_new, b_new = gd_step_mv(X, y, m, b, alpha) #compute the gradient step
        grad = np.concatenate([m_new,[b_new]])

        if np.linalg.norm(grad) < tol: # check if the function has converged (and return early if so)
            print(f"Converged!\nIteration: {n}")
            return m, b

        m = m_new #update m
        b = b_new #update b

    print(f"Did not converge...\n Final Result: {m, b}")
    return m, b

For now, we are using vanilla GD (ie a constant stepsize), although we might want to consider updating this if convergence takes too long... Let's see how it works

In [20]:
# Until we figure out a way to encode job_title, salary_currency, employee_residence, and company_location, we will drop those columns...

cols_to_drop = ['salary_currency', 'employee_residence', 'company_location']

X_subset = X.drop(columns=cols_to_drop)

# Let's initiate with all ones
m0 = np.ones(X_subset.shape[1])
b0=1

m_star, b_star = grad_descent_mv(X_subset, y, m0, b0, alpha=0.01, N=100000)
print(f"optimal parameters:\n m: {m_star}\n b: {b_star}")

Did not converge...
 Final Result: (array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
       nan], dtype=object), -inf)
optimal parameters:
 m: [nan nan nan nan nan nan nan nan nan nan nan nan nan nan]
 b: -inf


Using vanilla GD is taking too long to converge. Thus, we should update the algorithm

<span style="color:red"> perhaps this is another thing to talk to Ben about. Is there standard practice for picking which gradient method to use?</span>

### SGD (mini-batch)

In [None]:
def stochastic_gd(X, y, m, b, alpha, N, batch_size, tol=1e-5):
    '''This function performs N-many iterations of gradient descent for a multivariate linear model.
    
    Args:
        X: a numpy array of the input data
        y: a numpy array of the true output
        m: a numpy array of the current m values (slope)
        b: a float of the current b value
        alpha: a float the initial learning rate for the GD model
        N: the maximum number of iterations
        batch_size: the number of data points to include in each batch
        (*) tol: the convergence tolerance

    Output:
        m: an array of the final m values (slope)
        b: a float of the final bias value
    '''

    prev_loss = np.inf
    m = len(X) # the number of samples in the data

    for i in range(N):
        # Shuffle (ie. grab a random subset of indices
        indices = np.random.permutation(m)
        X = X[indices]
        y = y[indices]

        for j in range(0, n, batch_size):
            X_batch = X[j:j + batch_size]
            y_batch = y[j:j + batch_size]
            
            # Compute the gradients
            y_pred = np.dot(X, m) + b
            m_gradient = -2/n * np.mean(X_batch * (y_batch - y_pred))
            b_gradient = -2/n * np.mean(y_batch - y_pred)
            
            # Update the model parameters
            m -= alpha * m_gradient
            b -= alpha * b_gradient
            
        grad = np.concatenate([m,[b]])
        if np.linalg.norm(grad) < tol: # check if the function has converged (and return early if so)
            print(f"Converged!\nIteration: {i}")
            return m, b
        
    
    