## University Salary Prediction

Given data about university employees, let's try to predict the **salary** for a given employee.

We will use a variety of regression models to make our predictions.

Data source: https://www.kaggle.com/datasets/tysonpo/university-salaries

### Getting Started

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold

from sklearn.linear_model import Ridge
from sklearn.tree import DecisionTreeRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
data = pd.read_csv('salaries_final.csv')
data

Unnamed: 0,Year,Name,Primary Job Title,Base Pay,Department,College
0,2010,"Abaied, Jamie L.",Assistant Professor,64000.0,Department of Psychological Science,CAS
1,2011,"Abaied, Jamie L.",Assistant Professor,64000.0,Department of Psychological Science,CAS
2,2012,"Abaied, Jamie L.",Assistant Professor,65229.0,Department of Psychological Science,CAS
3,2013,"Abaied, Jamie L.",Assistant Professor,66969.0,Department of Psychological Science,CAS
4,2014,"Abaied, Jamie L.",Assistant Professor,68658.0,Department of Psychological Science,CAS
...,...,...,...,...,...,...
14465,2016,"van der Vliet, Albert",Professor,163635.0,Department of Pathology&Laboratory Medicine,COM
14466,2017,"van der Vliet, Albert",Professor,175294.0,Department of Pathology&Laboratory Medicine,COM
14467,2018,"van der Vliet, Albert",Professor,191000.0,Department of Pathology&Laboratory Medicine,COM
14468,2019,"van der Vliet, Albert",Professor,196000.0,Department of Pathology&Laboratory Medicine,COM


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14470 entries, 0 to 14469
Data columns (total 6 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Year               14470 non-null  int64  
 1   Name               14470 non-null  object 
 2   Primary Job Title  14470 non-null  object 
 3   Base Pay           14470 non-null  float64
 4   Department         14470 non-null  object 
 5   College            14470 non-null  object 
dtypes: float64(1), int64(1), object(4)
memory usage: 678.4+ KB


### Preprocessing 

In [4]:
def preprocess_inputs(df):
    df = df.copy()

    # Drop Name column
    df = df.drop('Name', axis=1)

    # Shuffle the data
    df = df.sample(frac=1.0).reset_index(drop=True)

    # Split df into X and y
    y = df['Base Pay']
    X = df.drop('Base Pay', axis=1)

    return X, y

In [5]:
X, y = preprocess_inputs(data)

In [6]:
X

Unnamed: 0,Year,Primary Job Title,Department,College
0,2020,Assistant Professor,Department of Surg-Urology,COM
1,2020,Professor,Department of Orthopaedics & Rehabilitation,COM
2,2010,Assistant Professor,Department of Med-Pulmonary,COM
3,2019,Associate Professor,Department of Anesthesiology,COM
4,2009,Academic Srvcs Professonal Sr,Department of Education,CESS
...,...,...,...,...
14465,2020,Assistant Professor,Department of Surg-Emergency Med,COM
14466,2017,Associate Professor,Department of Elec & Biomed Engineering,CEMS
14467,2017,Post Doctoral Associate,Department of Civil & Env Engineering,CEMS
14468,2020,Associate Professor,Department of Surg-Urology,COM


In [7]:
y

0         35000.0
1         32000.0
2        110000.0
3         24000.0
4         59858.0
           ...   
14465     35000.0
14466    110968.0
14467     60000.0
14468     40000.0
14469     98092.0
Name: Base Pay, Length: 14470, dtype: float64

### Building pipeline

In [8]:
pd.get_dummies(X['College'], dtype=int)

Unnamed: 0,Business,CALS,CAS,CEMS,CESS,CNHS,COM,Department of Ext,LCOMEO,Learning and Info Tech,Library,RSENR
0,0,0,0,0,0,0,1,0,0,0,0,0
1,0,0,0,0,0,0,1,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,0,0
3,0,0,0,0,0,0,1,0,0,0,0,0
4,0,0,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...
14465,0,0,0,0,0,0,1,0,0,0,0,0
14466,0,0,0,1,0,0,0,0,0,0,0,0
14467,0,0,0,1,0,0,0,0,0,0,0,0
14468,0,0,0,0,0,0,1,0,0,0,0,0


In [9]:
def build_pipeline(regressor):
    nominal_transformer = Pipeline(steps=[
        ('onehot', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(transformers=[
        ('nominal', nominal_transformer, ['Primary Job Title', 'Department', 'College'])
    ], remainder='passthrough')
    
    model = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('scaler', StandardScaler()),
        ('regressor', regressor)
    ])
    return model

In [10]:
models = {
    "Linear Regression (Ridge)": build_pipeline(Ridge()),
    "            Decision Tree": build_pipeline(DecisionTreeRegressor()),
    "           Neural Network": build_pipeline(MLPRegressor()),
    "            Random Forest": build_pipeline(RandomForestRegressor()),
    "        Gradient Boosting": build_pipeline(GradientBoostingRegressor())
}

### Model Selection (k-Fold Cross Validation)

In [11]:
def evaluate_model(model, X, y):
    kf = KFold(n_splits=5)
    rmses = []
    r2s = []
    
    for train_idx, test_idx in kf.split(X):
        # Fit the model
        model.fit(X.iloc[train_idx, :], y.iloc[train_idx])

        # Make predictions
        pred = model.predict(X.iloc[test_idx, :])

        # Calculate RMSE
        rmse = np.sqrt(np.mean((y.iloc[test_idx] - pred)**2))
        rmses.append(rmse)
        
        # Calculate R^2 
        r2 = 1 - np.sum((y.iloc[test_idx] - pred)**2) / np.sum((y.iloc[test_idx] - y.iloc[test_idx].mean())**2)
        r2s.append(r2)

    # Return average RMSE and R^2 
    return np.mean(rmses), np.mean(r2s)

In [12]:
# RMSE values for each model
for name, model in models.items():
    print(name + " RMSE: {:.2f}".format(evaluate_model(model, X, y)[0]))

Linear Regression (Ridge) RMSE: 28488.55
            Decision Tree RMSE: 30304.01
           Neural Network RMSE: 31118.30
            Random Forest RMSE: 28987.67
        Gradient Boosting RMSE: 31593.63


In [13]:
# R2 Score for each model
for name, model in models.items():
    print(name + " R^2 Score: {:.5f}".format(evaluate_model(model, X, y)[1]))

Linear Regression (Ridge) R^2 Score: 0.63637
            Decision Tree R^2 Score: 0.58798
           Neural Network R^2 Score: 0.56723
            Random Forest R^2 Score: 0.62319
        Gradient Boosting R^2 Score: 0.55278
