## Model Training

#### 1.1 Import Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [49]:
import numpy as np
import pandas as pd

from datetime import datetime

from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.dummy import DummyRegressor

from sklearn.model_selection import train_test_split, GridSearchCV, KFold
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler

from xgboost import XGBRegressor

seed = 31 # defined to ensure that the random processes are reproducible

Import the CSV Data as Pandas DataFrame

In [27]:
df = pd.read_csv('data/jobs_in_data_2024.csv')

Show Top 5 Records

In [28]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,work_setting,company_location,company_size,job_category
0,2024,Entry-level,Freelance,Applied Data Scientist,30000,USD,30000,United Kingdom,Remote,United Kingdom,M,Data Science and Research
1,2024,Executive,Full-time,Business Intelligence,230000,USD,230000,United States,In-person,United States,M,BI and Visualization
2,2024,Executive,Full-time,Business Intelligence,176900,USD,176900,United States,In-person,United States,M,BI and Visualization
3,2024,Senior,Full-time,Data Architect,171210,USD,171210,Canada,In-person,Canada,M,Data Architecture and Modeling
4,2024,Senior,Full-time,Data Architect,92190,USD,92190,Canada,In-person,Canada,M,Data Architecture and Modeling


In [29]:
def get_jobs_in_data_df():
    return pd.read_csv('data/jobs_in_data_2024.csv')

def get_full_time_jobs(input_df):
    df_with_full_time_jobs = input_df.copy()
    df_with_full_time_jobs = df_with_full_time_jobs[df_with_full_time_jobs['employment_type'] == 'Full-time']

    return df_with_full_time_jobs

def get_non_hybrid_jobs(input_df):
    df_with_non_hybrid_jobs = input_df.copy()
    df_with_non_hybrid_jobs = df_with_non_hybrid_jobs[df_with_non_hybrid_jobs['work_setting'] != 'Hybrid']

    return df_with_non_hybrid_jobs

def get_top_5_company_location(input_df):
    df_with_top_5_company_location = input_df.copy()
    
    top_5_countries = df_with_top_5_company_location['company_location'].value_counts().nlargest(5).index
    df_with_top_5_company_location['company_location'] = df_with_top_5_company_location['company_location'].apply(lambda x: x if x in top_5_countries else 'Other')

    return df_with_top_5_company_location    

def drop_columns(input_df):
    df_with_dropped_columns = input_df.copy()
    df_with_dropped_columns = df_with_dropped_columns.drop(columns=['salary', 'job_title', 'employee_residence', 'employment_type', 'salary_currency', 'work_year'])
    return df_with_dropped_columns


def get_formatted_df():
    df = get_jobs_in_data_df()
    df = get_full_time_jobs(df)
    df = get_non_hybrid_jobs(df)
    df = get_top_5_company_location(df)
    df = drop_columns(df)
    return df

df = get_formatted_df()
df.head()

Unnamed: 0,experience_level,salary_in_usd,work_setting,company_location,company_size,job_category
1,Executive,230000,In-person,United States,M,BI and Visualization
2,Executive,176900,In-person,United States,M,BI and Visualization
3,Senior,171210,In-person,Canada,M,Data Architecture and Modeling
4,Senior,92190,In-person,Canada,M,Data Architecture and Modeling
5,Mid-level,57753,In-person,United Kingdom,M,Data Science and Research


Now, we will create the X (features) and y (target) variables

In [30]:
X = df.drop(columns=['salary_in_usd'],axis=1)
y = df['salary_in_usd']

In [31]:
X.head()

Unnamed: 0,experience_level,work_setting,company_location,company_size,job_category
1,Executive,In-person,United States,M,BI and Visualization
2,Executive,In-person,United States,M,BI and Visualization
3,Senior,In-person,Canada,M,Data Architecture and Modeling
4,Senior,In-person,Canada,M,Data Architecture and Modeling
5,Mid-level,In-person,United Kingdom,M,Data Science and Research


In [32]:
y

1        230000
2        176900
3        171210
4         92190
5         57753
          ...  
14191    119059
14194    165000
14195    412000
14196    151000
14197    105000
Name: salary_in_usd, Length: 13939, dtype: int64

Let's split the data into training and testing sets with a 20% test size.

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=seed)
X_train.shape, X_test.shape

((11151, 5), (2788, 5))

We will create a data processing pipeline with a numerical component for mean imputation and scaling, and a categorical component for frequent-category imputation and one-hot encoding

In [34]:
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

num_pipeline = Pipeline(
    steps=[
            ("imputer", SimpleImputer(strategy="mean")),
            ("scaler", StandardScaler())
    ]
)
cat_pipeline = Pipeline(
        steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                ("one_hot_encoder", OneHotEncoder()),
        ]
)        

preprocessor = ColumnTransformer(
    [
        ("num_pipeline", num_pipeline, num_features),
         ("cat_pipeline", cat_pipeline, cat_features),        
    ]
)

In [35]:
X_train = preprocessor.fit_transform(X_train)
X_train.shape

(11151, 25)

Next, an evaluation function is created to give the metrics after the model training. The metrics we will be displaying are the mean absolute error, the mean squared error, the squared root of the mean squared error and the R².

In [36]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mse)
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

We then define a set of models and parameters to evaluate and determine the best-performing one.

In [44]:
models = {
    "Dummy Regressor": DummyRegressor(),
    "Random Forest": RandomForestRegressor(random_state = seed),
    "Linear Regression": LinearRegression(),
    "K-Neighbors": KNeighborsRegressor(),
    "XGBoost": XGBRegressor(random_state = seed),
}

params = {
    "Dummy Regressor": {
        "strategy": ["mean", "median"]
    },
    "Random Forest": {
        'n_estimators': [50, 100],
        'max_depth': [None, 10]
    },
    "Linear Regression": {
        'fit_intercept': [True, False]
    },
    "K-Neighbors": {
        'n_neighbors': [3, 5],
        'weights': ['uniform', 'distance']
    },
    "XGBoost": {
        'n_estimators': [50, 100],
        'learning_rate': [0.01, 0.1, 1]
    }
}

In [45]:
def train_and_evaluate_models(models_dic, params_dic):
    best_model_name = None
    best_model_params = None
    highest_r2 = -float('inf')  # R² can be negative, hence initialized to a very low value

    for model_name, model in models_dic.items():
        start_time = datetime.now()

        param = params_dic.get(model_name, {})
        cv = KFold(n_splits=4, shuffle=True, random_state=seed)
        
        gs = GridSearchCV(
            model, param, cv=cv, 
            scoring={
                'r2': 'r2',
                'neg_mean_absolute_error': 'neg_mean_absolute_error',
                'neg_mean_squared_error': 'neg_mean_squared_error',
                'neg_root_mean_squared_error': 'neg_root_mean_squared_error'
            }, 
            refit='r2'
        )
        gs.fit(X_train, y_train)

        best_estimator = gs.best_estimator_
        best_estimator.fit(X_train, y_train)

        r2 = gs.cv_results_['mean_test_r2'].mean()
        mae = -gs.cv_results_['mean_test_neg_mean_absolute_error'].mean()
        mse = -gs.cv_results_['mean_test_neg_mean_squared_error'].mean()
        rmse = -gs.cv_results_['mean_test_neg_root_mean_squared_error'].mean()

        end_time = datetime.now()
        elapsed_time = (end_time - start_time).total_seconds()
        
        print('Model:', model_name)
        print("- Best parameters:", gs.best_params_)
        print("- Time Elapsed: {:.2f} seconds".format(elapsed_time))
        print("- R²: {:.4f}".format(r2))
        print("- MAE: {:.4f}".format(mae))
        print("- MSE: {:.4f}".format(mse))
        print("- RMSE: {:.4f}".format(rmse))
        print("\n\n")

        if r2 > highest_r2:
            highest_r2 = r2
            best_model_name = model_name
            best_model_params = gs.best_params_

    print("\n\n\nBest Model")
    print("Model:", best_model_name)
    print("Parameters:", best_model_params)
    print("Highest R²: {:.4f}".format(highest_r2))

train_and_evaluate_models(models, params)


Model: Dummy Regressor
- Best parameters: {'strategy': 'mean'}
- Time Elapsed: 0.06 seconds
- R²: -0.0073
- MAE: 50027.0955
- MSE: 4120915365.2485
- RMSE: 64183.6890



Model: Random Forest
- Best parameters: {'max_depth': 10, 'n_estimators': 100}
- Time Elapsed: 32.17 seconds
- R²: 0.3066
- MAE: 40602.4339
- MSE: 2837553447.0282
- RMSE: 53257.0941



Model: Linear Regression
- Best parameters: {'fit_intercept': True}
- Time Elapsed: 0.11 seconds
- R²: 0.3065
- MAE: 40849.9817
- MSE: 2837663265.3822
- RMSE: 53258.9909



Model: K-Neighbors
- Best parameters: {'n_neighbors': 5, 'weights': 'uniform'}
- Time Elapsed: 14.37 seconds
- R²: 0.0756
- MAE: 46866.1013
- MSE: 3774956908.2589
- RMSE: 61306.2380



Model: XGBoost
- Best parameters: {'learning_rate': 0.1, 'n_estimators': 50}
- Time Elapsed: 2.09 seconds
- R²: 0.2785
- MAE: 41700.5602
- MSE: 2952135126.3876
- RMSE: 54297.9543






Best Model
Model: Random Forest
Parameters: {'max_depth': 10, 'n_estimators': 100}
Highest R²: 0.3066


In [46]:
model = RandomForestRegressor(random_state=seed, max_depth=10, n_estimators=100, verbose=False)
model.fit(X_train, y_train)

In [51]:
X_test_prepared = preprocessor.transform(X_test)
y_test_pred = model.predict(X_test_prepared)

print("R²: ", r2_score(y_test, y_test_pred))
print("RMSE: ", root_mean_squared_error(y_test, y_test_pred))
print("MAE: ", mean_absolute_error(y_test, y_test_pred))

R²:  0.2989463170032439
RMSE:  52877.33148548779
MAE:  40650.494285349974


### Results

The final model chosen was a **Random Forest Regressor** with the following parameters:
- `n_estimators=100`
- `max_depth=10`

On the test data, the model achieved:
- **R²-score**: 0.299, indicating it explained approximately 30% of the variance in the target variable.
- **Mean Absolute Error (MAE)**: 40650.49.
- **Root Mean Squared Error (RMSE)**: 52877.33.

These results suggest that while the model captures some underlying patterns, there is still considerable room for improvement in its performance.