## Model Training

#### 1.1 Import Data and Required Packages
Importing Pandas, Numpy, Matplotlib, Seaborn and Warings Library.

In [76]:
# Basic Import
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt 
import seaborn as sns
# Modelling
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,AdaBoostRegressor,GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.linear_model import LinearRegression, Ridge,Lasso
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.model_selection import RandomizedSearchCV
from sklearn.dummy import DummyRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import warnings

Import the CSV Data as Pandas DataFrame

In [77]:
df = pd.read_csv('data/jobs_in_data_2024.csv')

Show Top 5 Records

In [51]:
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,work_setting,company_location,company_size,job_category
0,2024,Entry-level,Freelance,Applied Data Scientist,30000,USD,30000,United Kingdom,Remote,United Kingdom,M,Data Science and Research
1,2024,Executive,Full-time,Business Intelligence,230000,USD,230000,United States,In-person,United States,M,BI and Visualization
2,2024,Executive,Full-time,Business Intelligence,176900,USD,176900,United States,In-person,United States,M,BI and Visualization
3,2024,Senior,Full-time,Data Architect,171210,USD,171210,Canada,In-person,Canada,M,Data Architecture and Modeling
4,2024,Senior,Full-time,Data Architect,92190,USD,92190,Canada,In-person,Canada,M,Data Architecture and Modeling


#### Preparing X and Y variables

In [78]:
X = df[df['employment_type'] == 'Full-time']
X = X[X['work_setting'] != 'Hybrid']

y = X['salary_in_usd']

X = X.drop(columns=['salary', 'salary_in_usd', 'job_title', 'employee_residence', 'employment_type', 'salary_currency', 'work_year'],axis=1)

top_5_countries = X['company_location'].value_counts().nlargest(5).index
X['company_location'] = X['company_location'].apply(lambda x: x if x in top_5_countries else 'Other')

# X['work_year'] = X['work_year'].astype('str')

In [65]:
X.head()

Unnamed: 0,experience_level,work_setting,company_location,company_size,job_category
1,Executive,In-person,United States,M,BI and Visualization
2,Executive,In-person,United States,M,BI and Visualization
3,Senior,In-person,Canada,M,Data Architecture and Modeling
4,Senior,In-person,Canada,M,Data Architecture and Modeling
5,Mid-level,In-person,United Kingdom,M,Data Science and Research


In [66]:
y

1        230000
2        176900
3        171210
4         92190
5         57753
          ...  
14191    119059
14194    165000
14195    412000
14196    151000
14197    105000
Name: salary_in_usd, Length: 13939, dtype: int64

In [79]:
# separate dataset into train and test
from sklearn.model_selection import train_test_split
seed = 31
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=seed)
X_train.shape, X_test.shape

((11151, 5), (2788, 5))

In [80]:
num_features = X.select_dtypes(exclude="object").columns
cat_features = X.select_dtypes(include="object").columns

from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

numeric_transformer = StandardScaler()
oh_transformer = OneHotEncoder()

num_pipeline = Pipeline(
    steps=[
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler(with_mean=False))
    ]
)
cat_pipeline = Pipeline(
        steps=[
                ("imputer", SimpleImputer(strategy="most_frequent")),
                ("one_hot_encoder", OneHotEncoder()),
        ]
)        

preprocessor = ColumnTransformer(
    [
        ("num_pipeline", num_pipeline, num_features),
         ("cat_pipeline", cat_pipeline, cat_features),        
    ]
)

In [81]:
X_train = preprocessor.fit_transform(X_train)

In [82]:
X_train.shape

(11151, 25)

#### Create an Evaluate Function to give all metrics after model Training

In [72]:
def evaluate_model(true, predicted):
    mae = mean_absolute_error(true, predicted)
    mse = mean_squared_error(true, predicted)
    rmse = np.sqrt(mse)
    r2_square = r2_score(true, predicted)
    return mae, rmse, r2_square

In [73]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import r2_score

models = {
    "Random Forest": RandomForestRegressor(),
    "Decision Tree": DecisionTreeRegressor(),
    "Gradient Boosting": GradientBoostingRegressor(),
    "Linear Regression": LinearRegression(),
    "K-Neighbors": KNeighborsRegressor(),
    "XGBoost": XGBRegressor(),
    "CatBoost": CatBoostRegressor(verbose=False),
    "AdaBoost": AdaBoostRegressor()
}

params={
    "Decision Tree": {
        'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
        'splitter': ['best', 'random'],
        'max_features': ['sqrt', 'log2'],
    },
    "Random Forest": {
        'criterion': ['squared_error', 'friedman_mse', 'absolute_error', 'poisson'],
        'max_features': ['sqrt', 'log2'],
        'n_estimators': [8, 16, 32, 64, 128, 256]
    },
    "Gradient Boosting": {
        'loss': ['squared_error', 'huber', 'absolute_error', 'quantile'],
        'learning_rate': [.1, .01, .05, .001],
        'subsample': [0.6, 0.7, 0.75, 0.8, 0.85, 0.9],
        'criterion': ['squared_error', 'friedman_mse'],
        'max_features': ['auto', 'sqrt', 'log2'],
        'n_estimators': [8, 16, 32, 64, 128, 256]
    },
    "K-Neighbors": {},
    "Linear Regression": {},
    "XGBoost": {
        'learning_rate': [.1, .01, .05, .001],
        'n_estimators': [8, 16, 32, 64, 128, 256]
    },
    "CatBoost": {
        'depth': [6, 8, 10],
        'learning_rate': [0.01, 0.05, 0.1],
        'iterations': [30, 50, 100]
    },
    "AdaBoost": {
        'learning_rate': [.1, .01, 0.5, .001],
        'loss': ['linear', 'square', 'exponential'],
        'n_estimators': [8, 16, 32, 64, 128, 256]
    }   
}

In [74]:
best_model_name = None
best_model_params = None
lowest_rmse = float('inf')
best_model = None

for model_name, model in models.items():
    param = params.get(model_name, {})
    gs = GridSearchCV(model, param, cv=10)
    gs.fit(X_train, y_train)

    best_estimator = gs.best_estimator_
    best_estimator.fit(X_train, y_train)

    y_train_pred = best_estimator.predict(X_train)
    mae, rmse, r2_square = evaluate_model(y_train_pred, y_train)
    
    print('\n\nModel performance for Training set')
    print('Model: ', model_name)
    print("- Root Mean Squared Error: {:.4f}".format(rmse))
    print("- Mean Absolute Error: {:.4f}".format(mae))
    print("- R2 Score: {:.4f}".format(r2_square))

    if rmse < lowest_rmse:
        lowest_rmse = rmse
        best_model_name = model_name
        best_model_params = gs.best_params_
        best_model = best_estimator

print("\n\n\nBest Model")
print("Model: ", best_model_name)
print("Parameters: ", best_model_params)
print("Lowest RMSE: {:.4f}".format(lowest_rmse))



Model performance for Training set
Model:  Random Forest
- Root Mean Squared Error: 51881.8874
- Mean Absolute Error: 39427.0390
- R2 Score: -0.9480


Model performance for Training set
Model:  Decision Tree
- Root Mean Squared Error: 51851.3302
- Mean Absolute Error: 39302.5563
- R2 Score: -0.9157


11520 fits failed out of a total of 34560.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
11520 fits failed with the following error:
Traceback (most recent call last):
  File "c:\Users\Vanderval\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\Users\Vanderval\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 1466, in wrapper
    estimator._validate_params()
  File "c:\Users\Vanderval\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "c:\Users\Vanderval\AppData\Local\Progr



Model performance for Training set
Model:  Gradient Boosting
- Root Mean Squared Error: 52358.3036
- Mean Absolute Error: 39926.8428
- R2 Score: -1.0748


Model performance for Training set
Model:  Linear Regression
- Root Mean Squared Error: 53161.8911
- Mean Absolute Error: 40767.2226
- R2 Score: -1.2327


Model performance for Training set
Model:  K-Neighbors
- Root Mean Squared Error: 55596.1847
- Mean Absolute Error: 41894.4294
- R2 Score: -1.6343


Model performance for Training set
Model:  XGBoost
- Root Mean Squared Error: 52036.2864
- Mean Absolute Error: 39619.1739
- R2 Score: -1.0414


Model performance for Training set
Model:  CatBoost
- Root Mean Squared Error: 52296.7618
- Mean Absolute Error: 39910.2809
- R2 Score: -1.1241


Model performance for Training set
Model:  AdaBoost
- Root Mean Squared Error: 55577.3476
- Mean Absolute Error: 43597.4872
- R2 Score: -2.7186



Best Model
Model:  Decision Tree
Parameters:  {'criterion': 'squared_error', 'max_features': 'log2', 

In [89]:
dummyModel = DummyRegressor()
params = {"strategy": ["mean", "median"]}
gs = GridSearchCV(dummyModel, params, cv=10)
gs.fit(X_train, y_train)

best_estimator = gs.best_estimator_
best_estimator.fit(X_train, y_train)

y_train_pred = best_estimator.predict(X_train)
mae, rmse, r2_square = evaluate_model(y_train_pred, y_train)
    
print('\n\nDummy Model performance')
print("- Root Mean Squared Error: {:.4f}".format(rmse))
print("- Mean Absolute Error: {:.4f}".format(mae))
print("- R2 Score: {:.4f}".format(r2_square))
print("- Parameters: ", gs.best_params_)



Dummy Model performance
- Root Mean Squared Error: 63968.7782
- Mean Absolute Error: 50191.1872
- R2 Score: 0.0000
- Parameters:  {'strategy': 'mean'}


### Results

As we can see, no model performed better then the dummy model using the simple mean of the observations. This indicates that either our models are not effectively capturing the underlying patterns in the data, or the data itself may not contain significant patterns that more complex models can exploit. 