# Preprocessing and model training


In this notebook, I will prepare my dataframes and then train several classification models off of them and compare results.

Building off of two previous notebooks: (https://github.com/gisthuband/Capstone_2_DS_Job_Locator/blob/main/data_wrangle.ipynb and https://github.com/gisthuband/Capstone_2_DS_Job_Locator/blob/main/exploratory_data_analysis.ipynb), I have constructed and analyzed a dataframe containing information on data science jobs in 2024.

I will use data to create a classification model, that will take my desired salary range and self perceived competitiveness in the job market, and use that to find the best locations and companies to apply for.

The data was found using a kaggle dataset containing 500 job postings for the data science filed in 2024, and a BLS report generated using data science field statistics of 2023.

Kaggle: https://www.kaggle.com/datasets/ritiksharma07/data-science-job-listings-from-glassdoor   

BLS: https://data.bls.gov/oes/#/occInd/One%20occupation%20for%20multiple%20industries 

The samples' (the individual job postings) features will be their upper salary post, lower salary post, company rating, total data scientists in company's state, ratio of job posts to total data scientists in company state, annual mean wage of state, annual median wage of state, ratio of job post to annual mean wage, and ratio of job post to annual median wage.

Each job will receive its label based on geographic region: west, midwest, south, east, or remote.

The models will train based of off the numerical features as x and the regions as y.


In the end, I will input my own desired salary range and my perceived competitiveness in the job market.  The salary range will correspond to the upper and lower salary features.  The perceived competitiveness will become the ratio of posting to state mean, which that in turn will be used to calculate the ratio of posting to state median.  These ratios will determine the state mean and median features in tandem with the inputted salary range.  The company rating, employment in state, and ratio of posts to employment will be automatically taken as their median values as to not overcomplicate the model.  From this input I will get the region label, and from this region label I can use the original dataframe and generate, the top posting cities in that region, along with the companies and the titles of roles accompanying those posts.



## These classification models will be tested and hyperparameter tuned

1.) Random Forest Classification (boosting)

2.) K Nearest Neighbors Classification

3.) Gradient Boosting Classification

In [1]:
from sklearn import preprocessing
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import f1_score
from sklearn.metrics import auc
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.metrics import accuracy_score,log_loss
from matplotlib import pyplot
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import scipy
import matplotlib.pyplot as plt
from sklearn import tree
from IPython.display import Image
%matplotlib inline
from sklearn import preprocessing
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import classification_report, confusion_matrix, roc_curve, auc
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import minmax_scale
from sklearn.model_selection import GridSearchCV
import sklearn
from sklearn import metrics

# 1.) Random Forest Classification

In [54]:
df = pd.read_csv('explored_data_v1.csv')

In [55]:
df = df.drop(columns='Unnamed: 0')

In [56]:
df.head()

Unnamed: 0,upper_salary,lower_salary,state,city,Job Title,Company Rating,company_name,tot_employment_in_state,Annual mean wage(2),Annual median wage(2),total_new_post_rat,range_avg_to_mean_ratio,range_avg_to_median_ratio,labels
0,84000.0,57000.0,WI,Onalaska,Associate Stop Loss Underwriter,2.7,The Insurance Center,3090.0,105250.0,101850.0,515.0,0.669834,0.692194,midwest
1,148165.491991,104355.331808,WI,Eau Claire,Marketing Advertising Analyst,3.0,"Net Health Shops, LLC",3090.0,105250.0,101850.0,515.0,1.199624,1.23967,midwest
2,160000.0,135000.0,WI,Madison,Manager - IT Infrastructure Engineering,3.9,UW Credit Union,3090.0,105250.0,101850.0,515.0,1.401425,1.448208,midwest
3,84000.0,59000.0,WI,Wausau,Associate Stop Loss Underwriter,2.7,The Insurance Center,3090.0,105250.0,101850.0,515.0,0.679335,0.702013,midwest
4,87000.0,58000.0,WI,New Berlin,Supply Chain Data Analyst (Day Shift) - New Be...,3.5,DB SCHENKER,3090.0,105250.0,101850.0,515.0,0.688836,0.711831,midwest


In [57]:
df = df.drop(columns=['state','city','Job Title','company_name'])

In [58]:
dum_df = pd.get_dummies(df['labels'])

In [59]:
ready_df = df

In [60]:
dummed_df = pd.concat([df, dum_df],axis=1)

In [61]:
dummed_df = dummed_df.drop(columns='labels')

In [62]:
dummed_df.head()

Unnamed: 0,upper_salary,lower_salary,Company Rating,tot_employment_in_state,Annual mean wage(2),Annual median wage(2),total_new_post_rat,range_avg_to_mean_ratio,range_avg_to_median_ratio,east,midwest,remote,south,west
0,84000.0,57000.0,2.7,3090.0,105250.0,101850.0,515.0,0.669834,0.692194,False,True,False,False,False
1,148165.491991,104355.331808,3.0,3090.0,105250.0,101850.0,515.0,1.199624,1.23967,False,True,False,False,False
2,160000.0,135000.0,3.9,3090.0,105250.0,101850.0,515.0,1.401425,1.448208,False,True,False,False,False
3,84000.0,59000.0,2.7,3090.0,105250.0,101850.0,515.0,0.679335,0.702013,False,True,False,False,False
4,87000.0,58000.0,3.5,3090.0,105250.0,101850.0,515.0,0.688836,0.711831,False,True,False,False,False


In [63]:
features = list(dummed_df.columns[dummed_df.columns != 'west'])
features.remove('south')
features.remove('east')
features.remove('midwest')
features.remove('remote')

In [94]:
X = dummed_df[features]

y = dummed_df[['west','east','south','midwest','remote']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, stratify=y)

In [95]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((275, 9), (69, 9), (275, 5), (69, 5))

In [96]:
grid_params = {'n_estimators':[50, 100, 200, 300, 500], 'criterion':['gini','entropy','log_loss'] }

In [97]:
gscv_rfc = GridSearchCV(RandomForestClassifier(), param_grid=grid_params, cv=5, scoring='accuracy')

In [98]:
rfc = gscv_rfc.fit(X_train, y_train)

In [99]:
print (rfc.best_params_)
print (rfc.best_score_)

{'criterion': 'entropy', 'n_estimators': 500}
0.9709090909090909


In [100]:
best_rfc = RandomForestClassifier(n_estimators=500, criterion='entropy')
res = best_rfc.fit(X_train, y_train)

In [101]:
y_pred = res.predict(X_test)

f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(np.array(y_test).argmax(axis=1), np.array(y_pred).argmax(axis=1))

print (f1)

print (cm)

0.8886205045625335
[[ 9  0  0  1  0]
 [ 1 13  0  0  0]
 [ 2  0 22  2  0]
 [ 2  0  0  7  0]
 [ 0  0  0  0 10]]


# 2.) K Neighbors Classifier

1.) Using the standard scaled x data (x - mean)/std

In [102]:
s_scaler = preprocessing.StandardScaler().fit(X_train)
X_train_s_scaled=s_scaler.transform(X_train)
X_test_s_scaled=s_scaler.transform(X_test)

X_train_s_scaled.shape, X_test_s_scaled.shape

((275, 9), (69, 9))

In [103]:
X_train_mm_scaled=preprocessing.minmax_scale(X_train)
X_test_mm_scaled=preprocessing.minmax_scale(X_test)

X_train_mm_scaled.shape, X_test_mm_scaled.shape

((275, 9), (69, 9))

In [104]:
grid_params_k = {'n_neighbors':[3, 4, 5, 6, 7, 8 ,9, 10],'leaf_size': [20,40,1], 'weights':['uniform','distance'], 'p':[1,2] }

In [105]:
gscv_knn = GridSearchCV(KNeighborsClassifier(), param_grid=grid_params_k, cv=5, scoring='accuracy')

In [106]:
knn_s = gscv_knn.fit(X_train_s_scaled, y_train)

In [107]:
print (knn_s.best_params_)
print (knn_s.best_score_)

{'leaf_size': 20, 'n_neighbors': 3, 'p': 1, 'weights': 'distance'}
0.7963636363636363


In [108]:
knn_s_best = KNeighborsClassifier(n_neighbors=3, leaf_size=20, p=1, weights='distance')
model = knn_s_best.fit(X_train_s_scaled, y_train)
y_pred = model.predict(X_test_s_scaled)

print('Best Test Accuracy Score:', metrics.accuracy_score(y_test, y_pred))  

Best Test Accuracy Score: 0.7681159420289855


In [109]:
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(np.array(y_test).argmax(axis=1), np.array(y_pred).argmax(axis=1))

print (f1)
print (cm)

0.7849801235990495
[[ 6  0  1  3  0]
 [ 0 10  3  1  0]
 [ 2  1 21  1  1]
 [ 2  1  0  6  0]
 [ 0  0  0  0 10]]


2.) Using the min max scaling now (X- Xmin)/range

In [110]:
grid_params_k = {'n_neighbors':[3, 4, 5, 6, 7, 8 ,9, 10],'leaf_size': [20,40,1], 'weights':['uniform','distance'], 'p':[1,2] }

In [111]:
gscv_knn = GridSearchCV(KNeighborsClassifier(), param_grid=grid_params_k, cv=5, scoring='accuracy')

In [112]:
knn_mm = gscv_knn.fit(X_train_mm_scaled, y_train)

In [113]:
print (knn_mm.best_params_)
print (knn_mm.best_score_)

{'leaf_size': 20, 'n_neighbors': 4, 'p': 1, 'weights': 'distance'}
0.8545454545454545


In [114]:
knn_mm_best = KNeighborsClassifier(n_neighbors=4, leaf_size=20, p=1, weights='distance')
model = knn_mm_best.fit(X_train_mm_scaled, y_train)
y_pred = model.predict(X_test_mm_scaled)

print('Best Test Accuracy Score:', metrics.accuracy_score(y_test, y_pred))  

Best Test Accuracy Score: 0.4492753623188406


In [115]:
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(np.array(y_test).argmax(axis=1), np.array(y_pred).argmax(axis=1))

print (f1)
print (cm)

0.4672369697504189
[[ 8  0  1  1  0]
 [ 0  5  7  1  1]
 [ 2  1 16  4  3]
 [ 1  3  2  2  1]
 [ 1  6  0  1  2]]


# 3.) Gradient Boosting Classifier

In [116]:
features = list(ready_df.columns[ready_df.columns != 'labels'])

X = ready_df[features]
y = ready_df['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1, stratify=y)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

((275, 9), (275,), (69, 9), (69,))

In [117]:
grid_params_gb = {'learning_rate':[0.05, 0.1, 0.25, 0.5, 0.75, 1],'n_estimators':[30, 40, 50, 75, 100],'max_depth':[2, 3, 4, 5, 6], 'max_features':[2, 3, 4, 5, 6]}

In [118]:
gcsv_gb = GridSearchCV(GradientBoostingClassifier(), param_grid=grid_params_gb, cv=5, scoring='accuracy')

In [119]:
gb = gcsv_gb.fit(X_train, y_train)

In [120]:
print (gb.best_params_)
print (gb.best_score_)

{'learning_rate': 0.75, 'max_depth': 3, 'max_features': 4, 'n_estimators': 50}
0.9854545454545456


In [121]:
gb_best = GradientBoostingClassifier(learning_rate=.75, max_depth=3, max_features=4, n_estimators=50)
model = gb_best.fit(X_train, y_train)
y_pred = model.predict(X_test)

print('Best Test Accuracy Score:', metrics.accuracy_score(y_test, y_pred)) 

Best Test Accuracy Score: 0.9565217391304348


In [122]:
f1 = f1_score(y_test, y_pred, average='weighted')
cm = confusion_matrix(y_test,y_pred)

print (f1)
print (cm)

0.9572595520421607
[[14  0  0  0  0]
 [ 0  8  0  0  1]
 [ 0  0 10  0  0]
 [ 0  1  0 24  1]
 [ 0  0  0  0 10]]


# In summary:

Here is the performance of the different models, as fitted upon their best hyperparameter adjustments:

RFC: (accuracy: .97, f1: .89)
KNN best scaling: (accuracy: .80 , f1: .77)
GB: (accuracy: .96, f1: .96)

The gradient boosting classification model performed the best on the confusion matrix statistic (f1 score), thus it is the prime contender for the final ML classification model to be used in the actual prediction.