# Text Mining - Job Ads


The data set includes information on job descriptions and minimum qualifications of job ads in the state of New York. See if we can predict the salary of a state employee (i.e., the `Salary` column in the data set) based on the job description and minimum qualifications. This is important, because this model can make a salary recommendation as soon as a job description is entered into the system.

## Goal

Use the **nyc-jobs.csv** data set and build a model to predict **Salary**. 

# Read and Prepare the Data

In [132]:
import pandas as pd
import numpy as np

np.random.seed(42)

jobs = pd.read_csv('nyc-jobs.csv')
jobs.head(5)

Unnamed: 0,Job ID,Salary,Job Description,Minimum Qualifications
0,395893,69484,Please read this posting carefully to make cer...,1. Admission to the New York State Bar; and ei...
1,400421,58666,"Under general direction, with wide latitude fo...",1. A baccalaureate degree from an accredited c...
2,389859,60125,Emergency and Intervention Services (EIS) prov...,1. A baccalaureate degree from an accredited c...
3,386138,81473,Only candidates who are permanent in the Civil...,"(1) Four (4) years of full-time, satisfactory ..."
4,393299,113920,** THIS POSTING HAS BEEN EXTENDED - CANDIDATES...,1. A master's degree in computer science from ...


In [133]:
#Check for missing values
jobs.isna().sum()

Job ID                     0
Salary                     0
Job Description            0
Minimum Qualifications    10
dtype: int64

In [134]:
#Impute "and" for missing Minimum Qualifications, it will be filtered during vectorization
jobs = jobs.fillna("and")

In [135]:
#Concatenate Job Description and Minimum Qualifications
jobs['Description'] = jobs['Job Description'] + ' ' + jobs['Minimum Qualifications']
jobs.head(5)

Unnamed: 0,Job ID,Salary,Job Description,Minimum Qualifications,Description
0,395893,69484,Please read this posting carefully to make cer...,1. Admission to the New York State Bar; and ei...,Please read this posting carefully to make cer...
1,400421,58666,"Under general direction, with wide latitude fo...",1. A baccalaureate degree from an accredited c...,"Under general direction, with wide latitude fo..."
2,389859,60125,Emergency and Intervention Services (EIS) prov...,1. A baccalaureate degree from an accredited c...,Emergency and Intervention Services (EIS) prov...
3,386138,81473,Only candidates who are permanent in the Civil...,"(1) Four (4) years of full-time, satisfactory ...",Only candidates who are permanent in the Civil...
4,393299,113920,** THIS POSTING HAS BEEN EXTENDED - CANDIDATES...,1. A master's degree in computer science from ...,** THIS POSTING HAS BEEN EXTENDED - CANDIDATES...


### Select Input and Target vars

In [136]:
inputText = jobs['Description']

target = jobs['Salary']

### Split data into Train/Test

In [137]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(inputText, target, test_size=0.3, random_state=42)

In [138]:
train_set.shape, train_y.shape

((1741,), (1741,))

In [139]:
test_set.shape, test_y.shape

((747,), (747,))

# Text Preparation

### Count Vectorizer

In [140]:
#Countvectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(stop_words='english')

#Fit_transform train
train_x_tr = count_vect.fit_transform(train_set)

#Transform test
test_x_tr = count_vect.transform(test_set)

train_x_tr, test_x_tr

(<1741x10247 sparse matrix of type '<class 'numpy.int64'>'
 	with 330736 stored elements in Compressed Sparse Row format>,
 <747x10247 sparse matrix of type '<class 'numpy.int64'>'
 	with 139836 stored elements in Compressed Sparse Row format>)

### TF-IDF Transformer

In [141]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer()

#Fit_transform train
train_x_tfidf = tf_transformer.fit_transform(train_x_tr)

#Transform test
test_x_tfidf = tf_transformer.transform(test_x_tr)

train_x_tfidf.shape, test_x_tfidf.shape

((1741, 10247), (747, 10247))

### Singular Value Decomposition

In [142]:
from sklearn.decomposition import TruncatedSVD

#If you are performing Latent Semantic Analysis, recommended number of components is 100
svd = TruncatedSVD(n_components=700, n_iter=10)

#Fit_transform train
train_x_lsa = svd.fit_transform(train_x_tfidf)

#Transform test
test_x_lsa = svd.transform(test_x_tfidf)

train_x_lsa.shape, test_x_lsa.shape

((1741, 700), (747, 700))

In [12]:
#Explained Variance
svd.explained_variance_.sum()

0.8545964852948239

85.5% of the dataset is explained by the SVDs

# Determine Baseline Error

In [13]:
#Average value of the target
mean_value = np.mean(train_y)

mean_value

77869.8845491097

In [14]:
# Predict all values as the mean
baseline_pred = np.repeat(mean_value, len(test_y))


In [15]:
from sklearn.metrics import mean_squared_error

baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 30011.38837561802


Baseline error is about $30k

# Model 1 - Random Forest

In [19]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint

In [65]:
#Randomized Search
param_grid = {'n_estimators': randint(100,1000),    
              #'max_leaf_nodes': randint(2,8),
              'max_features': ['auto','sqrt'],
              'max_depth': randint(4,10)
             }

rf = RandomizedSearchCV(estimator=RandomForestRegressor(),
                  param_distributions=param_grid,
                  n_iter=50,
                  cv=2,
                  return_train_score=True,
                  scoring='neg_mean_squared_error',
                  n_jobs=-1,
                  verbose=1
                )

#rf = RandomForestRegressor(n_estimators=1000, max_depth=5 , max_leaf_nodes=5, n_jobs=-1) 

rf.fit(train_x_lsa, train_y)

Fitting 2 folds for each of 50 candidates, totalling 100 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done 100 out of 100 | elapsed: 15.7min finished


RandomizedSearchCV(cv=2, error_score='raise-deprecating',
                   estimator=RandomForestRegressor(bootstrap=True,
                                                   criterion='mse',
                                                   max_depth=None,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators='warn',
                                                   n_jobs=None, oob_score=False,
                                                   random_sta...


In [66]:
#Find the best parameter set
rf.best_params_

{'max_depth': 9, 'max_features': 'auto', 'n_estimators': 401}

In [67]:
rf.best_estimator_

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=9,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=401,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

In [68]:
#Train RMSE
train_y_pred = rf.best_estimator_.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_y_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 11169.971900223127


In [69]:
#Test RMSE
test_y_pred = rf.best_estimator_.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_y_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 18032.222139214828


# Model 2 - SGD Regressor

In [21]:
from sklearn.linear_model import SGDRegressor
from numpy import random

#Randomized Search
param_grid = {'max_iter': randint(500,2000),    
              'learning_rate': ['constant','invscaling','adaptive','optimal']
             }

sgd = RandomizedSearchCV(estimator=SGDRegressor(),
                  param_distributions=param_grid,
                  n_iter=500,
                  cv=2,
                  return_train_score=True,
                  scoring='neg_mean_squared_error',
                  n_jobs=-1,
                  verbose=1
                )

#sgd = SGDRegressor(max_iter=300, eta0=0.2, learning_rate='adaptive', tol=1e-3)

sgd.fit(train_x_lsa, train_y)

Fitting 2 folds for each of 500 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   14.2s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:  1.1min
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:  2.5min
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:  4.5min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed:  5.7min finished


RandomizedSearchCV(cv=2, error_score='raise-deprecating',
                   estimator=SGDRegressor(alpha=0.0001, average=False,
                                          early_stopping=False, epsilon=0.1,
                                          eta0=0.01, fit_intercept=True,
                                          l1_ratio=0.15,
                                          learning_rate='invscaling',
                                          loss='squared_loss', max_iter=1000,
                                          n_iter_no_change=5, penalty='l2',
                                          power_t=0.25, random_state=None,
                                          shuffle=True, tol=0.001,
                                          validation_fraction=0.1, verbose=0,
                                          warm_start=False),
                   iid='warn', n_iter=500, n_jobs=-1,
                   param_distributions={'learning_rate': ['constant',
                                   

In [22]:
#Find the best parameter set
sgd.best_params_

{'learning_rate': 'invscaling', 'max_iter': 1963}

In [23]:
sgd.best_estimator_

SGDRegressor(alpha=0.0001, average=False, early_stopping=False, epsilon=0.1,
             eta0=0.01, fit_intercept=True, l1_ratio=0.15,
             learning_rate='invscaling', loss='squared_loss', max_iter=1963,
             n_iter_no_change=5, penalty='l2', power_t=0.25, random_state=None,
             shuffle=True, tol=0.001, validation_fraction=0.1, verbose=0,
             warm_start=False)

In [24]:
#Train RMSE
train_y_pred = sgd.best_estimator_.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_y_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 14302.778451357584


In [25]:
#Test RMSE
test_y_pred = sgd.best_estimator_.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_y_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 17464.745530405722


# Model 3 - Neural Network

In [170]:
from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import GridSearchCV

#Grid Search
param_grid = {
'learning_rate': ["constant", "invscaling", "adaptive"],
'hidden_layer_sizes': [(50,50,50)],
#'alpha': [.005,.01,.015],
'activation': ["relu","tanh"],
'solver': ['lbfgs','adam']
}

nn = GridSearchCV(estimator=MLPRegressor(),
                      param_grid=param_grid,
                      n_jobs=-1,
                      cv=3, 
                      scoring='neg_mean_squared_error',
                      return_train_score=True,
                      verbose=True)

nn.fit(train_x_lsa, train_y)

Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:   34.3s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=MLPRegressor(activation='relu', alpha=0.0001,
                                    batch_size='auto', beta_1=0.9, beta_2=0.999,
                                    early_stopping=False, epsilon=1e-08,
                                    hidden_layer_sizes=(100,),
                                    learning_rate='constant',
                                    learning_rate_init=0.001, max_iter=200,
                                    momentum=0.9, n_iter_no_change=10,
                                    nesterovs_momentum=True, power_t=0.5,
                                    random_stat...
                                    solver='adam', tol=0.0001,
                                    validation_fraction=0.1, verbose=False,
                                    warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'activation': ['relu', 'tanh'],
                         'hidden_layer_si

In [171]:
#Find the best parameter set
nn.best_params_

{'activation': 'relu',
 'hidden_layer_sizes': (50, 50, 50),
 'learning_rate': 'constant',
 'solver': 'lbfgs'}

In [172]:
nn.best_estimator_

MLPRegressor(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
             beta_2=0.999, early_stopping=False, epsilon=1e-08,
             hidden_layer_sizes=(50, 50, 50), learning_rate='constant',
             learning_rate_init=0.001, max_iter=200, momentum=0.9,
             n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
             random_state=None, shuffle=True, solver='lbfgs', tol=0.0001,
             validation_fraction=0.1, verbose=False, warm_start=False)

In [173]:
#Train RMSE
train_y_pred = nn.best_estimator_.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_y_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 4838.917418579043


In [174]:
#Test RMSE
test_y_pred = nn.best_estimator_.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_y_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 13640.333525396956


### Prepare Competition Dataset

In [180]:
comp = pd.read_csv('nyc-jobs-competition.csv')

In [144]:
#Concatenate Job Description and Minimum Qualifications
comp['Description'] = comp['Job Description'] + ' ' + comp['Minimum Qualifications']
comp.head(5)

Unnamed: 0,Job ID,Salary,Job Description,Minimum Qualifications,Description
0,398564,,"In December 2018, Mayor de Blasio and New York...",1. A baccalaureate degree from an accredited c...,"In December 2018, Mayor de Blasio and New York..."
1,308043,,"Under the direction of the Executive Director,...",1. A baccalaureate degree from an accredited c...,"Under the direction of the Executive Director,..."
2,400827,,The Family Independence Administration (FIA) i...,1. A baccalaureate degree from an accredited c...,The Family Independence Administration (FIA) i...
3,372029,,The NYC Mayorâ€™s Office of Environmental Reme...,"Professional/Vendor Certification, Education a...",The NYC Mayorâ€™s Office of Environmental Reme...
4,368207,,The NYC Department of Environmental Protection...,(1) A four-year high school diploma or its edu...,The NYC Department of Environmental Protection...


In [145]:
compText = comp['Description']

In [146]:
compText

0     In December 2018, Mayor de Blasio and New York...
1     Under the direction of the Executive Director,...
2     The Family Independence Administration (FIA) i...
3     The NYC Mayorâ€™s Office of Environmental Reme...
4     The NYC Department of Environmental Protection...
                            ...                        
86    The duties and responsibilities of this positi...
87    The Commission on Human Rights (the Commission...
88    Reporting to the Deputy Director of Training a...
89    Directs and manages the Facilities Central sec...
90    The Bureau of Asset Management (BAM) is respon...
Name: Description, Length: 91, dtype: object

### Count Vectorizer

In [147]:
txt_tr = count_vect.transform(compText)

In [148]:
txt_tr

<91x10247 sparse matrix of type '<class 'numpy.int64'>'
	with 16140 stored elements in Compressed Sparse Row format>

### TF-IDF Transformer

In [149]:
txt_tfidf = tf_transformer.transform(txt_tr)

In [150]:
txt_tfidf

<91x10247 sparse matrix of type '<class 'numpy.float64'>'
	with 16140 stored elements in Compressed Sparse Row format>

### Singular Value Decomposition

In [151]:
txt_lsa = svd.transform(txt_tfidf)

txt_lsa.shape

(91, 700)

In [152]:
txt_lsa

array([[ 2.68535216e-01, -1.08917370e-01, -1.36105683e-01, ...,
        -1.55291293e-02,  1.36466342e-03, -1.15462475e-02],
       [ 3.00875198e-01, -1.68737983e-01, -5.24292502e-02, ...,
         2.05461558e-02,  1.05916289e-02,  6.69093568e-03],
       [ 2.11198097e-01, -1.05998770e-01, -4.58127598e-02, ...,
        -5.89131765e-03,  2.71908238e-02, -9.77491840e-03],
       ...,
       [ 1.84506258e-01, -1.27393041e-01, -5.63322898e-02, ...,
        -9.96287538e-03,  4.52857980e-02, -1.59912501e-02],
       [ 2.98570840e-01, -1.34086491e-01, -6.63395433e-02, ...,
        -7.58661450e-03,  6.03194351e-03, -1.05503595e-02],
       [ 1.83805532e-01, -7.84787016e-02, -5.09173931e-02, ...,
         2.16936085e-02, -1.09304449e-05, -3.97172159e-03]])

### Predict Salary

In [175]:
#Predict comp data set with best NN model
salary_pred = nn.best_estimator_.predict(txt_lsa)

In [183]:
#Replace Salary column with predictions
comp['Salary']=salary_pred

In [189]:
#Drop unnecessary columns
comp = comp.drop(['Job Description','Minimum Qualifications'], axis=1)

In [190]:
comp

Unnamed: 0,Job ID,Salary
0,398564,81633.565112
1,308043,84914.940623
2,400827,69082.909554
3,372029,105795.121459
4,368207,71224.181872
...,...,...
86,269889,80483.420393
87,395905,68597.559357
88,387857,59366.191122
89,390805,70715.283587


In [192]:
#Output to CSV
comp.to_csv("johnson_nyc_competition.csv",index=False)