## Project 4 : Job Market Analysis

## Notebook 03: Predective Modelling - Regression Analysis

In [29]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression, LogisticRegression, LogisticRegressionCV, RidgeClassifierCV, LassoCV, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import label_binarize
from sklearn.neighbors import KNeighborsRegressor
from sklearn import metrics
from sklearn.svm import SVC
from ipywidgets import *
from IPython.display import display
from itertools import combinations
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, cross_val_predict
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
sns.set_style('whitegrid')
%matplotlib inline

This notebook will focus on identifing key factors impacting salaries and job categories. This csv was saved after completing our EDA in Notebook 02:

In [2]:
final_jobs = pd.read_csv('final_jobs.csv', index_col = [0])

In [3]:
final_jobs.head();

In [4]:
final_jobs.isnull().sum();

In [5]:
final_jobs.info();

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1270 entries, 0 to 1207
Data columns (total 15 columns):
job_title           1270 non-null object
job_location        1270 non-null object
advertiser          1270 non-null object
posted_date         1270 non-null object
salary              1270 non-null object
type_of_work        1270 non-null object
job_category        1270 non-null object
job_subcategory     1270 non-null object
job_description     1270 non-null object
job_searched        1270 non-null object
url                 1270 non-null object
new_job_category    1270 non-null object
new_job_title       1270 non-null object
final_salary        1270 non-null float64
experience_level    1270 non-null object
dtypes: float64(1), object(14)
memory usage: 158.8+ KB


## Data preprocessing 

Steps required to make the data ready for modelling:

1. Considered only features that are important for my model prediction
2. Dummified the categorical columns

In [6]:
final_jobs_small = final_jobs[['new_job_category','new_job_title','experience_level','job_location','type_of_work']]

In [7]:
final_jobs_small.head()

Unnamed: 0,new_job_category,new_job_title,experience_level,job_location,type_of_work
0,Information Technology,data analyst,Mid_level,All-Sydney-NSW,Contract/Temp
3,Information Technology,data analyst,Junior_level,All-Sydney-NSW,Contract/Temp
4,Information Technology,data analyst,Mid_level,All-Sydney-NSW,Full Time
5,Information Technology,data analyst,Mid_level,All-Sydney-NSW,Contract/Temp
6,Information Technology,data scientist,Mid_level,All-Sydney-NSW,Full Time


## Creating dummies for categorical columns

In [8]:
final_cols = pd.get_dummies(final_jobs_small, drop_first=True, prefix=None)

In [9]:
final_cols.columns

Index(['new_job_category_Gov_Services', 'new_job_category_Information Technology', 'new_job_category_Others', 'new_job_category_Sales & Marketing', 'new_job_title_data analyst', 'new_job_title_data manager', 'new_job_title_data scientist', 'experience_level_Mid_level', 'experience_level_Senior_level', 'job_location_All-Sydney-NSW', 'type_of_work_Contract/Temp', 'type_of_work_Full Time', 'type_of_work_Part Time'], dtype='object')

## Baseline Accuracy & Splitting DataFrame

In [10]:
X = final_cols
y = final_jobs['final_salary']

In [11]:
baseline = np.mean(y)
print('Baseline of model performance:' , baseline)

Baseline of model performance: 155353.87992125985


## Standardizing the DataFrame
Standarzing the data will make the dataset ready for modelling

In [12]:
ss = StandardScaler()
xn = ss.fit_transform(X)

  return self.partial_fit(X, y)
  return self.fit(X, **fit_params).transform(X)


## Splitting the data frame to train and test

Here I split my data into a train and test set. I will be training all my models on the training set and test it on the test set. This will give me the best idea of how well generalized my model is. I opted with a training set size of 70%.

In [13]:
X_train,X_test,y_train,y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [14]:
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(889, 13) (889,)
(381, 13) (381,)


## Fitting the models

## Model 1 : Linear Regression:

In [15]:
lr=LinearRegression()
model = lr.fit(X_train,y_train)
print('The score of train:', lr.score(X_train,y_train))
print('The score of test:', lr.score(X_test, y_test))

The score of train: 0.27712555504639147
The score of test: 0.28203546510083954


In [16]:
y_pred = lr.predict(X_test)
y_pred;

## Cross-validate the Lasso  𝑅2  with the optimal alpha

In [17]:
optimal_lasso = LassoCV(cv=4).fit(X_train,y_train)

lasso = Lasso(alpha=optimal_lasso.alpha_)

lasso_scores = cross_val_score(lasso,X_train,y_train, cv=4)
print('Optimal alpha:',optimal_lasso.alpha_)
print('Mean lasso CV R2:',np.mean(lasso_scores))

Optimal alpha: 213.18516811649337
Mean lasso CV R2: 0.26864489278532633


In [18]:
lasso.fit(X_train,y_train)
y_pred=lasso.predict(X_test)
print('Lasso train score:',lasso.score(X_train,y_train))
print('Lasso test score:',lasso.score(X_test,y_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Lasso train score: 0.27229734115359217
Lasso test score: 0.2752541049333094
RMSE: 48345.345426791835


# Observation:
The model created above does not perform particularly well in predicting the salary. This is evident with the low train and test R-squared scores. On average the predictions made with this model are off by $48345. The main issue here is I am predicting 70% of salaries given with 30% and imputing the missing values with median assuming there would be a linear relationship but it is not in this case.

## Looking at the coefficients for variables in Lasso:

In [19]:
lasso.fit(xn, y)

Lasso(alpha=213.18516811649337, copy_X=True, fit_intercept=True,
   max_iter=1000, normalize=False, positive=False, precompute=False,
   random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [20]:
lasso_coefs = pd.DataFrame({'Feature':X.columns,
                            'coef':lasso.coef_,
                            'abs_coef':np.abs(lasso.coef_)})

lasso_coefs.sort_values('abs_coef', inplace=True, ascending=False)

lasso_coefs.head(5)

Unnamed: 0,Feature,coef,abs_coef
10,type_of_work_Contract/Temp,21086.328519,21086.328519
1,new_job_category_Information Technology,16720.540899,16720.540899
8,experience_level_Senior_level,8461.821027,8461.821027
0,new_job_category_Gov_Services,8118.937323,8118.937323
5,new_job_title_data manager,5065.940199,5065.940199


## Model 2 : Random Forest Regressor

In [21]:
#Instantiate model with 3 decision trees
rf = RandomForestRegressor(n_estimators=10, max_depth = 3)
#Train the model on training data
rf.fit(X_train,y_train)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=3,
           max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=None,
           oob_score=False, random_state=None, verbose=0, warm_start=False)

In [22]:
rf_scores = cross_val_score(rf, X_train, y_train, cv=4)
print('Mean random forest CV R2:',np.mean(rf_scores))

Mean random forest CV R2: 0.22200090631577912


There is no much improvement on RMSE. Both Lasso CV and Random Forest are performing the same and giving the same score

In [23]:
y_pred = rf.predict(X_test)
print('Random forest train score:',rf.score(X_train,y_train))
print('Random forest test score:',rf.score(X_test,y_test))
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_test,y_pred)))

Random forest train score: 0.2895716805545827
Random forest test score: 0.2664832641138083
RMSE: 48637.002227311714


In [24]:
feature_importances = pd.DataFrame(rf.feature_importances_,
                                   index = X.columns,
                                    columns=['importance'])
feature_cols = ['cont/temp','IT','Gov_Services','data manager','senior level']
feature_importances.sort_values(by='importance', ascending=False).head()

Unnamed: 0,importance
type_of_work_Contract/Temp,0.60985
new_job_category_Information Technology,0.210813
new_job_category_Gov_Services,0.06227
new_job_title_data manager,0.057408
experience_level_Senior_level,0.037979


## Observation:

1. Both the models are more or less have similar and top features that has high impact in predicting the salaries. As we can see, Type of work plays an important role in defining salary.
2. Industry has a moderate to high impact on salary. Especially Information Technology, which we have already seen during the EDA stage.

The top impacters of salary are as follows:

- Work Type (Contract Employment)
- Category / Industry (Information Technology)
- Job Title (Data Manager)

# I tried both regression and classification models for predecting saalries. This notebook is with regression models. The notebook(4) is defined with classification models.