# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 4: Web Scraping Job Postings



### QUESTION 1: Factors that impact salary

 1. Determine the industry factors that are most important in predicting the salary amounts for these data.

To predict salary you will be building either a classification or regression model, using features like the location, title, and summary of the job. If framing this as a regression problem, you will be estimating the listed salary amounts. You may instead choose to frame this as a classification problem, in which case you will create labels from these salaries (high vs. low salary, for example) according to thresholds (such as median salary).

-------------

**Data Overview** 

Searched keywords: data analyst, data scientist, business analyst, business intelligence, data engineer and analytics. 

Scraped from mycareersfuture.sg - **914 rows** and **7 columns**, after cleaning
feature engineering added **6 columns**. 

Columns: 'company_name', 'employment_type', 'industry', 'job_description',
       'job_title', 'requirements', 'salary_max', 'salary_min', 'seniority',
       'salary_avg', 'job_title_clean', 'seniority_clean',
       'salary_class'

* Target feature is salary class, treshold is deterimined by IQR. 
       * High Salary is > IQR 75, Low Salary is < IQR 25
       * high salary treshold:  9075, low salary treshold:  5500
       * Baseline is 'mid' class between high and low salary threshold at ~50% of the entire dataset. 

Data scraping and cleaning in separate notebook

**Problem with job description**: from EDA it seems that many job descriptions were identical to each other. Meaning that different salary levels can have the same job description, which can confuse our models and increase the risk of misclassification.

**Hypothesis:**
- Prediction will perform better without job description feature

**Test**
- Test on Job Title, Industry Seniority, Salary class
- Test on JD
- Test on Requirements
- Test on all Features

**Models**
- Logistic Regression
- Stochastic Gradient Classifier
- Random Forest Classifier

**Evaluation Metric**
- F1
- Confusion Matrix
- Accuracy Score

In [249]:
import numpy as np
import scipy.stats as stats
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd

import matplotlib;
import matplotlib.pyplot as plt;

import re;

import seaborn as sns;

matplotlib.rcParams['font.family'] = 'mono';
matplotlib.rcParams['font.weight'] = 3;
matplotlib.rcParams['font.size'] = 10;

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [7]:
import sqlite3
sqlite_db = './careers.db'
conn = sqlite3.connect(sqlite_db) 
c = conn.cursor()

In [640]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin

import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer, TfidfVectorizer
from sklearn.model_selection import train_test_split,cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegressionCV, SGDClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn import metrics 
from sklearn.naive_bayes import MultinomialNB


### Test 1: Classification without JD/Requirements

In [67]:
query = '''
SELECT industry, job_title_clean, seniority_clean, salary_class
FROM alljobs'''

all_jobs = pd.read_sql(query, con=conn)

In [913]:
def get_features(model, classes):
    vectorizer = model.named_steps['vectorizer']
    clf = model.named_steps['clf']

    features_names = vectorizer.get_feature_names()
    features_names = np.asarray(features_names)

    print('Number of features: {} \n'.format(len(features_names)))

    try:
        if len(classes) > 2:
            for i, label in enumerate(classes):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print('%s: %s' % (label,', '.join(features_names[top10])))
        else:
            top10 = np.argsort(clf.coef_[0])[-10:] 
            print('Top 10 features found in %s:' % (classes[1]))
            print('%s' % (', '.join(features_names[top10])))
    except AttributeError:
        top10 = np.argsort(clf.feature_importances_)[-10:]
        print('Top 10 features found in %s:' % (classes[1]))
        print('%s' % (', '.join(features_names[top10])))
    
    
def get_scores(classes_names):
    
    y_pred = pipe.predict(X_test) 
    
    confusion = pd.DataFrame(confusion_matrix(y_test, y_pred), 
                             index=['is_{}'.format(classes_names[0]),'is_{}'.format(classes_names[1]), 'is_{}'.format(classes_names[2])],
                             columns=['pred_{}'.format(classes_names[0]),'pred_{}'.format(classes_names[1]), 'pred_{}'.format(classes_names[2])]) 
        
    print('Classification Report: \n')
    print(metrics.classification_report(y_test, y_pred))
    print('='*60)
    print('Accuracy: {}'.format(metrics.accuracy_score(y_test,y_pred)))
    print('='*60)
    print('Confusion Matrix: \n')
    print(confusion)

In [914]:
y = all_jobs.salary_class
X = all_jobs.industry+' '+all_jobs.job_title_clean+' '+all_jobs.seniority_clean

In [915]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

#### Normal Log Regression L2 Penalty

In [916]:
cvec = CountVectorizer(stop_words='english')
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t1m1 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.47      0.35      0.40        46
        low       0.71      0.49      0.58        51
        mid       0.59      0.78      0.67        86

avg / total       0.59      0.59      0.58       183

Accuracy: 0.5901639344262295
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         16         3        27
is_mid           6        25        20
is_low          12         7        67


#### Normalized Logistic Regression L2 Penalty

In [788]:
tfidf = TfidfTransformer()

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',logreg)])

model_t1m2 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.50      0.26      0.34        46
        low       0.71      0.47      0.56        51
        mid       0.56      0.81      0.66        86

avg / total       0.59      0.58      0.56       183

Accuracy: 0.5792349726775956
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         12         3        31
is_mid           3        24        24
is_low           9         7        70


#### Stochastic Model

In [576]:
tfidfvec = TfidfVectorizer(stop_words='english')
SDGC = SGDClassifier(random_state=42)

pipe = Pipeline(steps=[('vectorizer', tfidfvec),
                       ('clf',SDGC)])

model_t1m3 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.42      0.46      0.44        46
        low       0.61      0.43      0.51        51
        mid       0.57      0.64      0.60        86

avg / total       0.54      0.54      0.53       183

Accuracy: 0.5355191256830601
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         21         6        19
is_mid           6        22        23
is_low          23         8        55




#### Random Forest Classifier

In [533]:
RF = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',RF)])

model_t1m4 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       1.00      0.02      0.04        46
        low       0.80      0.24      0.36        51
        mid       0.50      0.98      0.66        86

avg / total       0.71      0.53      0.42       183

Accuracy: 0.5300546448087432
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high          1         1        44
is_mid           0        12        39
is_low           0         2        84


### Test Result:

Best performing model is Logistic Regression with L2 Penalty, with accuracy score, precision and recall of 0.59

Generally the fetures in Test 1 are good predictors of mid and low salary, not so much for high salary.

Let's try to see if the result with Job Description and Requirements will be better

### Test 2: Classification with JD

In [597]:
query = '''
SELECT industry, job_title_clean, seniority_clean, salary_class, job_description, requirements
FROM alljobs'''

all_jobs_test2 = pd.read_sql(query, con=conn)

In [618]:
stop = list(set(stopwords.words('english'))) + ['requir', 'descri', 'role', 'senior', 'enterpris', 'minimum']

In [599]:
#cleaning job description
all_jobs_test2['desc_clean'] = all_jobs_test2['job_description'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x).lower())
all_jobs_test2['req_clean'] = all_jobs_test2['requirements'].apply(lambda x: re.sub("[^a-zA-Z]", " ", x).lower())
all_jobs_test2 = all_jobs_test2.drop(['job_description', 'requirements'],axis=1)

In [604]:
y = all_jobs_test2.salary_class
X = all_jobs_test2.desc_clean
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

#### Normal Log Regression L2 Penalty

In [605]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words='english', ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t2m1 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.39      0.28      0.33        46
        low       0.58      0.51      0.54        51
        mid       0.53      0.65      0.59        86

avg / total       0.51      0.52      0.51       183

Accuracy: 0.5191256830601093
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         13         3        30
is_mid           6        26        19
is_low          14        16        56


In [606]:
get_features(model_t2m1, ['high', 'mid', 'low'])

Number of features: 52957 

high: look technic, ntuc enterpris, backend, includ, respons ntuc, drive, backend join, deliveri, lead, regulatori
mid: valid, inform, track, camri lead, nu camri, camri, campaign, support, develop data, junior
low: role responsibilitiesntuc, develop join, chang, member, android develop, look android, camr lead, nu camr, camr, strong


#### Normalized Logistic Regression L2 Penalty

In [607]:
tfidf = TfidfTransformer()

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',logreg)])

model_t2m2 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.45      0.22      0.29        46
        low       0.83      0.29      0.43        51
        mid       0.50      0.84      0.63        86

avg / total       0.58      0.53      0.49       183

Accuracy: 0.5300546448087432
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         10         1        35
is_mid           0        15        36
is_low          12         2        72


In [608]:
get_features(model_t2m2, ['high', 'mid', 'low'])

Number of features: 52957 

high: bank, machin, talent, region, team, learn, drive, machin learn, regulatori, googl
mid: brand, sa, media, perform, programm, seo, revenu, support, campaign, report
low: demand, work, etl, suppli, engin, technic, infrastructur, design, test, data


#### Stochastic Model

In [609]:
tfidfvec = TfidfVectorizer(stop_words='english', ngram_range=(1,2))
SDGC = SGDClassifier(random_state=42)

pipe = Pipeline(steps=[('vectorizer', tfidfvec),
                       ('clf',SDGC)])

model_t2m3 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.37      0.41      0.39        46
        low       0.56      0.37      0.45        51
        mid       0.46      0.52      0.49        86

avg / total       0.47      0.45      0.45       183

Accuracy: 0.453551912568306
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         19         2        25
is_mid           5        19        27
is_low          28        13        45




In [610]:
get_features(model_t2m3, ['high', 'mid', 'low'])

Number of features: 52957 

high: team, look technic, learn, lead, regulatori, deliveri, backend join, tax, drive, googl
mid: develop data, valid, promot, cognit, support, facil, seo, sa, report, campaign
low: consist, youtub, strong, abus, deep learn, base, work, suppli, chang, test


#### Random Forest Classifier

In [611]:
RF = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',RF)])

model_t2m4 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.38      0.07      0.11        46
        low       1.00      0.02      0.04        51
        mid       0.47      0.94      0.62        86

avg / total       0.59      0.46      0.33       183

Accuracy: 0.4644808743169399
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high          3         0        43
is_mid           0         1        50
is_low           5         0        81


In [612]:
get_features(model_t2m4, ['high', 'mid', 'low'])

Number of features: 52957 

Top 10 features found in mid:
omni channel, everi, dynam seamless, lifecycl, live, platform deliv, channel, background, divers, lap


**Note** - Although Test 2 didn't perform as well as Test 1, Test 2 is more insightful. For example, two of the models agree that 'regulatory' and 'strategy' are two predictors for high salary, we also found 'learning' which probably refers to 'learning' on a few models. Perhaps we can look into this in Test 4.

Similarly Job Description is a good predictor for low and mid salary (more low than mid), not so much for high salary

### Test 3: Classification with Requirements


In [613]:
y = all_jobs_test2.salary_class
X = all_jobs_test2.req_clean
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

In [614]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t3m1 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.59      0.43      0.50        46
        low       0.59      0.53      0.56        51
        mid       0.56      0.67      0.61        86

avg / total       0.58      0.57      0.57       183

Accuracy: 0.5737704918032787
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         20         2        24
is_mid           3        27        21
is_low          11        17        58


In [615]:
get_features(model_t3m1, ['high', 'mid', 'low'])

Number of features: 33666 

high: larg, face, disciplin, appli, good understand, success, qualif, consult, min year, min
mid: inform technolog, media, manufactur, offic, minimum, diploma, meticul, bachelor comput, follow, minimum bachelor
low: excel command, microsoft, demonstr, year, relat, requirementsqualif year, english, complet, phd comput, phd


In [616]:
#### Normalized Logistic Regression L2 Penalty
tfidf = TfidfTransformer()

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',logreg)])

model_t3m2 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.55      0.26      0.35        46
        low       0.84      0.31      0.46        51
        mid       0.51      0.85      0.64        86

avg / total       0.61      0.55      0.52       183

Accuracy: 0.5519125683060109
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         12         0        34
is_mid           0        16        35
is_low          10         3        73


In [617]:
get_features(model_t3m2, ['high', 'mid', 'low'])

Number of features: 33666 

high: market, solid, product, requirementsminimum qualif, experi, success, complex, requirementsminimum, consult, qualif
mid: pressur, abl, minimum bachelor, inform technolog, employe, media, meticul, benefit, manufactur, diploma
low: solut, busi analyst, relat, informatica, bank, complet, busi, phd, experi, data


In [550]:
#### Stochastic Model
tfidfvec = TfidfVectorizer(stop_words='english')
SDGC = SGDClassifier(random_state=42)

pipe = Pipeline(steps=[('vectorizer', tfidfvec),
                       ('clf',SDGC)])

model_t3m3 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.51      0.63      0.56        46
        low       0.55      0.43      0.48        51
        mid       0.58      0.58      0.58        86

avg / total       0.55      0.55      0.55       183

Accuracy: 0.5519125683060109
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         29         5        12
is_mid           5        22        24
is_low          23        13        50




In [551]:
get_features(model_t3m3, ['high', 'mid', 'low'])

Number of features: 3662 

high: fast, fx, requirementsminimum, bw, aaa, focu, disciplin, larg, consult, min
mid: manufactur, plu, media, bioinformat, sap, graduat, member, inform, follow, meticul
low: analyst, deriv, varieti, output, microsoft, complet, demonstr, relat, sens, phd


In [552]:
#RF Classifier

RF = RandomForestClassifier(n_estimators=100, max_depth=3, random_state=42)

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('tfidf',tfidf),
                       ('clf',RF)])

model_t3m4 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.00      0.00      0.00        46
        low       1.00      0.04      0.08        51
        mid       0.48      1.00      0.64        86

avg / total       0.50      0.48      0.32       183

Accuracy: 0.4808743169398907
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high          0         0        46
is_mid           0         2        49
is_low           0         0        86


  'precision', 'predicted', average, warn_for)


In [553]:
get_features(model_t3m4, ['high', 'mid', 'low'])

Number of features: 33955 

Top 10 features found in mid:
salari, ci tool, requirementsrequir year, background, meticul, offic, diploma, knowledg git, api, requir phd


### Last Test - All Features

In [475]:
## Test on Logistic Regression with Lasso Penalty, best performing model so far

In [649]:
stop = stop + ['ntuc', 'profession', 'requirementsrequir', 'responsibilities', 'phd']

In [650]:
y = all_jobs_test2.salary_class
X = all_jobs_test2.desc_clean+ ' '+all_jobs_test2.industry + ' '+ all_jobs_test2.job_title_clean +' '+ all_jobs_test2.seniority_clean +' '+all_jobs_test2.req_clean 
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

In [654]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t4m1 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.61      0.43      0.51        46
        low       0.67      0.57      0.62        51
        mid       0.62      0.77      0.68        86

avg / total       0.63      0.63      0.62       183

Accuracy: 0.6284153005464481
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         20         2        24
is_mid           5        29        17
is_low           8        12        66


In [652]:
get_features(model_t4m1, ['high', 'mid', 'low'])

Number of features: 78075 

high: backend join, appli, deliveri, taxat middl, manag good, relat, manag requirementsabout, relat manag, relat year, technolog relat
mid: entri, follow, taxat execut, engin bachelor, execut year, bachelor comput, junior, bachelor, data engin, execut
low: demonstr, execut requirementsqualif, relat requirementsabout, scientist comput, data scientist, scientist, responsibilitiesntuc part, responsibilitiesntuc, taxat good, technolog year


Best performing model so far, let's see if we can improve with Gradient Boosting

In [642]:
#Mehhhh
pipe = Pipeline([
    ('vectorizer', TfidfVectorizer(stop_words=stop, ngram_range=(1,2))),
    ('clf', GradientBoostingClassifier(random_state=42))
])

model_t4m2 = pipe.fit(X_train, y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.60      0.33      0.42        46
        low       0.58      0.51      0.54        51
        mid       0.56      0.73      0.63        86

avg / total       0.57      0.57      0.55       183

Accuracy: 0.5683060109289617
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         15         4        27
is_mid           2        26        23
is_low           8        15        63


### Conclusion

**1. Determine the industry factors that are most important in predicting the salary amounts for these data.**
 
From our last model, it seems that high salary requires softskills, such as relationship and management. 
Whereas Mid to Low salary look more into qualifications such as degree and achievements. The features/ words are messy, given more time I'd like to experiment more with stemming and NLP in general. 

Job descriptions are similar from one posting to another, thus the variance (and more salient points) are contained in requirements, industry, seniority and job title. Thus models that consider these other features are less likely to misclassify. This probably explained why our last model performed best. 

If the above assumption is correct, we can prove it by taking 'job description' out of our model and use the rest of the features.

In [834]:
y = all_jobs_test2.salary_class
X = all_jobs_test2.industry + ' '+ all_jobs_test2.job_title_clean +' '+ all_jobs_test2.seniority_clean +' '+all_jobs_test2.req_clean 
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

In [835]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t4m3 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.67      0.52      0.59        46
        low       0.67      0.63      0.65        51
        mid       0.64      0.73      0.68        86

avg / total       0.65      0.65      0.65       183

Accuracy: 0.6502732240437158
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         24         1        21
is_mid           4        32        15
is_low           8        15        63


In [836]:
#Test with Requirements only
y = all_jobs_test2.salary_class
X = all_jobs_test2.req_clean 
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((731,), (183,), (731,), (183,))

In [838]:
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_t4m4 = pipe.fit(X_train,y_train)

get_scores(['high', 'mid', 'low'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.50      0.39      0.44        46
        low       0.57      0.53      0.55        51
        mid       0.55      0.64      0.59        86

avg / total       0.54      0.55      0.54       183

Accuracy: 0.546448087431694
Confusion Matrix: 

         pred_high  pred_mid  pred_low
is_high         18         2        26
is_mid           5        27        19
is_low          13        18        55


#### The above proved our hypothesis: results for salary class prediction are better without  JD feature.

Reflection: The longest time I took was to clean the data, as probably expected from any scraping project. Unfortunately the cleaning process was still not thorough enough company names such as NTUC kept popping up as predictor keyword during modeling.

---------------------------

### QUESTION 2: Factors that distinguish job category

Using the job postings you scraped for part 1 (or potentially new job postings from a second round of scraping), identify features in the data related to job postings that can distinguish job titles from each other. There are a variety of interesting ways you can frame the target variable, for example:
- What components of a job posting distinguish data scientists from other data jobs?
- What features are important for distinguishing junior vs. senior positions?
-----------

to tackle the questions, I will perform analysis on subsets of the data 
- Q1 - data scientist, data engineer and data analyst.
- Q2 - Junior and Senior only

In [839]:
all_jobs_test2.head()

Unnamed: 0,industry,job_title_clean,seniority_clean,salary_class,desc_clean,req_clean
0,Information Technology,Data Engineer,Executive,mid,roles responsibilities design build launch...,requirements experience and passion for data e...
1,Banking and Finance,Business Analyst,Professional,mid,roles responsibilitieswe are looking for a p...,requirementsmandatory skill set degree in ban...
2,Engineering,Data Scientist,Senior Executive,mid,roles responsibilities integrate all owned m...,requirements to years hands on experience ...
3,"Engineering, Manufacturing",Data Scientist,Junior Executive,low,roles responsibilitiesresponsibilities ...,requirementsrequirements bachelor master o...
4,Others,Business Analyst,Executive,low,roles responsibilitiesto assist with promoti...,requirements degree in business administration...


In [840]:
all_jobs_test2.job_title_clean.value_counts()

Business Analyst       177
other                  174
Data Engineer          173
Data Analyst           121
Data Scientist          84
Developer               49
IT Related              35
Marketing Analytics     31
Consulting              26
HR Related              16
Other Data Related      15
Product                 13
Name: job_title_clean, dtype: int64

In [888]:
de_jobs = all_jobs_test2[all_jobs_test2.job_title_clean == 'Data Engineer']
ds_jobs = all_jobs_test2[all_jobs_test2.job_title_clean == 'Data Scientist']
da_jobs = all_jobs_test2[all_jobs_test2.job_title_clean == 'Data Analyst']
ds_da_jobs = pd.concat([ds_jobs, da_jobs, de_jobs], ignore_index=True)

In [948]:
ds_da_jobs.job_title_clean.value_counts()

Data Engineer     138
Data Analyst      120
Data Scientist     80
Name: job_title_clean, dtype: int64

In [889]:
ds_da_jobs = ds_da_jobs.drop_duplicates()

In [843]:
ds_da_jobs.head()

Unnamed: 0,industry,job_title_clean,seniority_clean,salary_class,desc_clean,req_clean
0,Engineering,Data Scientist,Senior Executive,mid,roles responsibilities integrate all owned m...,requirements to years hands on experience ...
1,"Engineering, Manufacturing",Data Scientist,Junior Executive,low,roles responsibilitiesresponsibilities ...,requirementsrequirements bachelor master o...
2,Sciences / Laboratory / R&D,Data Scientist,Professional,mid,roles responsibilitiesabout the institute fo...,requirements ph d in the field of computer sc...
3,Information Technology,Data Scientist,Middle Management,mid,roles responsibilities m s or ph d in r...,requirementsbasic qualifications m s or ph ...
4,Engineering,Data Scientist,Entry Level,low,roles responsibilitiesdo you have a passion ...,requirementsyou have at least a bachelor s d...


In [890]:
#Just use JD and Requirements
y = ds_da_jobs.job_title_clean
X = ds_da_jobs.desc_clean +' '+ds_da_jobs.req_clean 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((270,), (68,), (270,), (68,))

In [891]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_da_ds = pipe.fit(X_train,y_train)

In [892]:
get_scores(['Data Analyst', 'Data Scientist', 'Data Engineer'])

Classification Report: 

                precision    recall  f1-score   support

  Data Analyst       0.81      0.89      0.85        28
 Data Engineer       0.95      0.82      0.88        22
Data Scientist       0.78      0.78      0.78        18

   avg / total       0.84      0.84      0.84        68

Accuracy: 0.8382352941176471
Confusion Matrix: 

                   pred_Data Analyst  pred_Data Scientist  pred_Data Engineer
is_Data Analyst                   25                    1                   2
is_Data Scientist                  2                   18                   2
is_Data Engineer                   4                    0                  14


In [825]:
get_features(model_da_ds, ['Data Analyst', 'Data Engineer', 'Data Scientist'])

Number of features: 40334 

Data Analyst: analysis, data analytics, reporting, sales, financial, analyst, management, analytics, analytic, business
Data Engineer: design, master, technical, engineering, bachelor engineering, engineer, software, bachelor computer, bachelor, requirements bachelor
Data Scientist: model, data science, statistics, models, research, scientist, techniques, statistical, requirements engineering, requirements computer


### Insights
The result shows that skills required from a Data Analyst vary from Data Engineer or Data Scientist: 
* Analytics, Reporting, Financial or Business background is useful for Data Analysts, whereas
* Software engineering, technical chops, and at least a bachelor degree in computer science are useful for Data Engineer, whereas
* Modeling, statistical knowledge and research are essential for Data Scientist.

The scores for this question is significantly more accurate than our first question, I can think of at least two reasons:
- Number of samples are smaller thus risk of misclassification is lower
- The requirements for each job are specific. 

I'd also like to test the performance of this subset against our initial hypothesis: performance is better without job description. 

In [893]:
#Test with job requirements only -- if our hypothesis is true, this should perform better than the above.
y = ds_da_jobs.job_title_clean
X = ds_da_jobs.req_clean

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((270,), (68,), (270,), (68,))

In [894]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_da_ds = pipe.fit(X_train,y_train)

In [895]:
get_scores(['Data Analyst', 'Data Scientist', 'Data Engineer'])

Classification Report: 

                precision    recall  f1-score   support

  Data Analyst       0.81      0.89      0.85        28
 Data Engineer       0.95      0.86      0.90        22
Data Scientist       0.82      0.78      0.80        18

   avg / total       0.86      0.85      0.85        68

Accuracy: 0.8529411764705882
Confusion Matrix: 

                   pred_Data Analyst  pred_Data Scientist  pred_Data Engineer
is_Data Analyst                   25                    1                   2
is_Data Scientist                  2                   19                   1
is_Data Engineer                   4                    0                  14


Our hypothesis is true.

Further concluded that the salient points that distinguish job functions and salary brackets are contained in Requirements. 

Unlike Q1 where the performance is worse using Requirements alone, the perfomance here is better, possibly because of the distinct requirements per job functions and also smaller samples.

I want to test next if indeed requirements is the most imporant feature to predict the distinct job functions. If this is true, when we test with all other features the performance will be much worse.

In [896]:
y = ds_da_jobs.salary_class
X = ds_da_jobs.industry + ' '+ ds_da_jobs.job_title_clean +' '+ ds_da_jobs.seniority_clean +' '+ds_da_jobs.req_clean 
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((270,), (68,), (270,), (68,))

In [899]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_da_ds2 = pipe.fit(X_train,y_train)

In [900]:
get_scores(['Data Analyst', 'Data Scientist', 'Data Engineer'])

Classification Report: 

             precision    recall  f1-score   support

       high       0.88      0.37      0.52        19
        low       0.67      0.50      0.57        20
        mid       0.51      0.79      0.62        29

avg / total       0.66      0.59      0.58        68

Accuracy: 0.5882352941176471
Confusion Matrix: 

                   pred_Data Analyst  pred_Data Scientist  pred_Data Engineer
is_Data Analyst                    7                    0                  12
is_Data Scientist                  0                   10                  10
is_Data Engineer                   1                    5                  23


Assumption is true! 
* High precision, low recall for means our model is quite picky for Data Analyst. 
* High recall, low precision for means our model classify most as Data Engineer. 

Let's test if this is also true for Junior vs Senior positions

### Junior vs Senior Position

In [847]:
all_jobs_test2.seniority_clean.value_counts()

Professional         335
Executive            188
other                111
Senior Executive     105
Middle Management     71
Senior Management     52
Junior Executive      38
Entry Level           14
Name: seniority_clean, dtype: int64

In [877]:
all_jobs_test2['jrvssr'] = ['Senior' if 'senior' in seniority_clean.lower() or 'manage' in seniority_clean.lower() else
                            'Junior' if 'junior' in seniority_clean.lower() or 'executive' in seniority_clean.lower() or 'entry' in seniority_clean.lower() else                          
                            'Other' for seniority_clean in all_jobs_test2.seniority_clean]
jr_jobs = all_jobs_test2[all_jobs_test2.jrvssr == 'Junior']
sr_jobs = all_jobs_test2[all_jobs_test2.jrvssr == 'Senior']
sr_jr_jobs = pd.concat([jr_jobs, sr_jobs], ignore_index=True)

In [878]:
##Nice balance between Senior and Junior but too many for others. 
sr_jr_jobs.jrvssr.value_counts()

Junior    240
Senior    228
Name: jrvssr, dtype: int64

In [936]:
#Just use JD 
y = sr_jr_jobs.jrvssr
X = sr_jr_jobs.desc_clean 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((374,), (94,), (374,), (94,))

In [937]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_sr_jr = pipe.fit(X_train,y_train)

In [938]:
model_sr_jr_pred = pipe.predict(X_test) 
    
confusion = pd.DataFrame(confusion_matrix(y_test, model_sr_jr_pred),
                         index=['is_junior','is_senior'],
                        columns=['is_junior','is_senior'])

print('Classification Report: \n')
print(metrics.classification_report(y_test, model_sr_jr_pred))
print('='*60)
print('Accuracy: {}'.format(metrics.accuracy_score(y_test,model_sr_jr_pred)))
print('='*60)
print('Confusion Matrix: \n')
print(confusion)

Classification Report: 

             precision    recall  f1-score   support

     Junior       0.69      0.70      0.69        47
     Senior       0.70      0.68      0.69        47

avg / total       0.69      0.69      0.69        94

Accuracy: 0.6914893617021277
Confusion Matrix: 

           is_junior  is_senior
is_junior         33         14
is_senior         15         32


In [939]:
get_features(model_sr_jr, classes=['Junior', 'Senior'])

Number of features: 42207 

Top 10 features found in Senior:
scotts, existing, road, learning, business, roles responsibilitieslocation, responsibilitieslocation, closely, architecture, business analyst


In [926]:
## JD alone yield good scores. See if we can improve the performance using requirements alone
y = sr_jr_jobs.jrvssr
X = sr_jr_jobs.req_clean

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((374,), (94,), (374,), (94,))

In [928]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_sr_jr2 = pipe.fit(X_train,y_train)

In [929]:
model_sr_jr_pred = pipe.predict(X_test) 
    
confusion = pd.DataFrame(confusion_matrix(y_test, model_sr_jr_pred),
                         index=['is_junior','is_senior'],
                        columns=['is_junior','is_senior'])

print('Classification Report: \n')
print(metrics.classification_report(y_test, model_sr_jr_pred))
print('='*60)
print('Accuracy: {}'.format(metrics.accuracy_score(y_test,model_sr_jr_pred)))
print('='*60)
print('Confusion Matrix: \n')
print(confusion)

Classification Report: 

             precision    recall  f1-score   support

     Junior       0.62      0.72      0.67        47
     Senior       0.67      0.55      0.60        47

avg / total       0.64      0.64      0.64        94

Accuracy: 0.6382978723404256
Confusion Matrix: 

           is_junior  is_senior
is_junior         34         13
is_senior         21         26


In [931]:
get_features(model_sr_jr2, classes=['Junior', 'Senior'])

Number of features: 25644 

Top 10 features found in Senior:
completed bachelor, requirementsqualifications, able, understanding, business, product, solutions, management, requirementsqualifications years, completed


In [949]:
#Nope! For Senior vs Junior, perhaps the distinguishing factors do not lie in Job Requirements. 
#To test if our hypothesis is still true, let's see if we can improve the 
#scores using all other features, except job description

y = sr_jr_jobs.jrvssr
X = sr_jr_jobs.industry + ' '+ sr_jr_jobs.job_title_clean +' '+sr_jr_jobs.req_clean
X = X.map(lambda x: ' '.join([ps.stem(word)for word in x.split() if word not in stop]))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((374,), (94,), (374,), (94,))

In [950]:
#### Normal Log Regression L2 Penalty
cvec = CountVectorizer(stop_words=stop, ngram_range=(1,2))
logreg = LogisticRegression(penalty= 'l2')

pipe = Pipeline(steps=[('vectorizer', cvec),
                       ('clf',logreg)])

model_sr_jr3 = pipe.fit(X_train,y_train)

In [951]:
model_sr_jr_pred = pipe.predict(X_test) 
    
confusion = pd.DataFrame(confusion_matrix(y_test, model_sr_jr_pred),
                         index=['is_junior','is_senior'],
                        columns=['is_junior','is_senior'])

print('Classification Report: \n')
print(metrics.classification_report(y_test, model_sr_jr_pred))
print('='*60)
print('Accuracy: {}'.format(metrics.accuracy_score(y_test,model_sr_jr_pred)))
print('='*60)
print('Confusion Matrix: \n')
print(confusion)

Classification Report: 

             precision    recall  f1-score   support

     Junior       0.65      0.70      0.67        47
     Senior       0.67      0.62      0.64        47

avg / total       0.66      0.66      0.66        94

Accuracy: 0.6595744680851063
Confusion Matrix: 

           is_junior  is_senior
is_junior         33         14
is_senior         18         29


In [952]:
get_features(model_sr_jr3, classes=['Junior', 'Senior'])

Number of features: 23017 

Top 10 features found in Senior:
requirementsqualif, tool, abl, complet, requirementsqualif year, technolog consult, understand, comput, solut, manag


Nope! It performed worse. Thus Job Description contained the most important factor determining seniority. 