# Assignment - Text Mining - McCartney

In this assignment, we will focus on text mining. The data set for this exercise includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

Use the **jobs.csv** file as the data set. 

There are only two columns:<br>
**Salary:** The salary of that specific job<br>
**Job Description:** The description of the job ad<br>

## Goal

Use the **jobs.csv** data set and build a model to predict **Salary**. <br>

**Build at least two  models.**

### BE CAREFUL: THIS IS A REGRESSION TASK. USE REGRESSORS ONLY

## Submission:

Please save and submit this Jupyter notebook file. The correctness of the code matters for your grade. **Readability and organization of your code is also important.** You may lose points for submitting unreadable/undecipherable code. Therefore, use markdown cells to create sections, and use comments where necessary.


# Read and Prepare the Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
jobs = pd.read_csv('jobs.csv')

In [3]:
jobs.head(5)

Unnamed: 0,Salary,Job Description
0,67206,Civil Service Title: Regional Director Mental ...
1,88313,The New York City Comptrollerâ€™s Office Burea...
2,81315,With minimal supervision from the Deputy Commi...
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...
4,55675,Only candidates who are permanent in the Princ...


In [6]:
target = jobs['Salary']

In [9]:
# Check for missing values

jobs.isna().sum()

Salary             0
Job Description    0
dtype: int64

In [10]:
input_data = jobs['Job Description']

In [11]:
# Split the Data

from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.3, random_state=42)

In [18]:
train_set.shape, train_y.shape

((1689,), (1689,))

In [19]:
test_set.shape, test_y.shape

((724,), (724,))

# Sklearn: Text preparation

In [20]:
#TfidfVectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vect = TfidfVectorizer(stop_words='english')

train_x_tr = tfidf_vect.fit_transform(train_set)

In [21]:
# Perform the TfidfVectorizer transformation
# Be careful: We are using the train fit to transform the test data set. Otherwise, the test data 
# features will be very different and match the train set!!!

test_x_tr = tfidf_vect.transform(test_set)

In [22]:
train_x_tr.shape

(1689, 9914)

In [23]:
test_x_tr.shape

(724, 9914)

In [24]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr

<1689x9914 sparse matrix of type '<class 'numpy.float64'>'
	with 250443 stored elements in Compressed Sparse Row format>

In [25]:
# These data sets are "sparse matrix". We can't see them unless we convert using toarray()
train_x_tr.toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.02913336, 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [26]:
tfidf_vect.vocabulary_ #numeric is the column number of term not a count

{'candidates': 1507,
 'permanent': 6504,
 'computer': 2036,
 'systems': 8779,
 'manager': 5450,
 'title': 8994,
 'provide': 7045,
 'proof': 6999,
 'successful': 8652,
 'registration': 7408,
 'october': 6080,
 '2018': 112,
 'open': 6155,
 'competitive': 1977,
 'promotional': 6991,
 'exam': 3569,
 'apply': 816,
 'failure': 3757,
 'result': 7678,
 'disqualification': 2944,
 'department': 2691,
 'design': 2734,
 'construction': 2165,
 'division': 2984,
 'public': 7080,
 'buildings': 1420,
 'seeks': 8020,
 'director': 2858,
 'data': 2518,
 'analytics': 749,
 'team': 8859,
 'responsible': 7661,
 'providing': 7051,
 'descriptive': 2733,
 'diagnostic': 2814,
 'predictive': 6799,
 'insights': 4766,
 'based': 1143,
 'agency': 633,
 'external': 3711,
 'sources': 8321,
 'dataset': 2524,
 'includes': 4631,
 'basic': 1148,
 'project': 6977,
 'management': 5448,
 'schedule': 7923,
 'budget': 1409,
 'internal': 4862,
 'information': 4713,
 'sensors': 8051,
 'including': 4632,
 'uncensored': 9246,
 'le

# Latent Semantic Analyis (LSA) [Singular Value Decomposition (SVD)]

In [92]:
from sklearn.decomposition import TruncatedSVD

#If you are performing Latent Semantic Analysis, recommended number of components is 100
# adjusted to 1000 for better results increasing belond 1000 didn't improve results

svd = TruncatedSVD(n_components=1000, n_iter=10)

In [93]:
train_x_lsa = svd.fit_transform(train_x_tr)

In [94]:
train_x_lsa.shape

(1689, 1000)

In [95]:
train_x_lsa

array([[ 2.47205981e-01, -2.03271254e-01,  1.14899112e-01, ...,
         8.77539948e-03, -6.20937095e-03, -4.93151395e-03],
       [ 1.72948207e-01, -1.30748862e-01,  4.77585353e-03, ...,
         1.18678512e-02, -6.86054018e-03, -1.08827150e-02],
       [ 5.87776099e-01,  3.65734004e-01,  1.34432287e-01, ...,
         3.36750999e-03,  9.23260261e-04,  1.65015880e-03],
       ...,
       [ 1.33857761e-01, -1.06052106e-01, -4.01844658e-02, ...,
        -1.66707219e-03, -1.64577436e-03, -8.58008575e-04],
       [ 1.53406212e-01, -1.23536443e-01, -4.28305610e-02, ...,
        -1.54601262e-03,  2.38994933e-03,  1.29440605e-03],
       [ 2.17227521e-01, -5.64243368e-02,  3.15270243e-01, ...,
         5.81602771e-04, -5.74854318e-03, -7.28616891e-03]])

In [96]:
# Transform Test Data

test_x_lsa = svd.transform(test_x_tr)

In [97]:
test_x_lsa.shape

(724, 1000)

# Explore the SVDs

In [98]:
svd.explained_variance_.sum()

0.9313244137466226

In [99]:
#These are the all the components:
svd.components_

array([[ 6.88496587e-04,  8.57475540e-02,  9.35138953e-05, ...,
         2.58331510e-04,  2.64894838e-04,  4.74548918e-04],
       [-4.24457233e-04,  9.21278086e-02, -8.70152800e-05, ...,
        -5.37307789e-04,  4.90695772e-04, -9.88739389e-04],
       [-5.92600177e-04, -6.23455368e-02, -1.68140617e-04, ...,
        -3.77315274e-04, -3.92896112e-04, -5.15104924e-04],
       ...,
       [ 1.16194256e-02, -1.38046777e-02,  2.76821981e-03, ...,
         3.46936044e-04, -5.17929150e-03,  4.05395168e-04],
       [ 4.43129704e-03,  5.28992517e-03, -1.42810634e-03, ...,
         5.38024769e-04, -4.25462948e-03,  3.95263596e-04],
       [ 1.10663449e-02, -5.53266813e-03, -5.67943632e-03, ...,
         2.32397619e-03,  1.07069294e-03,  4.53821115e-03]])

In [77]:
svd.components_.shape

(500, 9914)

In [100]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [101]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [102]:
#Get the feature names from the count vectorizer:
feat_names = tfidf_vect.get_feature_names()

In [103]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)
#Be careful, indeces are in descending order (least important first)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

bureau 		weight = 0.10717931879805587
management 		weight = 0.10960765437645162
new 		weight = 0.1223038615911278
design 		weight = 0.12986195254896968
city 		weight = 0.13608521123029874
project 		weight = 0.13846106935768077
dep 		weight = 0.14965285958787955
construction 		weight = 0.15479839429867356
wastewater 		weight = 0.15841529884565192
water 		weight = 0.26425044968095257


# Calculate the baseline

In [139]:
from sklearn.metrics import mean_squared_error

#First find the average value of the target

mean_value = np.mean(train_y)

mean_value

78566.0307874482

In [137]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 28261.69063660107


# Model 1 - Decision Tree

In [144]:
from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor(min_samples_leaf = 10, max_depth=5) # remove parameter to have un-restricted learning

tree_reg.fit(train_x_lsa, train_y)

DecisionTreeRegressor(max_depth=5, min_samples_leaf=10)

In [145]:
from sklearn.metrics import mean_squared_error

#Train RMSE
train_pred = tree_reg.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 21978.54348483159


In [146]:
#Test RMSE
test_pred = tree_reg.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 24919.31128806112


# Model 2 - Randomized Grid Search

In [148]:
from sklearn.model_selection import RandomizedSearchCV

param_grid = [
    {'min_samples_leaf': np.arange(10, 30), 
     'max_depth': np.arange(10,30)}
  ]

tree_reg = DecisionTreeRegressor()

grid_search = RandomizedSearchCV(tree_reg, param_grid, cv=5, n_iter=10,
                           scoring='neg_mean_squared_error', verbose=1,
                           return_train_score=True)

grid_search.fit(train_x_lsa, train_y)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


RandomizedSearchCV(cv=5, estimator=DecisionTreeRegressor(),
                   param_distributions=[{'max_depth': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29]),
                                         'min_samples_leaf': array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29])}],
                   return_train_score=True, scoring='neg_mean_squared_error',
                   verbose=1)

In [149]:
cvres = grid_search.cv_results_

for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)

28510.71525015047 {'min_samples_leaf': 10, 'max_depth': 21}
27585.61621297857 {'min_samples_leaf': 23, 'max_depth': 14}
27923.662740091222 {'min_samples_leaf': 12, 'max_depth': 25}
27689.447775489632 {'min_samples_leaf': 26, 'max_depth': 29}
27170.681177144634 {'min_samples_leaf': 22, 'max_depth': 25}
27695.075112332197 {'min_samples_leaf': 27, 'max_depth': 14}
27817.282241032608 {'min_samples_leaf': 21, 'max_depth': 28}
28630.041261079463 {'min_samples_leaf': 11, 'max_depth': 13}
27170.681177144634 {'min_samples_leaf': 22, 'max_depth': 14}
27425.040889138832 {'min_samples_leaf': 24, 'max_depth': 12}


In [150]:
grid_search.best_params_

{'min_samples_leaf': 22, 'max_depth': 25}

In [151]:
grid_search.best_estimator_

DecisionTreeRegressor(max_depth=25, min_samples_leaf=22)

In [153]:
#Train RMSE
train_pred = grid_search.best_estimator_.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 18873.672345515817


In [154]:
#Test RMSE
test_pred = grid_search.best_estimator_.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 25462.945769219492


# Discussion

Briefly answer the following questions: (2 points) 
1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

1- Decision Tree has lowest RSME of 24,919<br>
2- Baseline Test RSME is 28,261<br>
3- DT performs better that Baseline by predicting more accurate salaries RMSE of 24k vers 28k<br>
4- No overfitting test set perfomred wosre than train set<br>