# Text Mining

In this assignment, we will focus on text mining. The data set for this exercise includes information on job descriptions and salaries. Use this data set to see if you can predict the salary of a job posting (i.e., the `Salary` column in the data set) based on the job description. This is important, because this model can make a salary recommendation as soon as a job description is entered into a system.

## Description of Variables

Use the **jobs.csv** file as the data set. 

There are only two columns:<br>
**Salary:** The salary of that specific job<br>
**Job Description:** The description of the job ad<br>

## Goal

Use the **jobs.csv** data set and build a model to predict **Salary**. <br>



# Read and Prepare the Data

In [2]:
import pandas as pd
import numpy as np

np.random.seed(5004)

In [4]:
jobs = pd.read_csv('jobs.csv')

In [5]:
jobs.head(5)

Unnamed: 0,Salary,Job Description
0,67206,Civil Service Title: Regional Director Mental ...
1,88313,The New York City Comptrollerâ€™s Office Burea...
2,81315,With minimal supervision from the Deputy Commi...
3,76426,OPEN TO CURRENT BUSINESS PROMOTION COORDINATOR...
4,55675,Only candidates who are permanent in the Princ...


In [6]:
# Check for missing values

jobs[['Job Description']].isna().sum()

Job Description    0
dtype: int64

In [7]:
# Check for missing values

jobs[['Salary']].isna().sum()

Salary    0
dtype: int64

In [8]:
input_data = jobs['Job Description']
target = jobs['Salary']

## Split the data

In [9]:
from sklearn.model_selection import train_test_split

train_set, test_set, train_y, test_y = train_test_split(input_data, target, test_size=0.2, random_state=3200432)

In [10]:
train_set.shape, train_y.shape

((1930,), (1930,))

In [11]:
test_set.shape, test_y.shape

((483,), (483,))

## Install and use nltk to prepare the data

In [13]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [1]:
import nltk
from nltk.corpus import stopwords
import re

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\abhir\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\abhir\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\abhir\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [20]:
#Create a blank list

new_train = []


# For each row in train_set, we will read the text, tokenize it, remove stopwords, lemmatize it, 
# and save it to the new list

for text in train_set:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_train.append(' '.join(words))

In [21]:
# This will be a list of 417 items. Each item has the tokens

new_train

['supervision latitude independent action initiative decision position perform responsible difficult clerical typing word processing data entry task perform administrative work moderate difficulty involving handling confidential information material refer member public ask assistance appropriate city agency utilize manual automated office system type record report form schedule review verify written information receive send separate distribute mail obtain transmit information public department personnel occasionally perform cashier messenger duty perform related work',
 'candidate permanent administrative project manager title provide proof successful registration june open competitive exam exam promotional exam exam may apply failure result disqualification department design construction division public building architecture engineering unit seek senior structural design review engineer selected candidate serve senior engineer structural discipline designing foundation structural syst

In [22]:
# Let's convert the original train_set to a dataframe

train_set_df = pd.DataFrame(train_set)

train_set_df['Job Description'] = new_train

train_set_df

Unnamed: 0,Job Description
1439,supervision latitude independent action initia...
303,candidate permanent administrative project man...
1431,division preservation finance seek deputy dire...
525,division tenant resource dtr office financial ...
1279,candidate permanent administrative project man...
...,...
128,tort division new york city law department see...
305,assistant project manager provide support proj...
1248,nyc department environmental protection dep en...
1070,nyc department environmental protection dep en...


In [24]:
# Let's do the same for test data 

new_test = []

for text in test_set:
    text = re.sub(r'[!"#$%&()*+,-./:;<=>?[\]^_`{|}~]', ' ', text).lower()
        
    words= nltk.tokenize.word_tokenize(text)
    words = [w for w in words if w.isalpha()]
    words = [w for w in words if len(w)>2 and w not in stopwords.words('english')]
        
    lemmatizer = nltk.stem.WordNetLemmatizer()
    words = [lemmatizer.lemmatize(w) for w in words]
    new_test.append(' '.join(words))



test_set_df = pd.DataFrame(test_set)

test_set_df['Job Description'] = new_test

test_set_df

Unnamed: 0,Job Description
463,drive development vehicle assist debris pick p...
1500,office economic opportunity nyc opportunity pa...
853,doitt provides sustained efficient effective d...
1135,reporting training supervisor trainer devoted ...
1428,new york city department environmental protect...
...,...
486,crane derrick unit mission ensure crane hoisti...
1273,new york city taxi limousine commission tlc la...
101,applicant permanent civil service status procu...
1955,director data operation oversee team support i...


In [25]:
new_test

['drive development vehicle assist debris pick pick material supply assist emergency snow removal prepare apartment move out',
 'office economic opportunity nyc opportunity part mayor office operation work reduce poverty broaden opportunity advancing use data evidence program policy design service delivery budget decision office economic opportunity recruiting one infrastructure engineer function java developer part integration service nyc enterprise data solution team java developer primarily responsible maintaining expanding current digital product service help drive long term data strategy enterprise data integration java developer responsible maintaining expanding current digital product creating new digital service strong software development skill make tool human centered help make complex information easy understand use technology solution solve business problem contribute new technology project assigned provide operation maintenance support existing application develop custom c

## Use Scikit-Learn to create the term-by-doc matrix

In [60]:
#Countvectorizer includes pre-processing, tokenization, filtering stop words
from sklearn.feature_extraction.text import CountVectorizer

# Let's see if we can limite the features and achieve a good accuracy
count_vect = CountVectorizer(max_features=1000)

train_x_tr = count_vect.fit_transform(train_set_df['Job Description'])

In [61]:
test_x_tr = count_vect.transform(test_set_df['Job Description'])

In [62]:
train_x_tr, test_x_tr

(<1930x1000 sparse matrix of type '<class 'numpy.int64'>'
 	with 209884 stored elements in Compressed Sparse Row format>,
 <483x1000 sparse matrix of type '<class 'numpy.int64'>'
 	with 52952 stored elements in Compressed Sparse Row format>)

In [63]:
from sklearn.feature_extraction.text import TfidfTransformer

tf_transformer = TfidfTransformer()

train_x_tfidf = tf_transformer.fit_transform(train_x_tr)

train_x_tfidf.shape

(1930, 1000)

In [64]:
# Now we need to perform the tf-idf transformation on the test data set

test_x_tfidf = tf_transformer.transform(test_x_tr)

test_x_tfidf.shape

(483, 1000)

In [65]:
# X_train_tf is a sparse matrix. We can't see it unless we convert using toarray()
train_x_tfidf[:,:].toarray()

array([[0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.0338336 , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.09980493, 0.        , 0.        , ..., 0.04171836, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.04284268, 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

## Latent Semantic Analysis (Singular Value Decomposition)

In [66]:
from sklearn.decomposition import TruncatedSVD

In [133]:
#If you are performing Latent Semantic Analysis, recommended number of components is 100

svd = TruncatedSVD(n_components=600, n_iter=10)

In [134]:
train_x_lsa = svd.fit_transform(train_x_tfidf)

In [135]:
train_x_lsa.shape

(1930, 600)

In [136]:
train_x_lsa

array([[ 2.33257176e-01, -1.12785333e-01, -8.93784372e-02, ...,
         4.47151086e-03, -1.86523018e-02, -6.95926494e-03],
       [ 3.77942277e-01, -6.02714716e-02,  4.32302979e-01, ...,
        -4.84160354e-03,  5.69294149e-04,  9.94935386e-03],
       [ 3.37252405e-01, -2.24969862e-01, -1.88829438e-02, ...,
         1.18157859e-02, -7.01567402e-03, -1.15067042e-02],
       ...,
       [ 5.01514630e-01,  7.09537006e-02, -7.40445893e-02, ...,
        -5.63481460e-03,  6.20508645e-03,  7.38282226e-04],
       [ 4.01897622e-01,  1.40732216e-01, -5.62314468e-02, ...,
        -2.02839456e-02, -2.91142120e-04, -2.07159731e-02],
       [ 3.44847310e-01, -2.12629740e-01,  5.98674092e-02, ...,
         2.69282222e-03, -1.89185654e-03,  7.55250437e-04]])

### Explore the SVDs

In [137]:
svd.explained_variance_.sum()

0.874839413935975

In [138]:
#These are the all the components:
svd.components_

array([[ 2.11587384e-02,  1.23261264e-02,  8.82193253e-03, ...,
         2.49584765e-02,  8.04874741e-03,  1.45571880e-02],
       [-1.62839473e-02,  6.39931277e-05, -1.72078550e-02, ...,
         1.24990113e-02, -1.63643643e-02, -1.82641325e-02],
       [-3.29790156e-03, -4.61621153e-03, -1.58147068e-02, ...,
        -1.18471673e-02, -1.21491796e-02, -7.22412940e-03],
       ...,
       [-1.81052505e-02,  2.17976569e-02, -3.25787459e-02, ...,
        -1.10781581e-02,  8.59477019e-03, -3.15859396e-02],
       [-2.94027420e-03,  1.33011768e-02,  1.07411669e-02, ...,
         5.01430033e-02, -3.77724081e-03,  6.22228399e-02],
       [-2.88358555e-02, -4.55591449e-02,  2.49325993e-02, ...,
        -2.49501223e-03, -1.60361759e-02, -9.92503268e-03]])

In [139]:
svd.components_.shape

(600, 1000)

In [140]:
#Let's select the first component:

first_component = svd.components_[0,:]

In [141]:
# Sort the weights in the first component, and get the indeces

indeces = np.argsort(first_component).tolist()

In [142]:
#Be careful, indeces are in descending order (least important first)

print(indeces)

[575, 96, 312, 543, 969, 978, 915, 843, 567, 196, 924, 333, 944, 490, 368, 534, 130, 350, 782, 388, 942, 269, 883, 4, 803, 679, 463, 437, 708, 419, 48, 288, 989, 29, 268, 548, 157, 142, 261, 684, 447, 103, 156, 246, 727, 100, 639, 816, 427, 307, 582, 464, 482, 902, 236, 768, 930, 721, 232, 694, 292, 472, 39, 753, 988, 527, 540, 767, 622, 559, 857, 765, 393, 475, 126, 584, 238, 621, 775, 564, 485, 590, 470, 881, 476, 764, 322, 92, 516, 841, 655, 417, 422, 693, 671, 970, 448, 617, 189, 538, 234, 483, 998, 928, 173, 919, 403, 710, 772, 351, 225, 64, 252, 267, 650, 435, 844, 273, 580, 592, 263, 241, 396, 171, 372, 549, 141, 979, 879, 84, 941, 250, 794, 960, 733, 82, 180, 571, 598, 790, 205, 783, 301, 984, 446, 349, 525, 697, 994, 14, 337, 258, 796, 524, 423, 599, 279, 23, 57, 22, 382, 530, 217, 121, 2, 823, 740, 206, 957, 274, 506, 539, 179, 503, 420, 848, 817, 357, 751, 685, 270, 487, 184, 38, 290, 218, 174, 640, 240, 952, 58, 717, 488, 687, 840, 822, 805, 533, 982, 124, 509, 20, 104, 11,

In [143]:
#Let's get the feature names from the count vectorizer:
feat_names = count_vect.get_feature_names()

In [144]:
#Print the last 10 terms (i.e., the 10 terms that have the highest weigths)

for index in indeces[-10:]:
    print(feat_names[index], "\t\tweight =", first_component[index])

agency 		weight = 0.11599942342965541
new 		weight = 0.12092872143218744
design 		weight = 0.12302483210966209
program 		weight = 0.12690049937870435
dep 		weight = 0.12860457807600037
wastewater 		weight = 0.13167282695774693
construction 		weight = 0.13542595716234807
city 		weight = 0.13926005613974768
project 		weight = 0.18193455099019998
water 		weight = 0.22281208016823992


### Let's transform the test data set

In [145]:
test_x_lsa = svd.transform(test_x_tfidf)

In [146]:
test_x_lsa.shape

(483, 600)

## Baseline

In [147]:
from sklearn.metrics import mean_squared_error

In [148]:
#First find the average value of the target

mean_value = np.mean(train_y)

mean_value

77652.36424870466

In [149]:
# Predict all values as the mean

baseline_pred = np.repeat(mean_value, len(test_y))

baseline_pred

array([77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642487,
       77652.3642487, 77652.3642487, 77652.3642487, 77652.3642

In [150]:
baseline_mse = mean_squared_error(test_y, baseline_pred)

baseline_rmse = np.sqrt(baseline_mse)

print('Baseline RMSE: {}' .format(baseline_rmse))

Baseline RMSE: 33047.83186817777


# Model 1
### Use any model that we have covered so far

In [182]:
#Train on 75% of the sample only
from sklearn.ensemble import GradientBoostingRegressor
gb_reg = GradientBoostingRegressor(max_depth=9, n_estimators=100, 
                                   learning_rate=0.03, subsample=0.75) 

gb_reg.fit(train_x_lsa, train_y)

GradientBoostingRegressor(learning_rate=0.03, max_depth=9, subsample=0.75)

In [183]:
#Train RMSE
train_pred = gb_reg.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 4058.653817629123


In [184]:
#Test RMSE
test_pred = gb_reg.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 20167.134427512752


# Model 2
### Use any model that we have covered so far

In [176]:
from sklearn.ensemble import RandomForestRegressor 

rnd_reg = RandomForestRegressor(n_estimators=50,n_jobs=-1) 

rnd_reg.fit(train_x_lsa, train_y)

RandomForestRegressor(n_estimators=50, n_jobs=-1)

In [177]:
#Train RMSE
train_pred = rnd_reg.predict(train_x_lsa)

train_mse = mean_squared_error(train_y, train_pred)

train_rmse = np.sqrt(train_mse)

print('Train RMSE: {}' .format(train_rmse))

Train RMSE: 6565.151717119893


In [178]:
#Test RMSE
test_pred = rnd_reg.predict(test_x_lsa)

test_mse = mean_squared_error(test_y, test_pred)

test_rmse = np.sqrt(test_mse)

print('Test RMSE: {}' .format(test_rmse))

Test RMSE: 20967.59767578778


# Discussion

1) Which model performs the best (and why)?<br>
2) What is the baseline?<br>
3) Does the best model perform better than the baseline (and why)?<br>
4) Does the best model exhibit any overfitting; what did you do about it?

#### 1. The first model (SGD Regressor model) performs the best because its RMSE value is lower than the other model (Random Forest Regressor). 
#### 2.the baseline is 33047.83186817777
#### 3.Yes, The model performs better than the baseline since the model's RMSE value is better(lower) than the baseline RMSE.
#### 4. Yes, it does exhibit overfitting. I tried changing the values of the parameters (max_depth, n_estimators, learning_rate) to make the RMSE values of train set similar to that of test set RMSE but it did not provide a better scope of improving overfitting.