# Deliverable 2: Identifying Promising Models & Model Tuning

## Andrew Grefer, Rebecca Jorgensen, Jonathan Murphy, Will Storment

To create some promising models, we imported the appropriate packages, specifically sklearn so we can get accuracy scores, MSE, and R^2 scores to guide us while we tune.  

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import tree
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score

%matplotlib inline
sns.set()

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import metrics

In [2]:
df_fake = pd.read_csv('fake_job_postings.csv')

We imported our dataset and wrangled it again.

In [3]:
fraudulent_col = df_fake.fraudulent
fake_model = df_fake.drop(labels = ['job_id', 'department', 'function', 'fraudulent'], axis = 1)

In [4]:
fake_model["title"].replace(np.nan, 'No title stated', inplace = True)
fake_model["location"].replace(np.nan, 'No location stated', inplace = True)
fake_model["company_profile"].replace(np.nan, 'No profile', inplace = True)
fake_model["description"].replace(np.nan, 'No description', inplace = True)
fake_model["requirements"].replace(np.nan, 'No requirements stated', inplace = True)
fake_model["employment_type"].replace(np.nan, 'Not stated', inplace = True)
fake_model["required_experience"].replace(np.nan, 'Not stated', inplace = True)
fake_model["required_education"].replace(np.nan, 'Not stated', inplace = True)
fake_model["industry"].replace(np.nan, 'No industry stated', inplace = True)

Making sure these are analyzed as binary values.

In [5]:
fake_model['salary_range'].loc[~fake_model['salary_range'].isnull()] = 1
fake_model['salary_range'].loc[fake_model['salary_range'].isnull()] = 0 
fake_model['benefits'].loc[~fake_model['benefits'].isnull()] = 1
fake_model['benefits'].loc[fake_model['benefits'].isnull()] = 0 
fake_model.rename(columns={'salary_range': 'salary_stated', 'benefits' : 'benefits_stated'}, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


To be able to create training and testing datasets based on the 'fraudulent' and 'description' values, we had to specifically make sure the 'description' was cast as a string type.

We are trying to predict whether the description is fraudulent, so 'fraudulent' = y is our target variable

We define our testing and training variables appropriately, where 'description' is our explanatory variable.

In [6]:
df_fake.replace('?', np.NaN, inplace=True)
df_fake['description'] = df_fake['description'].astype(str) 


y = df_fake.fraudulent      
df_fake = df_fake.drop('fraudulent', axis=1)  

X_train, X_test, y_train, y_test = train_test_split(df_fake['description'], y, test_size=0.35, random_state=53) 

In [7]:
y_train.head(20)

1601     0
5960     0
14302    0
14579    0
5596     0
2835     0
14935    0
15103    0
5211     0
6387     0
11470    0
13762    0
15697    0
16655    0
9643     0
9868     0
5829     0
8169     0
17633    1
2886     0
Name: fraudulent, dtype: int64

In [8]:
X_train.head(20)

1601     Our client, a growing company in Danbury, CT, ...
5960     SE1, London Bridge - Laserlife, part of the Vi...
14302    United Cerebral Palsy of Oregon &amp; SW Washi...
14579    The Sales Development Representative is respon...
5596     We're on a hunt for a e-mail marketing special...
2835     The Radio Producer shall not fail to properly ...
14935    OgilvyOne Worldwide, Athens seeks to recruit a...
15103    Apcera is completely re-imagining application ...
5211     This position has two primary and overarching ...
6387     Sounds like what you are looking for? Then app...
11470    Shapeways is the leading 3D printing marketpla...
13762    Babbel is constantly improving its internal pr...
15697    Hopper is a travel startup based in Cambridge,...
16655    Trans4u Ltd We are an International Translatio...
9643     LCC is a great company to work, we have a very...
9868     Play with kids, get paid for it :-)Love travel...
5829     PeoplePerHour is the UK's leading online marke.

Next we create a bag-of-words for the job postings.
count_vectorizer will ignore the stop_words such as 'and', 'but', 'the'
then we're fitting the training data, and then running it against the test data

In [9]:
count_vectorizer = CountVectorizer(stop_words='english') 
count_train = count_vectorizer.fit_transform(X_train) 
count_test = count_vectorizer.transform(X_test) 

Then, we instantiate a classifier, fit the classifier to the training data, create the prediction tags, calculate the accuracy score, and calculate the confusion matrix 

In [10]:
nb_classifier = MultinomialNB() 
nb_classifier.fit(count_train, y_train)
pred = nb_classifier.predict(count_test) 
score = metrics.accuracy_score(y_test, pred)  
print(score)
cm = metrics.confusion_matrix(y_test, pred, labels=[0, 1])  
print(cm)

0.9637264301693832
[[5887   36]
 [ 191  144]]


with a test_size of 0.35, we found the best accuracy score from our confusion matrix to be 96.37264%

(5887 + 144)/(0.35*17880) = 6031/6258 = 0.96372643 accuracy

36 Type I errors
191 Type II errors

(36+191)/(0.35*17880) = 227/6258 = 0.03627357 errors

In [11]:
df_fake = pd.read_csv('fake_job_postings.csv')

We wanted to compare the 'description' features with the 'requirements', so we created another confusion matrix based on 'requirements'

make sure the 'requirements' was cast as a string type.

We try to predict whether the requirements is fraudulent, so 'fraudulent' = y is our target variable

We define our testing and training variables appropriately, where 'requirements' is our explanatory variable.

In [12]:
df_fake.replace('?', np.NaN, inplace=True)
df_fake['requirements'] = df_fake['requirements'].astype(str) 

y = df_fake.fraudulent       
df_fake = df_fake.drop('fraudulent', axis=1)  

X_train, X_test, y_train, y_test = train_test_split(df_fake['requirements'], y, test_size=0.35, random_state=53) 

In [13]:
y_train.head(20)

1601     0
5960     0
14302    0
14579    0
5596     0
2835     0
14935    0
15103    0
5211     0
6387     0
11470    0
13762    0
15697    0
16655    0
9643     0
9868     0
5829     0
8169     0
17633    1
2886     0
Name: fraudulent, dtype: int64

In [14]:
X_train.head(20)

1601     Must be able to type at least 35 WPMMust have ...
5960     You'll need to have at least 12 months office ...
14302    A Bachelor’s Degree, or at least 1 year direct...
14579    • Sales skills to include demonstrated phone a...
5596     Exceptional copywriting and communication skil...
2835     The Radio Producer shall produce, coordinate, ...
14935    Minimum 1 year of art direction experienceExce...
15103    Diagnose and resolve latent and systemic relia...
5211     Master’s degree in Public Health, Health Educa...
6387                                                   nan
11470    2 + years experience in Product Management, Pr...
13762    Absolute necessity: Experience in concept, des...
15697    A qualified candidate hasA degree in Math, Sta...
16655    In order to register you in our database, plea...
9643     ·         Has experience on field as well as h...
9868     University degree required. TEFL / TESOL / CEL...
5829     Background / Experience We are less interested.

We create our bag-of-words again, but from 'requirements'

In [15]:
count_vectorizer = CountVectorizer(stop_words='english') 
count_train = count_vectorizer.fit_transform(X_train)  
count_test = count_vectorizer.transform(X_test)

Just like the othe cm, we instantiate a classifier, fit the classifier to the training data, create the prediction tags, calculate the accuracy score, and calculate the confusion matrix 

In [16]:
nb_classifier = MultinomialNB() 

nb_classifier.fit(count_train, y_train)

pred = nb_classifier.predict(count_test)  

score = metrics.accuracy_score(y_test, pred) 
print(score)

cm = metrics.confusion_matrix(y_test, pred, labels=[0, 1])  
print(cm)

0.9512623841482902
[[5910   13]
 [ 292   43]]


with a test_size of 0.35, we found the best accuracy score from our confusion matrix to be 95.1262384%

(5910 + 43)/(0.35*17880) = 5953/6258 = 0.951262384 accuracy

13 Type I errors
292 Type II errors

(13+292)/(0.35*17880) = 305/6258 = 0.04873 errors

To begin the random forest we first have to select predictors that have numeric values, and the only real ones we have are binary values.

In [17]:
data = fake_model[['salary_stated', 'benefits_stated', 'telecommuting','has_company_logo', 'has_questions']]

In [18]:
data.head()

Unnamed: 0,salary_stated,benefits_stated,telecommuting,has_company_logo,has_questions
0,0,0,0,1,0
1,0,1,0,1,0
2,0,0,0,1,0
3,0,1,0,1,0
4,0,1,0,1,1


In [19]:
data_train = data.iloc[:12000]
data_test = data.iloc[12000:]
fraud_train = fraudulent_col.iloc[:12000]
fraud_test = fraudulent_col.iloc[12000:]

Also, we need to tune paramateres in order to figure out the best number of estimators and depth of our decision trees.

In [20]:
from sklearn.model_selection import GridSearchCV

In [21]:
param_grid = {
    'bootstrap': [True],
    'max_depth': [2, 3, 4, 5, 6, 7, 8, 9],
    'n_estimators': [10, 25, 50, 75, 100, 200]
}

In [22]:
model = RandomForestRegressor()

In [23]:
model_grid = GridSearchCV(model, param_grid=param_grid, cv=5)

In [24]:
model_grid.fit(data_train, fraud_train)

GridSearchCV(cv=5, estimator=RandomForestRegressor(),
             param_grid={'bootstrap': [True],
                         'max_depth': [2, 3, 4, 5, 6, 7, 8, 9],
                         'n_estimators': [10, 25, 50, 75, 100, 200]})

In [25]:
model_grid.best_params_

{'bootstrap': True, 'max_depth': 2, 'n_estimators': 50}

Now that we know that the best model has a max depth of 2 and 50 estimators, we can now build the random forest.

In [26]:
model = RandomForestRegressor(n_estimators=50, max_depth=2, bootstrap=True)
model.fit(data_train, fraud_train)

RandomForestRegressor(max_depth=2, n_estimators=50)

In [27]:
y_pred = model.predict(data_test)

In [28]:
mse = mean_squared_error(fraud_test, y_pred)
print('Mean squared error:', mse)
r2 = r2_score(fraud_test, y_pred)
print('R-Squared:', r2)

Mean squared error: 0.05187344172126908
R-Squared: 0.09981427872795734


As we can see, the MSE is very small which is great but so is the R-squared. For this type of problem we are trying to solve, this R-Squared may be actually very good. However, we need to try other solutions first.