# Part 3 - Modeling

In this notebook we will be focusing on our full modeling process. I will be doing some preprocessing, hyperperameter tuning, and fiting our data into a variety of models in order to determine which model seems to perform best. 


**Model Results**

|Model|AUC Score|
|---|---|
|Baseline - Logistic Regression using TFIDF data| 0.58|
|Logistic Regression using Count Vectorizer Data| 0.55|
|KNN| 0.69|
|SVC| 0.5|
|Random Forest| 0.52|
|**Neural Network - MLPClassifer**|**0.738**|

The best model was using our neural network.

In [117]:
#importing the holy trinity of data science packages
import pandas as pd 
pd.options.display.max_rows = 4000
import numpy as np
import matplotlib.pyplot as plt

#Other visualization packages
import seaborn as sns

#Importing NLP plugins
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')
from nltk.stem import WordNetLemmatizer 
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Importing our Sklearn Plugins
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#importing our models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

#Model Evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

## Feature Engineering

We need to do some feature engineering. I would like to one hot encode my categorical data, as well as fit a TFIDF Vectorizer to my text data column. Might do a Count Vectorizer as well, and see if that changes anything to my model. In addition, I probably want to fit a PCA to reduce computational time. 

**Next Steps:**

1. One Hot Encode Cateogrical Data
2. Fit in a TFIDF Vectorizer
3. Fit in a Count Vectorizer
4. Determine if using a PCA would help. 

In [2]:
#Import our dataset
df = pd.read_csv('data/cleaned_data.csv', index_col = 0)

In [3]:
#Make Copy
df_2 = df.copy()

# One Hot Encoding using Pandas get dummies function
columns_to_1_hot = ['employment_type','required_experience','required_education',
                   'industry', 'function']

for column in columns_to_1_hot:
    encoded = pd.get_dummies(df_2[column])
    df_2 = pd.concat([df_2, encoded], axis = 1)

For simplicity sake, I will also drop the *title* & *location* columns

In [4]:
columns_to_1_hot += ['title', 'location']
    
#droping the original columns that we just one hot encoded from
df_2 = df_2.drop(columns_to_1_hot, axis = 1)
df_2.head(3)

Unnamed: 0,description,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,...,Public Relations,Purchasing,Quality Assurance,Research,Sales,Science,Strategy/Planning,Supply Chain,Training,Writing/Editing
0,"Food52, a fast-growing, James Beard Award-winn...",0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Organised - Focused - Vibrant - Awesome!Do you...,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Our client, located in Houston, is actively se...",0,1,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### Handling the description column 

First of all we need custom tokenizer to clean up our text data a little bit.

In [5]:
def tokenizer(text):
    
    #All characters in this string will be converted to lowercase
    text = text.lower()
    
    #Removing sentence punctuations
    for punctuation_mark in string.punctuation:
        text = text.replace(punctuation_mark,'')
    
    #Creating our list of tokens
    list_of_tokens = text.split(' ')
    #Creating our cleaned tokens list 
    cleaned_tokens = []
    #Intatiating our Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #Removing Stop Words in our list of tokens and any tokens that happens to be empty strings
    for token in list_of_tokens:
        if (not token in stop_words) and (token != ''):
            #lemmatizing our token
            token_lemmatized = lemmatizer.lemmatize(token)
            #appending our finalized cleaned token
            cleaned_tokens.append(token_lemmatized)
    
    return cleaned_tokens

In [6]:
#Instatiating our tfidf vectorizer
tfidf = TfidfVectorizer(tokenizer = tokenizer, min_df = 0.05, ngram_range=(1,3))
#Fit_transform our description 
tfidf_features = tfidf.fit_transform(df_2['description']) #this will create a sparse matrix

I want to append this sparse matrix to the original pandas DataFrame.

In [7]:
tfidf_vect_df = pd.DataFrame(tfidf_features.todense(), columns = tfidf.get_feature_names())

df_tfidf = pd.concat([df_2, tfidf_vect_df], axis = 1)

In [8]:
df_tfidf.head(3)

Unnamed: 0,description,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,...,write,writing,written,written communication,written verbal,year,year experience,you’ll,Unnamed: 20,–
0,"Food52, a fast-growing, James Beard Award-winn...",0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Organised - Focused - Vibrant - Awesome!Do you...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Our client, located in Houston, is actively se...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have now appended our tfdif results to our dataframe and we will need to drop the description column.

In [9]:
df_tfidf = df_tfidf.drop(['description'], axis = 1)

In [64]:
df_tfidf = df_tfidf.dropna()

Now let's do a similar procedure with a Count Vectorizer, so we can compare the two vectorizers in performance later on.

In [14]:
#Instatiating our CountVectorizer
count_vect = CountVectorizer(tokenizer = tokenizer, min_df = 0.05, ngram_range=(1,3))
#Fit_transform our description 
count_vect_features = count_vect.fit_transform(df_2['description']) #this will create a sparse matrix

In [15]:
count_vect_df = pd.DataFrame(count_vec_features.todense(), columns = count_vect.get_feature_names())

df_count_vect = pd.concat([df_2, count_vect_df], axis = 1)
df_count_vect = df_count_vect.drop(['description'], axis = 1)

In [97]:
df_count_vect = df_count_vect.dropna()

In [16]:
df_count_vect.head(3)

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,Associate,...,write,writing,written,written communication,written verbal,year,year experience,you’ll,Unnamed: 20,–
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Great, we now have two different dataframes with two different vectorizers preprocessing our description data. I will hold out on the PCA to see if I need it. I will only do it if the modelimg takes too long. 

**I will conduct the following steps:**
1. Logistic Regression w/ Tfidf
2. Logistic Regression w/ Count Vectorizer
3. I will evaluate both models and determine which is better, and for simplicity stake pick the superior vectorizer for the other models I would like to run.
4. KNearestNeighbors

# Model 1 - Logistic Regresion w/ Tfidf

In [85]:
target = df_tfidf.fraudulent
features = df_tfidf.drop(['fraudulent'], axis = 1)

#Spliting our Data into train and holdout sets to test our models
X_train, X_hold, y_train, y_hold = train_test_split(features, target, test_size = 0.1,
                                                    stratify = target, random_state = 42)

In [86]:
log_reg = LogisticRegression()
#I want to optimze the C-Value
c_values = [.00001, .0001, .001, .1, 1, 10, 100, 1000, 10000]
penalty_options = ['l1','l2']

param_grid = dict(C = c_values, penalty = penalty_options)

In [87]:
grid = GridSearchCV(log_reg, param_grid= param_grid, cv = 10, scoring = 'roc_auc', n_jobs = -1)

In [88]:
grid.fit(X_train, y_train)

GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [1e-05, 0.0001, 0.001, 0.1, 1, 10, 100, 1000,
                               10000],
                         'penalty': ['l1', 'l2']},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [89]:
print(grid.best_score_)
print(grid.best_params_)

0.8520861054296776
{'C': 1, 'penalty': 'l2'}


In [90]:
log_reg_pred = grid.predict(X_hold)

In [94]:
print(roc_auc_score(y_hold, log_reg_pred))

0.5801411682352026


In [95]:
print(classification_report(y_hold, log_reg_pred))

              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      1196
         1.0       0.70      0.16      0.26        43

    accuracy                           0.97      1239
   macro avg       0.84      0.58      0.62      1239
weighted avg       0.96      0.97      0.96      1239



Our Logistic Regression optimized for the C values and the penalty, we tested it with our hold out set only returned a 0.58 AUC score. Which highlights that our model is severely overfitting. 

# Model 2 - Logistic Regresion w/ Count Vectorizer

In [98]:
target_2 = df_count_vect.fraudulent
features_2 = df_count_vect.drop(['fraudulent'], axis = 1)

#Spliting our Data into train and holdout sets to test our models
X_train_2, X_hold_2, y_train_2, y_hold_2 = train_test_split(features_2, target_2, test_size = 0.1,
                                                    stratify = target_2, random_state = 42)

In [99]:
#Intiatiating our previous logistic regression model, using the count vectorizer dataset
grid_count_vect = GridSearchCV(log_reg, param_grid= param_grid, cv = 10, scoring = 'roc_auc', n_jobs = -1)

In [100]:
grid_count_vect.fit(X_train_2, y_train_2)
print(grid_count_vect.best_score_)
print(grid_count_vect.best_params_)

0.7766003404164705
{'C': 0.1, 'penalty': 'l2'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


In [103]:
log_reg_pred_2 = grid_count_vect.predict(X_hold_2)

In [104]:
print(roc_auc_score(y_hold_2, log_reg_pred_2))
print(classification_report(y_hold_2, log_reg_pred_2))

0.557721474683052
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      1196
         1.0       0.83      0.12      0.20        43

    accuracy                           0.97      1239
   macro avg       0.90      0.56      0.59      1239
weighted avg       0.96      0.97      0.96      1239



The Count Vectorizer did not really improve from my previous model, as such I will stick with the tfidf data.

# Model 3 - KNearestNeighbors

In [106]:
# Model - KNearestNeighbors
knn = KNeighborsClassifier(n_neighbors=5)

In [111]:
k_range = list(np.arange(2,32,2))
param_grid_knn = dict(n_neighbors=k_range)
print(param_grid_knn)

{'n_neighbors': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30]}


In [112]:
grid_knn = GridSearchCV(knn, param_grid_knn, cv=10, scoring='roc_auc',
                        n_jobs = -1)

In [114]:
grid_knn.fit(X_train, y_train)
print(grid_knn.best_score_)
print(grid_knn.best_params_)

0.9681148223918136
{'n_neighbors': 8}


Wow, just from the training score. It already seems that our KNearestNeighbors model is much better. Now let's test it with our holdout data. 

In [115]:
#predicting our holdout data
knn_pred = grid_knn.predict(X_hold)
#Printing out our evaluation metrics
print(roc_auc_score(y_hold, knn_pred))
print(classification_report(y_hold, knn_pred))

0.6964202380026444
              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      1196
         1.0       0.85      0.40      0.54        43

    accuracy                           0.98      1239
   macro avg       0.91      0.70      0.76      1239
weighted avg       0.97      0.98      0.97      1239



Impressive, by testing our KNN model on the holdout data, it seems to have improved by aroun 11 percentage points from our best logistic regression model. The KNN here is much superior. 

# Model 4 - Support Vector Classification

In [131]:
#Intatiating our SVM model
svc = SVC(kernel = 'rbf', gamma = 'auto' )

# I wont use a gridsearch because SVMs usually take a long looong time. I will just use a simple SVC
# and see how it plays out
svc.fit(X_train, y_train)

SVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
    max_iter=-1, probability=False, random_state=None, shrinking=True,
    tol=0.001, verbose=False)

In [132]:
svc.score(X_train, y_train)

0.9656471432415463

In [133]:
#predicting our holdout data
svc_pred = svc.predict(X_hold)
#Printing out our evaluation metrics
#Printing out our evaluation metrics
print(roc_auc_score(y_hold, svc_pred))
print(classification_report(y_hold, svc_pred))

0.5
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      1196
         1.0       0.00      0.00      0.00        43

    accuracy                           0.97      1239
   macro avg       0.48      0.50      0.49      1239
weighted avg       0.93      0.97      0.95      1239



  _warn_prf(average, modifier, msg_start, len(result))


# Model 5 - Random Forest

In [135]:
#Instatiating our random forest

rf = RandomForestClassifier()

In [136]:
#The parameters we want to tune with our random forest
n_estimators_range = [1, 2, 4, 8, 16, 32, 64, 100, 200]

param_grid_rf = dict(n_estimators=n_estimators_range)
print(param_grid_rf)

{'n_estimators': [1, 2, 4, 8, 16, 32, 64, 100, 200]}


In [137]:
grid_rf = GridSearchCV(rf, param_grid_rf, cv=10, scoring='roc_auc',
                        n_jobs = -1)

In [138]:
grid_rf.fit(X_train, y_train)
print(grid_rf.best_score_)
print(grid_rf.best_params_)

0.8067208847452003
{'n_estimators': 200}


In [146]:
rf_pred = grid_rf.predict(X_hold)
#Printing out our evaluation metrics
print(roc_auc_score(y_hold, rf_pred))
print(classification_report(y_hold, rf_pred))

0.5232558139534884
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      1196
         1.0       1.00      0.05      0.09        43

    accuracy                           0.97      1239
   macro avg       0.98      0.52      0.54      1239
weighted avg       0.97      0.97      0.95      1239



Quite dissapointed with the latest two models. SVC didn't do well, nor did the random forest.

# Model 6 - Neural Net MLPC Classifier

In [177]:
#Instatiatie our MLPClassifier
mlp = MLPClassifier(solver='lbfgs', 
                    activation = 'relu',
                   hidden_layer_sizes = (100,50,20), 
                    max_iter = 1000)


In [178]:
mlp.fit(X_train, y_train)

MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100, 50, 20), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=1000,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=None, shuffle=True, solver='lbfgs',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

In [179]:
mlp_pred = mlp.predict(X_hold)

#Printing out our evaluation metrics
print(roc_auc_score(y_hold, mlp_pred))
print(classification_report(y_hold, mlp_pred))

0.7383332037022633
              precision    recall  f1-score   support

         0.0       0.98      0.99      0.98      1196
         1.0       0.60      0.49      0.54        43

    accuracy                           0.97      1239
   macro avg       0.79      0.74      0.76      1239
weighted avg       0.97      0.97      0.97      1239



Finally our MLP classifier does the BEST JOB with an AUC score of 0.738% which is 15 percentage points better than our best model. 