# Part 3 - Modeling

In this notebook we will be focusing on our full modeling process. I will be doing some preprocessing, hyperperameter tuning, and fiting our data into a variety of models in order to determine which model seems to perform best. 

In [51]:
#importing the holy trinity of data science packages
import pandas as pd 
pd.options.display.max_rows = 4000
import numpy as np
import matplotlib.pyplot as plt

#Other visualization packages
import seaborn as sns

#Importing NLP plugins
from nltk.corpus import stopwords 
stop_words = stopwords.words('english')
from nltk.stem import WordNetLemmatizer 
import string

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

#Importing our Sklearn Plugins
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn import preprocessing
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

#importing our models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier

#Model Evaluation
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

## Feature Engineering

We need to do some feature engineering. I would like to one hot encode my categorical data, as well as fit a TFIDF Vectorizer to my text data column. Might do a Count Vectorizer as well, and see if that changes anything to my model. In addition, I probably want to fit a PCA to reduce computational time. 

**Next Steps:**

1. One Hot Encode Cateogrical Data
2. Fit in a TFIDF Vectorizer
3. Fit in a Count Vectorizer
4. Determine if using a PCA would help. 

In [2]:
#Import our dataset
df = pd.read_csv('data/cleaned_data.csv', index_col = 0)

In [3]:
#Make Copy
df_2 = df.copy()

# One Hot Encoding using Pandas get dummies function
columns_to_1_hot = ['employment_type','required_experience','required_education',
                   'industry', 'function']

for column in columns_to_1_hot:
    encoded = pd.get_dummies(df_2[column])
    df_2 = pd.concat([df_2, encoded], axis = 1)

For simplicity sake, I will also drop the *title* & *location* columns

In [4]:
columns_to_1_hot += ['title', 'location']
    
#droping the original columns that we just one hot encoded from
df_2 = df_2.drop(columns_to_1_hot, axis = 1)
df_2.head(3)

Unnamed: 0,description,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,...,Public Relations,Purchasing,Quality Assurance,Research,Sales,Science,Strategy/Planning,Supply Chain,Training,Writing/Editing
0,"Food52, a fast-growing, James Beard Award-winn...",0,1,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Organised - Focused - Vibrant - Awesome!Do you...,0,1,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"Our client, located in Houston, is actively se...",0,1,0,0,0,1,0,0,0,...,0,0,0,0,1,0,0,0,0,0


### Handling the description column 

First of all we need custom tokenizer to clean up our text data a little bit.

In [5]:
def tokenizer(text):
    
    #All characters in this string will be converted to lowercase
    text = text.lower()
    
    #Removing sentence punctuations
    for punctuation_mark in string.punctuation:
        text = text.replace(punctuation_mark,'')
    
    #Creating our list of tokens
    list_of_tokens = text.split(' ')
    #Creating our cleaned tokens list 
    cleaned_tokens = []
    #Intatiating our Lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    #Removing Stop Words in our list of tokens and any tokens that happens to be empty strings
    for token in list_of_tokens:
        if (not token in stop_words) and (token != ''):
            #lemmatizing our token
            token_lemmatized = lemmatizer.lemmatize(token)
            #appending our finalized cleaned token
            cleaned_tokens.append(token_lemmatized)
    
    return cleaned_tokens

In [6]:
#Instatiating our tfidf vectorizer
tfidf = TfidfVectorizer(tokenizer = tokenizer, min_df = 0.05, ngram_range=(1,3))
#Fit_transform our description 
tfidf_features = tfidf.fit_transform(df_2['description']) #this will create a sparse matrix

I want to append this sparse matrix to the original pandas DataFrame.

In [7]:
tfidf_vect_df = pd.DataFrame(tfidf_features.todense(), columns = tfidf.get_feature_names())

df_tfidf = pd.concat([df_2, tfidf_vect_df], axis = 1)

In [8]:
df_tfidf.head(3)

Unnamed: 0,description,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,...,write,writing,written,written communication,written verbal,year,year experience,you’ll,Unnamed: 20,–
0,"Food52, a fast-growing, James Beard Award-winn...",0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Organised - Focused - Vibrant - Awesome!Do you...,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,"Our client, located in Houston, is actively se...",0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


We have now appended our tfdif results to our dataframe and we will need to drop the description column.

In [9]:
df_tfidf = df_tfidf.drop(['description'], axis = 1)

In [64]:
df_tfidf = df_tfidf.dropna()

Now let's do a similar procedure with a Count Vectorizer, so we can compare the two vectorizers in performance later on.

In [14]:
#Instatiating our CountVectorizer
count_vect = CountVectorizer(tokenizer = tokenizer, min_df = 0.05, ngram_range=(1,3))
#Fit_transform our description 
count_vect_features = count_vect.fit_transform(df_2['description']) #this will create a sparse matrix

In [15]:
count_vect_df = pd.DataFrame(count_vec_features.todense(), columns = count_vect.get_feature_names())

df_count_vect = pd.concat([df_2, count_vect_df], axis = 1)
df_count_vect = df_count_vect.drop(['description'], axis = 1)

In [16]:
df_count_vect.head(3)

Unnamed: 0,telecommuting,has_company_logo,has_questions,fraudulent,Contract,Full-time,Other,Part-time,Temporary,Associate,...,write,writing,written,written communication,written verbal,year,year experience,you’ll,Unnamed: 20,–
0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Great, we now have two different dataframes with two different vectorizers preprocessing our description data. I will hold out on the PCA to see if I need it. I will only do it if the modelimg takes too long. 

**I will conduct the following steps:**
1. Logistic Regression w/ Tfidf
2. Logistic Regression w/ Count Vectorizer
3. I will evaluate both models and determine which is better, and for simplicity stake pick the superior vectorizer for the other models I would like to run.
4. KNearestNeighbors

# Model 1 - Logistic Regresion w/ Tfidf

In [65]:
target = df_tfidf.fraudulent
features = df_tfidf.drop(['fraudulent'], axis = 1)

#Spliting our Data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(features, target, test_size = 0.2,
                                                    stratify = target, random_state = 42)

In [72]:
log_reg = LogisticRegression()
#I want to optimze the C-Value
c_values = [.00001, .0001, .001, .1, 1, 10, 100, 1000, 10000]

param_grid = dict(C = c_values)

In [73]:
grid = GridSearchCV(log_reg, param_grid= param_grid, cv = 10, scoring = 'roc_auc', n_jobs = -1)

In [74]:
grid.fit(features, target)

GridSearchCV(cv=10, error_score=nan,
             estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                          fit_intercept=True,
                                          intercept_scaling=1, l1_ratio=None,
                                          max_iter=100, multi_class='auto',
                                          n_jobs=None, penalty='l2',
                                          random_state=None, solver='lbfgs',
                                          tol=0.0001, verbose=0,
                                          warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'C': [1e-05, 0.0001, 0.001, 0.1, 1, 10, 100, 1000,
                               10000]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='roc_auc', verbose=0)

In [39]:
# Model - KNearestNeighbors
knn = KNeighborsClassifier(n_neighbors=5)

In [40]:
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
print(param_grid)

{'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]}


In [41]:
grid = GridSearchCV(knn, param_grid, cv=10, scoring='accuracy', return_train_score=False)

In [42]:
grid.fit(features, target)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').