# <u>Train a baseline model</u>

In this section we will train a model without any particular parameters tuning. It will be our reference model.

In [45]:
import importlib.util
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

## ***Import and apply the preprocessing pipeline***

In [3]:
# Create an alias name for the module since the original one start with a number (not possible to import a file that start with a number).
path = '2-preprocessing.py'
module_name = 'preprocessing'
spec = importlib.util.spec_from_file_location(module_name, path)
preprocessing_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing_module)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)


In [21]:
# Load the data
path_data = 'Data/Fake_Real_Job_Posting.csv'
data_full = pd.read_csv(path_data)

# Keep only the data where the requirements field is not missing (not "Not Mentioned")
data_reduced = data_full[data_full['requirements'] != "Not Mentioned"]

In [22]:
# instantiate the preprocessing class
preprocessor = preprocessing_module.PreprocessingClass()

# apply the preprocessing function to the "requirements" field
data_reduced['clean_requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['clean_requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)


In [23]:
data_reduced['clean_requirements']

0        year experience ux ui design portfolio contain...
1        food graduate similar disciplineadvanced haccp...
2        job duty responsibility analysisperform root c...
3        international broadcaster shall least five 5 y...
4        experience professional environmentsare net na...
                               ...                        
17875    essential relational database theory understan...
17876    need somebody really love adwords google servi...
17877    requires ability become forklift able effectiv...
17878    education experiencebachelor degree physic mat...
17879    fluency englishsimilar professional experience...
Name: clean_requirements, Length: 15185, dtype: object

## ***Features Extraction***

In this part we will vectorize the texts (also called embedding). We will use the TF-IDF, this vectorization method takes into account the frequency of words without considering their semantics, but this should be sufficient for our analysis of fraudulent jos postings.

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf =  TfidfVectorizer() # repeat with a TDIDF vectorizer 
X_tfidf = tfidf.fit_transform(data_reduced['clean_requirements']) #fit and transform, same code as previously
X_tfidf

<15185x43458 sparse matrix of type '<class 'numpy.float64'>'
	with 727677 stored elements in Compressed Sparse Row format>

## ***Label encoder and train/test split***

In [32]:
label_encoder = LabelEncoder() # instantiate a label encoder
data_reduced['fraudulent_enc'] = label_encoder.fit_transform(data_reduced['fraudulent']) # fit and transform the encoder on labels

# Split the data into training and testing sets
X = X_tfidf # preprocessed requirements
y = data_reduced['fraudulent_enc'] # labels 

test_size=0.2 
random_state=42

# use the train_test_split function with the above test size and random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['fraudulent_enc'] = label_encoder.fit_transform(data_reduced['fraudulent']) # fit and transform the encoder on labels


In [33]:
print(X.shape)
print(y.shape)

(15185, 43458)
(15185,)


In [34]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(12148, 43458)
(3037, 43458)
(12148,)
(3037,)


## ***Model baseline***  

In this section we will try 3 differents models to predict the fraudulent job postings and we will choose which model could be interesting to keep as a baseline. In our case study (classification problem, predicting whether job offers are fraudulent or not) logistic regression can be a good start. It's a simple classification model that can be effective in detecting fraudulent job postings. We'll also be testing the Naive Bayes and Support Vector Machine models, which are also classification models that can be coherent for fraud detection. 

### <u>Logistic Regression</u>

In [42]:
class_labels = ["Fake", "Real"]

# styling the confusion matrix
confusion_matrix_kwargs = dict(
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)

def report(y_true, y_pred, class_labels):

    # print a classification report of the predictions # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn-metrics-classification-report
    print(classification_report(y_true, y_pred, target_names=class_labels))
    # create a confusion matrix and pass it to imshow to visualize it # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix
    # (the confusion_matrix_kwargs are here for styling only)
    confusion_matrix_data = confusion_matrix(y_true, y_pred, labels=label_encoder.transform(class_labels)) # --> labels in int
    fig = px.imshow(
        confusion_matrix_data, 
        **confusion_matrix_kwargs
        )
    fig.show()

In [43]:
# Instantiate the logistic regression model
logistic_regression_model = LogisticRegression(max_iter=1000)

# Fit the model to the training data
logistic_regression_model.fit(X_train, y_train)

# Make predictions on the test data
y_pred = logistic_regression_model.predict(X_test)

# Display the classification report and confusion matrix
report(y_test, y_pred, class_labels)


              precision    recall  f1-score   support

        Fake       0.97      0.22      0.36       157
        Real       0.96      1.00      0.98      2880

    accuracy                           0.96      3037
   macro avg       0.97      0.61      0.67      3037
weighted avg       0.96      0.96      0.95      3037



### <u>Naive Bayes</u>

In [49]:
# Instanciate a Naive Bayes gaussian class object
naive_bayes_model = GaussianNB()

# Fit the model to the training data
naive_bayes_model.fit(X_train.toarray(), y_train)

# Predictions on the test data
y_pred_NB = naive_bayes_model.predict(X_test.toarray())

# Display the classification report and confusion matrix
report(y_test, y_pred_NB, class_labels)

              precision    recall  f1-score   support

        Fake       0.42      0.68      0.52       157
        Real       0.98      0.95      0.96      2880

    accuracy                           0.93      3037
   macro avg       0.70      0.81      0.74      3037
weighted avg       0.95      0.93      0.94      3037



### <u>Support Vector Machine</u>

In [50]:
# Créez un objet modèle SVM
svm_model = SVC(kernel='linear', C=1.0, random_state=42)

# Entraînez le modèle sur les données d'entraînement
svm_model.fit(X_train, y_train)

# Faites des prédictions sur l'ensemble de test
y_pred_SVM = svm_model.predict(X_test)

# Display the classification report and confusion matrix
report(y_test, y_pred_SVM, class_labels)

              precision    recall  f1-score   support

        Fake       0.99      0.47      0.64       157
        Real       0.97      1.00      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.98      0.74      0.81      3037
weighted avg       0.97      0.97      0.97      3037



### <u>Conclusion</u>  

First of all, we can already conclude that the 3 models have difficulty in predicting "Fake" job postings. This can be seen not only in the confusion matrices, but also in the precision and recall of this class. Particularly for logistic regression and SVM, we have high precision but low recall. This means that the models are careful to predict positive examples (Fake job offers) and minimize false positives, but at the expense of missing many true positives. Naive Bayes seems to be better than the other 2 models at detecting fraud, but on the other hand it is less good at predicting "real" job postings. So, we'll concentrate on the SVM (baseline model), which has good accuracy and predicts faked job offers a little better than Logistic Regression (f1-score = 0.64 for SVM on the Fake class vs. 0.36 for Logistic Regression).
It's not surprising that the models have difficulty predicting fraud, as the classes in the dataset are very unbalanced. In order to improve the chosen model, it will be interesting to oversample and subsample to rebalance the classes.