# <u>Train a baseline model</u>

In the previous notebook we've selected a baseline model for fraudulent job postings prediction. This model is the Support Vector Machine and we will try to improve it in this section by balancing the classes. 

In [1]:
import importlib.util
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import resample
from imblearn.over_sampling import SMOTE

### ***Support Vector Machine model***

### preprocessing

In [2]:
# Create an alias name for the module since the original one start with a number (not possible to import a file that start with a number).
path = '2-preprocessing.py'
module_name = 'preprocessing'
spec = importlib.util.spec_from_file_location(module_name, path)
preprocessing_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing_module)

# Load the data
path_data = 'Data/Fake_Real_Job_Posting.csv'
data_full = pd.read_csv(path_data)

# Keep only the data where the requirements field is not missing (not "Not Mentioned")
data_reduced = data_full[data_full['requirements'] != "Not Mentioned"]

# Instantiate the preprocessing class
preprocessor = preprocessing_module.PreprocessingClass()

# Apply the preprocessing function to the "requirements" field
data_reduced['clean_requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['clean_requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)


### features extraction

In [3]:
tfidf =  TfidfVectorizer() # repeat with a TDIDF vectorizer 
X_tfidf = tfidf.fit_transform(data_reduced['clean_requirements']) #fit and transform, same code as previously
X_tfidf

<15185x43458 sparse matrix of type '<class 'numpy.float64'>'
	with 727677 stored elements in Compressed Sparse Row format>

### Label encoder, train/test split

In [4]:
label_encoder = LabelEncoder() # instantiate a label encoder
data_reduced['fraudulent_enc'] = label_encoder.fit_transform(data_reduced['fraudulent']) # fit and transform the encoder on labels

# Split the data into training and testing sets
X = X_tfidf # preprocessed requirements
y = data_reduced['fraudulent_enc'] # labels 

test_size=0.2 
random_state=42

# use the train_test_split function with the above test size and random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_reduced['fraudulent_enc'] = label_encoder.fit_transform(data_reduced['fraudulent']) # fit and transform the encoder on labels


### Model training and predictions

In [7]:
class_labels = ["Fake", "Real"]

# styling the confusion matrix
confusion_matrix_kwargs = dict(
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)

def report(y_true, y_pred, class_labels):

    # print a classification report of the predictions # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn-metrics-classification-report
    print(classification_report(y_true, y_pred, target_names=class_labels))
    # create a confusion matrix and pass it to imshow to visualize it # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix
    # (the confusion_matrix_kwargs are here for styling only)
    confusion_matrix_data = confusion_matrix(y_true, y_pred, labels=label_encoder.transform(class_labels)) # --> labels in int
    fig = px.imshow(
        confusion_matrix_data, 
        **confusion_matrix_kwargs
        )
    fig.show()

In [6]:
# SVM model
svm_model = SVC(kernel='linear', C=1.0, random_state=42)

# train (fit) the SVM model
svm_model.fit(X_train, y_train)

# Prediction on test sample
y_pred_SVM = svm_model.predict(X_test)

# Display the classification report and confusion matrix
report(y_test, y_pred_SVM, class_labels)

              precision    recall  f1-score   support

        Fake       0.99      0.47      0.64       157
        Real       0.97      1.00      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.98      0.74      0.81      3037
weighted avg       0.97      0.97      0.97      3037



### ***balancing classes***

As we have an imbalanced dataset (way more "real" job postings compared to the fraudulent ones), we will try to re balanced the "Fake" and "Real" classes. For that, we will compare the which between the SMOTE and the subsampling better improve the model predictions. (we will not try the oversampling method since it consists in adding more copies of the minority class and then the SVM model for predictions would take to many time to execute)

### **Undersampling Majority Class**

Undersampling can be defined as removing some observations of the majority class. It can be a good choice since we have a lot of data initially (17880 rows). We will use the resampling module from Scikit-Learn to randomly remove samples from the majority class.  

Subsampling before splitting the data can allow the exact same observations to be present in both the test and train sets. This can allow our model to simply memorize specific data points and cause overfitting. This is why we always split into test and train sets before trying the resampling techniques.

In [7]:
# Convert X_train, X_test in DataFrames
X_train_df = pd.DataFrame(X_train.toarray())
X_test_df = pd.DataFrame(X_test.toarray())

# Convert y_train, y_test in Series
y_train_series = pd.Series(y_train)
y_test_series = pd.Series(y_test)


In [8]:
# Concatenate our training data back together
df_train_subsampling = pd.concat([X_train_df, y_train_series.reset_index(drop=True)], axis=1)

In [9]:
df_train_subsampling

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,43449,43450,43451,43452,43453,43454,43455,43456,43457,fraudulent_enc
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12143,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
12144,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
12145,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1
12146,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1


In [10]:
# separate minority and majority classes
not_fraud = df_train_subsampling[df_train_subsampling.fraudulent_enc == 1]
fraud = df_train_subsampling[df_train_subsampling.fraudulent_enc == 0]

# downsample majority
not_fraud_downsampled = resample(not_fraud,
                                replace = False, # sample without replacement
                                n_samples = len(fraud), # match minority n
                                random_state = 27) # reproducible results

# combine minority and downsampled majority
downsampled = pd.concat([not_fraud_downsampled, fraud])

# checking counts
downsampled.fraudulent_enc.value_counts()

fraudulent_enc
1    555
0    555
Name: count, dtype: int64

In [12]:
# Split the data into training and testing sets
y = downsampled['fraudulent_enc'] # labels 
X = downsampled.drop(columns=['fraudulent_enc']) # preprocessed requirements

test_size=0.2 
random_state=42

# use the train_test_split function with the above test size and random seed
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size, random_state=random_state)

In [13]:
# re train the SVM model

# SVM model
svm_model = SVC(kernel='linear', C=1.0, random_state=42)

# train (fit) the SVM model
svm_model.fit(X_train, y_train)

# Prediction on test sample
y_pred_SVM = svm_model.predict(X_test)

# Display the classification report and confusion matrix
report(y_test, y_pred_SVM, class_labels)

              precision    recall  f1-score   support

        Fake       0.84      0.74      0.79       100
        Real       0.81      0.89      0.84       122

    accuracy                           0.82       222
   macro avg       0.82      0.81      0.82       222
weighted avg       0.82      0.82      0.82       222



### SMOTE  

Now we will try the SMOTE method. SMOTE or Synthetic Minority Oversampling Technique is a popular algorithm to creates sythetic observations of the minority class.

In [9]:
# SMOTE instance creation
smote = SMOTE(random_state=42)

# get synthetical exemple with SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# SVM model
svm_model = SVC(kernel='linear', C=1.0, random_state=42)

# train (fit) the SVM model
svm_model.fit(X_train_resampled, y_train_resampled)

# Prediction on test sample
y_pred_SVM = svm_model.predict(X_test)

# Display the classification report and confusion matrix
report(y_test, y_pred_SVM, class_labels)

              precision    recall  f1-score   support

        Fake       0.72      0.63      0.67       157
        Real       0.98      0.99      0.98      2880

    accuracy                           0.97      3037
   macro avg       0.85      0.81      0.83      3037
weighted avg       0.97      0.97      0.97      3037



### **conclusion on the balancing classes**  

The subsampling method shows good classification results. Precision and recall are well balanced for the 2 classes, and f1-scores are high. However, it is important to note that subsampling the majority class may result in a loss of potentially important information, which could affect the model's ability to generalize correctly to new data. 
On the other hand, the SMOTE method seems to be interesting. Indeed, if the accuracy is the same as that of the baseline model, the recall and accuracy for the "Fake" class are better (less unbalanced) and the f1-score for this class is also a little better. This means that the model predicts fraud better and misses fewer true positive values. We can observe, however, that it makes more errors on the predictions of the "Real" class. 39 job postings were predicted as fake when they were real, whereas in the baseline model we had only 1 of such error. It seems totally reasonable in our case study to accept that a few more job postings were unfairly considered fraudulent.

### **Hyperparameters variation**

In [6]:
class_labels = ["Fake", "Real"]

# SMOTE instance creation
smote = SMOTE(random_state=42)

# get synthetical exemple with SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# Liste des noyaux à tester
kernels = ['linear', 'poly', 'rbf', 'sigmoid']

# Liste des valeurs de C à tester
C_values = [0.1, 1, 10]

for kernel in kernels:
    for C in C_values:
        # Creation of the SVM model with the kernel and the value of the C
        svm_model = SVC(kernel=kernel, C=C, random_state=42)
        
        # Train the model
        svm_model.fit(X_train_resampled, y_train_resampled)
        
        # Preditions
        y_pred = svm_model.predict(X_test)
        
        # Classification report:
        print(f"Kernel: {kernel}, C: {C}")
        print(classification_report(y_test, y_pred, target_names=class_labels))
        confusion_matrix_kwargs = dict(
            text_auto=True, 
            title=f"Confusion Matrix (Kernel: {kernel}, C: {C})", width=1000, height=800,
            labels=dict(x="Predicted", y="True Label"),
            x=class_labels,
            y=class_labels,
            color_continuous_scale='Blues'
        )
        confusion_matrix_data = confusion_matrix(y_test, y_pred, labels=label_encoder.transform(class_labels)) # --> labels in int
        fig = px.imshow(
            confusion_matrix_data, 
            **confusion_matrix_kwargs
        )
        fig.show()


    

Kernel: linear, C: 0.1
              precision    recall  f1-score   support

        Fake       0.52      0.66      0.58       157
        Real       0.98      0.97      0.97      2880

    accuracy                           0.95      3037
   macro avg       0.75      0.81      0.78      3037
weighted avg       0.96      0.95      0.95      3037



Kernel: linear, C: 1
              precision    recall  f1-score   support

        Fake       0.72      0.63      0.67       157
        Real       0.98      0.99      0.98      2880

    accuracy                           0.97      3037
   macro avg       0.85      0.81      0.83      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: linear, C: 10
              precision    recall  f1-score   support

        Fake       0.73      0.62      0.67       157
        Real       0.98      0.99      0.98      2880

    accuracy                           0.97      3037
   macro avg       0.86      0.80      0.83      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: poly, C: 0.1
              precision    recall  f1-score   support

        Fake       0.90      0.56      0.69       157
        Real       0.98      1.00      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.94      0.78      0.84      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: poly, C: 1
              precision    recall  f1-score   support

        Fake       0.87      0.58      0.69       157
        Real       0.98      1.00      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.92      0.79      0.84      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: poly, C: 10
              precision    recall  f1-score   support

        Fake       0.86      0.58      0.69       157
        Real       0.98      0.99      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.92      0.79      0.84      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: rbf, C: 0.1
              precision    recall  f1-score   support

        Fake       0.95      0.45      0.61       157
        Real       0.97      1.00      0.98      2880

    accuracy                           0.97      3037
   macro avg       0.96      0.72      0.80      3037
weighted avg       0.97      0.97      0.96      3037



Kernel: rbf, C: 1
              precision    recall  f1-score   support

        Fake       0.84      0.59      0.69       157
        Real       0.98      0.99      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.91      0.79      0.84      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: rbf, C: 10
              precision    recall  f1-score   support

        Fake       0.87      0.59      0.70       157
        Real       0.98      1.00      0.99      2880

    accuracy                           0.97      3037
   macro avg       0.92      0.79      0.84      3037
weighted avg       0.97      0.97      0.97      3037



Kernel: sigmoid, C: 0.1
              precision    recall  f1-score   support

        Fake       0.42      0.68      0.52       157
        Real       0.98      0.95      0.97      2880

    accuracy                           0.94      3037
   macro avg       0.70      0.81      0.74      3037
weighted avg       0.95      0.94      0.94      3037



Kernel: sigmoid, C: 1
              precision    recall  f1-score   support

        Fake       0.33      0.69      0.44       157
        Real       0.98      0.92      0.95      2880

    accuracy                           0.91      3037
   macro avg       0.66      0.81      0.70      3037
weighted avg       0.95      0.91      0.93      3037



Kernel: sigmoid, C: 10
              precision    recall  f1-score   support

        Fake       0.30      0.66      0.41       157
        Real       0.98      0.91      0.95      2880

    accuracy                           0.90      3037
   macro avg       0.64      0.79      0.68      3037
weighted avg       0.94      0.90      0.92      3037



## ***Conclusion***

After using the SMOTE method to rebalance the classes, we varied the hyperparameters of the SVM prediction model. In particular, we varied the kernel and regularization parameters. The kernel defines the type of decision boundary that the algorithme use to seperate the classes while the regularization parameter (C) controls the model's generalization and error tolerance (missclassification). We can see that the sigmoid kernel is not at all suited to our problem and our data: we get even worse results than the basic model.  The kernels that improved the model, particularly for the prediction of the "fake" class, were the "poly", "rbf" and "linear" kernels. Thus, if we consider that it is less serious to predict that a job offer is fake when it is real than to predict that a job offer is real when it is fake, we can consider that the combo kernel=linear and C=1 is adapted to our case of study. Indeed, it offers a good balance between recall and precision, f1-scores are better than before and accuracy is still 97%.


