# <u>Train a sequence model</u>  

In this section we will implement a sequential deep learning model. We have chosen to train an RNN model composed of the following layers:
- 1 embedding layer
- 2 SimpleRNN layers with ReLu activation functions.
- 1 dense layer (output layer) with a sigmoid activation function.  

We will use Adam optimization and the binary cross entropy loss function.  
Finally, we will first implement this RNN model on pre-processed data over 20 epochs and then apply it to pre-processed data with balanced classes. We'll also use an early stopping callback function in the latter case.

In [26]:
import importlib.util
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
import plotly.express as px
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, Dense, Bidirectional
from tensorflow.keras.optimizers import Adam
from imblearn.over_sampling import SMOTE

## ***Load and preprocess the data***

In [27]:
# Create an alias name for the module since the original one start with a number (not possible to import a file that start with a number).
path = '2-preprocessing.py'
module_name = 'preprocessing'
spec = importlib.util.spec_from_file_location(module_name, path)
preprocessing_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(preprocessing_module)

# Load the data
path_data = 'Data/Fake_Real_Job_Posting.csv'
data_full = pd.read_csv(path_data)

# Keep only the data where the requirements field is not missing (not "Not Mentioned")
data_reduced = data_full[data_full['requirements'] != "Not Mentioned"]

# Instantiate the preprocessing class
preprocessor = preprocessing_module.PreprocessingClass()

# Apply the preprocessing function to the "requirements" field
data_reduced['clean_requirements'] = data_reduced['requirements'].astype(str).apply(preprocessor.preprocessing)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



## ***First (basic) RNN implementation*** 

Implementation on the preprocessed dataset

In [28]:
class_labels = ["Fake", "Real"]

# styling the confusion matrix
confusion_matrix_kwargs = dict(
    text_auto=True, 
    title="Confusion Matrix", width=1000, height=800,
    labels=dict(x="Predicted", y="True Label"),
    x=class_labels,
    y=class_labels,
    color_continuous_scale='Blues'
)

def report(y_true, y_pred, class_labels):

    # print a classification report of the predictions # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html#sklearn-metrics-classification-report
    print(classification_report(y_true, y_pred, target_names=class_labels))
    # create a confusion matrix and pass it to imshow to visualize it # https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html#sklearn.metrics.confusion_matrix
    # (the confusion_matrix_kwargs are here for styling only)
    confusion_matrix_data = confusion_matrix(y_true, y_pred, labels=label_encoder.transform(class_labels)) # --> labels in int
    fig = px.imshow(
        confusion_matrix_data, 
        **confusion_matrix_kwargs
        )
    fig.show()

In [22]:
# Tokenize the text data
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_reduced['clean_requirements'])

# Convert text data to sequences
sequences = tokenizer.texts_to_sequences(data_reduced['clean_requirements'])

# Padding sequences => add 0 to make the input sequences uniforms
max_sequence_length = 100
X = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data_reduced['fraudulent'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and compile the RNN model
max_words = 45000
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length))
model.add(SimpleRNN(64, activation='relu', return_sequences=True))
model.add(SimpleRNN(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2)

# Evaluate the model
y_pred = (model.predict(X_test) > 0.5).astype(int)
class_labels = ["Fake", "Real"]
report(y_test, y_pred, class_labels)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20
              precision    recall  f1-score   support

        Fake       1.00      0.03      0.06       157
        Real       0.95      1.00      0.97      2880

    accuracy                           0.95      3037
   macro avg       0.97      0.52      0.52      3037
weighted avg       0.95      0.95      0.93      3037



## ***RNN implementation on balanced classes***

As we've shown in other notebooks, the dataset classes are very unbalanced, and an improvement to the model would be to balance them with the SMOTE method. The previous model has good accuracy, but predicts the "fake" class very poorly. In this section, we're going to train the model on data with balanced classes and see if this has a good impact on the model's predictions. In addition, we'll use an earlystopping callback so that the model stops training when it spots 3 consecutive epochs with little improvement in the validation loss function. 

In [30]:
from tensorflow.keras.callbacks import EarlyStopping

# Tokenize the text data
max_words = 45000
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data_reduced['clean_requirements'])

# Convert text data to sequences
sequences = tokenizer.texts_to_sequences(data_reduced['clean_requirements'])

# Padding sequences for a consistent input shape
max_sequence_length = 100
X = pad_sequences(sequences, maxlen=max_sequence_length, padding='post', truncating='post')

# Encode the labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(data_reduced['fraudulent'])

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# SMOTE instance creation
smote = SMOTE(random_state=42)

# get synthetical exemple with SMOTE
X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)

# EarlyStopping Callback
early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

# Create and compile the RNN model
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_sequence_length))
model.add(SimpleRNN(64, activation='relu', return_sequences=True))
model.add(SimpleRNN(32, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train_resampled, y_train_resampled, epochs=20, batch_size=64, validation_split=0.2, callbacks=[early_stopping])
#model.fit(X_train, y_train, epochs=20, batch_size=64, validation_split=0.2)

# Evaluate the model
y_pred = (model.predict(X_test) > 0.5).astype(int)
class_labels = ["Fake", "Real"]
report(y_test, y_pred, class_labels)


Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
              precision    recall  f1-score   support

        Fake       0.06      0.32      0.10       157
        Real       0.95      0.70      0.81      2880

    accuracy                           0.68      3037
   macro avg       0.50      0.51      0.45      3037
weighted avg       0.90      0.68      0.77      3037



## ***Conclusion on the RNN implementation***

Based on our results, it is clear that the RNN model, even with SMOTE and early stopping, is not performing well for the task of fraudulent job posting detection. The metrics, especially the recall for the "Fake" class, indicate that the model is struggling to correctly identify fraudulent job postings. In the initial model, without SMOTE and early stopping, the recall for the "Fake" class is very low (0.03), meaning that the model is missing a significant number of fraudulent job postings. It has a high precision for the "Real" class, but this is not the main concern in fraud detection. After applying SMOTE to balance the classes, the recall for the "Fake" class improved (0.32), but the precision dropped significantly. The accuracy also decreased. This suggests that while the model is now better at detecting some fraudulent job postings, it's generating more false positives (incorrectly identifying real job postings as fake).