<a href="https://colab.research.google.com/github/dqminhv/fraudulent-job-posting-detection-with-NLP/blob/main/notebook/modeling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer

from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn import tree, metrics

Given the imbalanced nature of the dataset and the task of classifying fraudulent job postings based on job descriptions, several classification algorithms can be considered. However, algorithms that handle class imbalance well and are robust to noisy data are typically preferred. Here are some algorithms that you may want to consider:

- **Random Forest**: Random Forest is an ensemble learning algorithm that works well with imbalanced datasets. It builds multiple decision trees and combines their predictions to improve accuracy.

- **Gradient Boosting Machines (GBM)**: GBM algorithms like XGBoost, LightGBM, and CatBoost are also effective for imbalanced classification tasks. They sequentially build multiple weak learners to minimize a loss function, which often leads to better performance on imbalanced datasets.

- **Support Vector Machines (SVM)**: SVM is a powerful algorithm for binary classification tasks. By adjusting the hyperparameters, such as the regularization parameter (C) and the kernel function, SVM can be effective for imbalanced data.

- **Logistic Regression**: Despite its simplicity, logistic regression can perform well on imbalanced datasets, especially when combined with techniques like class weighting or penalization.

- **AdaBoost**: AdaBoost is an ensemble learning algorithm that combines multiple weak classifiers to create a strong classifier. It is known to perform well on imbalanced datasets.

- **Neural Networks**: Deep learning models, such as neural networks, can also be effective for imbalanced classification tasks, especially when dealing with large datasets. Architectures like Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs) can capture complex patterns in text data.

- **Naive Bayes**: Despite its simplicity and assumption of feature independence, Naive Bayes can perform surprisingly well on text classification tasks, including imbalanced datasets.

# Load train/test data

In [None]:
X_train = pd.read_csv('https://raw.githubusercontent.com/dqminhv/fraudulent-job-posting-detection-with-NLP/main/Data/X_train.csv')
X_test = pd.read_csv('https://raw.githubusercontent.com/dqminhv/fraudulent-job-posting-detection-with-NLP/main/Data/X_test.csv')
y_train = pd.read_csv('https://raw.githubusercontent.com/dqminhv/fraudulent-job-posting-detection-with-NLP/main/Data/y_train.csv')
y_test = pd.read_csv('https://raw.githubusercontent.com/dqminhv/fraudulent-job-posting-detection-with-NLP/main/Data/y_test.csv')

# Vectorizing text data

In [None]:
#Parameters
stop_words = 'english'
min_df = .2
max_df = .7
ngram_range=(1, 1)

In [None]:
#Using TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words=stop_words, max_df=max_df, min_df=min_df, ngram_range=ngram_range)