Skip to content
This repository has been archived by the owner on May 11, 2023. It is now read-only.
/ sc1015-project Public archive

This repository contains the project files for the SC1015 mini-project

Notifications You must be signed in to change notification settings

edward62740/sc1015-project

Repository files navigation

SC1015 Project

Overview

This repository contains the project files for the SC1015 mini-project.
With the recent increases in fraud and scams around the world, and specifically job scams, we will use Natural Language Processing (NLP) techniques to classify job listings as fraudulent or not. Listed here are the ipynb files used, which should be viewed in numerical order:

  1. EDA
  2. Data Cleaning, NLP
  3. Baseline Model - Random Forest
  4. Support Vector Machine (SVM) model
  5. Recurrent Neural Network (RNN) model

Dataset

The dataset used can be found here. It provides approx. 18k job descriptions, labelled as either fraudulent or not.

Methodology

The dataset contains a mix of numbers (stored as strings), boolean values and strings. The dataset was cleaned and modified with NLP techniques to be in a suitable form for the ML models:

  • Handle invalid, NaN or outlier data
  • All the data in each row was concatenated into a single string
  • The boolean values being encoded in the form yes/no_{column name}
  • After this, the stop words were removed with NLTK, and the text was lemmatized (converted to base form) with spaCy.
  • Next, there was some post-processing to further clean the data, such as normalizing case, removing hyperlinks, symbols etc.
  • The processed text was saved to a csv; model-specific vectorization/encoding are done in the following sections.

The random forest model is used to demonstrate that this particular task of fake job classification requires a more complex model, as random forest yielded poor results even with max depth of 32. Two versions of the random forest models were created - one to classify based on the numerical/boolean data, and the other with the vectorized text.

The cleaned text was vectorized with TF-IDF, class balanced with ADASYN, and used to create a SVC. It was found that for this particular dataset, the linear kernel is most performant, with C param of 0.1. Some attempts were made to use PCA/SVD to reduce the dimensionality, but this did not work due to the very high dimension of the data. Reducing the dimension at the vectorization stage yielded worse results. Instead, zero variance features were removed.

The cleaned text was class balanced with ADASYN, encoded with one-hot encoding, and fed into a RNN with GloVe used as a pre-trained embedding layer. Specifically, the RNN utilizes bi-directional LSTM to help with memory in the time domain. The float [0,1] output is then rounded to a boolean. LeakyReLU activation functions for the fully connected layers performed better than ReLU. The results of the classification are as below.

Results

This table denotes various performance statistics for the respective models on the test set (for the positive class, i.e. fraudulent)

Model Accuracy F1 Recall (TPR) Specificity (TNR) Precision
LinearSVC 0.988 0.88 0.827 0.997 0.93
Bi-directional LSTM 0.985 0.82 0.73 0.997 0.93
Random Forest (Baseline) 0.978 0.72 0.57 0.99 0.92

From some limited misclassification analysis (in SVM file), it was found that the word "experience" had an unusually high frequency in the misclassified positive samples, with most of the other words being conjunctions/fillers. While it is hard to draw any conclusions, it could be said that this word may be a more important feature for fraud classification.

Conclusion

In conclusion, we found that it is possible to classify fake job listings with relatively high accuracy (>80%), despite the highly imbalanced classes. More importantly, all the models are able to hit almost 100% accuracy with identifying real jobs, which is important as creating a model that falsely categorizes job listings as fraudulent will not be in the public interest.

Based on the successes of the ML models, we can determine that there exists a strong correlation between a real job listing and a certain pattern/word use in the job description.

Contributors

@edward62740 - SVM, RNN, NLP, data cleaning
@tian0662 - Random Forest, Decision Tree, EDA, Video
@iCiCici - RNN, EDA, Slides preperation, Video

References

About

This repository contains the project files for the SC1015 mini-project

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •