SC1015 Project

Overview

This repository contains the project files for the SC1015 mini-project.
With the recent increases in fraud and scams around the world, and specifically job scams, we will use Natural Language Processing (NLP) techniques to classify job listings as fraudulent or not. Listed here are the ipynb files used, which should be viewed in numerical order:

Dataset

The dataset used can be found here. It provides approx. 18k job descriptions, labelled as either fraudulent or not.

Methodology

Data Cleaning, NLP

The dataset contains a mix of numbers (stored as strings), boolean values and strings. The dataset was cleaned and modified with NLP techniques to be in a suitable form for the ML models:

Handle invalid, NaN or outlier data
All the data in each row was concatenated into a single string
The boolean values being encoded in the form yes/no_{column name}
After this, the stop words were removed with NLTK, and the text was lemmatized (converted to base form) with spaCy.
Next, there was some post-processing to further clean the data, such as normalizing case, removing hyperlinks, symbols etc.
The processed text was saved to a csv; model-specific vectorization/encoding are done in the following sections.

Baseline Model - Random Forest

The random forest model is used to demonstrate that this particular task of fake job classification requires a more complex model, as random forest yielded poor results even with max depth of 32. Two versions of the random forest models were created - one to classify based on the numerical/boolean data, and the other with the vectorized text.

Support Vector Machine (SVM) model

The cleaned text was vectorized with TF-IDF, class balanced with ADASYN, and used to create a SVC. It was found that for this particular dataset, the linear kernel is most performant, with C param of 0.1. Some attempts were made to use PCA/SVD to reduce the dimensionality, but this did not work due to the very high dimension of the data. Reducing the dimension at the vectorization stage yielded worse results. Instead, zero variance features were removed.

Recurrent Neural Network (RNN) model

The cleaned text was class balanced with ADASYN, encoded with one-hot encoding, and fed into a RNN with GloVe used as a pre-trained embedding layer. Specifically, the RNN utilizes bi-directional LSTM to help with memory in the time domain. The float [0,1] output is then rounded to a boolean. LeakyReLU activation functions for the fully connected layers performed better than ReLU. The results of the classification are as below.

Results

This table denotes various performance statistics for the respective models on the test set (for the positive class, i.e. fraudulent)

Model	Accuracy	F1	Recall (TPR)	Specificity (TNR)	Precision
LinearSVC	0.988	0.88	0.827	0.997	0.93
Bi-directional LSTM	0.985	0.82	0.73	0.997	0.93
Random Forest (Baseline)	0.978	0.72	0.57	0.99	0.92

From some limited misclassification analysis (in SVM file), it was found that the word "experience" had an unusually high frequency in the misclassified positive samples, with most of the other words being conjunctions/fillers. While it is hard to draw any conclusions, it could be said that this word may be a more important feature for fraud classification.

Conclusion

In conclusion, we found that it is possible to classify fake job listings with relatively high accuracy (>80%), despite the highly imbalanced classes. More importantly, all the models are able to hit almost 100% accuracy with identifying real jobs, which is important as creating a model that falsely categorizes job listings as fraudulent will not be in the public interest.

Based on the successes of the ML models, we can determine that there exists a strong correlation between a real job listing and a certain pattern/word use in the job description.

Contributors

@edward62740 - SVM, RNN, NLP, data cleaning
@tian0662 - Random Forest, Decision Tree, EDA, Video
@iCiCici - RNN, EDA, Slides preperation, Video

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github		.github
dataset		dataset
.gitignore		.gitignore
Data Cleaning and Lemmatization.ipynb		Data Cleaning and Lemmatization.ipynb
EDA.ipynb		EDA.ipynb
README.md		README.md
Random Forest.ipynb		Random Forest.ipynb
Recurrent Neural Network.ipynb		Recurrent Neural Network.ipynb
SC1015 Project.pdf		SC1015 Project.pdf
Support Vector Machine.ipynb		Support Vector Machine.ipynb
lemmatized_text.csv		lemmatized_text.csv
requirements.txt		requirements.txt
rnn_weights.h5		rnn_weights.h5

edward62740/sc1015-project

Folders and files

Latest commit

History

Repository files navigation

SC1015 Project

Overview

Dataset

Methodology

Results

Conclusion

Contributors

References

About

Topics

Resources

Stars

Watchers

Forks

Languages