**Name**

* **Yu-Chih (Wisdom) Chen**
* **Devon Delgado**
* **Xiaobing Xu**
* **Peter Ye**

**Date**

**11/16/2024**

# Fake Job Description Prediction Dataset

## Overview
This dataset is designed for developing classification models to identify fraudulent job postings. It contains approximately 18,000 job descriptions, with around 800 labeled as fake.

## Dataset Details
- **Total Entries**: ~18,000 job descriptions
- **Fraudulent Entries**: ~800
- **Data Types**: Textual information and meta-information about jobs

## Source
The University of the Aegean | Laboratory of Information & Communication Systems Security
(http://emscad.samos.aegean.gr/)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from feast import FeatureStore
from nltk.tokenize import word_tokenize  
from pandarallel import pandarallel
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from utli import *

## 1. Load Data

In [None]:
df = pd.read_csv('fake_job_postings.csv') # Change it to data source

In [None]:
df.head()

## 2. Exploratory Data Analysis (EDA)

In [None]:
# Get the dimensions of the Dataset
print("Dimensions of the Dataset (Rows, Columns):")
df.shape

In [None]:
# Removing any leading, and trailing whitespaces in columns
df.columns = df.columns.str.strip()

In [None]:
# Check if any duplicate rows in dataset
df.duplicated().sum()

In [None]:
# Getting an overview of the features and their types in the dataset
print("Overview of the features and their types:")
df.info()

In [None]:
# Count the number of columns with dtype 'object'
object_cols = df.select_dtypes(include=['object']).columns
num_object_cols = len(object_cols)

# Count the number of columns with dtype 'int64'
int_cols = df.select_dtypes(include=['int64']).columns
num_int_cols = len(int_cols)

print(f"Number of columns with object dtype: {num_object_cols}")
print(f"Number of columns with int64 dtype: {num_int_cols}")

### a. Missing Values

In [None]:
print("Display Missing values in the dataset: ")
print("\n")

print(df.isnull().sum())

In [None]:
# View percentage of missing values per column
print('Percent of Null Values in Each Column:\n')
print(df.isnull().sum()/df.shape[0]*100)

In [None]:
# Count and display percentage of missing values
missing_percent = (df.isnull().sum() / len(df)) * 100
missing_percent = missing_percent[missing_percent > 0].sort_values(ascending=False)

plt.figure(figsize=(8, 6))
missing_percent.plot(kind='bar', color='skyblue')
plt.title('Percentage of Missing Values by Column')
plt.ylabel('% of Missing Values')
plt.xlabel('Columns')
plt.xticks(rotation=45)
plt.show()

### b. Visualizatioin

In [None]:
#Differentiate categorical data and numerical data
df_num = df[['telecommuting','has_company_logo','has_questions','fraudulent','salary_range']]
df_cat = df[['title', 'location','company_profile', 'requirements','employment_type',
       'required_experience', 'required_education', 'industry', 'function']]

In [None]:
# Checking for Outliers in numerical data
plt.figure(figsize=[8,6])
sns.boxplot(data = df_num)
plt.title("Numerical Data of Outliers")
plt.show()

In [None]:
# Plots to see the distribution of the continuous features individually
plt.figure(figsize= (25,20))
plt.subplot(3,3,1)
# Convert 'employment_type' to string type before plotting
plt.hist(df.employment_type.astype(str), color='orange', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nEmployment type')
plt.xticks(rotation=45)

plt.subplot(3,3,2)
# Convert 'required_experience' to string type before plotting
plt.hist(df.required_experience.astype(str), color='lightblue', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nRequired Experience')
plt.xticks(rotation=45)

plt.subplot(3,3,3)
plt.hist(df.fraudulent, color='red', edgecolor = 'black', alpha = 0.7)
plt.xlabel('\nFraud')
plt.xticks(rotation=45)


plt.show()

In [None]:
# Number of Job Function
plt.figure(figsize=(48, 20))
plt.xticks(rotation=45)
plt.title("Number of Job Function", fontsize=20)
sns.set_style("darkgrid")
sns.countplot(x='function', data=df, color='blue')  # Adjust color if needed

In [None]:
# Calculate the sum of fraudulent postings by function
fraudulent_summary = df.groupby('function')['fraudulent'].sum().reset_index()

plt.figure(figsize=(25, 8))
sns.lineplot(data=fraudulent_summary, x='function', y='fraudulent', marker='o')
plt.title('Fraudulent Postings by Function', fontsize = 20)
plt.xlabel('Function')
plt.ylabel('Sum of Fraudulent Postings')
plt.xticks(rotation=45)
plt.grid(True)
plt.show()

In [None]:
# Bar plot for fraudulent (target) feature
fraud_colors = ['blue', 'red']
plt.figure(figsize=(6, 4))
sns.countplot(x='fraudulent', data=df, hue='fraudulent', palette=fraud_colors, dodge=False)
plt.title('Distribution of Fraudulent Job Postings')
plt.xlabel('Fraudulent')
plt.ylabel('Count')
plt.show()

In [None]:
# Bar plot for employment_type
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='employment_type', y='fraudulent', estimator=sum, hue='employment_type', dodge=False, palette='Set2')
plt.title('Fraudulent Postings by Employment Type')
plt.xlabel('Employment Type')
plt.ylabel('Sum of Fraudulent Postings')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Bar plot for required_experience
plt.figure(figsize=(8, 6))
sns.barplot(data=df, x='required_experience', y='fraudulent', estimator=sum, hue='required_experience', dodge=False, palette='Set1')
plt.title('Fraudulent Postings by Required Experience')
plt.xlabel('Required Experience')
plt.ylabel('Sum of Fraudulent Postings')
plt.xticks(rotation=45)
plt.show()

In [None]:
# Bar plot for required_education
plt.figure(figsize=(15, 8))
sns.barplot(data=df, x='required_education', y='fraudulent', estimator=sum, hue='required_education', dodge=False, palette='Set3')
plt.title('Fraudulent Postings by Required Education')
plt.xlabel('Required Education')
plt.ylabel('Sum of Fraudulent Postings')
plt.xticks(rotation=45)
plt.show()

### c. Data Preprocessing

In [None]:
# Select features
df_selected = select_features(df)

# Prepare initial features
df_processed = prepare_initial_features(df_selected)

# Create feature stores
structured_features, label_encoders = create_structured_features(df_processed)
text_features = create_text_features(df_processed)

In [None]:
print_feature_summary(structured_features, text_features)

In [None]:
# Subset target variable
target_features = df[['fraudulent']]

In [None]:
# Add PK column 
for df in [structured_features, text_features, target_features]:
        if 'job_id' in df.columns:
            df.drop(columns=['job_id'], inplace=True)
    
    # Add incremental job_id as the first column
structured_features.insert(0, 'job_id', range(1, len(structured_features) + 1))
text_features.insert(0, 'job_id', range(1, len(text_features) + 1))
target_features.insert(0, 'job_id', range(1, len(target_features) + 1))

In [None]:
# Add an event_timestamp column with today's timestamp
structured_features['event_timestamp'] = pd.Timestamp.now()
text_features['event_timestamp'] = pd.Timestamp.now()
target_features['event_timestamp'] = pd.Timestamp.now()

In [None]:
columns_to_convert = [
    "title_cleaned",
    "description_cleaned",
    "requirements_cleaned",
    "company_profile_cleaned",
    "benefits_cleaned"
]

text_features[columns_to_convert] = text_features[columns_to_convert].astype("string")

In [None]:
text_features.head()

In [None]:
structured_features.head()

In [None]:
target_features.head()

## 3. Feature Stores

### A. Structured Feature Store

The structured feature store contains processed numerical and categorical features. These features are ready for use in traditional machine learning models.

#### **(1) Label Encoded Categorical Features**
These categorical features have been converted to numerical values using label encoding:
- `location`: Job locations (e.g., "US, NY" → 1, "UK, London" → 2)
- `employment_type`: Job types (e.g., "Full-time" → 0, "Part-time" → 1)
- `required_experience`: Experience levels (e.g., "Entry Level" → 0, "Senior" → 1)
- `required_education`: Education requirements (e.g., "Bachelor's" → 0, "Master's" → 1)
- `industry`: Company industries (e.g., "Technology" → 0, "Healthcare" → 1)
- `function`: Job functions (e.g., "Engineering" → 0, "Sales" → 1)

#### **(2) Binary Features**
Simple 0/1 indicators:
- `telecommuting`: Remote work indicator (1=yes, 0=no)
- `has_company_logo`: Company logo presence (1=yes, 0=no)
- `has_questions`: Screening questions presence (1=yes, 0=no)
- `no_logo_no_questions`: Combined feature (1=no logo & no questions, 0=otherwise)

### **(3) Frequency Encoded Features**
Represents how common each category is in the dataset:
- `location_freq`: Location frequency (e.g., 0.25 = appears in 25% of postings)
- `employment_type_freq`: Employment type frequency
- `required_experience_freq`: Experience level frequency
- `required_education_freq`: Education requirement frequency
- `industry_freq`: Industry frequency
- `function_freq`: Job function frequency

### **(4) Text Length Features**
Character counts of cleaned text fields:
- `title_length`: Job title length
- `description_length`: Job description length
- `requirements_length`: Requirements text length
- `company_profile_length`: Company profile length
- `benefits_length`: Benefits text length

### **(5) Missing Value Indicators**
Binary flags (0/1) indicating missing values in original data:
- Various columns ending with `_is_missing`

## B. Text Feature Store

The text feature store contains cleaned and processed text data, ready for natural language processing tasks.

### **(1) Cleaned Text Features**
Each text field has been processed to remove noise and standardize format:

- `title_cleaned`
  - Original: "Senior Software Engineer (Python/Django)"
  - Cleaned: "senior software engineer python django"

- `description_cleaned`
  - Original: "We are looking for a talented Software Engineer..."
  - Cleaned: "looking talented software engineer..."

- `requirements_cleaned`
  - Original: "5+ years of Python experience required"
  - Cleaned: "years python experience required"

- `company_profile_cleaned`
  - Original: "We're a fast-growing tech company..."
  - Cleaned: "fast growing tech company"

- `benefits_cleaned`
  - Original: "401(k), Health Insurance, Flexible Hours"
  - Cleaned: "health insurance flexible hours"

## C. Target Feature Store

The target feature store only contains target variable 'fraudulent'

### a. Initialize the Feast repo called job repo to current directory & Save data to directory

In [None]:
!feast init -m job_repo

In [None]:
# Save features
save_features(structured_features, text_features, target_features)

### b. Define feature stores and schema for job_repo

In [None]:
# Update feature_store.yaml
feature_store_yaml_content = """
project: job_project
registry: data/registry.db
provider: local
online_store:
    type: sqlite
    path: data/online_store.db
offline_store:
    type: file
entity_key_serialization_version: 2
"""

# Write the content to feature_store.yaml
with open('job_repo/feature_repo/feature_store.yaml', 'w') as f:
    f.write(feature_store_yaml_content.strip())

print("Updated feature_store.yaml")

### c. Create features.py

In [None]:
# Create features.py file
features_py_content = '''
from datetime import timedelta
from feast import Entity, FeatureView, Field, FileSource, Project
from feast.types import Float32, String, Int64

# Define a project for the feature repo
project = Project(name="job_project", description="A project for job data")

# Define the entity
job = Entity(
    name="job_id",
    join_keys=["job_id"],
    description="Unique identifier for job",
)

# Define the Structured Feature data source
job_structured_data_source = FileSource(
    name='job_structured_data',
    path="data/structured_features.parquet",
    timestamp_field="event_timestamp")

# Define the Structured Feature
job_structured_features = FeatureView(
    name="job_structured_features_view",
    entities=[job],
    ttl=timedelta(days=1),
    schema=[
        Field(name="location", dtype=Int64),
        Field(name="employment_type", dtype=Int64),
        Field(name="required_experience", dtype=Int64),
        Field(name="required_education", dtype=Int64),
        Field(name="industry", dtype=Int64),
        Field(name="function", dtype=Int64),
        Field(name="telecommuting", dtype=Int64),
        Field(name="has_company_logo", dtype=Int64),
        Field(name="has_questions", dtype=Int64),
        Field(name="no_logo_no_questions", dtype=Int64),
        Field(name="location_freq", dtype=Float32),
        Field(name="employment_type_freq", dtype=Float32),
        Field(name="required_experience_freq", dtype=Float32),
        Field(name="required_education_freq", dtype=Float32),
        Field(name="industry_freq", dtype=Float32),
        Field(name="function_freq", dtype=Float32),
        Field(name="description_is_missing", dtype=Int64),
        Field(name="requirements_is_missing", dtype=Int64),
        Field(name="company_profile_is_missing", dtype=Int64),
        Field(name="benefits_is_missing", dtype=Int64),
        Field(name="location_is_missing", dtype=Int64),
        Field(name="employment_type_is_missing", dtype=Int64),
        Field(name="required_experience_is_missing", dtype=Int64),
        Field(name="required_education_is_missing", dtype=Int64),
        Field(name="industry_is_missing", dtype=Int64),
        Field(name="function_is_missing", dtype=Int64),
        Field(name="title_length", dtype=Int64),
        Field(name="description_length", dtype=Int64),
        Field(name="requirements_length", dtype=Int64),
        Field(name="company_profile_length", dtype=Int64),
        Field(name="benefits_length", dtype=Int64),
    ],
    online=True,
    source=job_structured_data_source,
)

# Define the Text Features data source
job_text_data_source  = FileSource(
    name='job_text_data',
    path="data/text_features.parquet",
    timestamp_field="event_timestamp" )

# Define the predictor Feature View
job_text_features = FeatureView(
    name="job_text_features_view",
    entities=[job], 
    ttl=timedelta(days=1),
    schema=[
        Field(name="title_cleaned", dtype=String),
        Field(name="description_cleaned", dtype=String),
        Field(name="requirements_cleaned", dtype=String),
        Field(name="company_profile_cleaned", dtype=String),
        Field(name="benefits_cleaned", dtype=String)
    ],
    online=True,
    source=job_text_data_source,
)

# Define the target data source
job_target_source = FileSource(
    name='job_target',
    path="data/target_features.parquet",
    timestamp_field="event_timestamp")

# Define the target Feature View with all columns
job_target_features = FeatureView(
    name="job_target_feature_view",
    entities=[job],
    ttl=timedelta(days=1),
    schema=[
        Field(name="fraudulent", dtype=Int64),
    ],
    online=True,
    source=job_target_source,
)
'''

# Write the content to features.py
with open('job_repo/feature_repo/features.py', 'w') as f:
    f.write(features_py_content.strip())

print("Created features.py")

### c. feast apply

In [None]:
!cd job_repo/feature_repo && feast apply

### d. Extract features from feature stores

In [None]:
# Set the store
store = FeatureStore('job_repo/feature_repo')

In [None]:
from datetime import datetime
# Simulate the scenario we want a bunch of 'job_id' that created recently
today = pd.Timestamp(datetime.now().date()) 
entity_df = target_features[target_features['event_timestamp'] >= today][['job_id', 'event_timestamp']]

In [None]:
# Retrieve predictors and target from different features view
training_data = store.get_historical_features(
    entity_df=entity_df,
    features=[
        "job_structured_features_view:location",
        "job_structured_features_view:employment_type",
        "job_structured_features_view:required_experience",
        "job_structured_features_view:required_education",
        "job_structured_features_view:industry",
        "job_structured_features_view:function",
        "job_structured_features_view:telecommuting",
        "job_structured_features_view:has_company_logo",
        "job_structured_features_view:has_questions",
        "job_structured_features_view:no_logo_no_questions",
        "job_structured_features_view:location_freq",
        "job_structured_features_view:employment_type_freq",
        "job_structured_features_view:required_experience_freq",
        "job_structured_features_view:required_education_freq",
        "job_structured_features_view:industry_freq",
        "job_structured_features_view:function_freq",
        "job_structured_features_view:description_is_missing",
        "job_structured_features_view:requirements_is_missing",
        "job_structured_features_view:company_profile_is_missing",
        "job_structured_features_view:benefits_is_missing",
        "job_structured_features_view:location_is_missing",
        "job_structured_features_view:employment_type_is_missing",
        "job_structured_features_view:required_experience_is_missing",
        "job_structured_features_view:required_education_is_missing",
        "job_structured_features_view:industry_is_missing",
        "job_structured_features_view:function_is_missing",
        "job_structured_features_view:title_length",
        "job_structured_features_view:description_length",
        "job_structured_features_view:requirements_length",
        "job_structured_features_view:company_profile_length",
        "job_structured_features_view:benefits_length",
        "job_text_features_view:title_cleaned",
        "job_text_features_view:description_cleaned",
        "job_text_features_view:requirements_cleaned",
        "job_text_features_view:company_profile_cleaned",
        "job_text_features_view:benefits_cleaned",
        "job_target_feature_view:fraudulent"
    ]
)


# Convert to DataFrame for inspection
training_data_df = training_data.to_df()

In [None]:
# Dropping the columns 'job_id' and 'event_timestamp' from training_data_df
training_data_df = training_data_df.drop(columns=['job_id', 'event_timestamp'])

In [None]:
training_data_df.head()

## 4. Data Post-processing

### a. Combined text features

In [None]:
# Combine Text Features
text_columns = ['title_cleaned', 'company_profile_cleaned', 'description_cleaned', 'requirements_cleaned', 'benefits_cleaned']
training_data_df['cleaned_combined_text'] = training_data_df[text_columns].agg(' '.join, axis=1)

In [None]:
training_data_df= training_data_df.drop(columns=text_columns)

In [None]:
training_data_df.info()

### b. TfidfVectorizer

In [None]:
# Initialize the vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

X_tfidf = tfidf_vectorizer.fit_transform(training_data_df['cleaned_combined_text'])

# Feature names
feature_names = tfidf_vectorizer.get_feature_names_out()

# Convert to DataFrame (optional)
tfidf_df = pd.DataFrame(X_tfidf.toarray(), columns=feature_names)

In [None]:
tfidf_df.shape

In [None]:
# SVD to reduce dimensionality
svd = TruncatedSVD(n_components=500)  
X_tfidf_reduced = svd.fit_transform(X_tfidf)
tfidf_df_reduced = pd.DataFrame(X_tfidf_reduced, columns=[f'svd_{i}' for i in range(500)])

In [None]:
training_data_df = training_data_df.drop(columns=['cleaned_combined_text'])

In [None]:
training_data_with_tfidf = pd.concat([training_data_df.reset_index(drop=True), tfidf_df_reduced], axis=1)

In [None]:
training_data_with_tfidf.head()

In [None]:
# Save DataFrame to Parquet for temporary usage
training_data_with_tfidf.to_parquet("training_data_with_tfidf.parquet", index=False)

## 5. Model Building

The target for this dataset is 'fraudulent', which is a binary variable of 0 or 1 to indicate if the listing is fraudulent or not (0 for not, 1 for is fraudulent). AUC-Precision-Recall will be used as the metric for best performance as it provides the best balance of the business problem: We want people to be safe from applying to fake jobs as they risk their information getting leaked or scammed, and at the same time want people to have trust in the system so that legitimate jobs don't get flagged as spam. F1-Score was not available as a metric for H2O to rank models.

Run `mlflow ui` in the terminal and go to http://127.0.0.1:5000 to look at the logged runs.

In [1]:
from mlflow_runner import run_mlflow_pipeline

run_mlflow_pipeline('training_data_with_tfidf.parquet')

Checking whether there is an H2O instance running at http://localhost:54321..... not found.
Attempting to start a local H2O server...
  Java Version: java version "1.8.0_371"; Java(TM) SE Runtime Environment (build 1.8.0_371-b11); Java HotSpot(TM) 64-Bit Server VM (build 25.371-b11, mixed mode)
  Starting server from /Users/devondelgado/miniconda3/envs/mlops/lib/python3.10/site-packages/h2o/backend/bin/h2o.jar
  Ice root: /var/folders/b4/nhbrf4ws253b52n__dgvd0gr0000gn/T/tmp0f7g8lle
  JVM stdout: /var/folders/b4/nhbrf4ws253b52n__dgvd0gr0000gn/T/tmp0f7g8lle/h2o_devondelgado_started_from_python.out
  JVM stderr: /var/folders/b4/nhbrf4ws253b52n__dgvd0gr0000gn/T/tmp0f7g8lle/h2o_devondelgado_started_from_python.err
  Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.


0,1
H2O_cluster_uptime:,02 secs
H2O_cluster_timezone:,America/Chicago
H2O_data_parsing_timezone:,UTC
H2O_cluster_version:,3.46.0.6
H2O_cluster_version_age:,23 days
H2O_cluster_name:,H2O_from_python_devondelgado_s1uc18
H2O_cluster_total_nodes:,1
H2O_cluster_free_memory:,3.541 Gb
H2O_cluster_total_cores:,10
H2O_cluster_allowed_cores:,10


Parse progress: |████████████████████████████████████████████████████████████████| (done) 100%
AutoML progress: |
15:34:12.150: _train param, Dropping bad and constant columns: [description_is_missing]

██
15:36:26.830: _train param, Dropping bad and constant columns: [description_is_missing]

████████████████████
15:38:30.871: _train param, Dropping bad and constant columns: [description_is_missing]

██
15:39:09.629: _train param, Dropping bad and constant columns: [description_is_missing]

█████████████
15:41:18.21: _train param, Dropping bad and constant columns: [description_is_missing]

██
15:42:04.306: _train param, Dropping bad and constant columns: [description_is_missing]

████████████████████████| (done) 100%
xgboost prediction progress: |███████████████████████████████████████████████████| (done) 100%
🏃 View run Leader_Model_Run at: http://127.0.0.1:5000/#/experiments/0/runs/a5b53a765cd0473eaf77639274dd5a6d
🧪 View experiment at: http://127.0.0.1:5000/#/experiments/0
xgboost 