# Welcome to my Data Preparations👋

About the Dataset:
[Real or Fake] : Fake Job Description Prediction
This dataset contains 18K job descriptions out of which about 800 are fake. The data consists of both textual information and meta-information about the jobs. The dataset can be used to create classification models which can learn the job descriptions which are fraudulent.

Inspiration
The dataset is very valuable as it can be used to answer the following questions:
Create a classification model that uses text data features and meta-features and predict which job description are fraudulent or real.
Identify key traits/features (words, entities, phrases) of job descriptions which are fraudulent in nature.
Run a contextual embedding model to identify the most similar job descriptions.
Perform Exploratory Data Analysis on the dataset to identify interesting insights from this dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import preprocessing

## Data Explorations 👀📝

### 1️⃣ Loading and Viewing the Raw Data

first, explore the dataset to know what inside it and then decide how im gonna prepare it

In [2]:
df = pd.read_csv('fake_job_postings.csv')

In [3]:
df

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,"CA, ON, Toronto",Sales,,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,,Computer Software,Sales,0
17876,17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",,,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,,0,0,0,Full-time,,,,,0
17878,17879,Graphic Designer,"NG, LA, Lagos",,,,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


### 2️⃣Scans every column

Understand the scale and basic structure of the dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

i think the datatype is pretty good already the int64 columns are indeed numbers, and the text-based columns are object so i might dont need datatype formatting but ill see later if i need it or not

many data are missing 😭, ill fix that later

### 3️⃣ Check Duplicates

In [5]:
duplicate_count = df.duplicated().sum()

print(f"Number of duplicate rows: {duplicate_count}")

Number of duplicate rows: 0


### 4️⃣Check Inconsistent Format

In [6]:
print("Value counts for 'employment_type':")
print(df['employment_type'].value_counts())
print("\n" + "="*30 + "\n") # A separator for clarity

print("Value counts for 'required_experience':")
print(df['required_experience'].value_counts())
print("\n" + "="*30 + "\n")

print("Value counts for 'required_education':")
print(df['required_education'].value_counts())

Value counts for 'employment_type':
employment_type
Full-time    11620
Contract      1524
Part-time      797
Temporary      241
Other          227
Name: count, dtype: int64


Value counts for 'required_experience':
required_experience
Mid-Senior level    3809
Entry level         2697
Associate           2297
Not Applicable      1116
Director             389
Internship           381
Executive            141
Name: count, dtype: int64


Value counts for 'required_education':
required_education
Bachelor's Degree                    5145
High School or equivalent            2080
Unspecified                          1397
Master's Degree                       416
Associate Degree                      274
Certification                         170
Some College Coursework Completed     102
Professional                           74
Vocational                             49
Some High School Coursework            27
Doctorate                              26
Vocational - HS Diploma                 9


### 3️⃣List the missing value

Count the total number of missing values in each column ,Calculate the percentage of missing values,Create a new DataFrame to hold the report,Filter the report to only show columns that actually have missing data
and sort it from most missing to least.

In [7]:
missing_values = df.isnull().sum()

missing_percent = (missing_values / len(df)) * 100

missing_report = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percent
})

missing_report = missing_report[missing_report['Missing Values'] > 0].sort_values(by='Percentage (%)', ascending=False)

print("Missing Value Report:")
print(missing_report)

Missing Value Report:
                     Missing Values  Percentage (%)
salary_range                  15012       83.959732
department                    11547       64.580537
required_education             8105       45.329978
benefits                       7212       40.335570
required_experience            7050       39.429530
function                       6455       36.101790
industry                       4903       27.421700
employment_type                3471       19.412752
company_profile                3308       18.501119
requirements                   2696       15.078300
location                        346        1.935123
description                       1        0.005593


## Data Cleaning 📚🧹

### Data Formatting

The salary_range column has over 80% missing values. While the standard approach might be to drop such a column, the absence of a salary range could itself be a strong indicator of a fraudulent job posting. To preserve this information, im gonna encode the salary range column to int type with 0 and 1 and rename column with has_salary. This column is 1 if a salary was provided and 0 otherwise. 

In [8]:
df['salary_range'] = df['salary_range'].notnull().astype(int)

df.rename(columns={'salary_range': 'has_salary'}, inplace=True)

print("DataFrame after encoding 'salary_range' into 'has_salary':")
df.head()

DataFrame after encoding 'salary_range' into 'has_salary':


Unnamed: 0,job_id,title,location,department,has_salary,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,0,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,0,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,0,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,0,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,0,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   has_salary           17880 non-null  int64 
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15184 non-null  object
 8   benefits             10668 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

the salary_range now is rename as has_salary with int datatype 

since the other column honestly also have a lot of missing values , i decide to adjust it a little cause using the same logic as before , those missing values just mean that the poster is not post the information so the absence of those data could itself be a strong indicator of a fraudulent job posting thats way ill keep them and modify them a little bit to handle missing value.

but this one is a little bit different with the salary range column because , for columns like salary_range, where the content itself was messy and not very useful, the Binary Flag (1/0) approach was perfect.
However, for columns like department, industry, or required_education, the actual values ("Sales," "IT," "Bachelor's Degree") are very valuable and likely have strong predictive power.

For these columns, i decide to put the Placeholder Category ("Unspecified") for naN data , it handles the missing values while keeping all the original, valuable data

In [10]:

columns_to_fill = [
    'department', 
    'required_education', 
    'benefits', 
    'required_experience', 
    'function', 
    'industry', 
    'company_profile', 
    'employment_type', 
    'requirements',      
    'location'           
]

for column in columns_to_fill:
    df[column] = df[column].fillna('Unspecified')

df['description'] = df['description'].fillna('Unspecified')

print("Missing Value Report after filling:")
print(df.isnull().sum())

Missing Value Report after filling:
job_id                 0
title                  0
location               0
department             0
has_salary             0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
dtype: int64


In [11]:
df

Unnamed: 0,job_id,title,location,department,has_salary,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,0,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,Unspecified,0,1,0,Other,Internship,Unspecified,Unspecified,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,0,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,Unspecified,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",Unspecified,0,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,Unspecified,0,1,0,Unspecified,Unspecified,Unspecified,Unspecified,Unspecified,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,0,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",Unspecified,0,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17875,17876,Account Director - Distribution,"CA, ON, Toronto",Sales,0,Vend is looking for some awesome new talent to...,Just in case this is the first time you’ve vis...,To ace this role you:Will eat comprehensive St...,What can you expect from us?We have an open cu...,0,1,1,Full-time,Mid-Senior level,Unspecified,Computer Software,Sales,0
17876,17877,Payroll Accountant,"US, PA, Philadelphia",Accounting,0,WebLinc is the e-commerce platform and service...,The Payroll Accountant will focus primarily on...,- B.A. or B.S. in Accounting- Desire to have f...,Health &amp; WellnessMedical planPrescription ...,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Internet,Accounting/Auditing,0
17877,17878,Project Cost Control Staff Engineer - Cost Con...,"US, TX, Houston",Unspecified,0,We Provide Full Time Permanent Positions for m...,Experienced Project Cost Control Staff Enginee...,At least 12 years professional experience.Abil...,Unspecified,0,0,0,Full-time,Unspecified,Unspecified,Unspecified,Unspecified,0
17878,17879,Graphic Designer,"NG, LA, Lagos",Unspecified,0,Unspecified,Nemsia Studios is looking for an experienced v...,1. Must be fluent in the latest versions of Co...,Competitive salary (compensation will be based...,0,0,1,Contract,Not Applicable,Professional,Graphic Design,Design,0


In [12]:
missing_values = df.isnull().sum()

missing_percent = (missing_values / len(df)) * 100

missing_report = pd.DataFrame({
    'Missing Values': missing_values,
    'Percentage (%)': missing_percent
})

missing_report = missing_report[missing_report['Missing Values'] > 0].sort_values(by='Percentage (%)', ascending=False)

print("Missing Value Report:")
print(missing_report)

Missing Value Report:
Empty DataFrame
Index: []


there is no more missing values 


In [15]:
df.to_csv('fake_job_postings_cleaned.csv', index=False)