<a href="https://colab.research.google.com/github/dianjin0407/BA820-Group4-Job-Listing-Integrity-Investigation/blob/main/BA820_B1_Group4_Deliverable2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Job Listing Integrity Investigation

**Group 4:** Dian Jin, Mingze Wu, Tanvi Sheth, Sneha Sunil Ekka, Jenil Shah

**Motivation**

As graduate students entering the job market, we spend most of our time exploring job opportunities online. It is important for us to be able to trust the platform where we share our data. This is why having the knowledge of real or fake job postings is so important.


**Problem Statement**

The main objective of this project is to leverage Natural Language Processing algorithms in order to process textual job postings and draw out patterns that distinguish fraudulent jobs from real ones. There is a critical need for an automated, reliable solution that can enhance the detection of fraudulent postings, improve platform integrity, and ensure a safe, trustworthy environment for both recruiters and job seekers.


**Business Relevance**

From a business standpoint, implementing a solution is crucial for platforms like LinkedIn and Handshake. By enhancing fraud detection accuracy, these platforms can create a safer environment for both - recruiters and job seekers, trust and satisfaction. Furthermore, these insights can guide strategic decisions and market trend analysis, facilitating business growth and enhancing goodwill among users.

**Dataset & Data Source**

The dataset - Real / Fake Job Posting Prediction is from Kaggle, retrieved from [this link](https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction?resource=download). The dataset has 17880 job postings and 18 features including job descriptions, company profiles, benefits, requirements, and a binary indicator of whether a posting is real or fake. The data consists of both textual information and meta-information about the jobs.

**Data Dictionary**

| Columns | description |
|----|----|
| job_id | Unique Job ID |
| title | The title of the job ad entry |
| location | Geographical location of the job ad |
| department | Corporate department (e.g. sales) |
| salary_range | Indicative salary range |
| company_profile | A brief company description |
| description | The details description of the job ad |
| requirements | Enlisted requirements for the job opening |
| benefits | Enlisted offered benefits by the employer |
| telecommuting | True for telecommuting positions |
| has_company_logo | True if company logo is present |
| has_questions | True if screening questions are present |
| employment_type | Full-type, Part-time, Contract, etc. |
| required_experience | Executive, Entry level, Intern, etc. |
| required_education | Doctorate, Master's Degree, Bachelor, etc. |
| industry | Automotive, IT, Health care, Real estate, etc. |
| function | Consulting, Engineering, Research, Sales etc. |
| fraudulent | target - Classification attribute |

## Importing Libraries

# Part 1
---

In [None]:
# Data Structures
import numpy  as np
import pandas as pd

# Corpus Processing
import re
# import string
import nltk
import nltk.corpus
from nltk.corpus import words
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize, RegexpTokenizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

# Visualization and Analysis
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from wordcloud import WordCloud
from sklearn.metrics import silhouette_samples, silhouette_score

# K-Means
from sklearn.cluster import KMeans
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

colors = px.colors.qualitative.Pastel
nltk.download('words')
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

## Loading the Data

In [None]:
!pip install opendatasets



In [None]:
import opendatasets as od

od.download("https://www.kaggle.com/datasets/shivamb/real-or-fake-fake-jobposting-prediction?resource=download")

Skipping, found downloaded files in "./real-or-fake-fake-jobposting-prediction" (use force=True to force download)


In [None]:
!pip install your-package-name



In [None]:
data = pd.read_csv('/content/real-or-fake-fake-jobposting-prediction/fake_job_postings.csv')
data.head()

Unnamed: 0,job_id,title,location,department,salary_range,company_profile,description,requirements,benefits,telecommuting,has_company_logo,has_questions,employment_type,required_experience,required_education,industry,function,fraudulent
0,1,Marketing Intern,"US, NY, New York",Marketing,,"We're Food52, and we've created a groundbreaki...","Food52, a fast-growing, James Beard Award-winn...",Experience with content management systems a m...,,0,1,0,Other,Internship,,,Marketing,0
1,2,Customer Service - Cloud Video Production,"NZ, , Auckland",Success,,"90 Seconds, the worlds Cloud Video Production ...",Organised - Focused - Vibrant - Awesome!Do you...,What we expect from you:Your key responsibilit...,What you will get from usThrough being part of...,0,1,0,Full-time,Not Applicable,,Marketing and Advertising,Customer Service,0
2,3,Commissioning Machinery Assistant (CMA),"US, IA, Wever",,,Valor Services provides Workforce Solutions th...,"Our client, located in Houston, is actively se...",Implement pre-commissioning and commissioning ...,,0,1,0,,,,,,0
3,4,Account Executive - Washington DC,"US, DC, Washington",Sales,,Our passion for improving quality of life thro...,THE COMPANY: ESRI – Environmental Systems Rese...,"EDUCATION: Bachelor’s or Master’s in GIS, busi...",Our culture is anything but corporate—we have ...,0,1,0,Full-time,Mid-Senior level,Bachelor's Degree,Computer Software,Sales,0
4,5,Bill Review Manager,"US, FL, Fort Worth",,,SpotSource Solutions LLC is a Global Human Cap...,JOB TITLE: Itemization Review ManagerLOCATION:...,QUALIFICATIONS:RN license in the State of Texa...,Full Benefits Offered,0,1,1,Full-time,Mid-Senior level,Bachelor's Degree,Hospital & Health Care,Health Care Provider,0


In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17880 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   job_id               17880 non-null  int64 
 1   title                17880 non-null  object
 2   location             17534 non-null  object
 3   department           6333 non-null   object
 4   salary_range         2868 non-null   object
 5   company_profile      14572 non-null  object
 6   description          17879 non-null  object
 7   requirements         15185 non-null  object
 8   benefits             10670 non-null  object
 9   telecommuting        17880 non-null  int64 
 10  has_company_logo     17880 non-null  int64 
 11  has_questions        17880 non-null  int64 
 12  employment_type      14409 non-null  object
 13  required_experience  10830 non-null  object
 14  required_education   9775 non-null   object
 15  industry             12977 non-null  object
 16  func

## Cleaning the Data

In [None]:
# Checking for NULLs in the data
nan_counts = data.isnull().sum()
nan_percent = data.isnull().sum()/data.shape[0]
nans_dict = {'count_of_nans':nan_counts, 'percent_of_nans':nan_percent}

pd.DataFrame(nans_dict).sort_values('percent_of_nans')

Unnamed: 0,count_of_nans,percent_of_nans
job_id,0,0.0
has_questions,0,0.0
has_company_logo,0,0.0
telecommuting,0,0.0
fraudulent,0,0.0
title,0,0.0
description,1,5.6e-05
location,346,0.019351
requirements,2695,0.150727
company_profile,3308,0.185011


In [None]:
# Dropping NAN value in the 'description' column
data.dropna(subset=['description'], inplace=True)
data.shape

(17879, 18)

In [None]:
# Verifying the categorical nature of some columns
columns = ['telecommuting','has_company_logo','has_questions','employment_type','required_experience','required_education','fraudulent']
cat_check = data[columns]
unique = cat_check.apply(lambda col: col.unique())
pd.DataFrame(unique, columns=['Unique Values'])

Unnamed: 0,Unique Values
telecommuting,"[0, 1]"
has_company_logo,"[1, 0]"
has_questions,"[0, 1]"
employment_type,"[Other, Full-time, nan, Part-time, Contract, T..."
required_experience,"[Internship, Not Applicable, nan, Mid-Senior l..."
required_education,"[nan, Bachelor's Degree, Master's Degree, High..."
fraudulent,"[0, 1]"


In [None]:
# Filling missing values in categorical columns with "Not Specified"
categorical_columns = ['employment_type', 'required_experience', 'required_education']
for column in categorical_columns:
    data[column].fillna('Not Specified', inplace=True)

# Collapsing 'Unspecified' and 'Not Specified' in the 'required_education' column into a single category
data.loc[data['required_education'] == 'Unspecified', 'required_education'] = 'Not Specified'

In [None]:
# Changing the datatypes of categorical columns to 'category'
data = data.astype({'telecommuting':'category', 'has_company_logo':'category', 'has_questions':'category', 'fraudulent':'category',
                    'employment_type':'category', 'required_experience':'category', 'required_education':'category'})

In [None]:
# Filling missing textual columns with empty strings
textual_columns = ['company_profile', 'description', 'requirements', 'benefits']
for column in textual_columns:
    data[column].fillna('', inplace=True)

data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 17879 entries, 0 to 17879
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype   
---  ------               --------------  -----   
 0   job_id               17879 non-null  int64   
 1   title                17879 non-null  object  
 2   location             17533 non-null  object  
 3   department           6333 non-null   object  
 4   salary_range         2868 non-null   object  
 5   company_profile      17879 non-null  object  
 6   description          17879 non-null  object  
 7   requirements         17879 non-null  object  
 8   benefits             17879 non-null  object  
 9   telecommuting        17879 non-null  category
 10  has_company_logo     17879 non-null  category
 11  has_questions        17879 non-null  category
 12  employment_type      17879 non-null  category
 13  required_experience  17879 non-null  category
 14  required_education   17879 non-null  category
 15  industry           

In [None]:
# Checking for NULLs again
nan_counts = data.isnull().sum()
nan_percent = data.isnull().sum()/data.shape[0]
nans_dict = {'count_of_nans':nan_counts, 'percent_of_nans':nan_percent}

nans_df = pd.DataFrame(nans_dict).sort_values('percent_of_nans')
nans_df[nans_df.count_of_nans > 0]

Unnamed: 0,count_of_nans,percent_of_nans
location,346,0.019352
industry,4902,0.274176
function,6454,0.360982
department,11546,0.645786
salary_range,15011,0.839588


In [None]:
## Investigating the 'salary_range' column
data['salary_range'].unique()

array([nan, '20000-28000', '100000-120000', '120000-150000',
       '50000-65000', '40000-50000', '60-80', '65000-70000', '75-115',
       '75000-110000', '17000-20000', '16000-28000', '95000-115000',
       '15000-18000', '50000-70000', '45000-60000', '30000-40000',
       '70000-90000', '10000-14000', '50-110', '28000-45000', '0-34300',
       '35000-40000', '9-Dec', '44000-57000', '18500-28000',
       '55000-75000', '30000-35000', '0-0', '20000-40000',
       '360000-600000', '50000-80000', '80000-100000', '52000-78000',
       '15750-15750', '40000-65000', '45000-50000', '30000-37000',
       '45000-67000', '35000-100000', '180000-216000', '45000-65000',
       '28000-32000', '0-1000', '36000-40000', '80000-110000',
       '35000-73000', '19000-19000', '60000-120000', '120000-15000000',
       '42000-55000', '90000-120000', '100000-150000', '28000-38000',
       '1600-1700', '50000-60000', '30000-70000', '32000-40000', '50-100',
       '9000-17000', '23040-28800', '105-110', '1300

In [None]:
# Checking no of rows that have bad values in 'salary_range'
salary_not_null = data[~ data['salary_range'].isna()]
salary_check_df = salary_not_null[salary_not_null['salary_range'].str.contains('[a-zA-Z]')]
salary_check_df.shape

(26, 18)

In [None]:
# Getting rid of the above rows that contain dates in the 'salaray_range'
indices_to_drop = salary_check_df.index

# Obtaining the cleaned data
data.drop(indices_to_drop, inplace=True)

# Resetting the index
data.reset_index(drop=True, inplace=True)
data.shape

(17853, 18)

**Note:**

For the `salary_range` column, we see that in addition to approximately 84% of the data being missing, the values are not all in the same format. We can see a few dates along with the minimum and maximum salary values. We also know from the `location` column that job listings are spead out through the world. Looking at the values, it is fair to assume that the salary ranges given are in their respective local currencies.

One way to handle this data is to get rid of the date format values and convert the local currencies to USD for a more standardized analysis. This however, raises the risk of introducing a mismatch in the salary values for comparable global job roles.

We also acknowledge that some companies do not prefer disclosing the salary range prior to making an offer to a candidate, so it isn't essential for all values in the salary_range column to be non-null.

In [None]:
# Investigating the 'department' column
list(data['department'].unique())

['Marketing',
 'Success',
 nan,
 'Sales',
 'ANDROIDPIT',
 'HR',
 ' R&D',
 'Engagement',
 'Businessfriend.com',
 'Medical',
 'Field',
 'All',
 'Design',
 'Production',
 'ICM',
 'General Services',
 'Engineering',
 'IT',
 'Business Development',
 'Human Resources',
 'Oil & Energy',
 'Marketplace',
 'Cloud Services',
 'FP',
 'Client Services',
 'Operations',
 'Materials',
 'tech',
 'Sales and Business Development',
 'R&D',
 'Development',
 'Incubation Services',
 'Field Operations',
 'MKT',
 'Technology',
 'Power Plant & Energy',
 'Approvals Department',
 'Playfair Capital',
 'Development ',
 'Tech',
 'Software development',
 'Media',
 'Line-Up',
 'Management',
 'Squiz ',
 'Finance',
 'Financial',
 'Retail',
 'Marketing and Communications',
 'Research',
 'Connectivity',
 'PMO',
 'Product',
 'Student Beans Mag',
 'Information Technology Group',
 'DTVMA',
 'G&A',
 'Implementations',
 'OPS',
 'Partnership Management',
 'Professional Services',
 'Customer Care',
 'Account Management',
 'EC',


In [None]:
# Dropping the 'department' column
data.drop('department', axis=1, inplace=True)

**Note:**

In the `department` column, the data is very vague with locations and numbers included. There are also a few entries with job roles and job titles in place of the department name or function.

Since there is no way to verify the correctness of the data, using it for the analysis would compromise the integrity of the data. For this reason, we decided to drop the `department` column.

In [None]:
# Investigating the 'function' column
data['function'].unique()

array(['Marketing', 'Customer Service', nan, 'Sales',
       'Health Care Provider', 'Management', 'Information Technology',
       'Other', 'Engineering', 'Administrative', 'Design', 'Production',
       'Education', 'Supply Chain', 'Business Development',
       'Product Management', 'Financial Analyst', 'Consulting',
       'Human Resources', 'Project Management', 'Manufacturing',
       'Public Relations', 'Strategy/Planning', 'Advertising', 'Finance',
       'General Business', 'Research', 'Accounting/Auditing',
       'Art/Creative', 'Quality Assurance', 'Data Analyst',
       'Business Analyst', 'Writing/Editing', 'Distribution', 'Science',
       'Training', 'Purchasing', 'Legal'], dtype=object)

In [None]:
# Investigating the 'industry' column
data['industry'].unique()

array([nan, 'Marketing and Advertising', 'Computer Software',
       'Hospital & Health Care', 'Online Media',
       'Information Technology and Services', 'Financial Services',
       'Management Consulting', 'Events Services', 'Internet',
       'Facilities Services', 'Consumer Electronics',
       'Telecommunications', 'Consumer Services', 'Construction',
       'Oil & Energy', 'Education Management', 'Building Materials',
       'Banking', 'Food & Beverages', 'Food Production',
       'Health, Wellness and Fitness', 'Insurance', 'E-Learning',
       'Cosmetics', 'Staffing and Recruiting',
       'Venture Capital & Private Equity', 'Leisure, Travel & Tourism',
       'Human Resources', 'Pharmaceuticals', 'Farming', 'Legal Services',
       'Luxury Goods & Jewelry', 'Machinery', 'Real Estate',
       'Mechanical or Industrial Engineering',
       'Public Relations and Communications', 'Consumer Goods',
       'Medical Practice', 'Electrical/Electronic Manufacturing',
       'Hospita

In [None]:
# Investigating the 'location' column
data['location'].unique()

array(['US, NY, New York', 'NZ, , Auckland', 'US, IA, Wever', ...,
       'US, CA, los Angeles', 'CA, , Ottawa', 'GB, WSX, Chichester'],
      dtype=object)

In [None]:
# Extracting 'country' from 'location' column
data['country'] = data['location'].str[:2]

**Note:**

The `location` column has very inconsistent data, with missing values, states and cities missing for quite a few rows. We are creating a new column - `country` by extracting just the first two letters (the country code) from the location column to analyse further.

In [None]:
# Filling missing values in other columns with "Not Specified"
other_columns = ['location', 'country', 'industry', 'function', 'salary_range']
for column in other_columns:
    data[column].fillna("Not Specified", inplace=True)

In [None]:
data.isna().sum()

job_id                 0
title                  0
location               0
salary_range           0
company_profile        0
description            0
requirements           0
benefits               0
telecommuting          0
has_company_logo       0
has_questions          0
employment_type        0
required_experience    0
required_education     0
industry               0
function               0
fraudulent             0
country                0
dtype: int64

## Preprocessing Text Data With NLP

### Functions

In [None]:
def keepWords(listOfTokens, listOfWords):
    '''retains a list of words from a tokenized list'''
    return [token for token in listOfTokens if token in listOfWords]

def removeWords(listOfTokens, listOfWords):
    '''removes a list of words (ie. stopwords) from a tokenized list'''
    return [token for token in listOfTokens if token not in listOfWords]

def extremeWords(listOfTokens):
    '''collects words composed of less than 2 or more than 21 letters'''
    extremeWords = []
    for token in listOfTokens:
        if len(token) <= 2 or len(token) >= 21:
            extremeWords.append(token)
    return extremeWords

### Tokenization

In [None]:
def tokenize_text(text):

    # Converting any non-string data to string
    text = str(text)

    # Converting all text to lowercase
    text = text.lower()

    # Replacing the matched characters with an empty string
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Replace the unmatched characters with an empty string
    text = text.replace(u'\ufffd', '8')         # Replaces the ASCII '�' symbol with '8'
    text = re.sub("\S*@\S*\s?", '', text)       # removes emails and mentions (words with @)
    text = re.sub('http\S+', '', text)          # removes URLs with http
    text = re.sub('www\S+', '', text)           # removes URLs with www

    # Tokenizing text
    tokens = word_tokenize(text)

    # Further cleaning the tokens
    stop_words = list(set(stopwords.words('english')))
    extreme_words = extremeWords(tokens)
    tokens = removeWords(tokens, stop_words)
    tokens = removeWords(tokens, extreme_words)

    # Stemming & Lemmatization
    stemmer = PorterStemmer()
    lemmatizer = WordNetLemmatizer()
    tokens = [stemmer.stem(lemmatizer.lemmatize(token)) for token in tokens]

    # Rejoining the tokens into a string
    text = " ".join(tokens)

    return text

### Vectorizarion

In [None]:
def vectorize_text():
    '''returns a model'''
    tfidf_vectorizer = TfidfVectorizer(norm=None)
    return tfidf_vectorizer

## EDA & Visualisations

### Exploratory Plots

In [None]:
# Plotting Distribution of Employment Types

employment_counts = data['employment_type'].value_counts().reset_index()
employment_counts.columns = ['employment_type', 'count']

# Plot using Plotly
fig = px.bar(employment_counts,
             y='employment_type',
             x='count',
             color='employment_type',
             orientation='h',
             title='Plot 1: Distribution of Employment Types',
             labels={'count': 'Count', 'employment_type': 'Employment Type'},
             color_discrete_sequence=colors[2:]
            )

fig.show()

**Insights:**

We have quite a few entries with no Employment type mentioned. However, the majority of the roles provided by the companies are for full-time roles.

In [None]:
# Plotting Distribution of Required Education for Job Postings

required_education = data['required_education'].value_counts().reset_index()
required_education.columns = ['required_education', 'count']

fig = px.bar(required_education,
             y='required_education',
             x='count',
             color='required_education',
             orientation='h',
             title='Plot 2: Distribution of Required Education for Job Postings',
             labels={'count': 'Count', 'required_education': 'Required Education'},
             color_discrete_sequence=colors[2:]
            )

fig.show()

**Insights:**

Majority of the job postings do not explicitly specify degree requirements. So we will be analysing the `requirements` column further to see if these specifications are mentioned in the text.

Other than that, we see that Bachelor's degree is the most commonly qualification required for these job listings.

In [None]:
# Plotting Top 10 Industries Represented in Job Postings

top_industries = data['industry'].value_counts().reset_index().head(10)
top_industries.columns = ['industry', 'count']

fig = px.bar(top_industries,
             y='industry',
             x='count',
             color='industry',
             orientation='h',
             title='Plot 3: Top 10 Industries Represented in Job Postings',
             labels={'count': 'Count', 'industry': 'Industry'},
             color_discrete_sequence=colors[2:]
            )
fig.show()

**Insights:**

Similar to the `required_education` column, we see that companies do not usually tag their industries on the job portals while listing a job requirement. These details will be further looked at while analysing the `company_profile` column.

For the categories mentioned, we see a dominance of technology companies providing IT, Computer Software and Interet-based services followed by Marketing and Advertising.

In [None]:
# Plotting Distribution of Job Postings With and Without Company Logo

# Creating a copy of data for plots
plot_df = data.copy()
plot_df['fraudulent'] = plot_df['fraudulent'].map({0: "Real", 1: "Fake"})

fig = px.histogram(plot_df, x='has_company_logo', color='fraudulent', barmode='group', category_orders={'has_company_logo': ['f', 't']}, color_discrete_sequence=colors)

fig.update_layout(
    title='Plot 4: Distribution of Job Postings With and Without Company Logo',
    xaxis_title='Has Company Logo',
    yaxis_title='Count',
    width=800,
    height=600,
    legend_title_text='Fraudulent'
)
fig.update_layout(xaxis=dict(tickvals=[0, 1], ticktext=['No', 'Yes']))

fig.show()

**Insights:**

Companies usually include their logo while posting a job on a platform. This includes both real and fraudulent job listings. However, the proportion of fraudulent job listings without a company logo is much higher. This would give the platform a chance to verify the credibility of the job listing.

In [None]:
# Plotting Distribution of Telecommuting Job Postings
fig = px.histogram(plot_df, x='telecommuting', color='fraudulent', barmode='group', category_orders={'telecommuting': ['f', 't']}, color_discrete_sequence=colors)

fig.update_layout(
    title='Plot 5: Distribution of Telecommuting Job Postings',
    xaxis_title='Telecommuting',
    yaxis_title='Count',
    width=800,
    height=600,
    legend_title_text='Fraudulent'
)
fig.update_layout(xaxis=dict(tickvals=[0, 1], ticktext=['No', 'Yes']))

fig.show()

**Insights:**

90% of the job listings do not provide the telecommuting or work-from-home option. This means that the jobs require the employees to be present in the office. Here we can also see that even though the count is smaller, the number of fraudulent job listings with the telecommuting option is approximately 10% of the total jobs that provide that option.

In [None]:
import plotly.graph_objects as go
from plotly.subplots import make_subplots

df1 = data.copy()

# removing not specified from the country column
df1 = df1[df1['country'] != 'Not Specified']

country_counts = df1.groupby(['country', 'fraudulent']).size().unstack(fill_value=0)

country_counts['total'] = country_counts.sum(axis=1)

country_counts = country_counts.sort_values(by='total', ascending=False)

top_fraudulent_countries, top_non_fraudulent_countries = [country_counts.nlargest(10, 1), country_counts.nlargest(10, 0)]

fig = make_subplots(rows=1, cols=2, subplot_titles=("Top 10 Fraudulent Countries", "Top 10 Non-Fraudulent Countries"))


fig.add_trace(go.Bar(x=top_fraudulent_countries.index, y=top_fraudulent_countries[1], name='Fraudulent', marker = {'color' : colors[0]}),
              row=1, col=1)

fig.add_trace(go.Bar(x=top_non_fraudulent_countries.index, y=top_non_fraudulent_countries[0], name='Non-Fraudulent', marker = {'color' : colors[1]}),
              row=1, col=2)

fig.update_layout(title="Plot 6: Top 10 Fraudulent and Non-Fraudulent Countries", showlegend=False)
fig.show()

**Insights:**

We can see that US has the highest job listings in both categories - fraudulent and real (non-fraudulent). For Australia (AU), we see a surprisingly high proportion of fraudulent cases compared to other countries. Germany (GR) on the other hand has negligible fraudulent cases.

**References:**

[Documentation for Plotly](https://plotly.com/python/plotly-express/)

[Documentation for Plotly Graphs](https://plotly.com/python/graph-objects/)

### Word Clouds for Text Columns

In [None]:
text_columns = ['company_profile', 'description', 'requirements', 'benefits']

# Creating a list to hold TF-IDF vectors for each text column
tfidf_vector_list = []

for column in text_columns:

  # Applying preprocessing to the dataset
  data_processed = data[column].apply(tokenize_text)

  # Vetorization
  vectorizer = vectorize_text()
  df_tfidf_transformed = vectorizer.fit_transform(data_processed)
  tfidf_vectors = pd.DataFrame(df_tfidf_transformed.toarray(), columns=vectorizer.get_feature_names_out())
  tfidf_vector_list.append(tfidf_vectors)

  # Plotting the word cloud for each column
  print('Wordcloud for', column)

  wordcloud_data = tfidf_vectors.sum(axis=0)  # Sum the TF-IDF vectors for each word
  wordcloud = WordCloud(width=800, height=400, background_color='white').generate_from_frequencies(wordcloud_data)

  plt.figure(figsize=(10, 5))
  plt.imshow(wordcloud, interpolation='bilinear')
  plt.axis("off")
  plt.show()

**Insights:**

In all the word-clouds, we can see that all companies put a lot of importance on the **team** and it is an important factor in all roles, whether it is a requirement from the applicant or the company itself. A strong and supportive team and team-members are what make a company successful.

`company_profile`:

The most prominent words in the company_profile column are services, business, solutions, work, clients and team. This gives us a peek into the company culture, which considers their clients a priority and believe in a healthy team-building culture.

`description`:

The description column which gives us information about the job profile very strongly focuses on team work followed by client, business and sales. This reinforces our point about how important teams are in a successful company.

`requirements`:

This column gives us the basic qualifications required for any job listing. We can see that experience, skills, ability are at the forefront. Based on our analysis above, where companies do not specify the basic qualifications, we can see that recruiters are now preferring their experience and skills over a degree or a certification.

`benefits`:

Here, we see that opportunity, competitive salary, insurance are all aspects that a company promises to their canditates.


**Reference:** [Documentation for WordCloud](https://www.analyticsvidhya.com/blog/2021/05/how-to-build-word-cloud-in-python/)

# Part 2
---

**Subsetting the dataset for PCA**

In [None]:
data_sampled = data.sample(frac=0.2, random_state=1)
data_sampled.shape

**Note:**

In order to check the most influential word tokens in our textual columns, we need to take a sample of our original data to check the feasability of the dimensionality reduction procedure.

We will be considering all these columns independent of each other so as to not cause any spillovers.

In [None]:
# Preprocessing and tokenization for 'company_profile' column
corpus_profile = data_sampled['company_profile'].apply(tokenize_text)

# Vectorization
tfidf_vectorizer = vectorize_text()
X = tfidf_vectorizer.fit_transform(corpus_profile)
tfidf_profile = pd.DataFrame(data = X.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
# Preprocessing and tokenization for 'description' column
corpus_description = data_sampled['description'].apply(tokenize_text)

# Vectorization
tfidf_vectorizer = vectorize_text()
X = tfidf_vectorizer.fit_transform(corpus_description)
tfidf_description = pd.DataFrame(data = X.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
# Preprocessing and tokenization for 'requirements' column
corpus_requirements = data_sampled['requirements'].apply(tokenize_text)

# Vectorization
tfidf_vectorizer = vectorize_text()
X = tfidf_vectorizer.fit_transform(corpus_requirements)
tfidf_requirements = pd.DataFrame(data = X.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

In [None]:
# Preprocessing and tokenization for 'benefits' column
corpus_benefits = data_sampled['benefits'].apply(tokenize_text)

# Vectorization
tfidf_vectorizer = vectorize_text()
X = tfidf_vectorizer.fit_transform(corpus_benefits)
tfidf_benefits = pd.DataFrame(data = X.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

## Dimensionality Reduction

## Clustering

# Part 3
---

## Classification