# Job Posting Data Analysis
In this notebook, the group will be working with the [Job Posting in Singapore](https://www.kaggle.com/datasets/techsalerator/job-posting-data-in-singapore) dataset. This dataset will be used for processing, analyzing, and visualizing data.

This project is carried out by the group **DS NERDS**, under Section **S19**, which consists of the following members:
- Colobong, Franz Andrick
- Chu, Andre Benedict M. 
- Pineda, Mark Gabriel A.
- Rocha, Angelo H. 
  
The output fulfills a part of our requirements for the course Statistical Modeling and Simulation (CSMODEL). 


# Import Libraries

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Dataset Description and Collection Process

This dataset offers a comprehensive overview of job openings across various sectors in Singapore. It provides an essential resource for businesses, job seekers, and labor market analysts, and it can also be a valuable tool for people who would like to be informed about job openings and employment trends in Singapore.

The data was collected by a global data provider called **Techsalerator**, by consolidating and categorizing job-related information from diverse sources, including company websites, job boards, and recruitment agencies. 

Now, let us load the CSV file into our workspace with **'latin1'** encoding as it contains special characters (e.g., é, ñ, ’) that caused a UnicodeDecodeError with the default **'utf-8'** encoding.

In [72]:
job_posting_df = pd.read_csv('Job Posting.csv', encoding='latin1')
job_posting_df.head()

Unnamed: 0,Website Domain,Ticker,Job Opening Title,Job Opening URL,First Seen At,Last Seen At,Location,Location Data,Category,Seniority,...,Description,Salary,Salary Data,Contract Types,Job Status,Job Language,Job Last Processed At,O*NET Code,O*NET Family,O*NET Occupation Name
0,bosch.com,,IN_RBAI_Assistant Manager_Dispensing Process E...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-29T19:59:45Z,2024-07-31T14:35:44Z,"Indiana, United States","[{""city"":null,""state"":""Indiana"",""zip_code"":nul...","engineering, management, support",manager,...,**IN\_RBAI\_Assistant Manager\_Dispensing Proc...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-02T14:47:55Z,43-1011.00,Office and Administrative Support,First-Line Supervisors of Office and Administr...
1,bosch.com,,Professional Internship: Hardware Development ...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-04T01:00:12Z,2024-07-29T17:46:16Z,"Delaware, United States","[{""city"":null,""state"":""Delaware"",""zip_code"":nu...",internship,non_manager,...,**Professional Internship: Hardware Developmen...,,"{""salary_low"":null,""salary_high"":null,""salary_...","full time, internship, m/f",closed,en,2024-07-31T17:50:07Z,17-2061.00,Architecture and Engineering,Computer Hardware Engineers
2,zf.com,,Process Expert BMS Production,https://jobs.zf.com/job/Shenyang-Process-Exper...,2024-04-19T06:47:24Z,2024-05-16T02:25:08Z,China,"[{""city"":null,""state"":null,""zip_code"":null,""co...",engineering,non_manager,...,ZF is a global technology company supplying sy...,,"{""salary_low"":null,""salary_high"":null,""salary_...",,closed,en,2024-05-18T02:32:04Z,51-9141.00,Production,Semiconductor Processing Technicians
3,bosch.com,,DevOps Developer with Python for ADAS Computin...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-08-16T10:20:37Z,2024-08-22T11:14:49Z,Romania,"[{""city"":null,""state"":null,""zip_code"":null,""co...","information_technology, software_development",non_manager,...,**DevOps Developer with Python for ADAS Comput...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-23T00:33:30Z,15-1252.00,Computer and Mathematical,Software Developers
4,bosch.com,,Senior Engineer Sales - Video Systems and Solu...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-07-01T17:31:20Z,2024-08-01T05:11:33Z,India,"[{""city"":null,""state"":null,""zip_code"":null,""co...","engineering, sales",non_manager,...,**Senior Engineer Sales - Video Systems and So...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-02T19:03:16Z,41-9031.00,Sales and Related,Sales Engineers


In [73]:
job_posting_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9919 entries, 0 to 9918
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Website Domain         9919 non-null   object 
 1   Ticker                 0 non-null      float64
 2   Job Opening Title      9919 non-null   object 
 3   Job Opening URL        9919 non-null   object 
 4   First Seen At          9919 non-null   object 
 5   Last Seen At           9919 non-null   object 
 6   Location               9508 non-null   object 
 7   Location Data          9919 non-null   object 
 8   Category               8250 non-null   object 
 9   Seniority              9919 non-null   object 
 10  Keywords               7646 non-null   object 
 11  Description            9807 non-null   object 
 12  Salary                 576 non-null    object 
 13  Salary Data            9919 non-null   object 
 14  Contract Types         8004 non-null   object 
 15  Job 

In [74]:
# Remove Duplicates
job_posting_df = job_posting_df.drop_duplicates()


In [75]:
# Imputation of data without any contract type and location
# Delete rows without a contract type or a location since this is useless for analyzing


print(job_posting_df[[ 'Location', 'Contract Types']].isnull().sum())

job_posting_df = job_posting_df.dropna(subset=['Location', 'Contract Types'], how='any')

print(job_posting_df[[ 'Location', 'Contract Types']].isnull().sum())


Location           411
Contract Types    1915
dtype: int64
Location          0
Contract Types    0
dtype: int64


In [76]:
# Cleaning Text Fields
# Will be used for categorization later
job_posting_df['O*NET Family'] = job_posting_df['O*NET Family'].str.strip().str.lower()
job_posting_df['Keywords'] = job_posting_df['Keywords'].str.strip().str.lower()
job_posting_df['O*NET Occupation Name'] = job_posting_df['O*NET Occupation Name'].str.strip().str.lower()


In [77]:
#Check values
unique_values_seniority = job_posting_df['Seniority'].unique()
print(unique_values_seniority)

['manager' 'non_manager' 'director' 'head' 'vice_president' 'c_level'
 'partner' 'president']


In [94]:
# Seniority Categorization
seniority_mapping = {
    'non_manager': 'Non-Managerial Position',
    'manager': 'Managerial Position',
    'director': 'Managerial Position',
    'head': 'Managerial Position',
    'vice_president': 'Executive Position',
    'c_level': 'Executive Position',
    'partner': 'Executive Position',
    'president': 'Executive Position',
}

# Map the values and count categories
seniority_categories = job_posting_df['Seniority'].map(seniority_mapping)
seniority_category_counts = seniority_categories.value_counts().sort_index()

print("Seniority Category Counts:")
print(seniority_category_counts)

Seniority Category Counts:
Seniority
Executive Position           17
Managerial Position        1433
Non-Managerial Position    6173
Name: count, dtype: int64


In [93]:
unique_job_fields = job_posting_df['O*NET Family'].unique()
print(unique_job_fields)

['office and administrative support' 'architecture and engineering'
 'computer and mathematical' 'sales and related'
 'installation, maintenance, and repair'
 'business and financial operations' 'production'
 'life, physical, and social science' 'management'
 'community and social service' 'transportation and material moving'
 'healthcare practitioners and technical' 'personal care and service'
 'educational instruction and library' 'construction and extraction'
 'arts, design, entertainment, sports, and media'
 'food preparation and serving related' 'protective service'
 'military specific' 'legal' 'healthcare support'
 'farming, fishing, and forestry'
 'building and grounds cleaning and maintenance' nan]


In [99]:
job_fields_mapping = {
    'office and administrative support' : 'Business and Administration',
    'architecture and engineering': 'Engineering and Construction',
    'computer and mathematical' : 'Technology',
    'sales and related': 'Business and Administration',    
    'installation, maintenance, and repair': 'Facilities Management and Services',
    'business and financial operations' : 'Business and Administration',
    'production': 'Manufacturing',
    'life, physical, and social science' : 'Science and Research',
    'management': 'Business and Administration',
    'community and social service' : 'Public Service',
    'transportation and material moving': 'Transportation and Logistics',
    'healthcare practitioners and technical' : 'Healthcare',
    'personal care and service': 'Healthcare',
    'educational instruction and library' : 'Education',
    'construction and extraction': 'Engineering and Construction',
    'arts, design, entertainment, sports, and media': 'Multimedia and Sports',
    'food preparation and serving related' : 'Facilities Management and Services',
    'protective service': 'Government and Public Safety',
    'military specific' : 'Government and Public Safety',
    'legal' : 'Legal Services',
    'healthcare support': 'Healthcare',
    'farming, fishing, and forestry': 'Agriculture and Natural Resources',
    'building and grounds cleaning and maintenance': 'Facilities Management and Services', 
    'nan' : 'Others'
}

job_fields_categories = job_posting_df['O*NET Family'].map(job_fields_mapping)
job_fields_category_counts = job_fields_categories.value_counts().sort_index()
print(job_fields_category_counts)


O*NET Family
Agriculture and Natural Resources       12
Business and Administration           2861
Education                              247
Engineering and Construction          1202
Facilities Management and Services     285
Government and Public Safety            53
Healthcare                             238
Legal Services                          18
Manufacturing                          669
Multimedia and Sports                   84
Public Service                          33
Science and Research                   269
Technology                            1342
Transportation and Logistics           308
Name: count, dtype: int64


In [100]:
unique_locations = job_posting_df['Location'].unique()
print(unique_locations)

['Indiana, United States' 'Delaware, United States' 'Romania' 'India'
 'Yokohama, Japan' 'Lincolnton, North Carolina, 28092, United States'
 'Hanau, Germany' 'Campinas, Brazil'
 'Charleston, South Carolina, 29418, United States'
 'Tennessee, United States'
 'Fort Lauderdale, Florida, 33309, United States' 'Singapore, Singapore'
 'United Kingdom' 'France' 'Albion, Indiana, 46701, United States'
 'Madrid, Spain' 'Bangalore, India' 'Brazil' 'Vietnam' 'Malaysia'
 'Anderson, South Carolina, 29621, United States' 'Debrecen, Hungary'
 'Pamplona, Spain' 'Bursa, Turkey' 'Salzburg, Austria' 'Slovenia'
 'Germany' 'Vernon Hills, Illinois, 60061, United States'
 'London, United Kingdom' 'Tokyo, Japan' 'Australia'
 'Massachusetts, United States' 'Northville, Michigan, United States'
 'Portugal' 'Osnabr\x9fck, Germany' 'Berlin, Germany' 'Budapest, Hungary'
 'Beijing, China' '_esk\x8e Bud_jovice, Czechia'
 'Chandler, Arizona, 85226, United States'
 'Auburn Hills, Michigan, 48326, United States'
 'Hart

## Potential Implications of the Data

## Structure of the Data

## Key Data Fields 

This section provides a brief description of the key attributes present in the dataset:


- **Job Posting Date**: Captures the date a job is listed. This is crucial for job seekers and HR professionals to stay updated on the latest opportunities and trends.

- **Job Title**: Specifies the position being advertised. This helps in categorizing and filtering job openings based on industry roles and career interests.

- **Company Name**: Lists the hiring company. This information assists job seekers in targeting their applications and helps businesses track competitors and market trends.

- **Job Location**: Provides the job's geographic location within Singapore. Job seekers use this to find opportunities in specific areas, while employers analyze regional talent and market conditions.

- **Job Description**: Includes details about responsibilities, required qualifications, and other relevant aspects. This is vital for candidates to determine if they meet the requirements and for recruiters to communicate expectations clearly.

## General Research Question 

In [None]:
# Checking for Multiple Data Representation of the same Categorical Values

In [None]:
# Checking for Incorrect Datatypes

In [None]:
# Checking for Missing/Null Values

In [None]:
# Checking for Duplicate Data

In [None]:
# Checking for Inconsistent Data

In [None]:
# Checking for Outliers

## Matplotlibs Charts Visualization

### EDA Question 1

Both formulaion and answer in the same cell

### EDA Question 2

### EDA Question 3