# Job Posting Data Analysis
In this notebook, the group will be working with the [Job Posting in Singapore](https://www.kaggle.com/datasets/techsalerator/job-posting-data-in-singapore) dataset. This dataset will be used for processing, analyzing, and visualizing data.

This project is carried out by the group **DS NERDS**, under Section **S19**, which consists of the following members:
- Colobong, Franz Andrick
- Chu, Andre Benedict M. 
- Pineda, Mark Gabriel A.
- Rocha, Angelo H. 
  
The output fulfills a part of our requirements for the course Statistical Modeling and Simulation (CSMODEL). 


# Import Libraries

TO-DO:
Put a brief description for each module used and how it was used in the notebook.


In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json

## Dataset Description and Collection Process

This dataset offers a comprehensive overview of job openings across various sectors in Singapore. It provides an essential resource for businesses, job seekers, and labor market analysts, and it can also be a valuable tool for people who would like to be informed about job openings and employment trends in Singapore.

The data was collected by a global data provider called **Techsalerator**, by consolidating and categorizing job-related information from diverse sources, including company websites, job boards, and recruitment agencies. 

Now, let us load the CSV file into our workspace with **'latin1'** encoding as it contains special characters (e.g., é, ñ, ’) that caused a UnicodeDecodeError with the default **'utf-8'** encoding.

## Potential Implications of the Data

## Structure of the Data

## Key Data Fields 

This section provides a brief description of the key attributes present in the dataset:


- **Job Posting Date**: Captures the date a job is listed. This is crucial for job seekers and HR professionals to stay updated on the latest opportunities and trends.

- **Job Title**: Specifies the position being advertised. This helps in categorizing and filtering job openings based on industry roles and career interests.

- **Company Name**: Lists the hiring company. This information assists job seekers in targeting their applications and helps businesses track competitors and market trends.

- **Job Location**: Provides the job's geographic location within Singapore. Job seekers use this to find opportunities in specific areas, while employers analyze regional talent and market conditions.

- **Job Description**: Includes details about responsibilities, required qualifications, and other relevant aspects. This is vital for candidates to determine if they meet the requirements and for recruiters to communicate expectations clearly.

### Data Cleaning 
This section of the data fields will be focusing on removing duplicates, null values that affect can possibly affect the exploratory data analysis questions, and fixing data fields by turning them into lowercase and removing any spaces.

In [37]:
job_posting_df = pd.read_csv('Job Posting.csv', encoding='latin1')
job_posting_df.head(200)

Unnamed: 0,Website Domain,Ticker,Job Opening Title,Job Opening URL,First Seen At,Last Seen At,Location,Location Data,Category,Seniority,...,Description,Salary,Salary Data,Contract Types,Job Status,Job Language,Job Last Processed At,O*NET Code,O*NET Family,O*NET Occupation Name
0,bosch.com,,IN_RBAI_Assistant Manager_Dispensing Process E...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-29T19:59:45Z,2024-07-31T14:35:44Z,"Indiana, United States","[{""city"":null,""state"":""Indiana"",""zip_code"":nul...","engineering, management, support",manager,...,**IN\_RBAI\_Assistant Manager\_Dispensing Proc...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-02T14:47:55Z,43-1011.00,Office and Administrative Support,First-Line Supervisors of Office and Administr...
1,bosch.com,,Professional Internship: Hardware Development ...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-04T01:00:12Z,2024-07-29T17:46:16Z,"Delaware, United States","[{""city"":null,""state"":""Delaware"",""zip_code"":nu...",internship,non_manager,...,**Professional Internship: Hardware Developmen...,,"{""salary_low"":null,""salary_high"":null,""salary_...","full time, internship, m/f",closed,en,2024-07-31T17:50:07Z,17-2061.00,Architecture and Engineering,Computer Hardware Engineers
2,zf.com,,Process Expert BMS Production,https://jobs.zf.com/job/Shenyang-Process-Exper...,2024-04-19T06:47:24Z,2024-05-16T02:25:08Z,China,"[{""city"":null,""state"":null,""zip_code"":null,""co...",engineering,non_manager,...,ZF is a global technology company supplying sy...,,"{""salary_low"":null,""salary_high"":null,""salary_...",,closed,en,2024-05-18T02:32:04Z,51-9141.00,Production,Semiconductor Processing Technicians
3,bosch.com,,DevOps Developer with Python for ADAS Computin...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-08-16T10:20:37Z,2024-08-22T11:14:49Z,Romania,"[{""city"":null,""state"":null,""zip_code"":null,""co...","information_technology, software_development",non_manager,...,**DevOps Developer with Python for ADAS Comput...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-23T00:33:30Z,15-1252.00,Computer and Mathematical,Software Developers
4,bosch.com,,Senior Engineer Sales - Video Systems and Solu...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-07-01T17:31:20Z,2024-08-01T05:11:33Z,India,"[{""city"":null,""state"":null,""zip_code"":null,""co...","engineering, sales",non_manager,...,**Senior Engineer Sales - Video Systems and So...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-02T19:03:16Z,41-9031.00,Sales and Related,Sales Engineers
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,heraeus.com,,Werkstudent / Praktikant / Abschlussarbeit Sup...,https://jobs.heraeus.com/job/Kleinostheim-Werk...,2024-04-30T17:12:07Z,2024-06-13T12:37:09Z,"Delaware, United States","[{""city"":null,""state"":""Delaware"",""zip_code"":nu...",internship,manager,...,**Werkstudent / Praktikant / Abschlussarbeit S...,,"{""salary_low"":null,""salary_high"":null,""salary_...",vollzeit,closed,de,2024-06-14T20:08:13Z,11-3071.04,Management,Supply Chain Managers
196,bosch.com,,Prctica Tcnico Prevencin de Riesgos,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-06T20:15:31Z,2024-08-26T15:35:45Z,Chile,"[{""city"":null,""state"":null,""zip_code"":null,""co...","engineering, healthcare_services",non_manager,...,**Prctica Tcnico Prevencin de Riesgos**\n\n...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-08-28T15:45:45Z,19-4042.00,"Life, Physical, and Social Science",Environmental Science and Protection Technicia...
197,bosch.com,,Bosch - Szchenyi Jobfair Gy_r (2024 Spring) T...,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-04-08T21:26:27Z,2024-04-23T01:06:41Z,"Budapest, Hungary","[{""city"":""Budapest"",""state"":null,""zip_code"":nu...",,non_manager,...,**Bosch - Szchenyi Jobfair Gy_r (2024 Spring)...,,"{""salary_low"":null,""salary_high"":null,""salary_...","contract, full time, internship",closed,en,2024-04-25T01:12:23Z,49-3023.00,"Installation, Maintenance, and Repair",Automotive Service Technicians and Mechanics
198,bosch.com,,Foreign Trade Specialist,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-06-12T13:18:01Z,2024-07-04T14:36:29Z,"Budapest, Hungary","[{""city"":""Budapest"",""state"":null,""zip_code"":nu...",,non_manager,...,**Foreign Trade Specialist**\n\n\n* Full-time\...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-07-06T14:41:44Z,13-1199.00,Business and Financial Operations,"Business Operations Specialists, All Other"


In [38]:
print(job_posting_df.dtypes)

Website Domain            object
Ticker                   float64
Job Opening Title         object
Job Opening URL           object
First Seen At             object
Last Seen At              object
Location                  object
Location Data             object
Category                  object
Seniority                 object
Keywords                  object
Description               object
Salary                    object
Salary Data               object
Contract Types            object
Job Status                object
Job Language              object
Job Last Processed At     object
O*NET Code                object
O*NET Family              object
O*NET Occupation Name     object
dtype: object


In [39]:
# Remove Duplicates
job_posting_df = job_posting_df.drop_duplicates()


Imputation of data without any contract type and location is necessary for analysis. Without a contract type, a job posting leads future talent to experience difficulties in evaluating their talents for the job. As companies aim to outsource talent into different countries, the location plays an important role in finding the talent targeted by the company. Hence, reducing these values assists in creating concrete EDA hypotheses.

In [40]:
# Imputation of data without any contract type and location
# Delete rows without a contract type or a location since this is useless for analyzing


print(job_posting_df[[ 'Location','Location Data', 'Contract Types']].isnull().sum())
print(f"Entries with both missing a Location and a Contract Type: {job_posting_df[['Location','Location Data', 'Contract Types']].isnull().all(axis=1).sum()}\n")

job_posting_df = job_posting_df.dropna(subset=['Location','Location Data', 'Contract Types'], how='any')

print(job_posting_df[[ 'Location','Location Data', 'Contract Types']].isnull().sum())
print(f"Entries with both missing a Location and a Contract Type: {job_posting_df[['Location', 'Location Data', 'Contract Types']].isnull().all(axis=1).sum()}")



Location           411
Location Data        0
Contract Types    1915
dtype: int64
Entries with both missing a Location and a Contract Type: 0

Location          0
Location Data     0
Contract Types    0
dtype: int64
Entries with both missing a Location and a Contract Type: 0


In [41]:
job_posting_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7623 entries, 0 to 9918
Data columns (total 21 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   Website Domain         7623 non-null   object 
 1   Ticker                 0 non-null      float64
 2   Job Opening Title      7623 non-null   object 
 3   Job Opening URL        7623 non-null   object 
 4   First Seen At          7623 non-null   object 
 5   Last Seen At           7623 non-null   object 
 6   Location               7623 non-null   object 
 7   Location Data          7623 non-null   object 
 8   Category               6396 non-null   object 
 9   Seniority              7623 non-null   object 
 10  Keywords               6030 non-null   object 
 11  Description            7571 non-null   object 
 12  Salary                 474 non-null    object 
 13  Salary Data            7623 non-null   object 
 14  Contract Types         7623 non-null   object 
 15  Job Statu

Remove any spaces and lower the cases of all texts to make comparing unique values for categorization less complicated.

In [42]:
# Fixing text fields

# Remove any spaces and make the text into lower case
job_posting_df['O*NET Family'] = job_posting_df['O*NET Family'].str.strip().str.lower()
job_posting_df['Keywords'] = job_posting_df['Keywords'].str.strip().str.lower()
job_posting_df['Location'] = job_posting_df['Location'].str.strip().str.lower()
job_posting_df['Seniority'] = job_posting_df['Seniority'].str.strip().str.lower()
job_posting_df['Contract Types'] = job_posting_df['Contract Types'].str.strip().str.lower()


In [43]:
# Checking for Incorrect Datatypes

### Categorizing Data by Seniority, Job Category, Location, and Skills 
Job field was categorized through analyzing the contents of the `O*NET Family Column` to specifically find related skills, education, or training that is required in a specific job field. `O*NET Family Column` is a much better fit for analyzing the job fields in the dataset compared to the `Category Column` which provides us a broader perspectives of the job fields. Skills was categorized through analyzing the content of the `Keywords Column` to specifically find the skills that often appeared in the dataset. Location was categorized by analyzing the `Location Data Column`, focusing on the country. Seniority was categorized by analyzing the `Seniority Column`. The `unique()` function was used in categorizing all data to reduce redundant results.

In [44]:
# Check all unique values
unique_values_seniority = job_posting_df['Seniority'].unique()
print(unique_values_seniority)

['manager' 'non_manager' 'director' 'head' 'vice_president' 'c_level'
 'partner' 'president']


In [45]:
# Specific mapping with all the unique values gathered
seniority_mapping = {
    'non_manager': 'Non-Managerial Position',
    'manager': 'Managerial Position',
    'director': 'Managerial Position',
    'head': 'Managerial Position',
    'vice_president': 'Executive Position',
    'c_level': 'Executive Position',
    'partner': 'Executive Position',
    'president': 'Executive Position',
}

# Map the values and count categories
seniority_categories = job_posting_df['Seniority'].map(seniority_mapping)
seniority_category_counts = seniority_categories.value_counts().sort_index()

print("Seniority Category Counts:")
print(seniority_category_counts)

Seniority Category Counts:
Seniority
Executive Position           17
Managerial Position        1433
Non-Managerial Position    6173
Name: count, dtype: int64


Categorization for the `O*NET Family`. Finding and mapping all unique job fields that are existant in the dataset

In [46]:
# Check all unique values
unique_job_fields = job_posting_df['O*NET Family'].unique()
print(unique_job_fields)

['office and administrative support' 'architecture and engineering'
 'computer and mathematical' 'sales and related'
 'installation, maintenance, and repair'
 'business and financial operations' 'production'
 'life, physical, and social science' 'management'
 'community and social service' 'transportation and material moving'
 'healthcare practitioners and technical' 'personal care and service'
 'educational instruction and library' 'construction and extraction'
 'arts, design, entertainment, sports, and media'
 'food preparation and serving related' 'protective service'
 'military specific' 'legal' 'healthcare support'
 'farming, fishing, and forestry'
 'building and grounds cleaning and maintenance' nan]


In [47]:
# Mapping of all unique values into specific categories
job_fields_mapping = {
    'office and administrative support' : 'Business and Administration',
    'architecture and engineering': 'Engineering and Construction',
    'computer and mathematical' : 'Technology',
    'sales and related': 'Business and Administration',    
    'installation, maintenance, and repair': 'Facilities Management and Services',
    'business and financial operations' : 'Business and Administration',
    'production': 'Manufacturing',
    'life, physical, and social science' : 'Science and Research',
    'management': 'Business and Administration',
    'community and social service' : 'Public Service',
    'transportation and material moving': 'Transportation and Logistics',
    'healthcare practitioners and technical' : 'Healthcare',
    'personal care and service': 'Healthcare',
    'educational instruction and library' : 'Education',
    'construction and extraction': 'Engineering and Construction',
    'arts, design, entertainment, sports, and media': 'Multimedia and Sports',
    'food preparation and serving related' : 'Facilities Management and Services',
    'protective service': 'Government and Public Safety',
    'military specific' : 'Government and Public Safety',
    'legal' : 'Legal Services',
    'healthcare support': 'Healthcare',
    'farming, fishing, and forestry': 'Agriculture and Natural Resources',
    'building and grounds cleaning and maintenance': 'Facilities Management and Services', 
    'nan' : 'Others'
}

# Map the values and get the count of the categories
job_fields_categories = job_posting_df['O*NET Family'].map(job_fields_mapping)
job_fields_category_counts = job_fields_categories.value_counts().sort_index()

print(job_fields_category_counts)


O*NET Family
Agriculture and Natural Resources       12
Business and Administration           2861
Education                              247
Engineering and Construction          1202
Facilities Management and Services     285
Government and Public Safety            53
Healthcare                             238
Legal Services                          18
Manufacturing                          669
Multimedia and Sports                   84
Public Service                          33
Science and Research                   269
Technology                            1342
Transportation and Logistics           308
Name: count, dtype: int64


Categorization for the `Contract Types`. Finding and mapping all unqiue contract types that are existant in the dataset.

In [48]:
split_contract_types = job_posting_df['Contract Types'].str.split(',').explode()

unique_ctypes = split_contract_types.str.strip().unique()

print(unique_ctypes)

['full time' 'internship' 'm/f' 'intern' 'tempo integral' 'onsite'
 'hybrid' 'remote' '3rd shift' 'long term' 'short term' 'part time'
 'vollzeit' 'm/w' 'permanent' 'temporary' 'contractor' 'fully remote'
 'contract' 'all levels' 'commission' 'summer' 'festanstellung'
 'work from home' 'vaste aanstelling' 'trabalho remoto' 'trainee'
 'practitioner' 'fuldtid' 'pe_ny etat' 'temps plein' 'day shift'
 'night shift' 'full or part time' 'teletrabajo' 'day time' 'm f' 'deltid'
 'nuit' 'temps partiel' 'freelance' 'tempo indeterminato']


In [49]:
from updated_ctypes import ctypes_mapping
from collections import Counter

# Lowercase the contract types for case-sensitive look-ups, substring matching
mapping_ctypes_lower = {k.lower(): v for k, v in ctypes_mapping.items()}

def map_ctypes_in_cell(str_keywords):
    # check for null categories
    if pd.isna(str_keywords):
        return []

    # list of contract types in lowercase
    ctypes_lower = str_keywords.lower()
    mapped_categories = []

    # loop for the contract types inside the category
    for ky, ct in mapping_ctypes_lower.items():
        # if contract types is in the list, append to list the category
        if ky in ctypes_lower:
            mapped_categories.append(ct)

    return mapped_categories

all_ct = []

# access every contract types in the dataset
for ctypes_cells in job_posting_df['Contract Types']:

    # get the category in the contract types cells then add/extend to the list
    ct = map_ctypes_in_cell(ctypes_cells)
    all_ct.extend(ct)

# Example: cell has: ['full time', 'intern', 'long term] --> ['Full Time', 'Internship/Trainee', 'Long Term']  

# counter for all categories
ct_counts = Counter(all_ct) 

# transfer to a dataframe for better mapping
df_counts = pd.DataFrame(list(ct_counts.items()), 
                        columns=['Contract Types', 'Count'])

df_counts = df_counts.sort_values('Count', ascending=False)

# Print the values
df_counts = df_counts.reset_index(drop=True)
print("=" * 50)
print(df_counts.to_string(index=False))


    Contract Types  Count
         Full Time   4810
Internship/Trainee   1666
            Hybrid    874
       Male/Female    688
   Remote/Flexible    445
Contract/Temporary    409
         Long Term    372
        All Levels    284
         Permanent    249
         Part Time    215
           On-site    206
        Commission    108
        Short Term     59
        Shift Work      6
 Full or Part Time      2
          Day Time      2
     Monday-Friday      1
             Night      1


The contents of the `Keywords Column` are often separated by comma's. So, the group cleaned them first before finding all unique values to reduce redundancy. 

In [50]:
# Split the contents of the keywords column
split_keywords = job_posting_df['Keywords'].str.split(',').explode()

# Then find the unique values, these mitigates redundancy a lot
unique_keywords = split_keywords.str.strip().unique()

print(unique_keywords)

[nan 'scrum' 'github' 'jenkins' 'growth' 'c++' 'linux' 'python'
 'microsoft azure' 'docker' 'business development' 'internship'
 'ecommerce' 'sap successfactors' 'e-commerce' 'servicenow' 'microsoft'
 'sap' 'cognex' 'omron' 'call center' 'hris' 'salesforce' 'social media'
 'customer success' 'contentful' 'gainsight' 'facebook' 'linkedin'
 'agorapulse' 'teamtailor' '.net' 'c#' 'angular' 'android' 'java' 'gerrit'
 'kotlin' 'power bi' 'keyence' 'bmc remedy' 'databricks'
 'azure databricks' 'microsoft excel' 'microsoft teams' 'simulink' 'novi'
 'kanban' 'real estate' 'microsoft word' 'sap s/4hana' 'informatica'
 'atlassian' 'atlassian jira' 'splunk' 'matlab' 'selenium' 'gradle'
 'postman' 'javascript' 'successfactors' 'qualtrics' 'microsoft 365'
 'contractor' 'branding' 'outbound' 'glassdoor' 'websocket' 'sigfox'
 'json' 'django' 'ansible' 'kubernetes' 'marketing campaigns' 'front-end'
 'back-end' 'angularjs' 'node.js' 'php' 'ruby' 'gatsby' 'graphql' 'remix'
 'informa' 'hubspot' 'microsoft

The produced result from the unqiue keywords has over 500+ values. The `update_keywords_skill` python file has every unique value produced inside the `keywords_skills_mapping` list with their categorization already mapped into them. The python file was separated for overall visual clarity. The mapping was done with AI assistance while a member of the group was double checking and making sure that the values are mapped correctly.

In [51]:
from update_keyword_skills import keywords_skills_mapping
from collections import Counter

# Lowercase the keys for case-sensitive look-ups, substring matching
mapping_skills_lower = {k.lower(): v for k, v in keywords_skills_mapping.items()}

def map_keywords_in_cell(str_keywords):
    # check for null categories
    if pd.isna(str_keywords):
        return []

    # list of keywords in lowercase
    keywords_lower = str_keywords.lower()
    mapped_categories = []

    # loop for the key inside the category
    for ky, ct in mapping_skills_lower.items():
        # if keyword is in the list, append to list the category
        if ky in keywords_lower:
            mapped_categories.append(ct)

    return mapped_categories

all_ct = []

# access every keyword in the dataset
for ky_cells in job_posting_df['Keywords']:

    # get the category in the keywords cells then add/extend to the list
    ct = map_keywords_in_cell(ky_cells)
    all_ct.extend(ct)

# Example: cell has: ['c++', 'mysql', 'linux] --> ['Programming Language', 'Databases', 'Operating System']  

# counter for all categories
ct_counts = Counter(all_ct) 

# transfer to a dataframe for better mapping
df_counts = pd.DataFrame(list(ct_counts.items()), 
                        columns=['Category', 'Count'])

df_counts = df_counts.sort_values('Count', ascending=False)

# Print the values
df_counts = df_counts.reset_index(drop=True)
print("=" * 50)
print(df_counts.to_string(index=False))


                      Category  Count
         Programming Languages   5262
       ERP & Business Software   4778
      Other and Broader Skills   1732
                 Methodologies   1460
Marketing & Social Media Tools    916
        Frameworks & Libraries    886
                DevOps & CI/CD    613
                Cloud Services    590
                Analytics & BI    576
             Tools & Platforms    504
           CMS & Web Platforms    368
             Operating Systems    262
                     Databases    255
                           CRM    215
            Project Management    214
                  Design Tools    196
         Networking & Security    127
                      Hardware    117
           Tools and Platforms     76


`locations_df` is a copy of the original dataset which will focus on the contents of the `Location Data Column` further. This aids in being consistent with what is currently being focus in the categorization.

In [52]:
locations_df = job_posting_df.copy()
# Check the contents of the Location Data
locations_df['Location Data']

0       [{"city":null,"state":"Indiana","zip_code":nul...
1       [{"city":null,"state":"Delaware","zip_code":nu...
3       [{"city":null,"state":null,"zip_code":null,"co...
4       [{"city":null,"state":null,"zip_code":null,"co...
5       [{"city":"Yokohama","state":null,"zip_code":nu...
                              ...                        
9914    [{"city":"Charleston","state":"South Carolina"...
9915    [{"city":null,"state":"Indiana","zip_code":nul...
9916    [{"city":null,"state":null,"zip_code":null,"co...
9917    [{"city":"Aveiro","state":null,"zip_code":null...
9918    [{"city":"Jiaxing","state":null,"zip_code":nul...
Name: Location Data, Length: 7623, dtype: object

This section of categorizing the `Location Data` focuses on parsing the JSON file into separated columns and rows to show a DataFrame consisting of the contents of the `Location Data Column`.

In [53]:
# Parse data into a dictionary
def parse_location(str_location):
    try:
        # Convert the json file into a python object
        data = json.loads(str_location)

        # Takes the first element: if a list, else returns the dictionary as the 
        # first element, otherwise return the dictionary
        return data[0] if isinstance(data, list) else data
    except:
        # Return an empty list
        return {}

# Parse the location data  
locations_df['Location Data'] = locations_df['Location Data'].apply(
    parse_location
)

# Normalize Location Data into new columns and rows
locations_df = pd.json_normalize(locations_df['Location Data'])
locations_df



Unnamed: 0,city,state,zip_code,country,region,continent,fuzzy_match
0,,Indiana,,United States,,,False
1,,Delaware,,United States,,,False
2,,,,Romania,,,False
3,,,,India,,,False
4,Yokohama,,,Japan,,,False
...,...,...,...,...,...,...,...
7618,Charleston,South Carolina,,United States,,,False
7619,,Indiana,,United States,,,False
7620,,,,Serbia,,,False
7621,Aveiro,,,Portugal,,,False


In [54]:
# Display the contents by categorizing the total number of entries per country
locations_df['country'].value_counts()

country
United States           2386
India                   1028
Germany                  966
Brazil                   299
Mexico                   295
Portugal                 294
Hungary                  261
Poland                   256
Turkey                   157
Japan                    156
Romania                  147
China                    119
Spain                    108
Malaysia                 103
United Kingdom            96
Czechia                   87
Austria                   85
Serbia                    70
Belgium                   61
Denmark                   56
Slovenia                  52
France                    49
Vietnam                   49
Netherlands               44
Thailand                  41
Colombia                  38
Australia                 32
Ireland                   27
Slovakia                  25
Italy                     24
Morocco                   20
Switzerland               16
Egypt                     14
Canada                    11
Argent

### Salary Related Processes

Now, we will check for missing or null values in the dataset. Upon inspection, we can see that the **`Ticker`** column—referring to the stock ticker symbol of the company that posted the job—contains only null values.

Since this column provides no usable information for analysis or modeling, we can safely drop it from the dataset.


In [55]:
key_data_fields = job_posting_df[['First Seen At', 'Job Opening Title', 'Job Opening URL', 'Location', 'Description']]
key_data_fields.head()

Unnamed: 0,First Seen At,Job Opening Title,Job Opening URL,Location,Description
0,2024-05-29T19:59:45Z,IN_RBAI_Assistant Manager_Dispensing Process E...,https://jobs.smartrecruiters.com/BoschGroup/74...,"indiana, united states",**IN\_RBAI\_Assistant Manager\_Dispensing Proc...
1,2024-05-04T01:00:12Z,Professional Internship: Hardware Development ...,https://jobs.smartrecruiters.com/BoschGroup/74...,"delaware, united states",**Professional Internship: Hardware Developmen...
3,2024-08-16T10:20:37Z,DevOps Developer with Python for ADAS Computin...,https://jobs.smartrecruiters.com/BoschGroup/74...,romania,**DevOps Developer with Python for ADAS Comput...
4,2024-07-01T17:31:20Z,Senior Engineer Sales - Video Systems and Solu...,https://jobs.smartrecruiters.com/BoschGroup/74...,india,**Senior Engineer Sales - Video Systems and So...
5,2024-03-29T02:27:59Z,[EM] _______ _________ (________ EV/HEV Compon...,https://jobs.smartrecruiters.com/BoschGroup/74...,"yokohama, japan",**[EM] _______ _________ (________ EV/HEV Comp...


In [56]:
# Check the Ticker column
null_count = job_posting_df['Ticker'].isna().sum()
print("Unique Values:", job_posting_df['Ticker'].unique())
print(f"Number of null values: {null_count}")

# Drop the column
job_posting_df = job_posting_df.drop(columns=['Ticker'])

Unique Values: [nan]
Number of null values: 7623


Now, we will check the **`Salary Data`** column to understand the details included in the salary information for each job posting. To avoid modifying the original `job_posting_df`, we will create a copy and store it in `salary_df`. This allows us to safely transform and clean the salary-related data without affecting the source DataFrame.

In [57]:
salary_df = job_posting_df.copy()
salary_df['Salary Data']


0       {"salary_low":null,"salary_high":null,"salary_...
1       {"salary_low":null,"salary_high":null,"salary_...
3       {"salary_low":null,"salary_high":null,"salary_...
4       {"salary_low":null,"salary_high":null,"salary_...
5       {"salary_low":null,"salary_high":null,"salary_...
                              ...                        
9914    {"salary_low":null,"salary_high":null,"salary_...
9915    {"salary_low":null,"salary_high":null,"salary_...
9916    {"salary_low":null,"salary_high":null,"salary_...
9917    {"salary_low":null,"salary_high":null,"salary_...
9918    {"salary_low":null,"salary_high":null,"salary_...
Name: Salary Data, Length: 7623, dtype: object

Upon inspection, we notice that the salary descriptions are stored as **JSON objects**—but currently in the form of **JSON strings**.

To make this data usable, we will:

1. **Parse** each string into a Python dictionary.
2. **Normalize** the dictionary so that each key becomes its own separate column in the DataFrame.

This will give us a clearer structure, allowing us to inspect and clean salary values more effectively.


In [58]:
salary_df = job_posting_df.copy()

# Parse json object into a dictionary
salary_df['Salary Data'] = salary_df['Salary Data'].apply(
    lambda x: json.loads(x) if isinstance(x, str) else x
)

# Normalize Salary Data into new columns and remove rows with null values
salary_df = pd.json_normalize(salary_df['Salary Data'])
salary_df

Unnamed: 0,salary_low,salary_high,salary_currency,salary_low_usd,salary_high_usd,salary_time_unit
0,,,,,,
1,,,,,,
2,,,,,,
3,,,,,,
4,,,,,,
...,...,...,...,...,...,...
7618,,,,,,
7619,,,,,,
7620,,,,,,
7621,,,,,,


By running `salary_df.info()`, we can observe that out of thousands of job postings, only **434** entries contain salary-related information. 

Since salary is a critical detail when analyzing job data, we want to ensure our next steps focus only on entries where salary is provided. To simplify our cleaning process, we will **temporarily drop rows with null values** for salary-related fields.


In [59]:
salary_df.info() 

# Drop rows with any null values
salary_df.dropna(inplace=True)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7623 entries, 0 to 7622
Data columns (total 6 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   salary_low        408 non-null    float64
 1   salary_high       408 non-null    float64
 2   salary_currency   408 non-null    object 
 3   salary_low_usd    408 non-null    float64
 4   salary_high_usd   408 non-null    float64
 5   salary_time_unit  408 non-null    object 
dtypes: float64(4), object(2)
memory usage: 357.5+ KB


Now that we've removed rows with null values, we can inspect the unique values present in each field. 

In particular, the **`salary_currency`** column contains two distinct values: **USD** and **EUR**.


In [60]:
salary_df['salary_currency'].value_counts()

salary_currency
USD    289
EUR    119
Name: count, dtype: int64

After checking the `salary_currency` field, we observe that most job salaries are already in **USD**. 

To ensure consistency in our analysis, we will normalize the data by converting all **EUR** salaries to **USD** using the exchange rate as of **June 17, 2025**:

- **1 EUR = 1.15 USD**

This conversion allows us to compare salaries more accurately and ensures uniformity across the dataset.


In [61]:
# Define conversion rate from EUR to USD
conversion_rate = 1.15

# Convert EUR to USD
for index, row in salary_df.iterrows():
    if row['salary_currency'] == 'EUR':
        salary_df.loc[index, 'salary_low'] = row['salary_low'] * conversion_rate
        salary_df.loc[index, 'salary_high'] = row['salary_high'] * conversion_rate
        salary_df.loc[index, 'salary_currency'] = 'USD'

# Drop redundant salary column 
salary_df.drop(columns=['salary_low_usd', 'salary_high_usd'], inplace=True, errors='ignore')

salary_df

Unnamed: 0,salary_low,salary_high,salary_currency,salary_time_unit
24,49.45,49.45,USD,hour
26,34437.90,34437.90,USD,year
124,171000.00,190000.00,USD,year
133,19.50,19.50,USD,hour
162,234062.00,245000.00,USD,year
...,...,...,...,...
7394,70000.00,86300.00,USD,year
7395,1087.90,1087.90,USD,month
7418,43.00,66.00,USD,hour
7503,16.50,16.50,USD,hour


Now that all the salaries are represented in **USD**, we can focus on the `salary_time_unit` column, which is categorized into three values: **hour**, **month**, and **year**. These indicate how each salary is paid.

In [62]:
salary_df['salary_time_unit'].value_counts()

salary_time_unit
year     234
hour     126
month     48
Name: count, dtype: int64

We notice that most salaries are already given on an **annual basis**. To maintain consistency and enable easier comparisons, we will convert all salaries to **annual salary**.

#### Conversion Formulas:
- **Monthly to Annual**:
  - `annual_salary = monthly_salary * 12`

- **Hourly to Annual** (assuming a standard 9-to-5 schedule):
  - `hours_per_week = 40`
  - `weeks_per_year = 52`
  - `hourly_to_annual = 40 * 52 = 2080`

In [63]:
# Conversion factors
monthly_to_annual = 12
hours_per_week = 40
weeks_per_year = 52
hourly_to_annual = hours_per_week * weeks_per_year  # 40 * 52 = 2080

for index, row in salary_df.iterrows():
    # Convert hourly salaries to annual
    if (row['salary_time_unit'] == 'hour'):
        salary_df.loc[index, 'salary_low'] = row['salary_low'] * hourly_to_annual
        salary_df.loc[index, 'salary_high'] = row['salary_high'] * hourly_to_annual
        salary_df.loc[index, 'salary_time_unit'] = 'year'
    
    # Convert monthly salaries to annual
    elif (row['salary_time_unit'] == 'month'):
        salary_df.loc[index, 'salary_low'] = row['salary_low'] * monthly_to_annual
        salary_df.loc[index, 'salary_high'] = row['salary_high'] * monthly_to_annual
        salary_df.loc[index, 'salary_time_unit'] = 'year'

    # Retain annual salaries
    else:
        salary_df.loc[index, 'salary_low'] = row['salary_low']
        salary_df.loc[index, 'salary_high'] = row['salary_high']

salary_df


Unnamed: 0,salary_low,salary_high,salary_currency,salary_time_unit
24,102856.0,102856.0,USD,year
26,34437.9,34437.9,USD,year
124,171000.0,190000.0,USD,year
133,40560.0,40560.0,USD,year
162,234062.0,245000.0,USD,year
...,...,...,...,...
7394,70000.0,86300.0,USD,year
7395,13054.8,13054.8,USD,year
7418,89440.0,137280.0,USD,year
7503,34320.0,34320.0,USD,year


Now that all salaries are in the same currency (**USD**) and time unit (**annual**), we can focus on the `salary_low` and `salary_high` fields.

These two fields represent the **lower and upper bounds** of the offered salary range. To simplify the analysis and create a single representative salary value, we will take the **mean** of these two values.

This gives us a new column, `annual_salary`, which reflects the average offered salary for the job.

In [64]:
salary_df['annual_salary'] = (salary_df[['salary_low', 'salary_high']].mean(axis=1))
salary_df


Unnamed: 0,salary_low,salary_high,salary_currency,salary_time_unit,annual_salary
24,102856.0,102856.0,USD,year,102856.0
26,34437.9,34437.9,USD,year,34437.9
124,171000.0,190000.0,USD,year,180500.0
133,40560.0,40560.0,USD,year,40560.0
162,234062.0,245000.0,USD,year,239531.0
...,...,...,...,...,...
7394,70000.0,86300.0,USD,year,78150.0
7395,13054.8,13054.8,USD,year,13054.8
7418,89440.0,137280.0,USD,year,113360.0
7503,34320.0,34320.0,USD,year,34320.0


Now that we've created the `annual_salary` column, the original fields—`salary_low`, `salary_high`, `salary_currency`, and `salary_time_unit`—are no longer needed for further analysis.

To clean up the DataFrame and simplify its structure, we will drop these columns.


In [65]:
salary_df.drop(columns=['salary_low', 'salary_high', 'salary_currency', 'salary_time_unit'], inplace=True)
salary_df

Unnamed: 0,annual_salary
24,102856.0
26,34437.9
124,180500.0
133,40560.0
162,239531.0
...,...
7394,78150.0
7395,13054.8
7418,113360.0
7503,34320.0


Now that we've cleaned and normalized the salary information into a single `annual_salary` column, we can integrate it back into the original `job_posting_df`.

We will assign this as a new column called `Annual_Salary`, allowing us to analyze job postings alongside their corresponding annual salaries.

In [66]:
# Add the annual salary to the original job_posting_df
job_posting_df['Annual_Salary'] = salary_df['annual_salary']
job_posting_df[job_posting_df['Annual_Salary'].notnull()]

Unnamed: 0,Website Domain,Job Opening Title,Job Opening URL,First Seen At,Last Seen At,Location,Location Data,Category,Seniority,Keywords,...,Salary,Salary Data,Contract Types,Job Status,Job Language,Job Last Processed At,O*NET Code,O*NET Family,O*NET Occupation Name,Annual_Salary
24,bosch.com,CONFERENTE (27059),https://jobs.smartrecruiters.com/BoschGroup/74...,2024-07-02T17:38:17Z,2024-07-09T06:38:08Z,brazil,"[{""city"":null,""state"":null,""zip_code"":null,""co...",,non_manager,,...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,pt,2024-07-11T08:34:20Z,13-1121.00,business and financial operations,"Meeting, Convention, and Event Planners",102856.0
26,bosch.com,Controls Engineer,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-03-07T23:23:28Z,2024-03-31T21:31:45Z,"lincolnton, north carolina, 28092, united states","[{""city"":""Lincolnton"",""state"":""North Carolina""...",engineering,non_manager,,...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-04-02T22:34:31Z,17-2071.00,architecture and engineering,Electrical Engineers,34437.9
124,zf.com,Manager Pricing Tools & Technology (m/f/d),https://jobs.zf.com/job/Bruxelles-Manager-Pric...,2024-08-15T09:05:05Z,2024-09-02T11:38:18Z,belgium,"[{""city"":null,""state"":null,""zip_code"":null,""co...","engineering, management",manager,sap,...,,"{""salary_low"":null,""salary_high"":null,""salary_...",m/f,,en,2024-09-02T11:38:18Z,41-3031.00,sales and related,"Securities, Commodities, and Financial Service...",180500.0
184,zf.com,Cloud Platform Engineer-Internship,https://jobs.zf.com/job/Monterrey-Cloud-Platfo...,2024-07-03T04:09:38Z,2024-07-23T07:23:27Z,"monterrey, mexico","[{""city"":""Monterrey"",""state"":null,""zip_code"":n...","engineering, information_technology, internship",non_manager,"microsoft, growth, c++, java, python, internsh...",...,,"{""salary_low"":null,""salary_high"":null,""salary_...","internship, intern",closed,en,2024-07-25T07:29:19Z,15-1299.08,computer and mathematical,Computer Systems Engineers/Architects,54080.0
187,heraeus.com,Territory Specialist - Atlanta Georgia,https://jobs.heraeus.com/job/remote-NA-Territo...,2024-06-13T12:33:42Z,2024-09-03T20:18:09Z,united states,"[{""city"":null,""state"":null,""zip_code"":null,""co...",,non_manager,sap successfactors,...,,"{""salary_low"":null,""salary_high"":null,""salary_...",remote,,en,2024-09-03T20:18:09Z,43-4181.00,office and administrative support,Reservation and Transportation Ticket Agents a...,45760.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7391,bosch.com,Jump In Ð Summer Internships,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-04-19T00:12:29Z,2024-05-19T02:20:44Z,"aveiro, portugal","[{""city"":""Aveiro"",""state"":null,""zip_code"":null...",,non_manager,internship,...,,"{""salary_low"":null,""salary_high"":null,""salary_...","intern, summer, internship",closed,en,2024-05-21T02:28:18Z,39-3091.00,personal care and service,Amusement and Recreation Attendants,45760.0
7394,bosch.com,Controls Engineer,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-31T20:59:06Z,2024-06-23T11:38:47Z,"charleston, south carolina, 29418, united states","[{""city"":""Charleston"",""state"":""South Carolina""...",engineering,non_manager,"microsoft, oracle, c#, mysql, python, javascri...",...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,closed,en,2024-06-25T11:42:32Z,17-2071.00,architecture and engineering,Electrical Engineers,78150.0
7395,bosch.com,Sr. Software Engineer - Internet of Things (IoT),https://jobs.smartrecruiters.com/BoschGroup/74...,2024-05-15T03:48:55Z,2024-09-03T00:07:40Z,"watertown, massachusetts, 02472, united states","[{""city"":""Watertown"",""state"":""Massachusetts"",""...","engineering, software_development",non_manager,"atlassian, java, node.js, python",...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,,en,2024-09-03T00:07:40Z,15-1252.00,computer and mathematical,Software Developers,13054.8
7418,bosch.com,SAP MDG Consultant,https://jobs.smartrecruiters.com/BoschGroup/74...,2024-07-23T04:33:15Z,2024-09-03T20:38:40Z,"bangalore, india","[{""city"":""Bangalore"",""state"":null,""zip_code"":n...",consulting,non_manager,"abap, sap",...,,"{""salary_low"":null,""salary_high"":null,""salary_...",full time,,en,2024-09-03T20:38:40Z,13-1111.00,business and financial operations,Management Analysts,113360.0


In [67]:
# Checking for Duplicate Data

To ensure that all date-related fields — which include the fields **`First Seen At`**, **`Last Seen At`**, and **`Job Last Processed At`** — are correctly formatted, we check for inconsistencies by attempting to parse them into datetime objects using pd.to_datetime(). Any values that fail to convert (e.g., due to incorrect format or invalid date values) are set to NaT, allowing us to count how many entries are invalid per column.

In [68]:
# Checking for Inconsistent Date Formatting
date_fields = ['First Seen At', 'Last Seen At', 'Job Last Processed At']
date_df = job_posting_df[date_fields].copy()

for col in date_fields:
    date_df[col] = pd.to_datetime(date_df[col], errors='coerce')
    invalid_formats = date_df[col].isna().sum()
    print(f"{invalid_formats} invalid date(s) found in '{col}'")


0 invalid date(s) found in 'First Seen At'
0 invalid date(s) found in 'Last Seen At'
0 invalid date(s) found in 'Job Last Processed At'


Since all the fields involving date and time have been verified to follow **consistent formats** (i.e., no invalid entries), we can proceed to the next step of the pre-processing pipeline.

In [69]:
# Checking for Outliers
# TODO: Check for outliers in the Salary Data Field

## Matplotlibs Charts Visualization

## General Research Question 

What are the underlying patterns and trends in the international job market?

### EDA Question 1 - Annual Salary and Job Field 
Job field in this case lies on their category within the `O*NET Family` categorization in the dataset. In this EDA question, the researchers aim to understand the following:
- What is the relationship between the annual salary and the job field in the dataset?
- What is the average salary for each job field?
- Which job fields show the lowest and highest salary variability?



### EDA Question 2 - Seniority and Contract Types (and Salary Relevance)
The researchers aim for this EDA question are to identify related patterns and trends within the `Seniority` and `Contract Types` variables. They will be guided by the following questions:
- What is the relationship between seniority and contract types in the dataset?
- What is the salary distribution for each combination/category of seniority and their equal contract types?
- Are certain contract types more prevalent at specific seniority levels?


### EDA Question 3 - Locations and Skills
Skills in this case lies on their category within the `Keywords` categorization in the dataset. In this EDA question, the researchers aim to understand the following:
- What is the relationship between the skills required by companies that are outsourcing to specific locations?
- What are the prevalent skill categories that exist for each location?
- Which locations have the highest demand for specific skills?