## Background

In recent years, the job market has undergone significant transformation driven by technological advancements and changes in work dynamics. As graduate students preparing to enter a competitive job landscape, understanding these trends and aligning our skills accordingly is critical.

## Project Goal

This project aims to assess personal job market readiness by analyzing industry trends, identifying skill gaps, and applying machine learning techniques to predict salary outcomes. Our ultimate objective is to propose personalized learning paths that enhance employability in the evolving market.

## Methods Overview

The project combines data cleaning, exploratory analysis, machine learning modeling, and skill assessment. Publicly available job posting data was used to explore patterns, build predictive models, and benchmark team capabilities against market expectations.


## Research Questions

1. Which industries are generating the highest number of job postings in 2024?
2. How does salary distribution vary across industries and job types?
3. What is the current skill gap between our team and the market demands?
4. How can we use machine learning models to estimate job market value and identify important predictors?

## Contribution

By linking data-driven market insights with personalized upskilling recommendations, this project provides both a strategic career roadmap and a framework for future job market analytics.

# Data Preparation and Cleaning 

In [None]:
#| echo: false
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.offline as pyo
import plotly.io as pio
import os

In [None]:
#| echo: false
data = pd.read_csv("files/lightcast_job_postings_new.csv")

In [None]:
columns_to_drop = [
   "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP",
    "NAICS2", "NAICS3", "NAICS4", "NAICS5", "NAICS6",
    "SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", "SOC_4", "SOC_4_NAME", "SOC_5", "SOC_5_NAME", "SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", "SOC_2021_5", "SOC_2021_5_NAME",
    'NAICS_2022_2', 'NAICS_2022_2_NAME', 'NAICS_2022_3',
       'NAICS_2022_3_NAME', 'NAICS_2022_4', 'NAICS_2022_4_NAME','NAICS_2022_5', 'NAICS_2022_5_NAME', 'SOC_2_NAME', 'SOC_3_NAME', 'SOC_4', 'SOC_4_NAME', 'SOC_5_NAME'
]
data_drop = data.drop(columns=columns_to_drop)

In [None]:
#| echo: false
figures_folder = "figures"
if not os.path.exists(figures_folder):
    os.makedirs(figures_folder)
#pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
#data_drop.columns

In [None]:
#Replace salary with median
salary_median = data_drop['SALARY'].median()
salary_to_median = data_drop['SALARY_TO'].median()
salary_from_median = data_drop['SALARY_FROM'].median()
data_drop['SALARY'] = data_drop['SALARY'].fillna(salary_median)
data_drop['SALARY_TO'] = data_drop['SALARY_TO'].fillna(salary_to_median)
data_drop['SALARY_FROM'] = data_drop['SALARY_FROM'].fillna(salary_from_median)

In [None]:
#Replace NA Values with 0 and -1
data_drop['MIN_YEARS_EXPERIENCE'] = data_drop['MIN_YEARS_EXPERIENCE'].fillna(0)
data_drop['DURATION'] = data_drop['DURATION'].fillna(-1)
data_drop['MODELED_DURATION'] = data_drop['MODELED_DURATION'].fillna(-1)

In [None]:
#Replace Missing Dates with Reasonable Values, and convert to date time format
data_drop['POSTED'] = pd.to_datetime(data['POSTED'], errors='coerce')
data_drop['EXPIRED'] = pd.to_datetime(data['EXPIRED'], errors='coerce')
data_drop['LAST_UPDATED_DATE'] = pd.to_datetime(data['LAST_UPDATED_DATE'], errors='coerce')
data_drop['MODELED_EXPIRED'] = pd.to_datetime(data_drop['MODELED_EXPIRED'], errors='coerce')

data_drop['EXPIRED'] = data_drop['EXPIRED'].fillna(pd.to_datetime('2100-12-31'))
data_drop['MODELED_EXPIRED'] = data_drop['MODELED_EXPIRED'].fillna(pd.to_datetime('2100-12-31'))

In [None]:
#Handle the remaining missing values
string_cols = data_drop.select_dtypes(include='object').columns
data_drop[string_cols] = data_drop[string_cols].fillna("Unknown")

numeric_cols = data_drop.select_dtypes(include=['float64', 'int64']).columns
data_drop[numeric_cols] = data_drop[numeric_cols].fillna(0)

In [None]:
#Remove Duplicates
data_cleaned = data_drop.drop_duplicates(subset=["TITLE", "COMPANY", "LOCATION", "POSTED"], keep="first")

In [None]:
data_cleaned[data_cleaned.isna().any(axis=1)]
data_cleaned = data_cleaned.drop(index=478)

In [None]:
data_cleaned.isna().sum()

In [None]:
#| include: false
folder = "files"
file_name = "cleaned_job_postings.csv"
file_path = os.path.join(folder, file_name)

# Save the DataFrame to CSV in the files folder
data_cleaned.to_csv(file_path, index=False)

# Data Visualization

In [None]:
#| echo: false
industry_counts = data_cleaned["NAICS_2022_6_NAME"].value_counts().head(11).reset_index()
industry_counts.columns = ['Industry', 'Count']
industry_counts = industry_counts[industry_counts['Industry'] != 'Unclassified Industry']

fig = px.bar(industry_counts, x='Industry', y='Count', title="Top 10 Job Postings by Industry")
fig.update_layout(xaxis_tickangle=45, height=800, margin=dict(b=200))
fig.write_html(os.path.join(figures_folder, "industry_plot.html"))

<iframe src="figures/industry_plot.html" width="100%" height="500"></iframe>

In [None]:
#| echo: false
print("The bar plot is used to display the top 10 highest number of job posting industries. \nThe graph shows that computer related services are standing out, management services and employment placement agencies also have double the amount of job postings than others in this category.")

In [None]:
#| echo: false
top_industries = data_cleaned["NAICS_2022_6_NAME"].value_counts().head(11).index
filtered_data = data_cleaned[data_cleaned["NAICS_2022_6_NAME"].isin(top_industries)]
filtered_data = filtered_data[filtered_data['NAICS_2022_6_NAME'] != 'Unclassified Industry']

fig = px.box(
    filtered_data,
    x="NAICS_2022_6_NAME",
    y="SALARY",
    title="Salary Distribution by Industry",
    labels={"NAICS_2022_6_NAME": "Industry", "SALARY": "Salary"},
    points="outliers",
    category_orders={"NAICS_2022_6_NAME": top_industries}
)


fig.update_layout(
    xaxis_tickangle=45,
    xaxis_title="Top 10 Industries",
    yaxis_title="Salary ($)",
    height=700,
    margin=dict(b=150)
)

fig.write_html(os.path.join(figures_folder, "salary_distribution_by_industry.html"))

<iframe src="figures/salary_distribution_by_industry.html" width="100%" height="500"></iframe>

In [None]:
#| echo: false
print("The box plot presents the salary distribution across the top 10 industries with the highest number of job postings. \n"
      "By reducing the number of categories and adjusting the axis labels, we improve readability.")

In [None]:
#| echo: false
remote_counts = data_cleaned["REMOTE_TYPE_NAME"].value_counts().reset_index()
remote_counts.columns = ["REMOTE_TYPE", "Count"]
remote_counts = remote_counts[remote_counts['REMOTE_TYPE'] != '[None]']


fig = px.pie(
    remote_counts,
    names="REMOTE_TYPE",
    values="Count",
    title="Remote vs. On-Site Jobs",
    color="REMOTE_TYPE",
    color_discrete_map={"Remote": "blue", "On-Site": "green", "Hybrid": "purple"}
)


fig.update_traces(textinfo="percent+label")
fig.update_layout(height=400)

fig.write_html(os.path.join(figures_folder, "remote_vs_onsite_jobs.html"))

<iframe src="figures/remote_vs_onsite_jobs.html" width="100%" height="500"></iframe>

In [None]:
#| echo: false
print("The pie chart represents the distribution of remote, on-site, and hybrid job postings. \n"
      "It helps visualize the proportion of different work arrangements in the job market.")

# Top 10 Job Postings by Industry
<iframe src="figures/industry_plot.html" width="100%" height="500"></iframe>
> The most frequently advertised job postings come from Custom Computer Programming Services, Accounting Services, and Employment Placement Agencies. These industries are consistently hiring across roles, suggesting a high demand for software developers, finance professionals, and recruiters. This indicates strong hiring momentum in tech and support functions.

# Salary Distribution by Industry
<iframe src="figures/salary_distribution_by_industry.html" width="100%" height="500"></iframe>
> Salary distribution varies widely across industries. While most sectors show a median salary between $80K and $120K, certain fields like Commercial Banking and Offices of Certified Public Accountants show higher outliers, indicating potential for high-earning roles. The variation within each industry also reflects differing job levels and skill demands.

# Remote vs. On-Site Jobs
<iframe src="figures/remote_vs_onsite_jobs.html" width="100%" height="500"></iframe>
> Over 78% of job postings offer remote work options, either fully or in hybrid mode. This highlights the growing normalization of flexible work arrangements post-pandemic. Only 7% of jobs are strictly on-site, indicating a permanent shift in job design and workplace expectations.

# Team Skill Levels Heatmap
<iframe src="figures/team_skill_levels_heatmap.html" width="100%" height="500"></iframe>
> The team demonstrates strong skill levels in Communication, Problem-Solving, and Teamwork, all scoring 5 across members. However, there are visible gaps in Machine Learning and Cloud Computing, particularly for Arohit. These gaps highlight potential areas for upskilling to align with industry demands in data and engineering roles.



{{< embed skill_gap_analysis.qmd >}}




{{< embed ml_methods.qmd >}}

## Multiple Linear Regression (MLR)

We used MLR to examine the relationship between job attributes and salary. After feature engineering and one-hot encoding, the model achieved an $R^2$ of 0.87, indicating strong predictive ability. Key variables influencing salary included state, experience level, and educational background.

## Random Forest Regressor

To compare with a non-linear approach, we implemented a Random Forest Regressor. Despite its flexibility, the model achieved a lower $R^2$ of 0.25, highlighting its limitations with high-dimensional sparse data. Feature importance showed that skill count and experience dominated predictions.

## Model Comparison

| Metric        | MLR         | Random Forest |
|---------------|-------------|----------------|
| RMSE          | 77,772      | 63,755         |
| R-squared     | 0.87        | 0.25           |

MLR provided more reliable insights for interpreting salary trends, while Random Forest offered useful feature ranking.


{{< embed nlp_methods.qmd >}}

# 1. Summary

This project integrates market trend analysis, skill benchmarking, and machine learning modeling to assess personal job readiness. Through rigorous data exploration and predictive modeling, we uncovered several important insights:

- **Industry Trends**: High demand is concentrated in the technology, consulting, and support service industries.
- **Salary Drivers**: Salary disparities are largely influenced by years of experience, the number of skills possessed, and geographic location.
- **Team Skill Assessment**: Team strengths are notable in areas of communication and problem-solving. However, skill gaps were identified in **cloud computing** and **machine learning** competencies.

These findings provide a grounded view of current labor market expectations and highlight actionable areas for professional development.

# 2. Future Directions

Looking ahead, several strategic pathways emerge:

- **Skill Development**: By implementing the personalized learning plans and focusing on closing the identified skill gaps, each member can substantially enhance their employability.
- **Continuous Trend Monitoring**: Staying attuned to evolving industry demands will ensure alignment between skills and market needs over time.
- **Scalability of Methods**: The analytic framework and techniques developed in this project are adaptable. They can be scaled to assist broader populations of job seekers and can be applied to a variety of career planning and workforce development scenarios.

Overall, this project demonstrates a robust approach to data-driven career readiness evaluation, setting a strong foundation for future strategic personal development.

---
