title: Data Cleaning & Exploration
subtitle: Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends
author:
  - name: Furong Wang
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
  - name: Marco Perez Garcia
    affiliations:
      - ref: bu
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3

title: "Salary & Compensation Trends in AI vs. Non-AI Careers"
author: "Furong Wang, Marco Perez Garcia"
date: today
format: 
  html:
    bibliography: references.bib
    csl: csl/econometrica.csl
    toc: true
---

Recent research has highlighted a growing divergence in salary trends between artificial intelligence
(AI)-focused careers and more traditional data science roles. @zhu2024unveiling found that professionals 
specializing in AI-related fields, such as machine learning engineers and AI researchers, consistently 
command higher salaries than their non-AI counterparts, including data analysts and general data scientists. 
This difference in compensation reflects the increasing demand for AI expertise as industries integrate automation, 
deep learning, and predictive analytics into their operations. While AI roles require specialized skills in areas 
such as neural networks and natural language processing, traditional data science positions often focus 
more on business intelligence, statistical analysis, and data visualization, which though even that is valuable
it does not see the same salary premiums.

Other studies reinforce this trend, showing how company size and industry specialization further impact 
salary structures. @chen2024unraveling analyzed U.S. salary trends from 2020 to 2023, reporting that 
salaries in AI-driven roles have shown steady increases, particularly within mid-to-large tech companies 
investing in AI innovation. In contrast, non-AI data science roles, such as data analysts, have experienced 
slower growth, and some projections indicate potential stagnation or slight salary declines in 2024. Similarly, 
@quan2023human found that professionals with expertise in AI, cloud computing, and big data technologies 
earn higher salaries than those with more generalist skills. Their findings suggest that as AI adoption expands 
across industries, the wage gap between AI and non-AI roles may continue to grow, emphasizing the importance of 
specialized technical expertise for long-term career advancement in data science.

## References



# Data Cleaning

---
title: Data Cleaning & Exploration
subtitle: Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends
author:
  - name: Furong Wang
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
  - name: Marco Perez Garcia
    affiliations:
      - ref: bu
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3
---

# **Data Cleaning & Preprocessing**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px

## Dropping Unnecessary Columns

### To simplify our job market analysis, we need to drop columns that are either unnecessary, duplicated, or outdated. 
Specifically, we removed:  
- **Older NAICS and SOC codes (e.g., `NAICS2`, `SOC_2`).**  
The North American Industry Classification System (NAICS) and Standard Occupational Classification (SOC) systems undergo periodic updates. Retaining only `NAICS_2022_6` and `SOC_2021_4` ensures we use the most recent classification standards. Moreover, older codes are redundant and may lead to inconsistencies in trend analysis.  
- **Tracking data and URLs (e.g., `ID`, `DUPLICATES`).**  
These columns related to data collection timestamps, unique identifiers, or internal system references, which do not contribute to meaningful insights about the job market. Similarly, URLs are not necessary for our analysis as they do not provide any additional value or context but add unnecessary complexity to the dataset.

### Why removing multiple versions of NAICS/SOC codes?
There are some reasons for this. Firstly, keeping only the latest industry and occupation classifications ensures our analysis reflects the most recent classification standards and avoid confusion and inconsistencies in classification. Additionally, reducing unnecessary columns speeds up data processing and enhances readability. This is particularly important when working with large datasets, as it minimizes the risk of errors and improves the efficiency of our analysis. Finally, it helps to focus on the most relevant information, allowing for clearer insights and conclusions regarding job market trends.

### Impact on Analysis:
By removing outdated and irrelevant columns, we achieve:  
- More accurate job market trends, focusing on meaningful variables.  
- Easier interpretation without clutter from redundant or technical fields.  
- Faster analysis and visualization, improving overall efficiency.  

In [None]:
job_postings = pd.read_csv('lightcast_job_postings.csv')

In [None]:
columns_to_drop = [
    "ID", "URL", "ACTIVE_URLS", "DUPLICATES", "LAST_UPDATED_TIMESTAMP", "ACTIVE_URLS", "TITLE", "COMPANY",
    "MSA", "STATE", "COUNTY", "CITY", "COUNTY_OUTGOING", "COUNTY_INCOMING", "MSA_OUTGOING", "MSA_INCOMING",
    "ONET", "ONET_2019", "CIP2", "CIP4", "CIP6", "MODELED_DURATION", "MODELED_EXPIRED",
    "CERTIFICATIONS", "COMMON_SKILLS", "SPECIALIZED_SKILLS", "SKILLS", "SOFTWARE_SKILLS",
    "LOT_V6_CAREER_AREA", "LOT_V6_OCCUPATION_GROUP", "LOT_V6_OCCUPATION", "LOT_V6_SPECIALIZED_OCCUPATION",
    "LOT_OCCUPATION_GROUP", "LOT_SPECIALIZED_OCCUPATION", "LOT_OCCUPATION", "LOT_CAREER_AREA",
    "NAICS2", "NAICS2_NAME", "NAICS3", "NAICS3_NAME", "NAICS4", "NAICS4_NAME", "NAICS5", "NAICS5_NAME", "NAICS6",
    "NAICS_2022_2", "NAICS_2022_2_NAME", "NAICS_2022_3", "NAICS_2022_3_NAME", "NAICS_2022_4", "NAICS_2022_4_NAME",
    "NAICS_2022_5", "NAICS_2022_5_NAME", "NAICS_2022_6",
    "SOC_2", "SOC_2_NAME", "SOC_3", "SOC_3_NAME", "SOC_5", "SOC_5_NAME", "SOC_4",
    "SOC_2021_2", "SOC_2021_2_NAME", "SOC_2021_3", "SOC_2021_3_NAME", "SOC_2021_5", "SOC_2021_5_NAME", "SOC_2021_4"
]

job_postings.drop(columns = columns_to_drop, inplace = True)

## Handling Missing Values
### We used different strategies for missing values:

In [None]:
plt.figure(figsize=(8, 6))
msno.heatmap(job_postings)
plt.title("Missing Values Heatmap")
plt.show()

#### Dealing with the Salary Column
The `SALARY` column has a significant number of missing values. To handle this, we replaced the missing values with the median salary for that specific title or industry. This approach is effective because it minimizes the impact of outliers and provides a more accurate representation of the typical salary for each job title.

In [None]:
title_median_salary = job_postings.groupby('TITLE_NAME')['SALARY'].median()
industry_median_salary = job_postings.groupby('NAICS_2022_6_NAME')['SALARY'].median()

In [None]:
job_postings['SALARY'] = job_postings.apply(
    lambda row: title_median_salary[row['TITLE_NAME']]
    if pd.isna(row['SALARY']) and row['TITLE_NAME'] in title_median_salary else row['SALARY'], 
    axis=1
)

In [None]:
job_postings['SALARY'] = job_postings.apply(
    lambda row: industry_median_salary[row['NAICS_2022_6_NAME']]
    if pd.isna(row['SALARY']) and row['NAICS_2022_6_NAME'] in industry_median_salary else row['SALARY'], 
    axis=1
)

In [None]:
job_postings['SALARY'] = job_postings['SALARY'].fillna(job_postings["SALARY"].median())

#### Dealing with Columns with >50% missing values
Dealing with columns that have more than 50% missing values is crucial for maintaining the integrity of our dataset. Columns with excessive missing data can introduce bias and reduce the reliability of our analysis. Therefore, we removed any columns that exceed this threshold. This ensures that our dataset remains focused on relevant and reliable information, enhancing the quality of our insights.

In [None]:
job_postings.dropna(thresh = len(job_postings) * 0.5, axis = 1, inplace = True)

#### Dealing with Categorical fields
Categorical fields, such as `TITLE_RAW`, were filled with "Unknown" for missing values. This approach allows us to retain the integrity of the dataset without introducing bias from arbitrary values. By labeling missing categorical data as "Unknown", we can still analyze trends without losing valuable information.

In [None]:
job_postings['TITLE_RAW'] = job_postings['TITLE_RAW'].fillna("Unknown")
job_postings['TITLE_CLEAN'] = job_postings['TITLE_CLEAN'].fillna("Unknown")
job_postings['COMPANY_RAW'] = job_postings['COMPANY_RAW'].fillna("Unknown")
job_postings['MSA_NAME'] = job_postings['MSA_NAME'].fillna("Unknown")
job_postings['MSA_NAME_OUTGOING'] = job_postings['MSA_NAME_OUTGOING'].fillna("Unknown")
job_postings['MSA_NAME_INCOMING'] = job_postings['MSA_NAME_INCOMING'].fillna("Unknown")

#### Dealing with Datetime fields
For the `EXPIRED` variable, we chose to fill the missing values with the maximum date from this column. We assumed that the missing value here is because the post has not expired yet. By using the maximum date, we can effectively handle missing values without introducing bias or skewing the results.

In [None]:
job_postings['POSTED'] = pd.to_datetime(job_postings['POSTED'])
job_postings['EXPIRED'] = pd.to_datetime(job_postings['EXPIRED'])

In [None]:
max_expired_date = job_postings['EXPIRED'].max()
job_postings['EXPIRED'] = job_postings['EXPIRED'].fillna(max_expired_date)

#### Dealing with Numerical fields
For the `MIN_YEARS_EXPERIENCE` variable, we chose to fill the missing values with the median `MIN_YEARS_EXPERIENCE` for a specific title or industry, similar to how we did with the `SALARY` variable. This can minimize the impact of outliers and provides a more accurate representation of the typical years of experience required for each job title.

In [None]:
title_median_exp = job_postings.groupby('TITLE_NAME')['MIN_YEARS_EXPERIENCE'].median()
industry_median_exp = job_postings.groupby('NAICS_2022_6_NAME')['MIN_YEARS_EXPERIENCE'].median()

In [None]:
job_postings['MIN_YEARS_EXPERIENCE'] = job_postings.apply(
    lambda row: title_median_exp[row['TITLE_NAME']]
    if pd.isna(row['MIN_YEARS_EXPERIENCE']) and row['TITLE_NAME'] in title_median_exp else row['MIN_YEARS_EXPERIENCE'], 
    axis=1
)

In [None]:
job_postings['MIN_YEARS_EXPERIENCE'] = job_postings.apply(
    lambda row: industry_median_exp[row['NAICS_2022_6_NAME']]
    if pd.isna(row['MIN_YEARS_EXPERIENCE']) and row['NAICS_2022_6_NAME'] in industry_median_exp else row['MIN_YEARS_EXPERIENCE'], 
    axis=1
)

In [None]:
job_postings['MIN_YEARS_EXPERIENCE'] = job_postings['MIN_YEARS_EXPERIENCE'].fillna(job_postings["MIN_YEARS_EXPERIENCE"].median())

`DURATION` variable is also a numerical field, but it has a different approach. We will fill the missing values with the difference between the `POSTED` and `EXPIRED`, which calculates the actual time span based on the available dates.

In [None]:
def impute_duration(cols):
    posted = cols.iloc[0]
    expired = cols.iloc[1]
    duration = cols.iloc[2]

    if pd.isnull(duration):
        return expired - posted
    else: 
        return duration

In [None]:
job_postings['DURATION'] = job_postings[['POSTED', 'EXPIRED', 'DURATION']].apply(impute_duration, axis = 1)

## Removing Duplicate Job Postings

### To ensure each job is counted only once, we removed duplicates based on job title, company, location, and posting date.

In [None]:
job_postings = job_postings.drop_duplicates(subset=["TITLE_NAME", "COMPANY_NAME", "LOCATION", "POSTED"], keep = "first")

In [None]:
job_postings.to_csv('job_postings_cleaned.csv', index=False)

# **Exploratory Data Analysis (EDA)**

## Top 20 companies by job postings

In [None]:
filtered_companies = job_postings[job_postings["COMPANY_NAME"] != "Unclassified"]

top_companies = filtered_companies["COMPANY_NAME"].value_counts().head(20)

fig = px.bar(
    x=top_companies.values,
    y=top_companies.index,
    orientation='h',
    title="Top 20 Companies by Job Postings (Excluding Unclassified)",
    labels={'x': 'Number of Job Postings', 'y': 'Company Name'},
    text=top_companies.values
)

fig.update_layout(
    xaxis_title="Number of Job Postings",
    yaxis_title="Company",
    yaxis={'categoryorder': 'total ascending'}, 
    height=600, 
    width=900
)

fig.show()

The visualization of the top 20 companies by job postings (excluding "Unclassified") highlights key trends in the job market, particularly in the increasing demand for AI-related roles. Many of the companies with the most postings—Deloitte, Accenture, PricewaterhouseCoopers (PwC), Oracle, Infosys, Meta, and CDW—are major players in technology, consulting, and digital transformation, sectors that have been heavily investing in AI, machine learning, and data-driven innovation.

The dominance of these companies in job postings suggests that careers in AI and technology-related fields are in high demand. Consulting giants like Deloitte, Accenture, PwC, and KPMG are actively expanding their AI divisions, helping businesses integrate AI into their operations. For instance, Deloitte has launched several AI tools, including chatbots like "DARTbot" for audit professionals and "NavigAite" for document review, to enhance efficiency and client services (Stokes, 2025). Additionally, companies like Meta are pioneers in AI research, focusing on areas such as generative AI, automation, and data science. Even in non-tech sectors, financial and healthcare firms such as Citigroup, Cardinal Health, and Blue Cross Blue Shield are leveraging AI for fraud detection, risk assessment, and personalized healthcare.

These trends indicate that pursuing a career in AI-related fields, such as data science, machine learning engineering, and AI research, could provide greater job opportunities and higher earning potential. The strong presence of technology and consulting firms in job postings reflects how AI is becoming a fundamental part of business strategies across industries. While traditional, non-AI careers will continue to exist, the rapid push toward automation and intelligent systems suggests that AI-related skills will be increasingly valuable in both technical and non-technical roles. As industries continue adopting AI, professionals who develop expertise in this area may have a competitive advantage in the evolving job market.

## Salary Distribution by Industry

In [None]:
fig = px.box(job_postings, x="NAICS_2022_6_NAME", y="SALARY", title="Salary Distribution by Industry")
fig.update_layout(width=1200, height=1000)
fig.show()

The box plot provides a clearer view of salary distributions across industries, highlighting variations in median salaries and outliers. Most industries exhibit salary concentrations below \$200K, with some sectors showing significantly higher outliers above \$300K-\$500K, suggesting high-paying roles in specialized fields.

AI-related jobs, typically found in industries such as technology, finance, and advanced manufacturing, often contribute to these high-salary outliers. Roles in machine learning, data science, and artificial intelligence engineering command premium salaries due to their specialized skill requirements, talent scarcity, and high demand across multiple industries. The broader salary spread in AI-intensive fields may also reflect differences in job seniority, from entry-level analysts to highly compensated AI researchers and executives.

Additionally, AI-driven industries tend to offer competitive compensation to attract top talent, given the rapid pace of technological advancement and the strategic importance of AI in business growth. The dense clustering of lower salaries in non-AI industries indicates a more constrained range, potentially due to standardized pay structures or lower technical barriers to entry. 

## Top 5 Occupations by Average Salary

In [None]:
avg_salary_per_occupation = job_postings.groupby("LOT_V6_OCCUPATION_NAME")["SALARY"].mean().reset_index()

top_occupations = avg_salary_per_occupation.sort_values(by="SALARY", ascending=False).head(5)

fig = px.bar(
        top_occupations,
        x="SALARY",
        y="LOT_V6_OCCUPATION_NAME",
        orientation='h',
        title="Top 5 Occupations by Average Salary",
        labels={"SALARY": "Average Salary ($)", "LOT_V6_OCCUPATION_NAME": "Occupation"},
        text=top_occupations["SALARY"]
    )

fig.update_layout(
        xaxis_title="Average Salary ($)",
        yaxis_title="Occupation",
        yaxis={"categoryorder": "total ascending"}, 
        height=700,
        width=900
    )

fig.show()

title: Exploratory Data Analysis
subtitle: Enhance EDA with Improved Visualizations and Deeper Insights
author:
  - name: Furong Wang
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
  - name: Marco Perez Garcia
    affiliations:
      - ref: bu
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import plotly.express as px

In [None]:
job_postings = pd.read_csv('job_postings_cleaned.csv')

# **Exploratory Data Analysis & Visualization**

In [None]:
plotly_layout = dict(
    font=dict(family="Arial", size=14),
    title_font=dict(size=20, family="Arial", color="black"),
    paper_bgcolor="white",
    plot_bgcolor="white",
    margin=dict(t=60, l=60, r=30, b=60),
    legend=dict(bordercolor="lightgray", borderwidth=1),
    xaxis=dict(
        title_font=dict(size=16, color="black"),  
        tickfont=dict(size=12, color="black"),    
        showgrid=True, 
        gridcolor="lightgray",
        showline=True,
        linecolor="black",
        linewidth=1,
        mirror=True,
        zeroline=True,
        zerolinecolor="gray",
        zerolinewidth=1
    ),
    yaxis=dict(
        title_font=dict(size=16, color="black"),  
        tickfont=dict(size=12, color="black"),    
        showgrid=True, 
        gridcolor="lightgray",
        showline=True,
        linecolor="black",
        linewidth=1,
        mirror=True,
        zeroline=True,
        zerolinecolor="gray",
        zerolinewidth=1
    ),
)

## Top 20 companies by job postings

In [None]:
filtered_companies = job_postings[job_postings["COMPANY_NAME"] != "Unclassified"]

top_companies = filtered_companies["COMPANY_NAME"].value_counts().head(20)

fig = px.bar(
    x=top_companies.values,
    y=top_companies.index,
    orientation='h',
    title="Top 20 Companies by Job Postings (Excluding Unclassified)",
    labels={'x': 'Number of Job Postings', 'y': 'Company Name'},
    text=top_companies.values
)

fig.update_layout(
    xaxis_title="Number of Job Postings",
    yaxis_title="Company",
    yaxis={'categoryorder': 'total ascending'}, 
    height=600, 
    width=900
)

fig.update_layout(**plotly_layout) 
fig.show()

The visualization of the top 20 companies by job postings (excluding "Unclassified") highlights key trends in the job market, particularly in the increasing demand for AI-related roles. Many of the companies with the most postings—Deloitte, Accenture, PricewaterhouseCoopers (PwC), Oracle, Infosys, Meta, and CDW—are major players in technology, consulting, and digital transformation, sectors that have been heavily investing in AI, machine learning, and data-driven innovation.

The dominance of these companies in job postings suggests that careers in AI and technology-related fields are in high demand. Consulting giants like Deloitte, Accenture, PwC, and KPMG are actively expanding their AI divisions, helping businesses integrate AI into their operations. For instance, Deloitte has launched several AI tools, including chatbots like "DARTbot" for audit professionals and "NavigAite" for document review, to enhance efficiency and client services (Stokes, 2025). Additionally, companies like Meta are pioneers in AI research, focusing on areas such as generative AI, automation, and data science. Even in non-tech sectors, financial and healthcare firms such as Citigroup, Cardinal Health, and Blue Cross Blue Shield are leveraging AI for fraud detection, risk assessment, and personalized healthcare.

These trends indicate that pursuing a career in AI-related fields, such as data science, machine learning engineering, and AI research, could provide greater job opportunities and higher earning potential. The strong presence of technology and consulting firms in job postings reflects how AI is becoming a fundamental part of business strategies across industries. While traditional, non-AI careers will continue to exist, the rapid push toward automation and intelligent systems suggests that AI-related skills will be increasingly valuable in both technical and non-technical roles. As industries continue adopting AI, professionals who develop expertise in this area may have a competitive advantage in the evolving job market.

## Salary Distribution by Industry

In [None]:
fig = px.box(job_postings, x="NAICS_2022_6_NAME", y="SALARY", title="Salary Distribution by Industry")
fig.update_layout(width=1200, height=1000)
fig.update_layout(**plotly_layout)
fig.show()

The box plot provides a clearer view of salary distributions across industries, highlighting variations in median salaries and outliers. Most industries exhibit salary concentrations below \$200K, with some sectors showing significantly higher outliers above \$300K-\$500K, suggesting high-paying roles in specialized fields.

AI-related jobs, typically found in industries such as technology, finance, and advanced manufacturing, often contribute to these high-salary outliers. Roles in machine learning, data science, and artificial intelligence engineering command premium salaries due to their specialized skill requirements, talent scarcity, and high demand across multiple industries. The broader salary spread in AI-intensive fields may also reflect differences in job seniority, from entry-level analysts to highly compensated AI researchers and executives.

Additionally, AI-driven industries tend to offer competitive compensation to attract top talent, given the rapid pace of technological advancement and the strategic importance of AI in business growth. The dense clustering of lower salaries in non-AI industries indicates a more constrained range, potentially due to standardized pay structures or lower technical barriers to entry. 

## Top 5 Occupations by Average Salary

In [None]:
avg_salary_per_occupation = job_postings.groupby("LOT_V6_OCCUPATION_NAME")["SALARY"].mean().reset_index()

top_occupations = avg_salary_per_occupation.sort_values(by="SALARY", ascending=False).head(5)

fig = px.bar(
        top_occupations,
        x="SALARY",
        y="LOT_V6_OCCUPATION_NAME",
        orientation='h',
        title="Top 5 Occupations by Average Salary",
        labels={"SALARY": "Average Salary ($)", "LOT_V6_OCCUPATION_NAME": "Occupation"},
        text=top_occupations["SALARY"]
    )

fig.update_layout(
        xaxis_title="Average Salary ($)",
        yaxis_title="Occupation",
        yaxis={"categoryorder": "total ascending"}, 
        height=700,
        width=900
    )

fig.update_layout(**plotly_layout)
fig.show()

The salary distribution in the graph clearly shows that the highest-paying occupations are directly tied to artificial intelligence, data analytics, and business intelligence. The top-paying role, "Computer Systems Engineer / Architect," averages over \$156,000, followed by "Business Intelligence Analyst" at \$125,000 and other AI-driven roles like "Data Mining Analyst" and "Market Research Analyst," all exceeding \$100,000. These occupations rely heavily on AI, machine learning, and data-driven decision-making, making it clear that mastering AI-related skills is directly linked to higher salaries. The strong earnings for these roles indicate that industries are willing to pay a premium for professionals who can build, interpret, and optimize AI-driven systems.

In contrast, traditional non-AI careers, which are not as data or automation-focused, tend to fall outside these top salary brackets. The job market is shifting towards AI dependency, where knowing how to work with artificial intelligence, big data, and automation tools is no longer just an advantage but a necessity for higher-paying opportunities. As industries integrate AI at an increasing pace, professionals who fail to develop AI-related expertise risk stagnating in lower-paying roles, while those who embrace AI technologies position themselves for significantly better financial rewards.

## **Enhanced Visualizations**

## Job Postings Trend Over Time (Top Companies)

In [None]:
job_postings['POSTED'] = pd.to_datetime(job_postings['POSTED'])
top_companies = (
    job_postings[job_postings["COMPANY_NAME"] != "Unclassified"]["COMPANY_NAME"]
    .value_counts()
    .head(10)
    .index
)

filtered = job_postings[job_postings['COMPANY_NAME'].isin(top_companies)]

trend = (
    filtered.groupby([filtered['POSTED'].dt.to_period('M'), 'COMPANY_NAME'])
    .size()
    .reset_index(name='Postings')
)
trend['POSTED'] = trend['POSTED'].dt.to_timestamp()

fig = px.line(trend, x='POSTED', y='Postings', color='COMPANY_NAME',
              title='Monthly Job Postings for Top 10 Companies')
fig.update_layout(**plotly_layout)
fig.show()

The line chart above reveals dynamic shifts in job posting activity among the top 10 hiring companies over recent months. Several key patterns emerge:

- Infosys shows a strong upward trend, indicating a possible expansion phase or increased demand for tech-related talent. This could reflect growing project loads or client demand in IT services and consulting.

- Accenture and Deloitte maintain relatively stable posting volumes, suggesting consistent hiring pipelines. This stability aligns with their roles as global consulting giants with ongoing needs for specialized talent in digital transformation, data analytics, and strategy.

- Humana and Insight Global exhibit moderate declines followed by slight recoveries, potentially pointing to seasonal or project-based hiring fluctuations in healthcare and staffing services.

- Companies like KPMG, Oracle, and PricewaterhouseCoopers (PwC) show lower and flatter posting trends, possibly indicating a more conservative hiring approach or specific recruitment periods during the year.

- Merit America, a nonprofit focused on career advancement, remains on the lower end of the spectrum. However, its presence in the top 10 indicates consistent demand in educational or workforce development roles.

Overall, the chart highlights Infosys as a standout, with its consistent rise suggesting aggressive recruitment. In contrast, other firms maintain steady or slightly fluctuating volumes, reflecting industry-specific hiring cycles. This trend-based view can be valuable for job seekers, workforce planners, or analysts studying labor market activity in the consulting, healthcare, tech, and staffing sectors.

## Salary Distribution by Industry (Filtered Outliers)

In [None]:
Q1 = job_postings['SALARY'].quantile(0.25)
Q3 = job_postings['SALARY'].quantile(0.75)
IQR = Q3 - Q1

filtered_salaries = job_postings[
    (job_postings['SALARY'] >= Q1 - 1.5*IQR) & 
    (job_postings['SALARY'] <= Q3 + 1.5*IQR)
]

fig = px.box(filtered_salaries, x="NAICS_2022_6_NAME", y="SALARY", 
             title="Filtered Salary Distribution by Industry")
fig.update_layout(width=1200, height=800, xaxis_tickangle=45)
fig.update_layout(**plotly_layout)
fig.show()

The box plot above provides a cleaned and focused view of salary distributions across different industries, with extreme outliers removed to highlight more meaningful central trends.

- High variation across industries: Some industries display a narrow salary band, suggesting standardized roles (e.g., Retail or Administrative sectors), while others—especially in tech, consulting, and finance—show wider spreads, indicating diverse job levels and pay scales.

- Technology and data-driven sectors (e.g., Computer Systems Design, Custom Software Development) tend to cluster toward the higher end of the salary spectrum, reflecting the premium placed on digital skills, AI, and advanced analytics.

- Healthcare and scientific industries also show strong mid-to-upper ranges, hinting at specialized roles that demand advanced education or certifications.

- In contrast, industries like Warehousing, Food Services, and Retail generally reflect lower median salaries, consistent with roles requiring less formal education or technical expertise.

This visualization emphasizes how industry selection can significantly impact earning potential, even before considering role or experience level. For job seekers or workforce planners, it provides a valuable benchmark when evaluating career paths or advising on industry transitions.

## Fastest-Growing Industries Over Time

In [None]:
monthly_industry = (
    job_postings.groupby([job_postings['POSTED'].dt.to_period("M"), "NAICS_2022_6_NAME"])
    .size()
    .reset_index(name='Postings')
)
monthly_industry["POSTED"] = monthly_industry["POSTED"].dt.to_timestamp()

top_industries = monthly_industry.groupby("NAICS_2022_6_NAME")["Postings"].sum().nlargest(6).index

top_industries = [industry for industry in top_industries if industry != "Unclassified Industry"]

filtered_growth = monthly_industry[monthly_industry["NAICS_2022_6_NAME"].isin(top_industries)]

fig = px.line(filtered_growth, x="POSTED", y="Postings", color="NAICS_2022_6_NAME",
              title="Top 5 Industries by Job Postings Over Time (Excluding Unclassified)")
fig.update_layout(**plotly_layout)
fig.show()

This line plot presents job posting trends across the top five industries (excluding unclassified roles), offering a clearer picture of sector-specific hiring momentum over the past several months.

- Employment Placement Agencies show the most significant increase in job postings, suggesting a surge in demand for staffing services. This could reflect broader labor market activity, such as rising contract work, workforce mobility, or seasonal hiring cycles.

- Administrative Management and Consulting Services maintain consistently high levels of postings, highlighting the ongoing demand for business strategy, operations, and project management talent. The slight upward trend may align with businesses seeking advisory support during periods of uncertainty or transformation.

- Computer Systems Design Services and Custom Computer Programming Services demonstrate steady hiring activity, reinforcing the continued need for tech infrastructure, custom software development, and IT support roles across industries.

- Commercial Banking, while slightly more volatile, remains a key hiring industry. This might reflect fluctuations in financial service needs, regulatory adjustments, or regional economic conditions.

Overall, the chart illustrates that technology, consulting, staffing, and finance remain dominant hiring sectors — with tech-related industries showing stable demand and staffing services accelerating most rapidly. These insights are valuable for job seekers targeting high-opportunity industries, and for workforce planners aiming to align talent strategies with real-time market shifts.

## Salary Trends Over Time for Top 5 Occupations

In [None]:
job_postings['POSTED'] = pd.to_datetime(job_postings['POSTED'])
top_occ = job_postings['LOT_V6_OCCUPATION_NAME'].value_counts().head(5).index

filtered_jobs = job_postings[job_postings['LOT_V6_OCCUPATION_NAME'].isin(top_occ)]
filtered_jobs['Month'] = filtered_jobs['POSTED'].dt.to_period("M").dt.to_timestamp()

salary_trend = (
    filtered_jobs.groupby(['Month', 'LOT_V6_OCCUPATION_NAME'])['SALARY']
    .mean().reset_index()
)

fig = px.line(salary_trend, 
              x="Month", 
              y="SALARY", 
              color="LOT_V6_OCCUPATION_NAME",
              title="Average Salary Trends Over Time for Top 5 Occupations")
fig.update_layout(**plotly_layout)
fig.show()

The line chart illustrates average salary trends over time for the top five most frequently posted occupations. A few meaningful patterns emerge:

- Computer Systems Engineer / Architect consistently ranks as the highest-paid occupation, maintaining an average salary around or above $150,000. This reflects the strong demand for highly skilled professionals in systems architecture, a field that supports infrastructure in both legacy enterprises and cloud-native environments.

- Data / Data Mining Analysts and Business Intelligence Analysts both show stable and competitive salaries in the range of ~$120,000–$130,000. These roles are closely tied to data-driven decision-making, reflecting how AI and analytics continue to shape business strategy and operations.

- Clinical Analysts / Clinical Documentation Specialists demonstrate slightly lower salary levels but remain relatively consistent, indicating steady demand in the healthcare and life sciences sectors—often associated with electronic health records, compliance, and process optimization.

- Business / Management Analysts show moderate but stable pay, aligning with generalist consulting and strategic support functions. While their salaries are slightly below the technical roles, they still remain above the $100,000 mark.

Overall, this plot reinforces the idea that technical and analytical occupations—especially those connected to data, engineering, and system-level design—continue to command premium salaries in the job market. Notably, salary stability across all five roles suggests that these are high-value, high-demand positions, resilient to short-term economic shifts.

## Average Salary by Employment Type

In [None]:
avg_salary_by_type = (
    job_postings.groupby("EMPLOYMENT_TYPE_NAME")["SALARY"]
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)

import plotly.express as px

fig = px.bar(avg_salary_by_type, 
             x="EMPLOYMENT_TYPE_NAME", 
             y="SALARY",
             title="Average Salary by Employment Type",
             labels={"SALARY": "Average Salary ($)", "EMPLOYMENT_TYPE_NAME": "Employment Type"},
             text="SALARY")

fig.update_layout(yaxis_tickprefix="$", height=500)
fig.update_layout(**plotly_layout)
fig.show()

This bar chart compares the average salaries across different employment types, revealing key patterns in compensation based on job structure:

- Full-time roles (>32 hours) lead with the highest average salary at approximately $117,324, which aligns with expectations — these positions often come with more responsibilities, benefits, and long-term career opportunities.

- Part-time / full-time hybrid roles earn slightly less on average (~$104,379), potentially due to inconsistent hours or project-based employment models that offer flexibility but not always the highest compensation.

- Part-time roles (≤32 hours) average just below $102,000, a surprisingly competitive figure. This could reflect specialized part-time positions (e.g., consultants or contract professionals) that still command high hourly rates despite reduced hours.

Notably, the relatively narrow gap between employment types suggests that skills and job function may have a stronger influence on salary than hours alone. High-paying part-time and hybrid roles could indicate a shift toward flexible, high-skill labor markets, where experienced professionals negotiate premium pay for reduced workloads.




# Skill Gap Analysis

---
title: Skill Gap Analysis
subtitle: Compare the skills required in IT job postings against the actual skills of your group members to identify knowledge gaps and areas for improvement.
author:
  - name: Furong Wang
    affiliations:
      - id: bu
        name: Boston University
        city: Boston
        state: MA
  - name: Marco Perez Garcia
    affiliations:
      - ref: bu
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3
---

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
from collections import Counter
import json
import ast

# **Team-based Skill Dataframe**

With our chosen IT career path as Business Analysts, we identified our current skills relevant to the role and assessed proficiency levels using a numerical scale from 1 to 5:  
1 = Beginner   
2 = Basic Knowledge   
3 = Intermediate   
4 = Advanced   
5 = Expert  

The following heatmap visualized our team strengths and gaps.

In [None]:
job_postings = pd.read_csv("lightcast_job_postings.csv", low_memory = False)

In [None]:
team_skills_data = {
    "Name": ["Furong", "Marco"],
    "R": [3, 3],
    "Python": [3, 4],
    "SQL": [2, 3],
    "Microsoft Excel": [5, 5],
    "Data Visulization": [4, 4],
    "Amazon Web Services": [2, 2],
    "Risk Analytics": [3, 3],
    "Data Mining": [3, 3]
}

df_team_skills = pd.DataFrame(team_skills_data)
df_team_skills.set_index("Name", inplace = True)

In [None]:
plt.figure(figsize = (8, 6))
heatmap = sns.heatmap(df_team_skills, annot = True, cmap = "YlGnBu", 
                    linewidths = 0.5, fmt = ".1f", vmin = 1, vmax = 5)

cbar = heatmap.collections[0].colorbar
cbar.set_ticks([1, 2, 3, 4, 5])
cbar.set_ticklabels(['Beginner', 'Basic', 'Intermediate', 'Advanced', 'Expert'])

plt.title("Team Skill Levels Heatmap", fontsize = 16)
plt.ylabel("Average Proficiency (1-5)", fontsize = 12)
plt.xlabel("Skills", fontsize = 12)
plt.xticks(rotation = 40, ha = 'right')
plt.yticks(rotation = 0, ha = 'right')
plt.tight_layout()
plt.show()

# **Team Skills vs. Industry Requirements**

To compare our team’s skills to industry requirements, we identified the most in-demand skills from IT job postings. We focused on the industry group `Computing Infrastructure Providers, Data Processing, Web Hosting, and Related Services`, as it closely aligns with our chosen career path.

The bar plot below illustrates the top 10 skills most in demand within IT job postings, providing insights into industry expectations.

In [None]:
it_jobs = job_postings[job_postings['NAICS_2022_6'] == 518210]

In [None]:
all_skills = []

def parse_skills(skills_str):
    try:
        if pd.isna(skills_str):
            return []
        try:
            return json.loads(skills_str)
        except:
            return ast.literal_eval(skills_str)
    except:
        print(f"Warning: Could not parse skills: {skills_str}")
        return []

for skills_str in it_jobs['SKILLS_NAME'].dropna():
    skills_list = parse_skills(skills_str)
    all_skills.extend(skills_list)

skill_counter = Counter(all_skills)
top_10_skills = skill_counter.most_common(10)

top_skills_df = pd.DataFrame(top_10_skills, columns = ['Skill', 'Count'])

In [None]:
total_postings = len(it_jobs)
top_skills_df['Percentage'] = (top_skills_df['Count'] / total_postings * 100).round(1)
top_skills_df = top_skills_df.sort_values('Count', ascending = False)

plt.figure(figsize=(10, 6))
sns.barplot(
    x = 'Count', 
    y = 'Skill', 
    data = top_skills_df,
    hue = 'Skill',
    palette = "Blues_r"
)
plt.title("Top 10 Skills Required in IT Job Postings", fontsize = 16)
plt.xlabel("Number of Job Postings", fontsize = 12)
plt.ylabel("Skills", fontsize = 12)
plt.xlim(0, 500) 

for i, row in enumerate(top_skills_df.itertuples()):
    plt.text(
        row.Count + 5, 
        i, 
        f"{row.Count} ({row.Percentage}%)",
        va='center'
    )

plt.tight_layout()
plt.show()

title: Machine Learning Methods
subtitle: Clustering and Machine Learning Techniques for Job Market Trends Analysis
author:
  - name: Furong Wang
    affiliations:
      - ref: bu
  - name: Marco Perez Garcia
    affiliations:
      - ref: bu
affiliations:
  - id: bu
    name: Boston University
    city: Boston
    state: MA
bibliography: references.bib
csl: csl/econometrica.csl
format:
  html:
    toc: true
    number-sections: true
    df-print: paged
jupyter: python3
---

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from pyspark.sql import SparkSession
import plotly.io as pio
from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler, StandardScaler
from pyspark.ml import Pipeline
from tabulate import tabulate
from IPython.display import HTML
from pyspark.sql import Window
from pyspark.sql import functions as F
from pyspark.sql.functions import col, pow, sqrt, abs, mean, avg, sum as spark_sum, round as spark_round, row_number
from pyspark.sql.types import DoubleType
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
import plotly.graph_objects as go

In [None]:
plotly_layout = dict(
    font=dict(family="Arial", size=14),
    title_font=dict(size=20, family="Arial", color="black"),
    paper_bgcolor="white",
    plot_bgcolor="white",
    margin=dict(t=60, l=60, r=30, b=60),
    legend=dict(bordercolor="lightgray", borderwidth=1),
    xaxis=dict(
        title_font=dict(size=16, color="black"),  
        tickfont=dict(size=12, color="black"),    
        showgrid=True, 
        gridcolor="lightgray",
        showline=True,
        linecolor="black",
        linewidth=1,
        mirror=True,
        zeroline=True,
        zerolinecolor="gray",
        zerolinewidth=1
    ),
    yaxis=dict(
        title_font=dict(size=16, color="black"),  
        tickfont=dict(size=12, color="black"),    
        showgrid=True, 
        gridcolor="lightgray",
        showline=True,
        linecolor="black",
        linewidth=1,
        mirror=True,
        zeroline=True,
        zerolinecolor="gray",
        zerolinewidth=1
    ),
)

# **Unsupervised Learning: KMeans Clustering**

In this section, we used KMeans clustering to group job postings based on `Minimum Years of Experience` and `Salary`. Our goal was to find natural patterns in how job roles are distributed across different experience and salary ranges. After clustering, we used `NAICS6_NAME` industry classifications to interpret the types of industries represented in each cluster.

By doing so, we can better understand how different industries vary in their experience requirements and compensation levels, providing insights into salary structures across the job market.

In [None]:
spark = SparkSession.builder.appName("LightcastData").getOrCreate()

# Load Data
df = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("lightcast_job_postings.csv")

# Show Schema and Sample Data
# df.printSchema() 
# df.show(5)

# Register the DataFrame as a temporary SQL table
df.createOrReplaceTempView("job_postings")

In [None]:
df = df.dropna(subset=['MIN_YEARS_EXPERIENCE', 'SALARY', 'NAICS6_NAME'])
df = df.filter(df.NAICS6_NAME != 'Unclassified Industry')

df_casted = df.select(
    col("MIN_YEARS_EXPERIENCE").cast(DoubleType()),
    col("SALARY").cast(DoubleType()),
    col("NAICS6_NAME")
)

assembler = VectorAssembler(
    inputCols=["MIN_YEARS_EXPERIENCE", "SALARY"], 
    outputCol='features_unscaled'
)
df_features = assembler.transform(df_casted)

In [None]:
scaler = StandardScaler(
    inputCol='features_unscaled', 
    outputCol='features', 
    withMean=True, 
    withStd=True
)
scaler_model = scaler.fit(df_features)
df_scaled = scaler_model.transform(df_features)

In [None]:
scores = []
ks = list(range(2, 9))

for k in ks:
    kmeans = KMeans().setK(k).setSeed(42).setFeaturesCol("features")
    model = kmeans.fit(df_scaled)
    transformed = model.transform(df_scaled)

    evaluator = ClusteringEvaluator(
        featuresCol="features", predictionCol="prediction", metricName="silhouette")
    score = evaluator.evaluate(transformed)
    scores.append(score)

fig = go.Figure()
fig.add_trace(go.Scatter(x=ks, y=scores, mode='lines+markers', name='Silhouette Score'))
fig.update_layout(title="Silhouette Score vs. k", 
                  xaxis_title="k",
                  yaxis_title="Score",
                  template="simple_white",
                  width=900)
fig.update_layout(**plotly_layout) 
fig.show()

In [None]:
kmeans_final = KMeans(featuresCol='features', k=4, seed=688)
model_final = kmeans_final.fit(df_scaled)

predictions = model_final.transform(df_scaled)

In [None]:
industry_counts = predictions.groupBy('prediction', 'NAICS6_NAME').count()

window_spec = Window.partitionBy('prediction')
industry_counts = industry_counts.withColumn('total', spark_sum('count').over(window_spec))

industry_counts = industry_counts.withColumn('percentage', spark_round(col('count') / col('total') * 100, 2))

window_top5 = Window.partitionBy('prediction').orderBy(col('percentage').desc())
industry_top5 = industry_counts.withColumn('row_num', row_number().over(window_top5)).filter(col('row_num') <= 5)

industry_top5.orderBy('prediction', 'row_num').show(100, truncate=False)

In [None]:
pandas_df = predictions.select('MIN_YEARS_EXPERIENCE', 'SALARY', 'prediction').toPandas()

pandas_df['Cluster_Name'] = pandas_df['prediction'].map({0: 'Cluster 1', 
                                                         1: 'Cluster 2', 
                                                         2: 'Cluster 3',
                                                         3: 'Cluster 4'})

plt.figure(figsize=(8, 5))
fig = px.scatter(
    pandas_df, 
    x="MIN_YEARS_EXPERIENCE", 
    y="SALARY", 
    color="Cluster_Name", 
    title="K-Means Clustering on Job Postings Data", 
    labels={
        "MIN_YEARS_EXPERIENCE": "Minimum Years of Experience", 
        "SALARY": "Salary",
        "Cluster_Name": "Cluster"
    },
    category_orders={"Cluster_Name": ["Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4"]}
)

fig.update_layout(**plotly_layout,
                  width=800,
                  height=500) 
fig.show()

Based on the silhouette score of the K value and practical anlysis requirements, we identified four distinct clusters that capture major compensation patterns in the labor market.

**Here are key findings based on salary and experience trends:**

- **Cluster 1:**  
    - **Experience/Salary Pattern:** Requires higher minimum years of experience. Offers only moderate salary levels despite higher experience.  
    - **Top Industries:** Computer Systems Design Services, Administrative Management and General Management Consulting Services, and Custom Computer Programming Services.  
    - **Insight:** Jobs demanding significant prior experience but offering relatively moderate compensation. Indicates competitive markets in tech and consulting sectors.  
- **Cluster 2:**  
    - **Experience/Salary Pattern:** Consistently the highest salaries across a wide range of experience levels.  
    - **Top Industries:** Administrative Management and General Management Consulting Services, Web Search Portals and Other Information Services, and Commercial Banking.  
    - **Insight:** Reflects premium-paying roles in consulting, web services, and finance. This suggests opportunities for substantial earnings even with moderate experience.  
- **Cluster 3:**   
    - **Experience/Salary Pattern:** Requires lower years of experience. Salary levels are generally the lowest.  
    - **Top Industries:** Administrative Management and General Management Consulting Services, Employment Placement Agencies, and Direct Health and Medical Insurance Carriers.  
    - **Insight:** Entry-level or early-career roles in sectors with limited immediate salary growth.  
- **Cluster 4:**   
    - **Experience/Salary Pattern:** Moderate years of experience required. Salary levels are moderately high.  
    - **Top Industries:** Administrative Management and General Management Consulting Services, Employment Placement Agencies, and Commercial Banking.  
    - **Insight:** Steady career tracks offering good compensation for mid-experience professionals.  

**Implications for Salary and Compensation Trends:**  
- Salary growth is not always linear with experience; certain clusters show salary plateaus despite increasing experience.  
- Industry effects are significant: sectors like Professional Services and Finance consistently appear across clusters, but compensation levels vary depending on experience requirements.  
- High-paying opportunities exist both at low and high experience levels, depending on industry and role specialization.  

**Implications for Job Seekers:**  
- **High Salary Aspirations:** Target roles in Cluster 2 industries like consulting, finance, and web services where premium salaries are achievable even with moderate experience.  

- **Career Launch:** Cluster 3 industries may provide easier entry points for new graduates but with lower starting salaries. In contrast, positions in Cluster 4 offer a good balance between experience investment and salary rewards.  

- **Beware of High-Experience/Moderate-Pay Sectors:** Cluster 1 jobs may require significant experience without corresponding salary premiums, requiring careful career planning.  

# **Supervised Learning: Random Forest Regression**

To deepen our analysis on Salary and Compensation Trends, we constructed a Random Forest Regression model using salary as the target variable. The goal of this model is to predict salary outcomes based on key job posting attributes, and to identify the relative importance of different factors influencing compensation in the labor market.

The predictor variables selected for the model include: `DURATION`, `MIN_YEARS_EXPERIENCE`, `LOT_V6_OCCUPATION_NAME`, `STATE_NAME`, `EMPLOYMENT_TYPE_NAME`

In [None]:
df_rf = df.dropna(subset=['DURATION', 'MIN_YEARS_EXPERIENCE',  
                          'LOT_V6_OCCUPATION_NAME', 'STATE_NAME', 'EMPLOYMENT_TYPE_NAME',
                          'SALARY'])

categorical_cols = ['LOT_V6_OCCUPATION_NAME', 'STATE_NAME', 'EMPLOYMENT_TYPE_NAME'] 
continuous_cols = ['DURATION', 'MIN_YEARS_EXPERIENCE'] 

# Index and One-Hot Encode
indexers = [StringIndexer(inputCol=col, outputCol=f"{col}_idx", handleInvalid='skip') for col in categorical_cols]
encoders = [OneHotEncoder(inputCol=f"{col}_idx", outputCol=f"{col}_vec") for col in categorical_cols]

# Assemble base features 
assembler = VectorAssembler(
    inputCols=continuous_cols 
    + [f"{col}_vec" for col in categorical_cols], 
    outputCol="features"
)

# Build pipeline and transform
pipeline = Pipeline(stages=indexers + encoders + [assembler]) 
data = pipeline.fit(df_rf).transform(df_rf)

# Show final structure
data.select("SALARY", "features").show(5, truncate=False)

In [None]:
train_data, test_data = data.randomSplit([0.8, 0.2], seed=688)

rf = RandomForestRegressor(featuresCol="features",
                           labelCol="SALARY", 
                           numTrees=150,
                           maxDepth=9, 
                           seed=688 
                           )

# Train model
rf_model = rf.fit(train_data.select("SALARY", "features"))

# Generate predictions
rf_preds = rf_model.transform(train_data.select("SALARY", "features"))

In [None]:
# Extract feature importances
def get_actual_feature_names(df_rf, assembler, encoded_cols):
    full_feature_names = []

    for col_name in assembler.getInputCols():
        if col_name in encoded_cols:
            try:
                attr_meta = df_rf.schema[col_name].metadata['ml_attr']['attrs']
                for attr_group in attr_meta.values():
                    for attr in attr_group:
                        full_feature_names.append(attr['name'])
            except:
                full_feature_names.append(col_name)
        else:
            full_feature_names.append(col_name)

    return full_feature_names

encoded_cols = [f"{col}_vec" for col in categorical_cols] 
feature_names = get_actual_feature_names(data, assembler, encoded_cols)
importances = rf_model.featureImportances.toArray() 

In [None]:
def clean_feature_names(feature_list):
    clean_names = []
    for name in feature_list:
        if isinstance(name, list):
            clean_names.append(", ".join(str(n) for n in name))
        elif isinstance(name, str) and name.startswith("["):
            clean_names.append(name.replace("[", "").replace("]", "").replace("'", "").replace('"', '').strip())
        else:
            clean_names.append(str(name))
    return clean_names

# Build dataframe
importance_df = pd.DataFrame({
    "Feature": feature_names,
    "Importance": importances
}).sort_values(by="Importance", ascending=False)

importance_df["Feature"] = clean_feature_names(importance_df["Feature"])
top_importance_df = importance_df.head(15)

# Plot
plt.figure(figsize=(9, 6))
sns.barplot(
    x="Importance",
    y="Feature",
    data=top_importance_df,
    hue="Feature",
    palette="viridis"
)

import textwrap
labels = plt.gca().get_yticklabels()
new_labels = [textwrap.fill(label.get_text(), width=30) for label in labels]
plt.yticks(range(len(new_labels)), new_labels, fontsize=9)
plt.xticks(fontsize=10)

plt.title("Top 15 Feature Importances from Random Forest Model", fontsize=14, fontweight='bold')
plt.xlabel("Importance", fontsize=12, fontweight='bold')
plt.ylabel("Feature", fontsize=12, fontweight='bold')
plt.tight_layout()
plt.show()

In [None]:
evaluator_r2 = RegressionEvaluator(labelCol="SALARY", predictionCol="prediction", metricName="r2")

rf_preds = rf_model.transform(test_data)
rf_residuals = rf_preds.select(
    col("SALARY"),
    col("prediction"),
    (col("SALARY") - col("prediction")).alias("residual")
)

rf_r2   = evaluator_r2.evaluate(rf_preds) 
rf_rmse = np.sqrt(rf_residuals.select(avg(pow(col("residual"), 2))).first()[0]) 
rf_aic  = None
rf_bic  = None

rf_pdf = rf_residuals.select("SALARY", col("prediction").alias("RandomForest")).toPandas()
rf_df = pd.DataFrame({"SALARY": rf_pdf["SALARY"], "RandomForest": rf_pdf["RandomForest"]})

In [None]:
plt.figure(figsize=(7, 17))
sns.set(style="whitegrid")

models = {"RandomForest": (rf_rmse, rf_r2, "NA", "NA")}

model_dfs = {"RandomForest": rf_df}

for idx, (model_name, (rmse, r2, aic, bic)) in enumerate(models.items(), 1):
    plt.subplot(3, 1, idx)
    
    model_data = model_dfs[model_name]
    
    sns.scatterplot(x="SALARY", y=model_name, data=model_data, alpha=0.5, label=model_name)
    
    x_min = model_data["SALARY"].min()
    x_max = model_data["SALARY"].max()
    plt.plot([x_min, x_max], [x_min, x_max], 'r-', label="Ideal Fit")
    
    if aic != "NA" and bic != "NA":
        plt.title(f"{model_name} Prediction\nRMSE={rmse:.1f} | R²={r2:.3f} | AIC={aic:.1f} | BIC={bic:.1f}", fontweight="bold")
    else:
        plt.title(f"{model_name} Prediction\nRMSE={rmse:.1f} | R²={r2:.3f} | AIC=NA | BIC=NA", fontweight="bold")
    
    plt.xlabel("Actual Salary", fontweight="bold")
    plt.ylabel("Predicted Salary", fontweight="bold")
    plt.legend()

plt.tight_layout()
plt.show()

**Model Evaluation:**  
- The R-Squared of 0.454 indicates a moderate level of predictive power, suggesting that the model captures a substantial portion of salary variability but leaves room for improvement.  
- The scatterplot shows that most predictions are reasonably aligned with the actual salaries but tend to underpredict higher salary values (especially above $200,000), which is common due to the small number of very high salaries ("long tail" effect).  

**Feature Importance Analysis:**  
- The top 15 feature importances from random forest model show how different factors contribute to salary predictions. In our model, years of experience are by far the strongest predictor of salary. Having some certain occupation titles can also greatly affect salary expectations. Interestingly, the job posting duration is also an important factor in salary prediction, which may be related to stable roles.  
- On the other hand, geographic location and job type have less impact on salary predictions, with states like Oregon and California showing specific salary patterns.  

**Implications for Job Seekers:**  
- **Experience Pays Off:** The model shows that minimum years of experience is the dominant factor influencing salary. For job seekers, gaining and accurately showcasing professional experience is crucial to achieving higher salary outcomes. Moreover, investing in data-related skills can be a smart career move.  
- **Occupation Choice Matters:** Specific technical roles (especially Computer Systems Engineers, Data Analysts, and Business Intelligence Analysts) are associated with higher salaries. Choosing high-demand, specialized roles can significantly improve salary prospects.  
- **Location Strategy:** While experience and occupation dominate, geography still plays a role, for example, states like Oregon, California, and New York affect salary expectations. Job seekers willing to relocate or negotiate for remote work with companies based in higher-paying states may gain salary advantages.
