# Overview

In this section, we examine job postings across Business Analytics, Data Science, and Machine Learning roles to identify key patterns before modeling. We explore salary, experience, remote work trends, and skill demand to guide the regression and classification stages.

We focused on:
- Job distribution across industries and roles  
- Salary variation by experience and remote type  
- Skill demand frequency and specialization  
- Emerging data science and analytics trends

---

# 1. Top Job Titles and Employers


In [None]:
import pandas as pd, plotly.express as px

df = pd.read_csv("data/lightcast_cleaned.csv")

fig = px.histogram(df, x="Average_Salary", nbins=40,
                   color_discrete_sequence=["#187145"])
fig.update_layout(
    title="Distribution of Average Salaries",
    xaxis_title="Average Salary ($)",
    yaxis_title="Number of Job Postings",
    template="plotly_white"
)
fig.show()

### Salary Distribution and Outliers
Salaries are right-skewed — most roles pay under $150K, but select senior and ML-focused positions exceed $200K, highlighting outlier opportunities for experienced professionals.


In [None]:
fig = px.scatter(
    df, 
    x="MIN_YEARS_EXPERIENCE", 
    y="Average_Salary",
    color="REMOTE_GROUP",
    trendline="ols",
    title="Salary vs. Minimum Experience by Remote Type"
)
fig.update_layout(template="plotly_white",
                  xaxis_title="Minimum Years of Experience",
                  yaxis_title="Average Salary ($)")
fig.show()

### Salary vs Experience
Salary shows a positive correlation with experience, particularly for hybrid and remote roles — suggesting advanced or flexible positions are often compensated higher.


In [None]:
fig = px.box(df, 
             x="ROLE_GROUP", 
             y="Average_Salary", 
             color="ROLE_GROUP",
             color_discrete_sequence=["#187145", "#45A274", "#79C99E"],
             title="Salary Comparison Across Role Categories")
fig.update_layout(template="plotly_white", xaxis_title="Role Group")
fig.show()

### Salary by Role Category (BA / DS / ML)
Machine Learning roles exhibit the highest median salaries, followed by Data Science and Business Analytics. This validates using ROLE_GROUP as a key feature in regression modeling.


In [None]:
fig = px.choropleth(
    df,
    locations="STATE",
    locationmode="USA-states",
    color="Average_Salary",
    scope="usa",
    color_continuous_scale="Greens",
    title="Average Salary by U.S. State"
)
fig.show()

### Geographic and Remote Trends
Coastal states (CA, MA, NY) show higher average salaries — consistent with tech-industry concentration. These differences will be important for geographic feature engineering.


In [None]:
fig = px.box(df, x="REMOTE_GROUP", y="Average_Salary",
             color="REMOTE_GROUP",
             color_discrete_sequence=["#45A274", "#79C99E", "#A7D9C9"],
             title="Salary Distribution by Remote Work Type")
fig.update_layout(template="plotly_white", xaxis_title="Remote Type")
fig.show()

### Remote Work vs Salary Trends
Remote roles tend to offer higher median salaries compared to onsite jobs, supporting modern hybrid compensation trends in data-driven fields.


In [None]:
from collections import Counter

skills_flat = [s.strip() for sublist in df['TOP_SKILLS'].dropna().str.split(',') for s in sublist if s.strip()]
skill_counts = pd.DataFrame(Counter(skills_flat).most_common(15), columns=['Skill', 'Count'])

fig = px.bar(skill_counts, x='Skill', y='Count', 
             title="Most Frequent Skills in Job Postings",
             color_discrete_sequence=["#187145"])
fig.update_layout(template="plotly_white", xaxis_title="Skill", yaxis_title="Count")
fig.show()

### Top Skills Frequency
Python, SQL, and Machine Learning dominate job descriptions, indicating their critical role in employability and forming the foundation for the modeling phase.


In [None]:
import plotly.figure_factory as ff

corr = df[["Average_Salary","MIN_YEARS_EXPERIENCE","MAX_YEARS_EXPERIENCE"]].corr()
fig = ff.create_annotated_heatmap(
    z=corr.values, 
    x=corr.columns.tolist(), 
    y=corr.columns.tolist(), 
    colorscale='greens', showscale=True)
fig.update_layout(title="Feature Correlation Matrix", template="plotly_white")
fig.show()

### Correlation Heatmap
Salary correlates moderately with both minimum and maximum experience, confirming their inclusion in regression modeling.
We’ll rely on additional categorical encodings (e.g., REMOTE_GROUP, ROLE_GROUP) for predictive depth.