## Which data roles are the most accessible, and when is the best time to apply?

This section explores the demand for various data roles and includes the following analyses:
- The most in-demand Data Roles worldwide and the number of skills required for each of them
- Top-3 Data Roles in Europe, the United States, and other countries
- Job posting trends for the Top-3 Data Roles

### Import Libraries

In [1]:
from pathlib import Path

import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

### Load Cleaned Dataset

In [2]:
df = pd.read_csv(Path.cwd().parents[1] / 'Raw_Data' / 'df_Final.csv')

### The most in-demand Data Roles worldwide and the number of skills required for each of them
To understand which data roles are easiest to break into, we first need to identify the roles most in demand globally. A higher number of job postings typically indicates a lower barrier to entry and more available opportunities. 

At the same time, average number of  skills required in job descriptions provides insight into role complexity and the potential ease of entry.

In [3]:
# Top-10 the most common job titles
top_titles = (
    df['job_title_short']
       .value_counts()
       .head(10)
       .rename_axis('job_title')
       .reset_index(name='count')
       .sort_values(by='count', ascending=True)
)

# Prepare avg skills data aligned with top_titles
df_skills = df.dropna(subset=['job_skills']).copy()
df_skills['num_skills'] = df_skills['job_skills'].apply(lambda x: len(x.split(',')))

avg_skills = (
    df_skills[df_skills['job_title_short'].isin(top_titles['job_title'])]
    .groupby('job_title_short')['num_skills']
    .mean()
    .round(0)
    .astype(int)
    .reset_index(name='avg_skills')
)

# Merge avg skills with top_titles so order matches
top_titles = top_titles.merge(avg_skills, left_on='job_title', right_on='job_title_short')

fig = px.bar(
    top_titles,
    x='count',
    y='job_title',
    orientation='h',
    title='The Most In-Demand Data Roles Worldwide',
    labels={'count': 'Number of Job Postings', 'job_title': ''}
)

fig.update_traces(text=top_titles['count'], textposition='outside', marker_color='steelblue')
fig.update_layout(margin=dict(t=60, l=150, r=150)) 

# Add title annotation for avg skills
fig.add_annotation(
    xref='paper',
    yref='paper',
    x=1.17,
    y=1.17,
    text='Average number<br>of skills required',
    showarrow=False,
    font=dict(color='black', size=14),
    align='center'
)

# Add avg skills as annotations aligned horizontally to right of plot
for idx, row in top_titles.iterrows():
    fig.add_annotation(
        xref='paper',  
        y=row['job_title'],
        x=1.1,  
        text=str(row['avg_skills']),
        showarrow=False,
        font=dict(color='white', size=14),
        bgcolor='steelblue',
        bordercolor='darkgrey',
        borderwidth=1,
        borderpad=4,
        align='center',
        yanchor='middle'
    )

fig.show()

The most popular data roles, **Data Scientist**, **Data Analyst**, and **Data Engineer**, stand out with significantly more job postings compared to others. At the same time, the lower number of skills required for **Data Analysts** suggests it is a more accessible **entry point** into the data field, making it a practical **base role for further analysis**.

In [4]:
job_colors = {
    'Data Analyst': '#4c78a8',
    'Data Scientist': '#6baed6',
    'Data Engineer': '#9ecae1'
}

### Top-3 Data Roles in Europe, the United States, and other countries

Now, let’s break down job postings by location and explore which of the top three data roles is the most popular in **Europe**. By comparing it with trends in the **US**—often the headliner of the global data job market—we can gain insight into **emerging demand patterns** and anticipate what roles are likely to grow in popularity across Europe in the near future.

In [5]:
regions = ['EU', 'US', 'Other']
top3_global = df['job_title_short'].value_counts().nlargest(3).index.tolist()

# Subplots setup
fig = make_subplots(
    rows=1,
    cols=3,
    specs=[[{'type': 'domain'}, {'type': 'domain'}, {'type': 'domain'}]],
    column_widths=[0.3, 0.3, 0.3],
    horizontal_spacing=0.1
)

# Adjusted positions for perfect visual centering
title_positions = [0.12, 0.5, 0.9]

for i, (region, x_pos) in enumerate(zip(regions, title_positions)):
    region_data = df[df['region_group'] == region]['job_title_short']
    
    # Count for top 3 roles only
    counts = {job: region_data[region_data == job].count() for job in top3_global}
    labels = [job for job in top3_global if counts[job] > 0]
    values = [counts[job] for job in labels]
    colors = [job_colors[job] for job in labels]

    fig.add_trace(
    go.Pie(
        labels=labels,
        values=values,
        hole=0.5,
        marker=dict(colors=colors, line=dict(color='white', width=2)),
        textinfo='percent',
        hoverinfo='none',
        showlegend=(i == 0)
    ),
    row=1, col=i+1
)

    fig.add_annotation(
        text=region,
        x=x_pos,
        y=1.03,
        showarrow=False,
        font_size=14
    )

# Layout
fig.update_layout(
    title_text='Top-3 Data Roles by Region',
    height=340,
    width=950,
    margin=dict(t=50, b=30, r=40, l=40),
)

fig.show()

The most popular data role in Europe is **Data Analyst**, which suggests a regional focus on business intelligence and operational insights. In contrast, **Data Science** ranks second in the EU but is the most in-demand role in the US, reflecting a growing emphasis on advanced analytics and machine learning. Meanwhile, **Data Engineering** leads in other countries, likely due to the need to build strong data infrastructure as a foundation for future analytics capabilities.

Based on these insights, we will focus on the top three most popular data roles — Data Analyst, Data Scientist, and Data Engineer — in the next steps of the analysis, as they represent the most attainable and strategic paths for job seekers.

### Job posting trends for the Top-3 Data Roles
Next, we explore the **monthly trends** for the top three data roles. This helps identify **when hiring peaks** throughout the year—key timing knowledge for job seekers aiming to land a role. It also confirms the **consistency of demand** for each role over time.

In [6]:
# Set top 3 roles and extract month
top3_roles = ['Data Analyst', 'Data Scientist', 'Data Engineer']
df_top3 = df[df['job_title_short'].isin(top3_roles)].copy()
df_top3['job_posted_date'] = pd.to_datetime(df_top3['job_posted_date'])
df_top3['month'] = df_top3['job_posted_date'].dt.to_period('M').astype(str)

# Group by month and role
monthly_counts = (
    df_top3.groupby(['month', 'job_title_short'])
    .size()
    .reset_index(name='count')
)

# Pivot for plotting
df_pivot = monthly_counts.pivot(index='month', columns='job_title_short', values='count').fillna(0)
df_pivot = df_pivot.sort_index()

# Plot
fig = go.Figure()

for role in top3_roles:
    fig.add_trace(go.Scatter(
        x=df_pivot.index,
        y=df_pivot[role],
        mode='lines+markers',
        name=role,
        line=dict(color=job_colors[role], width=4),
        marker=dict(size=9)
    ))

# Layout and styling
fig.update_layout(
    title='Job Postings Trend for Top-3 Data Roles',
    xaxis_title='',
    yaxis_title='Number of Job Postings',
    height=400,
    width=950,
    template='plotly_white',
    margin=dict(t=60, b=40, l=60, r=40),
    legend_title='Job Title'
)

fig.show()

The number of job postings for **Data Analyst** and **Data Scientist** roles peaks in **January**, followed by decreasing, then rises again between **June and August**. This pattern suggests a **stable hiring trend** aligned with the start of the financial year and mid-year budget reviews.

For job seekers, this implies that **January and July are strategic times to apply**.

### Summary:

**Data Analyst**, **Data Scientist**, and **Data Engineer** as the three most in-demand data roles in the global job market, while the **Data Analyst** role is the most popular in Europe. Meanwhile, the higher demand for **Data Scientists** in the US suggests that this field will continue to expand globally, offering future opportunities.

The relatively lower skill count for **Data Analysts** suggests it is a more **accessible entry point** into the data field, making it a practical base role for further analysis.

Hiring activity for these top roles peaks in **January** and **mid-summer** (June–August), aligning with annual and mid-year planning cycles — valuable timing insight for job seekers.