# How can you maximize your salary as a Data Analyst? 
- Which skills are most strongly correlated with higher pay?
- What type of job better to search for - Remote/In-Office?
- Do you need special degree to get bigger salary?
- When better to search for a job to get bigger salary.

### For now correlations calculated globally - for all countrieas and job titles - need to review

In [39]:
import pandas as pd
#import plotly.express as px

#import matplotlib.pyplot as plt
#import seaborn as sns

import numpy as np

import plotly.graph_objects as go

df = pd.read_csv('/Users/kolesnikevgenia/Documents/Python_Projects/Job_Skills/Raw_Data/df_Final.csv')

### Remove Salary Outliers

In [40]:
df_eu_ds = df[
    #(df['region_group'] == 'EU') &
    (df['job_title_short'] == 'Data Analyst')
]

#Calculate global IQR
Q1_global = df_eu_ds['salary_month_avg_eur'].quantile(0.25)
Q3_global = df_eu_ds['salary_month_avg_eur'].quantile(0.75)
IQR_global = Q3_global - Q1_global
lower_bound_global = Q1_global - 1.5 * IQR_global
upper_bound_global = Q3_global + 1.5 * IQR_global

#Build list to collect filtered results
filtered_groups = []

#Iterate over groups manually
for (country, title), group in df_eu_ds.groupby(['job_country', 'job_title_short']):
    if len(group) >= 5:
        Q1 = group['salary_month_avg_eur'].quantile(0.25)
        Q3 = group['salary_month_avg_eur'].quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR
    else:
        lower = lower_bound_global
        upper = upper_bound_global

    filtered = group[
        (group['salary_month_avg_eur'] >= lower) &
        (group['salary_month_avg_eur'] <= upper)
    ]
    filtered_groups.append(filtered)

# Combine all groups back
df_filtered = pd.concat(filtered_groups, ignore_index=True)

print(f"Original dataset size: {len(df_eu_ds)}")
print(f"Filtered dataset size: {len(df_filtered)}")

Original dataset size: 5451
Filtered dataset size: 5324


### Correlation between month and number of job postings

### Number of Job Postings and Median Salary of Data Analysts Over Time

In [42]:
df_filtered['month_period'] = df_filtered['job_posted_date'].dt.to_period('M').astype(str)

# 3. Job postings count per month
job_postings_count = (
    df_filtered.groupby('month_period')
    .size()
    .reset_index(name='job_postings_count')
    .sort_values('month_period')
    .reset_index(drop=True)
)

# 4. Median salary per month
median_salary = (
    df_filtered.groupby('month_period')['salary_month_avg_eur']
    .median()
    .reset_index(name='median_salary')
    .sort_values('month_period')
    .reset_index(drop=True)
)

# 5. Merge both into one DataFrame
df_merged = pd.merge(
    job_postings_count,
    median_salary,
    on='month_period'
)

# 6. Plot
fig = go.Figure()

# Job postings line
fig.add_trace(go.Scatter(
    x=df_merged['month_period'],
    y=df_merged['job_postings_count'],
    mode='lines+markers',
    name='Job Postings Count',
    line=dict(color='#1f77b4', width=3),
    yaxis='y1'
))

# Median salary line
fig.add_trace(go.Scatter(
    x=df_merged['month_period'],
    y=df_merged['median_salary'],
    mode='lines+markers',
    name='Median Salary (EUR)',
    line=dict(color='#ff7f0e', width=3),
    yaxis='y2'
))

# Layout with dual axes
fig.update_layout(
    title="Data Analyst: Job Postings and Median Salary for Data Analyst Over Time",
    xaxis=dict(title='Month'),
    yaxis=dict(title='Job Postings Count', side='left'),
    yaxis2=dict(title='Median Salary (EUR)', overlaying='y', side='right'),
    legend=dict(x=0.1, y=1.1, orientation='h'),
    height=450,
    width=900,
    template='plotly_white',
    margin=dict(t=70, b=40)
)

fig.show()

# Correlation between number of postings and job salaries

In [43]:
df_filtered['month_period'] = df_filtered['job_posted_date'].dt.to_period('M').astype(str)

# 1. Job postings count per month
job_postings_count = (
    df_filtered.groupby('month_period')
    .size()
    .reset_index(name='job_postings_count')
    .sort_values('month_period')
    .reset_index(drop=True)
)

# 2. Median salary per month
median_salary = (
    df_filtered.groupby('month_period')['salary_month_avg_eur']
    .median()
    .reset_index(name='median_salary')
    .sort_values('month_period')
    .reset_index(drop=True)
)

# 3. Next month salary: shift median salary up by 1
median_salary['median_salary_next_month'] = median_salary['median_salary'].shift(-1)

# 4. Merge counts with salaries
df_corr = pd.merge(
    job_postings_count,
    median_salary[['month_period', 'median_salary', 'median_salary_next_month']],
    on='month_period'
)

# 5. Drop last row (no next month salary)
df_corr = df_corr.dropna()

# 6. Calculate correlations
corr_current = df_corr['job_postings_count'].corr(df_corr['median_salary'])
corr_next = df_corr['job_postings_count'].corr(df_corr['median_salary_next_month'])

# 7. Print results
print(f"Correlation between job postings and current month median salary: {corr_current:.3f}")
print(f"Correlation between job postings and next month median salary: {corr_next:.3f}")

Correlation between job postings and current month median salary: 0.709
Correlation between job postings and next month median salary: 0.878


Interpretation
This is a moderate to strong positive correlation.
It suggests that when there are more job postings in a given month, the average salary tends to be higher in the following month.
This could indicate a lagged effect where increased hiring demand signals or even drives up salary offers shortly after.
It might reflect market dynamics like salary negotiations catching up after spikes in demand, or employers responding to talent shortages by raising salaries in the near term.
What could cause this pattern?
Hiring cycles: Companies may ramp up postings first, then adjust salary offers in response.
Budget timing: Salary budgets or offers might be updated shortly after observing market demand.
Market signals: A surge in postings could reflect competitive pressures that lead to salary increases soon after.