# Title: Comparing global salary trends within Data science vs Cybersecurity careers

Younes Melouah
od23ym@leeds.ac.uk

# PROJECT PLAN

### Data description:
 The data for this project is sourced from 2 kaggle datasets titled "Global Data science Salary trends" and "Cybersecurity salaries" . Both of these dataset provides comprehensive information on the salaries of data-related and cybersecurity-related roles from around the world spanning over the period (2020-2022).


### Overview of aims and objectives:

Aim: The primary goal of this project is to explore and analyse global salary trends within data science and cybersecurity then comparing them, and by answering the main question on which field pays more and is likely to pay more in the future based on the data we have now. By delving into these datasets, the project aims to:

1. Conduct an exploratory analysis to identify the hierarchy of data roles ranking them from most least popular across the globe.
2. Create visual representation ranking data roles from highest to lowest salaries.
3. Visualise the variations of salary within these data roles over times to trace and observe any important trends and shifts from 2020 to 2022



# THE DATA:
Both Datasets used provide comprehensive information on the salaries of data-related and cyber-related roles from around the world.

The datasets have the same columns with different information including:
- work_year: The year at when the employee was working
- experience_level: EN Entry-level / Junior MI Mid-level / Intermediate SE Senior-level / -Expert EX Executive-level / Director
- employment_type: PT Part-time FT Full-time CT Contract FL Freelance
- job_title: The specific job role
- salary: The gross salary.
- salary_currency: The currency of the salary paid.
- salary in usd: The salary in USD (FX rate divided by avg. USD rate for the respective year via fxdata.foorilla.com).
- employee_residence: Employee's country of residence in during the work year 
- remote_ratio: The overall amount of work done remotely, possible values are as follows: 0 No remote work (less than 20%) 50 Partially remote 100 Fully remote (more than 80%)
- company_location: The country of the employer's office
- company_size: S (small), M employees (medium) L more than 250 employees (large)

### Data quality and Considerations

There are some uncertainties and considerations associated with our dataset:
- The criteria for small, medium or large company according to the dataset is not clearly defined.

- dataset does not specify whether "company_location" means the company's main office/headquarters or the employee's work location.

- There is some ambiguity surrounding the nature of the employees' experiences, especially regarding how long they have been at their registered company.

- The dataset does not account for the employees' diverse skill sets, which could have been further interpreted as a salary determinant.

- Having the same job title does not mean identical job responsiblities or tasks, thus, differences in salaries can exist within the same job title.

- The dataset's reliability can be questioned too, especially if it was self-reported. This could lead to biases and tendencies to inflate/deflate certain figures

Despite these uncertainties, the dataset still provides a wealth of information enough to take an introductory step towards the analysis of global data science job market trends. Steps within the data architecture section mention the steps taken to refine our dataset and make it ready for analysis.



# Objective:

In today's digital age, information about tech jobs can be found everywhere, from Linkedin to social media to blogs and more. Whilst there is an abundant amount of data available, they are often broadly explored as many important factors like experience, regional job market nuances, geographical considerations and employment types tend to be overlooked.

Thus, the objectives:

Objective 1: Correlate and visualize average salaries by job titles in the fields of Data Science and Cybersecurity, highlighting variations based on company size, experience levels, and employment types.
Objective 2: Provide a global perspective by charting the trend of salaries based on employee residence, emphasizing top locations with the highest average salaries.
Objective 3: Analyze and showcase the impact of remote work on salaries within the Data Science and Cybersecurity fields.

# ARCHITECTURE:
#### 1. Data Collection:

Raw data sources: The initial point of data gathering. The data for this project primarily comes from our previously mentioned kaggle dataset " Global Data science Salary trends".


#### 2. Data cleaning and Pre-processing

Data Quality Check:removed duplicates, removing missing values and guaranteeing consistent data formatting.
Transformation: Converted all salaries into USD currency.
Normalization: Ensured data like job titles or country names are standardized to ensure uniformity.


#### 3. Exploratory Data Analysis (EDA)

Descriptive Statistics: Essential statistics, including mean, median and standard deviation helped offer a snapshot of data distribution.
Visualization: Graphical representation helped in visualising these data distributions, correlations and evolving trends.

#### 4. Analysis:

Model Building (if predictive analytics is desired): Create statistical models.
Insights Extraction: Derive actionable insights based on patterns or trends.

#### 5. Reporting and Visualization:

Dashboards: Use tools like Python libraries to create visual dashboards.


#### 6. Feedback Loop:

Based on the insights or outcomes from the analysis, there might be a need to revisit earlier stages for more data or different preprocessing steps.


# DATA PREPROCESSING


In [None]:
import pandas as pd
import datetime as dt
import plotly.express as px
import ipywidgets as widgets
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import warnings 
warnings.filterwarnings('ignore')

In [None]:
df1 = pd.read_csv('ds_salaries.csv') # read the data science roles from csv
df2 = pd.read_csv('salaries_cyber.csv') # read the Cyber security roles from csv

In [None]:
df1

In [None]:
df2

In [None]:
df1.drop(columns=['Unnamed: 0'], inplace=True) # remove unwanted column 
df1.head()

In [None]:
# both dataframes have the same and equal number of column with the same data type
# combine both df 
combined_df = pd.concat([df1, df2], ignore_index=True)
combined_df.to_csv('combined_file.csv', index=False)

In [None]:
# take all the unique locations list within company_location column
unique_locations_list = list(combined_df['company_location'].unique())
print(unique_locations_list)

In [None]:
# convert the country code name to its full name
country_mapping = {
    'DE': 'Germany',
    'JP': 'Japan',
    'GB': 'United Kingdom',
    'HN': 'Honduras',
    'US': 'United States',
    'HU': 'Hungary',
    'NZ': 'New Zealand',
    'FR': 'France',
    'IN': 'India',
    'PK': 'Pakistan',
    'CN': 'China',
    'GR': 'Greece',
    'AE': 'United Arab Emirates',
    'NL': 'Netherlands',
    'MX': 'Mexico',
    'CA': 'Canada',
    'AT': 'Austria',
    'NG': 'Nigeria',
    'ES': 'Spain',
    'PT': 'Portugal',
    'DK': 'Denmark',
    'IT': 'Italy',
    'HR': 'Croatia',
    'LU': 'Luxembourg',
    'PL': 'Poland',
    'SG': 'Singapore',
    'RO': 'Romania',
    'IQ': 'Iraq',
    'BR': 'Brazil',
    'BE': 'Belgium',
    'UA': 'Ukraine',
    'IL': 'Israel',
    'RU': 'Russia',
    'MT': 'Malta',
    'CL': 'Chile',
    'IR': 'Iran',
    'CO': 'Colombia',
    'MD': 'Moldova',
    'KE': 'Kenya',
    'SI': 'Slovenia',
    'CH': 'Switzerland',
    'VN': 'Vietnam',
    'AS': 'American Samoa',
    'TR': 'Turkey',
    'CZ': 'Czech Republic',
    'DZ': 'Algeria',
    'EE': 'Estonia',
    'MY': 'Malaysia',
    'AU': 'Australia',
    'IE': 'Ireland',
    'BW': 'Botswana',
    'AZ': 'Azerbaijan',
    'AQ': 'Antarctica',
    'AX': 'Åland Islands',
    'SE': 'Sweden',
    'ET': 'Ethiopia',
    'NO': 'Norway',
    'ID': 'Indonesia',
    'RS': 'Serbia',
    'AR': 'Argentina',
    'ZA': 'South Africa',
    'UM': 'United States Minor Outlying Islands',
    'EG': 'Egypt',
    'TW': 'Taiwan',
    'SA': 'Saudi Arabia',
    'AF': 'Afghanistan'
}

country_names = [country_mapping[abbr] for abbr in ['DE', 'JP', 'GB', 'HN', 'US', 'HU', 'NZ', 'FR', 'IN', 'PK', 'CN', 'GR', 'AE', 'NL', 'MX', 'CA', 'AT', 'NG', 'ES', 'PT', 'DK', 'IT', 'HR', 'LU', 'PL', 'SG', 'RO', 'IQ', 'BR', 'BE', 'UA', 'IL', 'RU', 'MT', 'CL', 'IR', 'CO', 'MD', 'KE', 'SI', 'CH', 'VN', 'AS', 'TR', 'CZ', 'DZ', 'EE', 'MY', 'AU', 'IE', 'BW', 'AZ', 'AQ', 'AX', 'SE', 'ET', 'NO', 'ID', 'RS', 'AR', 'ZA', 'UM', 'EG', 'TW', 'SA', 'AF'] if abbr in country_mapping]

print(country_names)

In [None]:
# replace the full name location 
combined_df['employee_residence'] = combined_df['employee_residence'].replace(country_mapping)
combined_df['company_location'] = combined_df['company_location'].replace(country_mapping)
df1['employee_residence'] = df1['employee_residence'].replace(country_mapping)
df1['company_location'] = df1['company_location'].replace(country_mapping)
df2['employee_residence'] = df2['employee_residence'].replace(country_mapping)
df2['company_location'] = df2['company_location'].replace(country_mapping)

In [None]:
# before proceeding further for analysis let's convert the abbreviations used the dataset to its full meaning
#  in order to make it easy for the reader or the user to read

experience_map = {
    'EN': 'Entry Level',
    'MI': 'Mid Level',
    'SE': 'Senior Level',
    'EX': 'Executive'
}

employment_map = {
    'PT': 'Part-time',
    'FT': 'Full-time',
    'CT': 'Contractor',
    'FL': 'Freelancer'
}

company_size_map = {
    'S': 'Small',
    'M': 'Medium',
    'L': 'Big Enterprise'
}

for df in [df1, df2,combined_df]:
    df['experience_level'] = df['experience_level'].map(experience_map)
    df['employment_type'] = df['employment_type'].map(employment_map)
    df['company_size'] = df['company_size'].map(company_size_map)
    df['employee_residence'] = df['employee_residence'].replace(country_mapping)
    df['company_location'] = df['company_location'].replace(country_mapping)

In [None]:
combined_df

In [None]:
combined_df.info()

In [None]:
combined_df.describe().T

In [None]:
# given our statistical result, it is essential to know whether dataFrame (df) values are missing (Nan or None), 
# so will output 
combined_df.isnull().sum()

# Exploring Global Salary trends in Data science and Cyber Security roles across time from 2020-2022

In [None]:

# Combining both Data science and Cyber Security salary trends over time into one graphical representation
plt.figure(figsize=(12, 7))

# Data Science salaries trend
salary_trend_data = df1[['salary_in_usd', 'work_year']].sort_values(by='work_year')
sns.lineplot(data=salary_trend_data, x='work_year', y='salary_in_usd', marker='o', linestyle='--', color='blue', label='Data Science')

# Cyber Security salaries trend
salary_trend_cyber = df2[['salary_in_usd', 'work_year']].sort_values(by='work_year')
sns.lineplot(data=salary_trend_cyber, x='work_year', y='salary_in_usd', marker='o', linestyle='--', color='red', label='Cyber Security')

plt.title('Cyber security and Data science salary Trends Over Time', fontweight='bold')
plt.legend()
plt.show()



The graph above compares salary trends for Data science and Cyber security roles from 2020 to end of 2022. Both roles show a general upward trend meaning both their average salaries increased over the time frame.

For data science average salary (in blue), the trend shows a steady and relatively linear increase from 2020 to 2021, increasing from approximately $97k to $100k. Whereas Cyber security average salary (in red) starts above data science at approximately $110k and steadily decreases to approximately $105k. 

Both trends plateau briefly at 2021 and resume to a fast upward trajectory, particularly the cyber security salary trend. This temporary plateau is intriguing, without too much context it can be difficult to pinpoint the exact cause however the rapid increase in salaries post-2021 could hypothetically be attributed to the mid/post-pandemic world where businesses were going through digitisation and thus there has been an increased demand for tech roles overall. Increased digitisation also means cyber threats and attacks may have surged which is possibly why cyber security salaries grew faster than data science.

Overall, the graph indicates a general increased demand for both Cyber Security and Data science roles, however we will investigate further into the salary trends between within other specific variables including location, experience and company size.

# Exploring Salary trends by Company size within Data science and Cybersecurity (2020-2022)

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# making subplots 
fig, ax = plt.subplots(1, 2, figsize=(20, 6))

orders = ['Small', 'Medium', 'Big Enterprise']

# plot data science on the left 
sns.lineplot(data=df1, x='work_year', y='salary_in_usd', hue='company_size', hue_order=orders, marker='o', ax=ax[0])
ax[0].set_title('Salary Trend Over Time by Company Size in Data Science', fontweight='bold')
ax[0].set_xlabel('Year', fontweight='bold')
ax[0].set_ylabel('Salary (USD)', fontweight='bold')
ax[0].legend(title='Company Size', title_fontsize=10, fontsize=10, loc='upper left')
ax[0].set_facecolor("#f4f4f4")
ax[0].grid(False)

# plot cyber security on the right 
sns.lineplot(data=df2, x='work_year', y='salary_in_usd', hue='company_size', marker='o', ax=ax[1])
ax[1].set_title('Salary Trend Over Time by Company Size in Cyber Security', fontweight='bold')
ax[1].set_xlabel('Year', fontweight='bold')
ax[1].set_ylabel('Salary (USD)', fontweight='bold')
ax[1].legend(title='Company Size', title_fontsize=10, fontsize=10, loc='upper left')
ax[1].set_facecolor("#f4f4f4")
ax[1].grid(False)


# Adjust the space between the plots
plt.tight_layout()

# Display the plots
plt.show()


From the visual representation above, we see a comparison between salary trends over time by company size within data science (DS) and Cyber security (CS) through a line graph. 

In both DS and CS sectors, big enterprises (green line) show a steady increase in salaries throughout the observed period. Particularly in DS, the upward trend is a little more prominent, showing a consistent demand for their roles. In the cyber security field, whilst there is a similar rise, there is an intriguing narrowing of the gap with medium companies towards 2022, suggesting possible shifts in market dynamics or a growth in demand within the medium-sized sector.

We also see a notable volatile trend in both DS and CS roles for medium-sized companies (orange line). Interestingly both salaries dip between 2020 and 2021, this could potentially be influenced by the pandemic as medium sized/small-sized companies find it harder to maintain themselves during global crises with their limited resources and are more susceptible to resorting to lay-offs as means of survival. Again, we interestingly see an immensely sharp recovery in both sectors with a fast salary growth from 2021 to 2022 , with CS roles even edging out the big enterprises. These graphs whilst optimistic, requires a closer examination for the outliers or high-demand roles as they could disproportionately influence the average. 
 

# Visualising Average salaries by Job title in Data science and Cyber security fields

In [None]:
# filter out the combined df to only count for job title with frequency higher than or equal 10
filtered_jobs = combined_df['job_title'].value_counts()
job_titles_above_10 = filtered_jobs[filtered_jobs >= 10].index
combined_df_filtered = combined_df[combined_df['job_title'].isin(job_titles_above_10)]

The code above combines both of the data sets (data science and cyber security) into one data set 'combined_df_filtered'. This makes it easier to filter the data for better in-depth analysis as we remove obsolete rows where job title has frequency of less than 10. This combination of datasets also allows us to to focus better on comparing specific job roles rather than just the general sectors.

In [None]:
# take job type and find the average salary the job
average_job_salary = combined_df_filtered.groupby('job_title')['salary_in_usd'].mean().sort_values(ascending=True).round()

# create horizontal bar chart
fig_cyber = px.bar(average_job_salary, x=average_job_salary, y=average_job_salary.index,
        color=average_job_salary, labels={'job_title': 'Job Title', 'x': 'Average Salary (USD)'},  
            text=average_job_salary, orientation='h', template='seaborn',
             title='<b>Average Salary Roles in Cyber Security & Data Science',
                          width=800,   height=1200 # Adjust as needed
)   
fig_cyber.show()
# this will out but a bar graph in ascending order listing the top paying roles which have a count of 10 or above in our dataset

A general observation we can see the top roles are security roles but we have that are in data science. But we have to ask ourselves:
- what experience level do we have for these roles
- where are the employees for these job listed based (lower boundary)?
- at what currency are they getting paid in
- what company do they work for?
- what is remote ratio, and employment type?

# Salary Distribution across Experience Levels within Data science and Cyber Security sectors

In [None]:
#  display box chart with experience (Ascending order)
order = ['Entry Level', 'Mid Level', 'Senior Level', 'Executive']
combined_df_filtered['experience_level'] = pd.Categorical(combined_df_filtered['experience_level'], categories=order, ordered=True)
combined_df_filtered = combined_df_filtered.sort_values('experience_level') # sort value of experience column as shown in the order variable (Array)

# display box plot
salary_distribution_by_experience_data = px.box(combined_df_filtered, x='experience_level', y='salary_in_usd', color='experience_level',
             labels={'experience_level': 'Experience Level', 'salary_in_usd': 'Salary (USD)'},
             title='Salary Distribution by Experience Level')

# Update y-axis to iterate by 100k
max_salary = combined_df_filtered['salary_in_usd'].max()  # Get the maximum salary for setting the y-axis limit
y_axis_ticks = list(range(0, int(max_salary) + 1, 100000))  # Generate tick values by 20k

salary_distribution_by_experience_data.update_layout(
    yaxis=dict(
        tickvals=y_axis_ticks,
        ticktext=[f"${val:,}" for val in y_axis_ticks]
    )
)

salary_distribution_by_experience_data.show()

# Job Frequency Distribution by Experience Level (ordered by highest salary job titles to lowest salary)

In [None]:

# rearrange the bar chart from salary instead of count, displaying the ones with highest salaries first rather than the count 
ordered_job_titles = combined_df_filtered.groupby('job_title')['salary_in_usd'].sum().sort_values(ascending=False).index.tolist()
fig = px.histogram(combined_df_filtered, x='job_title', color='experience_level', height=800, width=900)
# Update layout to order bars by total salary
fig.update_layout(barmode='stack', xaxis={'categoryorder':'array', 'categoryarray': ordered_job_titles})
fig.show()

Here we can observe the highest paying jobs have Senior and Mid Level the most, but most importantly they are the most counted for in our dataset.
This make it reliable fact that these roles are the highest paid.

# Salary distribution across employment types (FullTime, Contractor, PartTime, Freelancer)

In [None]:
fig = px.box(combined_df_filtered, x='salary_in_usd', y='employment_type', color='employment_type', height=500, width=800)
fig.update_layout(barmode='stack', yaxis={'categoryorder':'total ascending'})

This another representation of our bar graph, this gives us more information on minimum and maximum outliers. It also give us information on the median within interquartile range

In [None]:
# rearrange the bar chart to see which jobs have which employment type
employment_job_titles = combined_df_filtered.groupby('job_title')['salary_in_usd'].sum().sort_values(ascending=False).index.tolist()
fig = px.histogram(combined_df_filtered, x='job_title', color='employment_type', height=800, width=900)
# Update layout to order bars by total salary
fig.update_layout(barmode='stack', xaxis={'categoryorder':'array', 'categoryarray': ordered_job_titles})
fig.show()

In [None]:
employment_type_counts = combined_df_filtered['employment_type'].value_counts()
fig = px.pie(employment_type_counts, 
             values=employment_type_counts.values, 
             names=employment_type_counts.index,
             title='Distribution of Employment Types')
fig.show()

Most jobs are full-time and this is full represented in this graph for all the top paying roles in tech, this account for 98.30% with the rest being
0.72% for Contractor,0.72% for Part-time and 0.26% for Freelancer

# Job Titles Distribution Across Small, Medium, and Large Companies

In [None]:
# represent the how many roles recorded in our dataset work in small, medium and large companies
fig = px.histogram(combined_df_filtered, x='job_title', color='company_size', height=800, width=1000, category_orders={"company_size": ['Small', 'Medium', 'Large']} )
fig.update_layout(barmode='stack', xaxis={'categoryorder':'total descending'},  )

The first thing that stands out from the graph is the dominance of Big Enterprise across almost all job titles. This suggests that larger organisations tend to have a broader variety of specialised roles, especially within the sectors we are exploring (data and cyber security) which makes sense as generally deal with larger volumes of data, have higher complexity of operations and have increased security needs. The top 2 most frequently found roles "Security Engineer", "Data scientist" is the most common job title, with a significant number of these professionals working in Medium companies.

Toles such as Machine Learning Engineer and Data scientist are notably prevalent in smaller companies. These specialised positions could be in high demand as smaller firms strivehto catch up with larger enterprises, this means they are likely to prioritise investing in cutting-edge technologies and roles to remain competitive. Conversely, smaller firms may not have the same need for roles like DevOps or other maintenance job titles like data architect , which are more common in large corporate infrastructures.e nature of the roles that have higher distribution within small companies, are specialised roles like machine le arning engineer, or data scientist where their skills can be of demand as small companies have to catch up to big companies and is safe to assume that they require cutting edge stuff so they can survive in the market and they also don't really need roles like devops or maintanance roles that are required in big companiesR

# Evolution of Global salaries by Employee residence over time (2020-2022)

In [None]:
# Remote Work and Salary based on employees location
# find maximum and minimum salary values
avg_salary_by_residence_year = combined_df_filtered.groupby(['employee_residence', 'work_year'])['salary_in_usd'].mean().reset_index()
global_min_salary = avg_salary_by_residence_year['salary_in_usd'].min()
global_max_salary = avg_salary_by_residence_year['salary_in_usd'].max()

# create a choropleth with animation for each work year, keeping color scale stable
fig_choropleth = px.choropleth(avg_salary_by_residence_year, 
                               locations='employee_residence', 
                               locationmode='country names',
                               color='salary_in_usd',
                               animation_frame='work_year',
                               title='Regional Salary Disparity Over Time',
                               color_continuous_scale='RdYlBu',
                               labels={'salary_in_usd': 'Average Salary (USD)'},
                               range_color=(global_min_salary, global_max_salary)  # set stable color scale (if this changes it can be confusing to analyse the map over time)
                              )

fig_choropleth.show()


# Global Salary Trends by Experience Level and Employee residence

In [None]:
avg_salary_by_residence_year_experience = combined_df_filtered.groupby(['employee_residence', 'work_year', 'experience_level'])['salary_in_usd'].mean().reset_index()

# Create a choropleth with animation for each work year and facet columns for each experience level
fig_choropleth = px.choropleth(avg_salary_by_residence_year_experience, 
                               locations='employee_residence', 
                               locationmode='country names',
                               color='salary_in_usd',
                               animation_frame='work_year',
                               facet_col='experience_level', # Separate map for each experience level
                               title='Regional Salary Disparity Over Time by Experience Level',
                               color_continuous_scale='RdYlBu', 
                               labels={'salary_in_usd': 'Average Salary (USD)'},
                               facet_col_wrap=2,
                               range_color=(global_min_salary, global_max_salary) # Number of columns before wrapping, adjust as needed
                              )

fig_choropleth.show()

I've taken the this map to see the variation in salaries within a field of data Science based on experience levels, tracked over time in heated map. By employing a visually intuitive choropleth map, we can assess the geographical distribution of average salaries across different countries, segmented by experience levels such as 'Entry Level', 'Mid Level', 'Senior Level', and 'Executive'. The animation feature allows us to observe the progression of these disparities across various years, offering insights into evolving economic and industrial landscapes.
In summary, this analysis, with its temporal and spatial dimensions, provides a dynamic picture of the global Data Science salary landscape, segmented by expertise, enabling better-informed decisions for both employers and professionals.

The results derived from such an analysis are multifaceted. First, it provides a holistic view of where top talent might be most economically rewarded, and conversely, where professionals might be under compensated given global standards. This could be instrumental for organizations in determining competitive pay scales based on regional standards and experience prerequisites. Additionally, for professionals within the industry, it offers a comparative perspective on potential global opportunities and avenues for career advancement. Over the years, if a certain region consistently offers higher remunerations for a specific experience level, it could indicate a deficit of professionals with that expertise in the area, or a particularly high demand for the role due to local industry booms.



Entry Level (2021):
Developing countries, notably Brazil and India, typically offer lower salaries for Entry Level positions.
Developed nations, especially the US and UK, provide the most competitive compensation for these roles, with salaries often ranging between 80-90k USD.

Entry Level (2022):
Highest salaries for entry level are in North Americas, second comes UK and then India and Pakistan

Mid Level (2021):
For Mid Level Data Science roles:
 North American countries, with the US in the lead, offer the highest remuneration.
Europe follows, with salary ranges typically falling between 50k to 90k USD.

Mid Level (2022):
US highest paid then second comes north America with europe ranging 50k-90k USD

Senior Level roles (2021)
North American countries, particularly the US, lead in offering the highest salaries for Senior Level positions. Europe trails closely behind as the second-highest paying region.

Senior Roles (2022)

Executive (2021):
Executive roles see the highest compensation in countries like the US, Germany, India, and Russia, with the US standing out as the premier destination for top-tier salaries.

Executives (2022):
most high paid roles are based in US and North America at 150k-200k USD
No roles in other regions or part of the world

# Yearly Top 10 Locations with Highest Average Salaries (Mean)

In [None]:
# we can visualize the above maps further by taking the top countries pay, and seeing the change over time

grouped_data = combined_df_filtered.groupby(['work_year', 'company_location'])

#calculate mean salary for each group
average_salaries = grouped_data['salary_in_usd'].mean().reset_index()

average_salaries = average_salaries.rename(columns={'salary_in_usd': 'Mean Salary'})

# for each year we will get top 10 locations with highest mean salaries
sorted_salaries = average_salaries.sort_values(by=["work_year", "Mean Salary"], ascending=[True, False])
top_roles_each_year = sorted_salaries.groupby('work_year').head(10).reset_index(drop=True).round(2)
top_roles_each_year

# Animated Salary Trends for Cyber Security and Data science Roles by Country

In [None]:
fig_bar = px.bar(top_roles_each_year, x="company_location", y="Mean Salary", color = top_roles_each_year.index,
                 animation_frame="work_year", 
                 range_y=[0,300000])

fig_bar.update_yaxes(showgrid=False),
fig_bar.update_xaxes(categoryorder='total descending')
fig_bar.update_traces(hovertemplate=None, texttemplate="%{y}", textfont_color='white')
fig_bar.update_layout(margin=dict(t=100, b=0, l=70, r=40),
                        coloraxis_showscale=False,
                        title_text='Average of Salary Per Country (USD)',
                        xaxis_title=' ', yaxis_title=" ",
                        title_font=dict(size=25),
                        font=dict(color='black'),
                          )

fig_bar.show()

Across all three years, countries like the United States, Switzerland consistently appeared in the top rankings, indicating a strong salary trend in these nations for the respective sectors.
The top three countries with the highest average salaries in the fields of Data Science and Cybersecurity were the United States ($130,900.73), Switzerland ($118,008.00), and Afghanistan ($100,000.00). Other countries include the United Kingdom ($89,431.80) and Germany ($83,276.90).

In 2021:
Australia led with an average salary of $154,364.89, followed by Switzerland ($138,715.50) and Ireland ($118,205.00). The United States had an average of $115,694.29. Additionally, several countries like China, Kenya, and Russia reported an average salary of $100,000.00.

In 2022:
Italy topped the list with an astonishing average salary of $270,647.00. The United Arab Emirates and Israel followed. The United States continued to stay in the top tier with $142,766.06.

# Global Trends in Remote Work and Salaries for Data science and Cyber security roles over Time

In [None]:
# remote work vs salary
fig_remote_work_data = px.choropleth(combined_df_filtered, 
                                      locations='employee_residence', 
                                      locationmode='country names',
                                      color='remote_ratio',  
                                      hover_data=['salary_in_usd'],
                                      animation_frame='work_year',  # Adding the animation frame
                                      title='Remote Work and Salary by Employee Residence Over Time',
                                      color_continuous_scale='RdYlBu', 
                                      labels={'salary_in_usd': 'Average Salary (USD)'},
                                   )

fig_remote_work_data.show()

# CONCLUSION

### Achievements:
Throughout this project I successfully managed to understand various relationships on salary for tech roles and shown it:

1) General Trend Analysis: both fields in Data Science & Cyber Security have shown an increase in average salaries from 2020 to 2022. the fall from 2020 to 2021 could have been due to covid as summarized in our analysis. and Has increased back to 2022 this could be several factors including digital transformation many companies underwent during the pandemic, increase in AI, as well as increase in cyber security threats.

2) Larger companies consistently have offered higher salaries in both Data Science and Cyber security roles overall, however it was interesting to find that the salaries in 2022 for big enterprise and medium companies were very small, this narrow gap in salary could be due to potential market shift or increase in demand in medium size companies, to undergo a strong digital transformation which could again due to the pandemic.

3) Filtering job titles that occur less frequently helps compare roles more reliably, since we have more data of that particular role. Visualizing average salaries by job provided a strong insights into which roles are paid the most on average, while the distribution across experience levels helps understand which experience levels are most common for high-paying roles.
Distribution across Employment type was evident and clear that most of these (98%) being full-time for the high paying tech roles and there was no further observation or comparison to other employment type

4) Visualizing salary trends based on employees' location offers insights into regional disparities and how they change over time. The choropleth maps give a detailed view of the global distribution of salaries based on experience level, and the bar charts focus on the top-paying countries, allowing for a comparison of how these rankings change year by year.

### Limitations:

Data scope: the analysis is limited to 2020 to 2022, trends prior to or after this period are not covered, which therefore doesn't provide a clearer picture for tech roles in the longer duration in order to examine more information and see other possible relationships.

The geographical coverage: while the the dataset offers a global tech jobs, there is still limitation in terms of specific regions within a country or not accounting for some accounting for som countries around the world

The sources of data collection might have inherent biases. For instance, self-reported salaries can sometimes be inflated or under-reported, potentially skewing the results.

External factors including Industry comparison, analysing job titles across different industries such as Finance, Business, Healthcare, Defence and technology. This can provide a strong insight for salary

Variability in Job Titles: Different companies might have different titles for similar roles, which might lead to discrepancies when categorizing and analyzing these roles.

Skills: Analysing skills set of employees can also provide a strong insight into the market value for different job expertise. Salaries highly depends on the role specification and task involved, for instance the more technical skills the more likely the salary will be high. 



### Future Work:
Despite these limitations this project enacts as one step forward to observing nuanced insights on job market dynamics. By exploring the relationships between various factors and salary determinants and visualising it for job seekers can help equip them with a clearer understanding of wage trends in their desired industry , eliminating the need for rigorous manual research. Perhaps platforms like Linkedin or Indeed could potentially utilise these insights to enhance their user experience, offering more targeted and relevant job suggestions based on these observed trends.

Other recommendations to improve this data analysis could be done by:
- Increasing Timeframe: Use data from a more extended period to identify long-term trends.

- Focusing on specific regions or countries to provide a more specific view of salary trends.

- Incorporating Additional Data Points: Introduce more variables like education levels, specific skill sets, or tools proficiency to provide a richer analysis on salary determinants.

- Machine Learning Predictions: Utilize machine learning algorithms to forecast future salary trends based on historical and current data.