# What's the Best Data Job?

A company hired me to makae valuable insights from a data set because they think is a great opportunity for their website if they make a Blog Entry with a title similar to "TOP DATA JOBS YOU SHOULD BE CONSIDERING". I accepted the job, because, as a student pursuing a Data Science Bachelors Degree I have found a lot of Data Related Jobs and now, I have the opportunity to analyze them and be to helpful to the company and to former classmates with their carreer paths.

## Who is my Audience?

This notebook is dedicated to my former classmates on the Data Science Bachelors Programm are overwhelmed by the vast amount of data jobs out there. 

## What do I want to Communicate

I want to narrow down and breakdown the TOP Jobs for Data Science Student. To be able to do this I will start I want to put on context the demand and the salaries of Data Jobs throughout the years. 
I want to tell the people how much salary to expect going through their career path, going from an Entry Level Position all the way to an Executive Position.

Being that all of us, went through a Pandemic and now we can see the changes in the job market, I want my audience to know what to expect in terms of Remote Work, what are the Remote Work Opportunities.

Another valuable insight I want to provide is about the how much money can someone expect from different countries. 




# Libraries

In [195]:
# Loading Dataset and Dataframe Manipulation
import pandas as pd
import numpy as np

# Changing Country Codes
import pycountry

#Visualization
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

#Building Plotlty Dashboard
import dash
import dash_core_components as dcc
import dash_html_components as html

#Assessment Brief
from IPython.display import Image



# Loading Data Set and Explanation

First we have a first look at the first five rows of our Data Set

In [172]:
df = pd.read_csv('ds_salaries.csv')
df.head()

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary,salary_currency,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,SE,FT,Principal Data Scientist,80000,EUR,85847,ES,100,ES,L
1,2023,MI,CT,ML Engineer,30000,USD,30000,US,100,US,S
2,2023,MI,CT,ML Engineer,25500,USD,25500,US,100,US,S
3,2023,SE,FT,Data Scientist,175000,USD,175000,CA,100,CA,M
4,2023,SE,FT,Data Scientist,120000,USD,120000,CA,100,CA,M


# Exploratory Data Analysis (EDA) and Preprocessing

Here I am performing a quick Exploratory Analysis to understand more the data, check for null values, identiying features and looking for possible changes to make the Data Set better for our pipeline.

In [173]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3755 entries, 0 to 3754
Data columns (total 11 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   work_year           3755 non-null   int64 
 1   experience_level    3755 non-null   object
 2   employment_type     3755 non-null   object
 3   job_title           3755 non-null   object
 4   salary              3755 non-null   int64 
 5   salary_currency     3755 non-null   object
 6   salary_in_usd       3755 non-null   int64 
 7   employee_residence  3755 non-null   object
 8   remote_ratio        3755 non-null   int64 
 9   company_location    3755 non-null   object
 10  company_size        3755 non-null   object
dtypes: int64(4), object(7)
memory usage: 322.8+ KB


In [174]:
df.isnull().sum()

work_year             0
experience_level      0
employment_type       0
job_title             0
salary                0
salary_currency       0
salary_in_usd         0
employee_residence    0
remote_ratio          0
company_location      0
company_size          0
dtype: int64

There is not a single null value in our Data Frame

In [175]:
numerical_columns = df.select_dtypes(include=['int64', 'float64'])
categorical_columns = df.select_dtypes(include=['object'])

display(numerical_columns)
display(categorical_columns)

Unnamed: 0,work_year,salary,salary_in_usd,remote_ratio
0,2023,80000,85847,100
1,2023,30000,30000,100
2,2023,25500,25500,100
3,2023,175000,175000,100
4,2023,120000,120000,100
...,...,...,...,...
3750,2020,412000,412000,100
3751,2021,151000,151000,100
3752,2020,105000,105000,100
3753,2020,100000,100000,100


Unnamed: 0,experience_level,employment_type,job_title,salary_currency,employee_residence,company_location,company_size
0,SE,FT,Principal Data Scientist,EUR,ES,ES,L
1,MI,CT,ML Engineer,USD,US,US,S
2,MI,CT,ML Engineer,USD,US,US,S
3,SE,FT,Data Scientist,USD,CA,CA,M
4,SE,FT,Data Scientist,USD,CA,CA,M
...,...,...,...,...,...,...,...
3750,SE,FT,Data Scientist,USD,US,US,L
3751,MI,FT,Principal Data Scientist,USD,US,US,L
3752,EN,FT,Data Scientist,USD,US,US,S
3753,EN,CT,Business Data Analyst,USD,US,US,L


## Changes to be Made
- We can convert the Work year to Categorical Columns
- Change the Labels from Experience level and Employment Type to full description
- We want to know about the salaries in the same currency to be able to compare them easily, so we can remove the Salary Column and the Salary Currency Column
- Change the Company_Location and Employee_Residence to Country Code ISO 3 so it works better with Plotly Library

In [176]:
df['work_year'] = df['work_year'].apply(str)
#df['remote_ratio'] = df['remote_ratio'].apply(str)

df['experience_level'].replace({
    'SE':'Senior Level',
    'MI':'Intermediate Level',
    'EN':'Entry Level',
    'EX':'Executive Level',
}, inplace=True)

df['employment_type'].replace({
    'FT':'Full-Time',
    'CT':'Contract',
    'PT':'Part-Time',
    'FL':'Freelance',
}, inplace=True);

df.drop(columns=['salary', 'salary_currency'], inplace=True)

In [181]:
#Asked ChatGPT to help me to change the values using pycountry library

# Create a dictionary mapping ISO-2 to ISO-3 country codes
iso2_to_iso3 = {country.alpha_2: country.alpha_3 for country in pycountry.countries}

# Map ISO-2 country codes to ISO-3 country codes in the DataFrame
df['company_location'] = df['company_location'].map(iso2_to_iso3)

df


Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
0,2023,Senior Level,Full-Time,Principal Data Scientist,85847,ES,100,ESP,L
1,2023,Intermediate Level,Contract,ML Engineer,30000,US,100,USA,S
2,2023,Intermediate Level,Contract,ML Engineer,25500,US,100,USA,S
3,2023,Senior Level,Full-Time,Data Scientist,175000,CA,100,CAN,M
4,2023,Senior Level,Full-Time,Data Scientist,120000,CA,100,CAN,M
...,...,...,...,...,...,...,...,...,...
3750,2020,Senior Level,Full-Time,Data Scientist,412000,US,100,USA,L
3751,2021,Intermediate Level,Full-Time,Principal Data Scientist,151000,US,100,USA,L
3752,2020,Entry Level,Full-Time,Data Scientist,105000,US,100,USA,S
3753,2020,Entry Level,Contract,Business Data Analyst,100000,US,100,USA,L


## Explanatory Data Analysis

First I want to check how many different Job Listing are there in the Data

In [179]:
df['job_title'].describe()

count              3755
unique               93
top       Data Engineer
freq               1040
Name: job_title, dtype: object

We have a count of 93 different Data Job Positions available.

### There are many Job Listings in the Data Set that have just a few ocurrences, for this reason I'll work with just the Job Titles that have more than 30 ocurrences

Here I create a Dataset with the original one but just keeping the Positions stored in filtered_job_titles, which are the ones that appear more than 30 times

In [182]:
job_titles = df['job_title'].value_counts() 
filtered_job_titles = job_titles[job_titles >30]
filtered_job_titles = filtered_job_titles.reset_index().rename(columns={'job_title': 'count', 'index': 'job_title', })
top_jobs_df = df[df['job_title'].isin(filtered_job_titles['job_title'])]
top_jobs_df

Unnamed: 0,work_year,experience_level,employment_type,job_title,salary_in_usd,employee_residence,remote_ratio,company_location,company_size
1,2023,Intermediate Level,Contract,ML Engineer,30000,US,100,USA,S
2,2023,Intermediate Level,Contract,ML Engineer,25500,US,100,USA,S
3,2023,Senior Level,Full-Time,Data Scientist,175000,CA,100,CAN,M
4,2023,Senior Level,Full-Time,Data Scientist,120000,CA,100,CAN,M
5,2023,Senior Level,Full-Time,Applied Scientist,222200,US,0,USA,L
...,...,...,...,...,...,...,...,...,...
3746,2021,Intermediate Level,Full-Time,Data Scientist,119059,SG,100,ISR,M
3748,2021,Intermediate Level,Full-Time,Data Engineer,28369,MT,50,MLT,L
3750,2020,Senior Level,Full-Time,Data Scientist,412000,US,100,USA,L
3752,2020,Entry Level,Full-Time,Data Scientist,105000,US,100,USA,S


# Growth of Data Jobs Over the Years

This graph visualizes the trends of salaries from 2020 to 2023 which are the years available in the Dataset. 

### Decision Made on the Graph

The title gives the message that i want to communicate.

I chose eto make a line graph to tell that it is a continuous variable that goes over time, in addition it is easy to appreciate the Ups and Downs.

**Thinking Like a Designer**

Even though I know the bars are a redundancy because what I want to explain can be told just by the line, I think that for this graph can be used as a visual aid as well to make a simple thing more obvious. I added the Bar Graphs on the background with a gradient of the same color, starting from color 'Linen' for low values and color 'Plum' for higher values

In [183]:
jobcounts_years = job_listings_per_year = df['work_year'].value_counts().sort_index()

#Colorscale for the bars 
colorscale = [
    [0, 'linen'], 
    [1, 'plum']    
]

fig1 = go.Figure(go.Scatter(
    x=jobcounts_years.index,
    y=jobcounts_years.values,
    mode='lines+markers',
    line=dict(color='blueviolet'),
    marker=dict(symbol='circle',
                size=15,
                color='blueviolet'),
))
# Asked ChatGPT to trace the Bars and make use of the colorscale to change color depending on the value
fig1.add_trace(go.Bar(
    name='bargraph',
    x=jobcounts_years.index,
    y=jobcounts_years.values,
    marker=dict(
        color=jobcounts_years.values,  # Use the values as color data
        colorscale=colorscale,         # Set the color scale
        showscale=False                # Show the color scale legend
    ),  #
))

# Asked Chat GPT to turn the background and Transparent
fig1.update_layout(
    title={
        'text': 'The Demand for Jobs in the Data Family have increased over the last 4 Years',
        'x': 0.5,  # Set the x position to center
        'y': 0.95,  # Adjust the y position if needed
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24}  # Increase the title font size
    },
    xaxis_title='Year',
    yaxis_title='Job Listings',
    plot_bgcolor='whitesmoke',  
    paper_bgcolor='whitesmoke',
    xaxis=dict(
        tickfont=dict(size=20),
        title=dict(text='Year', font=dict(size=24))
        ),
    yaxis=dict(
        tickfont=dict(size=20),
        title=dict(text='Job Listings', font=dict(size=24))
        ),
    showlegend=False
)

fig1.show()

# Salaries Over Time

In [184]:
salaries_over_years = df.groupby('work_year')['salary_in_usd'].mean().reset_index()

fig2 = go.Figure(go.Scatter(
    x=salaries_over_years['work_year'],
    y=salaries_over_years['salary_in_usd'],
    mode="lines+markers+text",
    marker=dict(symbol='arrow-right',
                size=15,
                color='blueviolet'),
    text="$" + salaries_over_years['salary_in_usd'].astype(str),
    textfont=dict(size=13),
    textposition='top center',
))

# Asked Chat GPT to turn the background and Transparent and Change the Title 
fig2.update_layout(
    title={
        'text': 'The Salaries for Jobs in the Data Family have increased over the last 4 Years',
        'x': 0.5,
        'y': 0.95,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24} 
    },
    xaxis_title='Year',
    yaxis_title='Job Listings',
    plot_bgcolor='whitesmoke',  
    paper_bgcolor='whitesmoke',
    xaxis=dict(
        tickfont=dict(size=20),
        title=dict(text='Year', font=dict(size=24)),
        showgrid=False
        ),
    yaxis=dict(
        tickfont=dict(size=20),
        title=dict(text='Job Listings', font=dict(size=24)),
        showgrid=False
        )
)
# Adding vertical Blocks
fig2.add_vrect(
    x0=-0, x1=1,
    fillcolor="tomato", opacity=0.4,
    layer="below", line_width=0,
)

fig2.add_vrect(
    x0=1, x1=3,
    fillcolor="lime", opacity=0.4,
    layer="below", line_width=0,
)


fig2.show()

# Remote Opportunities by Experience Level

In [185]:
remote_opportunities_by_year = df.groupby(['work_year', 'experience_level'])['remote_ratio'].mean().reset_index()

# Create a figure
fig3 = go.Figure()

# Get unique experience levels
experience_levels = remote_opportunities_by_year['experience_level'].unique()
color_palette = ['slateblue', 'red', 'green', 'purple']
# Add traces for each experience level
for experience_level in experience_levels:
    fig3.add_trace(go.Scatter(
        x=remote_opportunities_by_year[remote_opportunities_by_year['experience_level'] == experience_level]['work_year'],
        y=remote_opportunities_by_year[remote_opportunities_by_year['experience_level'] == experience_level]['remote_ratio'],
        name=experience_level,
        mode='lines',
        line=dict(shape='linear', smoothing=1.1),
    ))
#  ['linear', 'spline', 'hv', 'vh', 'hvh', 'vhv']

fig3.add_vrect(
    x0=2, x1=3,
    fillcolor="red", opacity=0.4,
    layer="below", line_width=0,
)

text_values = ['Entry Level', 'Executive Level', 'Intermediate Level', 'Senior Level']
for i in range(len(text_values)):
    fig3.add_annotation(
        text=text_values[i],
        x=0,
        y=remote_opportunities_by_year[remote_opportunities_by_year['experience_level'] == text_values[i]]['remote_ratio'].iloc[0],
        xref="x",
        yref="y",
        showarrow=True,
        ax=-70,
        ay=0,
        font=dict(color=color_palette[i], size=14),
        arrowcolor=color_palette[i],
    )
fig3.update_layout(
    title={
        'text': 'The Remote Work Opportunities are declining in 2023 regardless of the Experience Level',
        'x': 0.5,
        'y': 0.95,
        'xanchor': 'center',
        'yanchor': 'top',
        'font': {'size': 24} 
    },
    xaxis_title=dict(text='Year', font=dict(size=25)),
    yaxis_title=dict(text='Remote Ratio', font=dict(size=25)),
    showlegend=False,
    plot_bgcolor='whitesmoke',  
    paper_bgcolor='whitesmoke',
)

fig3.show()


# Pay by Experience Level

In [186]:
pay_by_exp = df.groupby('experience_level')['salary_in_usd'].mean().reset_index()

category_order = ['Entry Level', 'Intermediate Level', 'Senior Level', 'Executive Level']

pay_by_exp['experience_level'] = pd.Categorical(pay_by_exp['experience_level'], categories=category_order, ordered=True)
pay_by_exp = pay_by_exp.sort_values('experience_level')

fig4 = go.Figure()

fig4.add_trace(go.Scatter(
    x=pay_by_exp['experience_level'],
    y=pay_by_exp['salary_in_usd'],
    mode='lines+markers+text',
    fill='tozeroy',
    #Asked ChatGPT How to make the line Smoother
    line=dict(
        color='darkmagenta',
        width=2,
        shape='spline',
        smoothing=1.3 
    )
))
fig4.update_layout(
    title={
        'text': "Salary per Experience Level",
        'y':0.9,
        'x':0.5,
        'font': {'size': 30},
        'xanchor': 'center',
        'yanchor': 'top'},

    xaxis_title='Experience Level',
    yaxis_title='Salary (USD)',
    margin=dict(l=40, r=100, t=80, b=100),
    xaxis=dict(range=[-0.2, 3.01], tickfont=dict(size=17), title=dict(text='Experience Level', font=dict(size=24))),
    yaxis=dict(range=[78000, 220000], tickfont=dict(size=17), title=dict(text='Salary in USD', font=dict(size=24))),
    plot_bgcolor='snow',  
    paper_bgcolor='snow',
)

for i in range(len(pay_by_exp)):
    fig4.add_annotation(
        x=pay_by_exp['experience_level'][i],
        y=pay_by_exp['salary_in_usd'][i],
        xref="x",
        yref="y",
        text='$' + str(pay_by_exp['salary_in_usd'][i]),
        showarrow=True,
        font=dict(
            family="Courier New, monospace",
            size=16,
            color="white",
        ),
        align="center",
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        arrowcolor="indigo",
        ax=0,
        ay=-38,
        bordercolor="indigo",
        borderwidth=2,
        borderpad=4,
        bgcolor="indigo",
        opacity=0.6
    )

fig4.show()

In [187]:
avg_salary_by_location = df.groupby('company_location')['salary_in_usd'].mean().reset_index()

# Create a choropleth map
fig5 = go.Figure(data=go.Choropleth(
    locations=avg_salary_by_location['company_location'],
    z=avg_salary_by_location['salary_in_usd'],
    locationmode='ISO-3',
    colorscale='BuPu',
    colorbar_title='Average Salary',
))

# Customize the map layout
fig5.update_layout(
    title={
        'text': "Average Salaries By location",
        'y':0.9,
        'x':0.5,
        'font': {'size': 30},
        'xanchor': 'center',
        'yanchor': 'top'},
    geo=dict(showframe=False, showcoastlines=False),
    height=800,  # Adjust the height of the map
    width=1500,  # Adjust the width of the map
)

# Show the map
fig5.show()


# What's the most Wanted Job Around the World?

In [188]:
top_jobs = top_jobs_df['job_title'].value_counts().sort_values(ascending=True)
top_jobs = top_jobs[5:]

fig6 = go.Figure(go.Bar(
    y=top_jobs.index,
    x=top_jobs.values,
    orientation='h',
    marker=dict(color=['lightgray'] * (len(top_jobs) - 3) + ['pink', 'plum', 'mediumpurple']),#Asked ChatGPT to turn the first bar to Orange
))

# Asked Chat GPT to place an Hexagram Marker in the first bar
for i in range(3):
    fig6.add_trace(go.Scatter(
        x=[top_jobs.values[-i - 1]],  # X position taken from the value of the last 3 bars
        y=[top_jobs.index[-i - 1]],  # Y position taken from the index of the last 3 bars
        mode='markers',
        marker=dict(symbol='hexagram', color='gold', size=30),
        showlegend=False
    ))


text_values = ['3', '2', '1']
for i in range(len(text_values)):
    fig6.add_annotation(
        x=top_jobs.values[-3 + i],  # X position taken from the values of the last 3 bars
        y=top_jobs.index[-3 + i],  # Y position taken from the index of the last 3 bars
        xref="x",
        yref="y",
        text=text_values[i],  # Text value '1', '2', '3'
        showarrow=False,
        font=dict(
            family="Courier New, monospace",
            size=25,
            color="black",
        ),
        align="center",
        arrowhead=2,
        arrowsize=1,
        arrowwidth=2,
        arrowcolor="indigo",
        ax=0,
        ay=-38,
    )


fig6.update_layout(
    title={
        'text': "From 2020 to 2023 the most wanted job is Data Engineer",
        'y':0.9,
        'x':0.5,
        'font': {'size': 30},
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title=dict(text='Amount of Job Listings', font=dict(size=25)),
    showlegend=False,
    plot_bgcolor='snow',  
    paper_bgcolor='snow',
)



fig6.show()

# Salary Ranges

In [189]:
topjobs_salaries = df[['job_title','salary_in_usd']][df['job_title'].isin(top_jobs.index)]

#ASked Chat GPT to do this
median_values = topjobs_salaries.groupby('job_title')['salary_in_usd'].median()
lower_q = topjobs_salaries.groupby('job_title')['salary_in_usd'].quantile(0.25)
upper_q = topjobs_salaries.groupby('job_title')['salary_in_usd'].quantile(0.75)
job_titles = median_values.index

fig8 = go.Figure()

# Add horizontal bar trace with error bars
fig8.add_trace(go.Bar(
    y=job_titles,
    x=upper_q - lower_q,
    name='IQR',
    orientation='h',
    marker_color='mediumpurple',
    base=lower_q,
    hoverinfo='x',
    showlegend=False
))

# Add median line as a scatter trace
fig8.add_trace(go.Scatter(
    y=job_titles,
    x=median_values,
    name='Median',
    mode='markers',
    marker=dict(symbol='arrow-down', color='whitesmoke', size=16),
))
fig8.add_trace(go.Scatter(
    y=job_titles,
    x=median_values,
    name='Median',
    mode='markers',
    marker=dict(symbol='arrow-up', color='whitesmoke', size=16),
))

fig8.update_layout(
    title={
        'text': 'Median and IQR of Salary by Job Title',
        'y':0.9,
        'x':0.5,
        'font': {'size': 30},
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title=dict(text='Salary', font=dict(size=25)),
    yaxis_title=dict(text='Job Title', font=dict(size=25)),
    plot_bgcolor='lavenderblush',
    paper_bgcolor='lavenderblush',
    showlegend=False
)

fig8.show()



# Best Place to Work for the Top Job Titles

In [190]:
experience_df = top_jobs_df.groupby(['experience_level','company_size']).agg({'salary_in_usd':'mean'}).reset_index()

fig7 = go.Figure()

category_order = ['Entry Level', 'Intermediate Level', 'Senior Level', 'Executive Level']
colors = {'S': 'lightgray', 'M': 'mediumpurple', 'L': 'lightgray'}

for company_size, color in colors.items():
    filtered_data = experience_df[experience_df['company_size'] == company_size]
    fig7.add_trace(go.Bar(
        x=filtered_data['experience_level'],
        y=filtered_data['salary_in_usd'],
        name=company_size,
        marker=dict(color=color),
    ))

    fig7.update_layout(
    title={
        'text': 'Among All Experience Levels, Salaries offered by Medium Companies are on Average higher than Small and Large Companies',
        'y':0.9,
        'x':0.5,
        'font': {'size': 25},
        'xanchor': 'left',
        'yanchor': 'top'},
    xaxis_title=dict(text='Experience Level', font=dict(size=25)),
    yaxis_title=dict(text='Salary (USD)', font=dict(size=25)),
    showlegend=False,
    plot_bgcolor='snow',
    paper_bgcolor='snow',
    xaxis=dict(categoryorder='array', categoryarray=category_order),
    yaxis=dict(showgrid=False),
    title_x=0.1,
    )

fig7.add_annotation(
    text='Medium Size Company',
    x='Entry Level',
    y=130000,
    showarrow=False,
    font=dict(color='purple')
)
fig7.add_annotation(
    text='Medium Size Company',
    x='Intermediate Level',
    y=144000,
    showarrow=False,
    font=dict(color='purple')
)
fig7.add_annotation(
    text='Medium Size Company',
    x='Senior Level',
    y=185000,
    showarrow=False,
    font=dict(color='purple')
)
fig7.add_annotation(
    text='Medium Size Company',
    x='Executive Level',
    y=230000,
    showarrow=False,
    font=dict(color='purple')
)
fig7.show()


# Building a Basic Dashboard with Plotly Dash

### Link to Youtube Video to See How it Works


In [191]:

app = dash.Dash()

app.layout = html.Div([

    dcc.Graph(figure=fig1), 
    dcc.Graph(figure=fig2),
    dcc.Graph(figure=fig3),
    dcc.Graph(figure=fig4),
    dcc.Graph(figure=fig5),
    dcc.Graph(figure=fig6), 
    dcc.Graph(figure=fig8), 
    dcc.Graph(figure=fig7), 
])


In [192]:
app.run_server(debug=True, use_reloader=False)

Dash is running on http://127.0.0.1:8050/

 * Serving Flask app "__main__" (lazy loading)
 * Environment: production
[2m   Use a production WSGI server instead.[0m
 * Debug mode: on


# Assessment Submission Form

In [None]:
Image(filename='/Users/diegoportillaamarillas/Documents/GISMA/Assessment Submission Form Data Visualization.jpg')
