# Evaluation and Overview

> Restating the work done and the evaluation process.

## Overview

For this project I wanted to try to present the information about data jobs in a way that it would be easier to make judgements on the potential salary based on your and company's location, if it's remote and your seniority. For this I've made a series of visualizations which were later evaluated by getting feedback from people working in data related jobs. Based on that feedback I've made some changes to the visualization and present them below.

### Data

The data for this project comes from a dataset of salaries for data related jobs. The raw data comes from ai-jobs.net and the dataset can be found on [Kaggle](https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries). The data consists of 607 positions over 3 years (2020 - 2022). In addition to the salary and the year each position also has basic information about the company and the type of job.


### Goals and Tasks

Based on the information provided in the dataset and preliminary exploration I wanted to make visualizations which would allow both to show the findings that I found interesting and to think about the data without directly working with it. This line of thought allowed me to select 5 areas which had an important effect on the salary and which I wanted to showcase:

- Company Location
- Working for a Foreign Company
- Company Size and Experience
- Remote Work
- Yearly Changes

### Design

The 5 areas listed above were the key considerations in choosing what kind of data relationships to visualize and how to do that. While these areas are still quite broad, they already provide enough of a base to start building visualization that allow to get some insight into them. However, due to the open-endedness it's still important to show a little bit more information in order to allow people to find their own insights.

Based on that I've tried to balance simplicity and expressivity of visualizations. One of the main ways of achieving that was showing distributions of averages. While these visualizations are a bit harder to interpret, I thought that the shortcoming was justified because of the technological literacy of the target audience and the expressivity of visualizations. I believe that the feedback received during the evaluation confirms this.

### Evaluation

In order to evaluate the visualizations, I've decided to show them to three people who have worked in data related jobs and listen to their feedback. The evaluation was structred in the following way:

- First, I told the general topic of the visualizations
- Secondly, they looked through each visualization while telling their thoughts on them
- After that I explained what each visualization was aimed to show and we had a final discussion about them

#### Feedback

Based on the received feedback I've condensed them into brief reviews of each visualization:

- Company Location: generally good and clear, but would like more data points for some countries
- Working for a Foreign Company: 
  - salary distributions: not clear what's trying to show/compare, the tail is difficult to see
  - average salary: not readable at all
- Company Size and Experience: the visualization is ok, but it's not clear that it's clickable/selectable
- Remote Work: everything's good
- Yearly Changes: everything's good

From this list it's pretty straightforward to improve the visualizations by making them more clear.


## Final Visualizations

Here are the visualizations, but with slight modifications based on the received feedback.

In [None]:
#| hide
import warnings

In [None]:
#| hide
warnings.filterwarnings("ignore")

In [None]:
#| hide

import altair as alt
import pandas as pd
import numpy as np

from dataviz_course.explore_data import prepare_dataset

In [None]:
#| hide

data = prepare_dataset()

In [None]:
#| hide
WIDTH = 400
HEIGHT = 300

In [None]:
#| hide

sort_max_salary = alt.EncodingSortField(field="Salary (usd)", op="max")
salary_scale = alt.Scale(domain=[data["Salary (usd)"].min(), data["Salary (usd)"].max()])

def get_selection_opacity(selection):
    return alt.condition(selection, alt.value(1), alt.value(.2))

### Company Location

In [None]:
#| echo: false

alt.Chart(data).mark_circle().encode(
    y=alt.Y(field="Salary (usd)", type="quantitative"),
    x=alt.X(field="Company Location", type="nominal", sort=sort_max_salary),
    tooltip=["Salary (usd)", "Job Title", "Employee Residence", "Number of Employees"]
).properties(height=HEIGHT*1.2, width=WIDTH*1.2, title="Company Location vs Salary")

### Working for a Foreign Company

In [None]:
#| echo: false

alt.Chart(data).transform_density(
    density="Salary (usd)",
    groupby=["Working for a Foreign Company"],
    counts=True
).mark_line().encode(
    x=alt.X(field="value", type="quantitative", title="Salary (usd)"),
    y=alt.Y("density", type="quantitative", title="Normalized Count"),
    color="Working for a Foreign Company"
).properties(width=WIDTH, height=HEIGHT, title="Salary for Domestic or Foreign Employment")

In [None]:
#| hide

averages = data.groupby(['Employee Residence', 'Working for a Foreign Company'])['Salary (usd)'].mean().reset_index()
averages.head()

Unnamed: 0,Employee Residence,Working for a Foreign Company,Salary (usd)
0,AE,False,100000.0
1,AR,True,60000.0
2,AT,False,76738.666667
3,AU,False,108042.666667
4,BE,False,85699.0


In [None]:
#| hide

pivoted = averages.pivot(index='Employee Residence', columns='Working for a Foreign Company', values='Salary (usd)')
pivoted = pivoted[(~pivoted.isna().any(axis=1))]
pivoted = pivoted / pivoted[False].values[:, None]

pivoted.head()

Working for a Foreign Company,False,True
Employee Residence,Unnamed: 1_level_1,Unnamed: 2_level_1
BR,1.0,4.873853
CA,1.0,1.031128
DE,1.0,0.912368
ES,1.0,1.842517
FR,1.0,1.171921


In [None]:
#| hide

increase = pd.DataFrame(
    {
        "Domestic": pivoted[False],
        "Foreign": pivoted[True],
    }
).reset_index()
increase.head()

Unnamed: 0,Employee Residence,Domestic,Foreign
0,BR,1.0,4.873853
1,CA,1.0,1.031128
2,DE,1.0,0.912368
3,ES,1.0,1.842517
4,FR,1.0,1.171921


In [None]:
#| hide

melted_increase = increase.melt('Employee Residence', var_name='Working for a Foreign Company', value_name='Residence Normalized Salary')
melted_increase.tail()

Unnamed: 0,Employee Residence,Working for a Foreign Company,Residence Normalized Salary
33,PT,Foreign,0.690467
34,RU,Foreign,0.342857
35,SG,Foreign,1.333337
36,US,Foreign,1.265475
37,VN,Foreign,11.05


In [None]:
#| echo: false

chart = alt.Chart(melted_increase).mark_line(point=True).encode(
    x=alt.X('Working for a Foreign Company', type="nominal"),
    y=alt.Y('Residence Normalized Salary', type="quantitative"),
    color='Employee Residence',
    order='Employee Residence',
    tooltip=["Residence Normalized Salary", "Employee Residence"]
).properties(
    title='Residence-Normalized Difference in Average Salary',
    width=WIDTH,
    height=HEIGHT
)
chart


### Company Size and Experience

In [None]:
#| echo: false

options = ["<50", "50-250", ">250"]
dropdown = alt.binding_select(
    options=[None, *options],
    labels=["All", *options],
    name="Number of Employees: "
)
selection = alt.selection_single(fields=["Number of Employees"], bind=dropdown)

scatter = alt.Chart(data).mark_circle().encode(
    x=alt.X(field="Number of Employees", type="nominal", sort=alt.EncodingSortField(field="Salary (usd)", op="max")),
    y=alt.Y(field="Salary (usd)", type="quantitative"),
    color="Experience Level",
    tooltip=["Job Title", "Employee Residence", "Salary (usd)"],
    opacity=get_selection_opacity(selection)
).add_selection(selection).properties(width=WIDTH*0.4, height=HEIGHT, title="Salary based on Company Size")

histogram = alt.Chart(data).mark_bar().encode(
    x=alt.X(field="Salary (usd)", type="quantitative", bin=alt.Bin(step=50000), scale=salary_scale),
    y="count()",
    color="Experience Level",
    tooltip=["count()"]
).transform_filter(selection).properties(width=WIDTH*0.5, height=HEIGHT, title="Salary based on Experience Level").interactive()

scatter | histogram

### Remote Work

In [None]:
#| echo: false

salaries_chart = alt.Chart(data).mark_circle().encode(
    x=alt.X(field="On-site/Remote", type="nominal"),
    y="Salary (usd)",
    tooltip=["Salary (usd)", "Job Title", "Employee Residence", "Company Location"]
).properties(width=WIDTH, height=HEIGHT)

average_salaries_chart = alt.Chart(data).mark_line(color="black").encode(
    x=alt.X(field="On-site/Remote", type="nominal"),
    y=alt.Y("Salary (usd)", type="quantitative", aggregate="mean", axis=alt.Axis(title="Salary (usd)")),
).properties(width=WIDTH, height=HEIGHT, title="Salaries for Different Work Types")

salaries_chart + average_salaries_chart

### Yearly Changes

In [None]:
#| echo: false

salaries_chart = alt.Chart(data).mark_circle().encode(
    x="Work Year",
    y="Salary (usd)",
    tooltip=["Job Title", "Employee Residence", "Company Location"]
).properties(width=WIDTH, height=HEIGHT)

average_salaries_chart = alt.Chart(data).mark_line(color="black").encode(
    x=alt.X(field="Work Year", type="nominal"),
    y=alt.Y("Salary (usd)", type="quantitative", aggregate="mean", axis=alt.Axis(title="Salary (usd)")),
).properties(width=WIDTH, height=HEIGHT, title="Salaries in each Year")

salaries_chart + average_salaries_chart

## Conclusions

The first round of evaluations proved to be quite successfull both showing the validity of the design choices and provding some insight into how to improve them. The use of more complex visualizations works, but requires careful considerations in order to not make them incomprehensible. When that's done right, the visualization work particularly well when the target audience is more technical.

The visualizations themselves still have room for improvement, especially, in the aesthetic front. All in all, I believe that these visualization already work quite effectively and the audience is able to analyse the data and make interesting conclusions without directly interacting with it.