# **Exploratory Data Analysis**

## Description

The purpose of this analysis is to identify the general patterns in the data we collected, which are based on the visualizations that we plan to utilize within our dashboard.

## Imports

In [1]:
import altair as alt
import pandas as pd

alt.data_transformers.disable_max_rows();

## Extracting the data

In [2]:
data: pd.DataFrame = pd.read_csv(
    "../data/raw/global_graduate_employability_index.csv"
)

data.head()

Unnamed: 0,Country,Region,University_Name,Degree_Level,Field_of_Study,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Top_Industry,Job_Role,Skill_1,Skill_2,Skill_3,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year
0,USA,North America,Harvard,Bachelor,Engineering,2017,79.3,85.6,66700,Manufacturing/Construction,Robotics Engineer,AutoCAD,Lean Six Sigma,MATLAB,69,8.8,66,2017
1,USA,North America,MIT,Bachelor,Engineering,2023,83.8,87.9,84500,Manufacturing/Construction,Civil Engineer,Lean Six Sigma,AutoCAD,MATLAB,71,65.4,63,2023
2,Israel,Middle East & Africa,Technion,Master,Healthcare & Medicine,2019,81.7,83.2,88300,Healthcare,Public Health Specialist,Clinical Research,Diagnostics,Patient Care,52,5.0,74,2019
3,India,Asia-Pacific,IIT Bombay,Master,Computer Science,2016,84.2,92.1,21000,Technology,AI Researcher,DevOps,Python,Cloud Computing,69,10.3,48,2016
4,South Africa,Middle East & Africa,University of Cape Town,PhD,Business & Finance,2023,83.6,86.3,48600,Finance/Consulting,Management Consultant,Financial Modeling,Data Analysis,Market Research,69,64.0,65,2023


## Summary statistics

We can see that we can have 10 out of the 18 columns being string type mostly for categorical features, while the remaining 8 are numerical features regarding each one of the university, degree level, field of study, and year of graduation combination.

In [3]:
data.info()

<class 'pandas.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 18 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country                            3500 non-null   str    
 1   Region                             3500 non-null   str    
 2   University_Name                    3500 non-null   str    
 3   Degree_Level                       3500 non-null   str    
 4   Field_of_Study                     3500 non-null   str    
 5   Graduation_Year                    3500 non-null   int64  
 6   Employment_Rate_6_Months (%)       3500 non-null   float64
 7   Employment_Rate_12_Months (%)      3500 non-null   float64
 8   Average_Starting_Salary_USD        3500 non-null   int64  
 9   Top_Industry                       3500 non-null   str    
 10  Job_Role                           3500 non-null   str    
 11  Skill_1                            3500 non-null   str    
 12  Ski

We can see we have no null values in our data. We can now take a look at what's inside:

In [4]:
data.describe().round(2)

Unnamed: 0,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year
count,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0
mean,2020.02,85.74,90.41,64668.0,65.19,40.06,69.74,2020.02
std,3.2,6.99,6.74,30727.51,14.71,30.32,14.24,3.2
min,2015.0,65.5,68.0,-1200.0,30.0,5.0,40.0,2015.0
25%,2017.0,80.7,85.5,39400.0,55.0,10.6,60.0,2017.0
50%,2020.0,85.6,90.6,63900.0,64.0,37.3,70.0,2020.0
75%,2023.0,90.9,96.0,85400.0,75.0,68.3,79.25,2023.0
max,2025.0,99.0,100.0,189400.0,100.0,90.0,100.0,2025.0


In [5]:
string_cols: list[str] = list(
    data.select_dtypes(include=["object", "string"]).columns
)

category_counts: pd.Series = data[string_cols]

print("Number of categories per string column:\n")

summary = pd.DataFrame({
    "n_categories": data[string_cols].nunique(dropna=False)
}).sort_values("n_categories", ascending=False)

summary

Number of categories per string column:



Unnamed: 0,n_categories
University_Name,46
Skill_3,34
Skill_2,34
Skill_1,34
Job_Role,29
Country,21
Top_Industry,7
Field_of_Study,7
Region,5
Degree_Level,3


## Cleaning the data

Given that there are some suspicious negative values in the salary column, we will assess this situation by checking what could be happening:

In [6]:
n_total: int = len(data)
n_neg_or_0_salary: int = (
    data["Average_Starting_Salary_USD"] <= 0
).sum()

quality_tbl = pd.DataFrame({
    "metric": ["total_rows", "negative_or_0__salary"],
    "count":  [n_total, n_neg_or_0_salary],
})
quality_tbl["pct_of_rows"] = quality_tbl["count"] / n_total
quality_tbl

Unnamed: 0,metric,count,pct_of_rows
0,total_rows,3500,1.0
1,negative_or_0__salary,1,0.000286


There seems to be just 1 row with a suspiciously low (negative) salary:

In [7]:
data.loc[data["Average_Starting_Salary_USD"] <= 0]

Unnamed: 0,Country,Region,University_Name,Degree_Level,Field_of_Study,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Top_Industry,Job_Role,Skill_1,Skill_2,Skill_3,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year
2527,Argentina,Latin America,University of Buenos Aires,Master,Social Sciences,2020,91.7,97.8,-1200,Government/NGO,Human Resources Specialist,Communication,Qualitative Research,Public Policy,52,28.8,62,2020


Given that it is just one row, and this was during the pandemic (which possibly includes some really atypical cases), we can proceed to delete it:

In [8]:
data = data.loc[data["Average_Starting_Salary_USD"] > 0]

We check we do not have the issue anymore:

In [9]:
data.loc[data["Average_Starting_Salary_USD"] <= 0]

Unnamed: 0,Country,Region,University_Name,Degree_Level,Field_of_Study,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Top_Industry,Job_Role,Skill_1,Skill_2,Skill_3,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year


We can also see if there might be some additional odd cases:

In [10]:
salary_dist: alt.Chart = (
    alt.Chart(data)
    .mark_bar()
    .encode(
        x=alt.X(
            "Average_Starting_Salary_USD:Q",
            bin=alt.Bin(maxbins=30),
            title="Average Starting salary (USD)"
        ),
        y=alt.Y("count():Q", title="Count"),
        tooltip=[alt.Tooltip("count():Q", title="Count")]
    )
    .properties(
        width=500, height=300,
        title="Distribution of cleaned average starting salary (USD)"
    )
)

salary_dist.save("../img/eda_plots/clean_overall_salary_dist.png") 
salary_dist

![Overall Clean, Average Starting Salary Distribution](../img/eda_plots/clean_overall_salary_dist.png)

Other than some low values (which could be attributed to several other factors), there seem to be no other major anomalies.

Now we will proceed to analyse some of the relevant numeric variables that are present within our data set, and save this into the processed pipeline for future uses.

In [11]:
data.to_csv("../data/processed/processed_data.csv")

## Employment rate

### *Employment (6 months)*

In [12]:
print(
    "Global employment percentage after 6 months:\n"
    f'Mean: {data["Employment_Rate_6_Months (%)"].mean():.2f}\n'
    f'Minimum: {data["Employment_Rate_6_Months (%)"].min():.2f}\n'
    f'Maximum: {data["Employment_Rate_6_Months (%)"].max():.2f}'
)

Global employment percentage after 6 months:
Mean: 85.73
Minimum: 65.50
Maximum: 99.00


### *Employment (1 year)*

In [13]:
print(
    "Global employment percentage after 1 year:\n"
    f'Mean: {data["Employment_Rate_12_Months (%)"].mean():.2f}\n'
    f'Minimum: {data["Employment_Rate_12_Months (%)"].min():.2f}\n'
    f'Maximum: {data["Employment_Rate_12_Months (%)"].max():.2f}'
)

Global employment percentage after 1 year:
Mean: 90.41
Minimum: 68.00
Maximum: 100.00


### *Top universities by employment rate*

In [14]:

uni_emp_summary: pd.DataFrame = (
    data.groupby(["University_Name", "Region", "Country"], as_index=False)
        .agg(
            mean_6=("Employment_Rate_6_Months (%)", "mean"),
            mean_12=("Employment_Rate_12_Months (%)", "mean"),
        )
)

uni_emp_summary["mean_overall"] = uni_emp_summary[["mean_6", "mean_12"]].mean(axis=1)

uni_emp_summary["rank"] = uni_emp_summary["mean_overall"].rank(
    method="dense", ascending=False
).astype(int)


top50: pd.DataFrame = (
    uni_emp_summary.sort_values(
        ["mean_overall", "University_Name"], ascending=[False, True]
    ).head(50).copy()
)


ordered_unis: list[str] = top50["University_Name"].tolist()

top50 = (
    top50.set_index("University_Name")
         .reindex(ordered_unis)
         .reset_index()
)


employment: alt.Chart = (
    alt.Chart(top50)
    .mark_bar()
    .encode(
        y=alt.Y(
            "University_Name:N",
            sort=None,
            title="University"
        ),
        x=alt.X("mean_overall:Q", title="Average Employment Rate (%)"),
        color=alt.Color(
            "Region:N",
            title="Region",
            legend=alt.Legend(orient="top", direction="horizontal")
        ),
        tooltip=[
            alt.Tooltip("rank:Q", title="Ranking"),
            alt.Tooltip("University_Name:N", title="University"),
            alt.Tooltip("Region:N", title="Region"),
            alt.Tooltip("Country:N", title="Country"),
            alt.Tooltip("mean_6:Q", title="Empl. Rate (6 months) %", format=".2f"),
            alt.Tooltip("mean_12:Q", title="Empl. Rate (12 months) %", format=".2f"),
            alt.Tooltip("mean_overall:Q", title="Avg Empl. Rate %", format=".2f"),
        ],
    )
    .properties(
        width=500, height=600,
        title="Top 50 Universities by Average Employment Rate (2015 to 2025)"
    )
)

employment.save("../img/eda_plots/employment.png")

employment


![Top universities by employment rate](../img/eda_plots/employment.png)

Observe that most of the top-ranked universities in terms of employment rate are mostly coming from North America, Europe, and Asia-Pacific.

## Starting Salary

In [15]:
print(
    "Global starting salary:\n"
    f'Mean: ${data["Average_Starting_Salary_USD"].mean():,.2f}\n'
    f'Minimum: ${data["Average_Starting_Salary_USD"].min():,.2f}\n'
    f'Maximum: ${data["Average_Starting_Salary_USD"].max():,.2f}'
)

Global starting salary:
Mean: $64,686.82
Minimum: $4,200.00
Maximum: $189,400.00


### *Top industries by starting salary*

In [16]:

TOP_N = 10

industry_salary: pd.DataFrame = (
    data.groupby("Top_Industry", as_index=False)
        .agg(avg_salary=("Average_Starting_Salary_USD", "mean"))
)

industry_salary["rank"] = (
    industry_salary["avg_salary"].rank(method="dense", ascending=False).astype(int)
)

top_industries: pd.DataFrame = (
    industry_salary.sort_values("avg_salary", ascending=False)
                   .head(TOP_N)
                   .copy()
)

ordered_inds: list[str] = top_industries["Top_Industry"].tolist()

top_industries = (
    top_industries.set_index("Top_Industry")
                  .reindex(ordered_inds)
                  .reset_index()
)


bar_top_industries: alt.Chart = (
    alt.Chart(top_industries)
    .mark_bar()
    .encode(
        x=alt.X("Top_Industry:N", sort=None, title="Top Industry"),
        y=alt.Y(
            "avg_salary:Q",
            title="Average Starting Salary (USD)",
            axis=alt.Axis(format="$,.0f"),
        ),
        tooltip=[
            alt.Tooltip("rank:Q", title="Rank"),
            alt.Tooltip("Top_Industry:N", title="Industry"),
            alt.Tooltip("avg_salary:Q", title="Avg Salary (USD)", format="$,.2f"),
        ],
    )
    .properties(
        width=500, height=300,
        title=f"Top Industries by Average Starting Salary"
    )
)


bar_top_industries.save("../img/eda_plots/bar_top_industries.png")
bar_top_industries


![Top industries by salary](../img/eda_plots/bar_top_industries.png)

Here we can observe how **technology**, **consulting**, **healthcare**, and **finance** rank amongst the top-paying industries by the starting salary of newly graduates.

### *Salary yearly trends*

In [17]:
salary_over_time: pd.DataFrame = (
    data.groupby(["Year", "Field_of_Study"], as_index=False)
        .agg(avg_salary=("Average_Starting_Salary_USD", "mean"))
)


highlight: alt.Parameter = alt.selection_point(fields=["Field_of_Study"], bind="legend")

line_salary_over_time: alt.Chart = (
    alt.Chart(salary_over_time)
    .mark_line(point=True)
    .encode(
        x=alt.X("Year:O", title="Year", sort="ascending"),
        y=alt.Y(
            "avg_salary:Q",
            title="Average Starting Salary (USD)",
            axis=alt.Axis(format="$,.0f"),
            scale=alt.Scale(domain=[4e4, 1e5])
        ),
        color=alt.Color("Field_of_Study:N", title="Field of Study"),
        opacity=alt.condition(highlight, alt.value(1), alt.value(0.12)),
        tooltip=[
            alt.Tooltip("Year:O", title="Year"),
            alt.Tooltip("Field_of_Study:N", title="Field of Study"),
            alt.Tooltip("avg_salary:Q", title="Avg Salary (USD)", format="$,.2f"),
        ],
    )
    .add_params(highlight)
    .properties(
        width=500, height=300,
        title="Average Starting Salary Over Time by Field of Study"
    )
)


line_salary_over_time.save("../img/eda_plots/line_salary_over_time.png")
line_salary_over_time


![Salary over time by industry](../img/eda_plots/line_salary_over_time.png)

Here we can observe how data science and AI, as well as computer science and healthcare, prevail to be the most relevant fields of study in terms of top mean starting salary after graduation.

## Distribution of degree programs

In [18]:

deg: pd.DataFrame = (
    data.groupby("Degree_Level", as_index=False)
        .size()
        .rename(columns={"size": "count"})
)

deg["share"] = deg["count"] / deg["count"].sum()

ring_degree_level: alt.Chart = (
    alt.Chart(deg)
    .mark_arc(innerRadius=120)
    .encode(
        theta=alt.Theta("count:Q", title="Count"),
        color=alt.Color("Degree_Level:N", title="Degree Level"),
        tooltip=[
            alt.Tooltip("Degree_Level:N", title="Degree Level"),
            alt.Tooltip("count:Q", title="Count", format=",.0f"),
            alt.Tooltip("share:Q", title="Share", format=".1%")
        ],
    )
    .properties(
        width=500, height=500,
        title="Degree Level Proportion (All Records)"
    )
)

ring_degree_level.save("../img/eda_plots/ring_degree_level.png")
ring_degree_level


![Degree distribution](../img/eda_plots/ring_degree_level.png)

Here we can observe an even mixture of degree levels offered by universities as registered in our data.

## Degree comparison: employment rates at 6 vs 12 months

In [19]:
deg_emp: pd.DataFrame = (
    data.groupby("Degree_Level")[[
        "Employment_Rate_6_Months (%)", "Employment_Rate_12_Months (%)"
    ]].mean().reset_index().round(2)
)

deg_emp

Unnamed: 0,Degree_Level,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%)
0,Bachelor,84.59,89.36
1,Master,85.42,90.09
2,PhD,87.18,91.76


In [20]:
deg_emp_long: pd.DataFrame = deg_emp.melt(
    id_vars=["Degree_Level"],
    value_vars=["Employment_Rate_6_Months (%)", "Employment_Rate_12_Months (%)"],
    var_name="Horizon",
    value_name="Employment_Rate"
)

deg_emp_long["Horizon"] = deg_emp_long["Horizon"].replace({
    "Employment_Rate_6_Months (%)": "6 months",
    "Employment_Rate_12_Months (%)": "12 months"
})

deg_emp_chart: alt.Chart = (
    alt.Chart(deg_emp_long)
    .mark_bar()
    .encode(
        x=alt.X("Degree_Level:N", title="Degree level"),
        xOffset=alt.XOffset("Horizon:N"),
        y=alt.Y("Employment_Rate:Q", title="Average employment rate"),
        color=alt.Color("Horizon:N", title="Horizon"),
        tooltip=[
            "Degree_Level:N",
            "Horizon:N",
            alt.Tooltip("Employment_Rate:Q", format=".2f", title="Avg employment rate")
        ],
    )
    .properties(
        width=300, height=300,
        title="Average employment rate by degree level (6 vs 12 months)"
    )
)

deg_emp_chart.save("../img/eda_plots/employment_by_degree.png")
deg_emp_chart

![Employment by Degree Level and time after graduation](../img/eda_plots/employment_by_degree.png)

Here we can observe that the higher the education degree in general, the more the employability increases in general.

## Salary by Field of Study × Degree

In [21]:
salary_heat: alt.Chart = (
    alt.Chart(data)
    .transform_aggregate(
        mean_salary="mean(Average_Starting_Salary_USD)",
        groupby=["Field_of_Study", "Degree_Level"]
    )
    .mark_rect()
    .encode(
        x=alt.X("Degree_Level:N", title="Degree level"),
        y=alt.Y("Field_of_Study:N", title="Field of study"),
        color=alt.Color("mean_salary:Q", title="Mean salary (USD)"),
        tooltip=[
            "Field_of_Study:N",
            "Degree_Level:N",
            alt.Tooltip("mean_salary:Q", format=",.0f", title="Mean salary (USD)")
        ],
    )
    .properties(
        width=300, height=500,
        title="Mean starting salary: Field of study × Degree"
    )
)

salary_labels: alt.Chart = (
    alt.Chart(data)
    .transform_aggregate(
        mean_salary="mean(Average_Starting_Salary_USD)",
        groupby=["Field_of_Study", "Degree_Level"]
    )
    .mark_text()
    .encode(
        x=alt.X("Degree_Level:N"),
        y=alt.Y("Field_of_Study:N"),
        text=alt.Text("mean_salary:Q", format=",.0f")
    )
)

salary_heat_with_labels: alt.Chart = salary_heat + salary_labels

salary_heat_with_labels.save("../img/eda_plots/salary_heat_map.png")
salary_heat_with_labels

![Mean Starting Salary Heatmap by Degree Level and Field of Study](../img/eda_plots/salary_heat_map.png)

We can see a trend to having hiegh salaries as the level of education increases, with Healthcare and Data being some of the most relevant in this regard