# **Exploratory Data Analysis**

## Description

The purpose of this analysis is to identify the general patterns in the data we collected, which are based on the visualizations that we plan to utilize within our dashboard.

## Imports

In [1]:
import altair as alt
import pandas as pd

## Extracting the data

In [2]:
data: pd.DataFrame = pd.read_csv(
    "../data/raw/global_graduate_employability_index.csv"
)

data.head()

Unnamed: 0,Country,Region,University_Name,Degree_Level,Field_of_Study,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Top_Industry,Job_Role,Skill_1,Skill_2,Skill_3,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year
0,USA,North America,Harvard,Bachelor,Engineering,2017,79.3,85.6,66700,Manufacturing/Construction,Robotics Engineer,AutoCAD,Lean Six Sigma,MATLAB,69,8.8,66,2017
1,USA,North America,MIT,Bachelor,Engineering,2023,83.8,87.9,84500,Manufacturing/Construction,Civil Engineer,Lean Six Sigma,AutoCAD,MATLAB,71,65.4,63,2023
2,Israel,Middle East & Africa,Technion,Master,Healthcare & Medicine,2019,81.7,83.2,88300,Healthcare,Public Health Specialist,Clinical Research,Diagnostics,Patient Care,52,5.0,74,2019
3,India,Asia-Pacific,IIT Bombay,Master,Computer Science,2016,84.2,92.1,21000,Technology,AI Researcher,DevOps,Python,Cloud Computing,69,10.3,48,2016
4,South Africa,Middle East & Africa,University of Cape Town,PhD,Business & Finance,2023,83.6,86.3,48600,Finance/Consulting,Management Consultant,Financial Modeling,Data Analysis,Market Research,69,64.0,65,2023


## Summary statistics

We can see that we can have 10 out of the 18 columns being string type mostly for categorical features, while the remaining 8 are numerical features regarding each one of the university, degree level, field of study, year of collection and year of graduation combination.

In [3]:
data.info()

<class 'pandas.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 18 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Country                            3500 non-null   str    
 1   Region                             3500 non-null   str    
 2   University_Name                    3500 non-null   str    
 3   Degree_Level                       3500 non-null   str    
 4   Field_of_Study                     3500 non-null   str    
 5   Graduation_Year                    3500 non-null   int64  
 6   Employment_Rate_6_Months (%)       3500 non-null   float64
 7   Employment_Rate_12_Months (%)      3500 non-null   float64
 8   Average_Starting_Salary_USD        3500 non-null   int64  
 9   Top_Industry                       3500 non-null   str    
 10  Job_Role                           3500 non-null   str    
 11  Skill_1                            3500 non-null   str    
 12  Ski

In [4]:
data.describe().round(2)

Unnamed: 0,Graduation_Year,Employment_Rate_6_Months (%),Employment_Rate_12_Months (%),Average_Starting_Salary_USD,Skill_Demand_Score (1–100),Remote_Work_Availability (%),Employer_Reputation_Score (1–100),Year
count,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0,3500.0
mean,2020.02,85.74,90.41,64668.0,65.19,40.06,69.74,2020.02
std,3.2,6.99,6.74,30727.51,14.71,30.32,14.24,3.2
min,2015.0,65.5,68.0,-1200.0,30.0,5.0,40.0,2015.0
25%,2017.0,80.7,85.5,39400.0,55.0,10.6,60.0,2017.0
50%,2020.0,85.6,90.6,63900.0,64.0,37.3,70.0,2020.0
75%,2023.0,90.9,96.0,85400.0,75.0,68.3,79.25,2023.0
max,2025.0,99.0,100.0,189400.0,100.0,90.0,100.0,2025.0


In [5]:
string_cols = data.select_dtypes(include=["object", "string"]).columns

category_counts = data[string_cols]

print("Number of categories per string column:\n")
# print(category_counts.to_string())

summary = pd.DataFrame({
    "n_categories": data[string_cols].nunique(dropna=False)
}).sort_values("n_categories", ascending=False)

summary

Number of categories per string column:



Unnamed: 0,n_categories
University_Name,46
Skill_3,34
Skill_2,34
Skill_1,34
Job_Role,29
Country,21
Top_Industry,7
Field_of_Study,7
Region,5
Degree_Level,3


Now we will proceed to analyse some of the relevant numeric variables that are present within our data set.

## Employment rate

### *Employment (6 months)*

In [6]:
print(
    "Global employment percentage after 6 months:\n"
    f'Mean: {data["Employment_Rate_6_Months (%)"].mean():.2f}\n'
    f'Minimum: {data["Employment_Rate_6_Months (%)"].min():.2f}\n'
    f'Maximum: {data["Employment_Rate_6_Months (%)"].max():.2f}'
)

Global employment percentage after 6 months:
Mean: 85.74
Minimum: 65.50
Maximum: 99.00


### *Employment (1 year)*

In [7]:
print(
    "Global employment percentage after 1 year:\n"
    f'Mean: {data["Employment_Rate_12_Months (%)"].mean():.2f}\n'
    f'Minimum: {data["Employment_Rate_12_Months (%)"].min():.2f}\n'
    f'Maximum: {data["Employment_Rate_12_Months (%)"].max():.2f}'
)

Global employment percentage after 1 year:
Mean: 90.41
Minimum: 68.00
Maximum: 100.00


### *Top universities by employment rate*

In [8]:

uni_emp_summary: pd.DataFrame = (
    data.groupby(["University_Name", "Region", "Country"], as_index=False)
        .agg(
            mean_6=("Employment_Rate_6_Months (%)", "mean"),
            mean_12=("Employment_Rate_12_Months (%)", "mean"),
        )
)

uni_emp_summary["mean_overall"] = uni_emp_summary[["mean_6", "mean_12"]].mean(axis=1)

uni_emp_summary["rank"] = uni_emp_summary["mean_overall"].rank(method="dense", ascending=False).astype(int)


top50: pd.DataFrame = (
    uni_emp_summary.sort_values(["mean_overall", "University_Name"], ascending=[False, True])
       .head(50)
       .copy()
)


top50["University_Name"] = pd.Categorical(
    top50["University_Name"],
    categories=top50["University_Name"].tolist(),
    ordered=True
)


employment: alt.Chart = (
    alt.Chart(top50)
    .mark_bar()
    .encode(
        y=alt.Y(
            "University_Name:N",
            sort=None,
            title="University"
        ),
        x=alt.X("mean_overall:Q", title="Average Employment Rate (%)"),
        color=alt.Color(
            "Region:N",
            title="Region",
            legend=alt.Legend(orient="top", direction="horizontal")
        ),
        tooltip=[
            alt.Tooltip("rank:Q", title="Ranking"),
            alt.Tooltip("University_Name:N", title="University"),
            alt.Tooltip("Region:N", title="Region"),
            alt.Tooltip("Country:N", title="Country"),
            alt.Tooltip("mean_6:Q", title="Empl. Rate (6 months) %", format=".2f"),
            alt.Tooltip("mean_12:Q", title="Empl. Rate (12 months) %", format=".2f"),
            alt.Tooltip("mean_overall:Q", title="Avg Empl. Rate %", format=".2f"),
        ],
    )
    .properties(width=800, height=900, title="Top 50 Universities by Average Employment Rate (2015 to 2025)")
)

employment.save("../img/eda_plots/employment.png")

employment


![Top universities by employment rate](../img/eda_plots/employment.png)

Observe that most of the top-ranked universities in terms of employment rate are mostly coming from North America, Europe, and Asia-Pacific.

## Starting Salary

In [9]:
print(
    "Global starting salary:\n"
    f'Mean: ${data["Average_Starting_Salary_USD"].mean():,.2f}\n'
    f'Minimum: ${data["Average_Starting_Salary_USD"].min():,.2f}\n'
    f'Maximum: ${data["Average_Starting_Salary_USD"].max():,.2f}'
)

Global starting salary:
Mean: $64,668.00
Minimum: $-1,200.00
Maximum: $189,400.00


### *Top industries by starting salary*

In [10]:

TOP_N = 10

industry_salary: pd.DataFrame = (
    data.groupby("Top_Industry", as_index=False)
        .agg(avg_salary=("Average_Starting_Salary_USD", "mean"))
)

industry_salary["rank"] = (
    industry_salary["avg_salary"].rank(method="dense", ascending=False).astype(int)
)

top_industries: pd.DataFrame = (
    industry_salary.sort_values("avg_salary", ascending=False)
                   .head(TOP_N)
                   .copy()
)


top_industries["Top_Industry"] = pd.Categorical(
    top_industries["Top_Industry"],
    categories=top_industries["Top_Industry"].tolist(),
    ordered=True
)

bar_top_industries: alt.Chart = (
    alt.Chart(top_industries)
    .mark_bar()
    .encode(
        x=alt.X("Top_Industry:N", sort=None, title="Top Industry"),
        y=alt.Y(
            "avg_salary:Q",
            title="Average Starting Salary (USD)",
            axis=alt.Axis(format="$,.0f"),
        ),
        tooltip=[
            alt.Tooltip("rank:Q", title="Rank"),
            alt.Tooltip("Top_Industry:N", title="Industry"),
            alt.Tooltip("avg_salary:Q", title="Avg Salary (USD)", format="$,.2f"),
        ],
    )
    .properties(width=800, height=400, title=f"Top Industries by Average Starting Salary")
)


bar_top_industries.save("../img/eda_plots/bar_top_industries.png")
bar_top_industries


![Top industries by salary](../img/eda_plots/bar_top_industries.png)

Here we can observe how **technology**, **consulting**, **healthcare**, and **finance** rank amongst the top-paying industries by the starting salary of newly graduates.

### *Salary yearly trends*

In [11]:
salary_over_time: pd.DataFrame = (
    data.groupby(["Year", "Field_of_Study"], as_index=False)
        .agg(avg_salary=("Average_Starting_Salary_USD", "mean"))
)


highlight: alt.Parameter = alt.selection_point(fields=["Field_of_Study"], bind="legend")

line_salary_over_time: alt.Chart = (
    alt.Chart(salary_over_time)
    .mark_line(point=True)
    .encode(
        x=alt.X("Year:O", title="Year", sort="ascending"),
        y=alt.Y(
            "avg_salary:Q",
            title="Average Starting Salary (USD)",
            axis=alt.Axis(format="$,.0f"),
            scale=alt.Scale(domain=[4e4, 1e5])
        ),
        color=alt.Color("Field_of_Study:N", title="Field of Study"),
        opacity=alt.condition(highlight, alt.value(1), alt.value(0.12)),
        tooltip=[
            alt.Tooltip("Year:O", title="Year"),
            alt.Tooltip("Field_of_Study:N", title="Field of Study"),
            alt.Tooltip("avg_salary:Q", title="Avg Salary (USD)", format="$,.2f"),
        ],
    )
    .add_params(highlight)
    .properties(width=900, height=450, title="Average Starting Salary Over Time by Field of Study")
)


line_salary_over_time.save("../img/eda_plots/line_salary_over_time.png")
line_salary_over_time


![Salary over time by industry](../img/eda_plots/line_salary_over_time.png)

Here we can observe how data science and AI, as well as computer science and healthcare, prevail to be the most relevant fields of study in terms of top mean starting salary after graduation.

## Distribution of degree programs

In [12]:

deg: pd.DataFrame = (
    data.groupby("Degree_Level", as_index=False)
        .size()
        .rename(columns={"size": "count"})
)

deg["share"] = deg["count"] / deg["count"].sum()

ring_degree_level: alt.Chart = (
    alt.Chart(deg)
    .mark_arc(innerRadius=120)
    .encode(
        theta=alt.Theta("count:Q", title="Count"),
        color=alt.Color("Degree_Level:N", title="Degree Level"),
        tooltip=[
            alt.Tooltip("Degree_Level:N", title="Degree Level"),
            alt.Tooltip("count:Q", title="Count", format=",.0f"),
            alt.Tooltip("share:Q", title="Share", format=".1%")
        ],
    )
    .properties(width=450, height=450, title="Degree Level Proportion (All Records)")
)

ring_degree_level.save("../img/eda_plots/ring_degree_level.png")
ring_degree_level


![Degree distribution](../img/eda_plots/ring_degree_level.png)

Here we can observe an even mixture of degree levels offered by universities as registered in our data.