# Work Impact on Mental Health

## Information about the dataset

**This dataset is synthetic**.

**Self-reporting survey** conducted in 2025 that includes the **workers' personal info** (age, gender, country), the **work-related info** (the field of work, if it's remote or hybrid or on-site and how many work hours, salary) **and a report of mental and health conditions** (stress/anxiety/depression, level of burnouts, physical issues, social isolation).

## Questions I will try to answer in this project:

**1)** Do the different **work arrangements** correlate to the **severity level and number of reports of burnout**?

**2)** What is the **distribution of social isolation** the different **work arrangements**? Is it associated with different **health statuses** such as *depression* or *anxiety*?

**3)** Do *on-site workers* associate with higher **counts of physical issues** than *hybrid or remote workers* do? Does the number of physical issues reported correlate to any **reported mental health condition**?

### Shareholder's perspective

This project is conducted from an **HR perspective**.

The analysis will examine **how different work arrangements (on-site, hybrid, and remote) relate to employees’ mental health**, focusing on levels of social isolation, burnout severity, and reported mental health conditions.

## Columns in this dataset

### The columns that I will use:
- Work_Arrangement: the type of work - on-site, hybrid or remote
- Mental_Health_Status: conditions that were self-reported such as anxiety, stress, depression, burnout...
- Burnout_Level: self-report of the severity of their burnout - low, medium, high
- Physical_Health_Issues: health complaints like back pain, eye strain...
- Social_Isolation_Score: a 1-5 scale of how much they feel isolated, 1=none, 5=severe

### Columns that I won't be using:
- Survey_Date
- Age
- Gender
- Region
- Industry (work industry like tech, retail...)
- Job_Role (the individual job title)
- Salary_Range (I won't consider the salary as a health factor)

***Side Note:***

There are a few columns that I thought I would use but ended up not using. I decided to keep them in for the sake of showing my progress.

Those columns are:

- Hours_Per_Week: how many work hours in 1 week (a number)
- Work_Life_Balance_Score: a 1-5 scale of how much do the worker believes they balance between work and leisure, 1=poor, 5=excellent

# Data cleaning

## Examining the raw data

Loading necessary libraries:

In [51]:
import pandas as pd
import plotly.express as px
data = pd.read_csv("/kaggle/input/remote-work-health-impact-survey-2025/post_pandemic_remote_work_health_impact_2025.csv")

Let’s start by looking at the **first 5 rows of the dataset**.

In [52]:
data.head()

Unnamed: 0,Survey_Date,Age,Gender,Region,Industry,Job_Role,Work_Arrangement,Hours_Per_Week,Mental_Health_Status,Burnout_Level,Work_Life_Balance_Score,Physical_Health_Issues,Social_Isolation_Score,Salary_Range
0,2025-06-01,27,Female,Asia,Professional Services,Data Analyst,Onsite,64,Stress Disorder,High,3,Shoulder Pain; Neck Pain,2,$40K-60K
1,2025-06-01,37,Female,Asia,Professional Services,Data Analyst,Onsite,37,Stress Disorder,High,4,Back Pain,2,$80K-100K
2,2025-06-01,32,Female,Africa,Education,Business Analyst,Onsite,36,ADHD,High,3,Shoulder Pain; Eye Strain,2,$80K-100K
3,2025-06-01,40,Female,Europe,Education,Data Analyst,Onsite,63,ADHD,Medium,1,Shoulder Pain; Eye Strain,2,$60K-80K
4,2025-06-01,30,Male,South America,Manufacturing,DevOps Engineer,Hybrid,65,,Medium,5,,4,$60K-80K


Next, we’ll check the size of the dataset.

In [53]:
print("number of rows: ", data.shape[0])
print("number of columns: ", data.shape[1])

number of rows:  3157
number of columns:  14


We see that we have **3,157 rows** (employees who filled out the survey) and **14 columns**.

Let’s check the **column names**.

In [54]:
data.columns

Index(['Survey_Date', 'Age', 'Gender', 'Region', 'Industry', 'Job_Role',
       'Work_Arrangement', 'Hours_Per_Week', 'Mental_Health_Status',
       'Burnout_Level', 'Work_Life_Balance_Score', 'Physical_Health_Issues',
       'Social_Isolation_Score', 'Salary_Range'],
      dtype='object')



Next, we’ll check for missing values.

In [55]:
data.isna().any()

Survey_Date                False
Age                        False
Gender                     False
Region                     False
Industry                   False
Job_Role                   False
Work_Arrangement           False
Hours_Per_Week             False
Mental_Health_Status        True
Burnout_Level              False
Work_Life_Balance_Score    False
Physical_Health_Issues      True
Social_Isolation_Score     False
Salary_Range               False
dtype: bool



It seems that we have **missing values in "Mental_Health_Status" and "Physical_Health_Issues"**

Let’s check **how many missing values** each column has

In [56]:
print("missing data in Mental_Health_Status column: ", data["Mental_Health_Status"].isna().sum())
print("-" * 50)
print("missing data in Physical_Health_Issues column: ", data["Physical_Health_Issues"].isna().sum())

missing data in Mental_Health_Status column:  799
--------------------------------------------------
missing data in Physical_Health_Issues column:  280


**Now that we have all this information, we can start cleaning the data.**

## Data Cleaning

### Subsetting the data

As you saw before, **there are columns that we don’t need for this analysis.**.

We want to **focus on work arrangements and mental health** but we don't need personal employee information or industry-specific details.

We will start by **creating a new dataset that includes only the columns relevant to our analysis**

In [57]:
# making a list of the columns I need
cols = [
    "Work_Arrangement",
    "Hours_Per_Week",
    "Mental_Health_Status",
    "Burnout_Level",
    "Work_Life_Balance_Score",
    "Social_Isolation_Score",
    "Physical_Health_Issues"
]

# making a subset, "copy" is to prevent making changes to the original dataset
df = data[cols].copy()

print("the columns in the filtered dataset: ")
df.columns

the columns in the filtered dataset: 


Index(['Work_Arrangement', 'Hours_Per_Week', 'Mental_Health_Status',
       'Burnout_Level', 'Work_Life_Balance_Score', 'Social_Isolation_Score',
       'Physical_Health_Issues'],
      dtype='object')

### Data Types

Now let's look at the data again and **check the data types** to make sure that **numerical data is stored as integers and text data is stored as objects**.

In [58]:
df.head()

Unnamed: 0,Work_Arrangement,Hours_Per_Week,Mental_Health_Status,Burnout_Level,Work_Life_Balance_Score,Social_Isolation_Score,Physical_Health_Issues
0,Onsite,64,Stress Disorder,High,3,2,Shoulder Pain; Neck Pain
1,Onsite,37,Stress Disorder,High,4,2,Back Pain
2,Onsite,36,ADHD,High,3,2,Shoulder Pain; Eye Strain
3,Onsite,63,ADHD,Medium,1,2,Shoulder Pain; Eye Strain
4,Hybrid,65,,Medium,5,4,


In [59]:
df.dtypes

Work_Arrangement           object
Hours_Per_Week              int64
Mental_Health_Status       object
Burnout_Level              object
Work_Life_Balance_Score     int64
Social_Isolation_Score      int64
Physical_Health_Issues     object
dtype: object

It appears that all data types are correct.

For future reference, I will **convert the "Work_Arrangement" column to the "category" data type** to preserve the order **Onsite -> Hybrid -> Remote**

In [60]:
df["Work_Arrangement"] = df["Work_Arrangement"].astype(pd.CategoricalDtype(
    ["Onsite", "Hybrid", "Remote"], ordered=True))

print("the Work_Arrangement column type is now: ", df["Work_Arrangement"].dtype.name)

the Work_Arrangement column type is now:  category


### Missing Values

When we printed the 1st rows of this dataset, we noticed that **some rows contain "NaN"** values, which represent missing data.

Because the missing values appear in **self-reported mental and physical health columns**, I assume that a missing value **indicates there is no condition to report**.

Therefore, we will **replace "NaN" values with "None"** and print a single row again to confirm the change.

In [61]:
# Print the "before"
print("Before: ")
print(df.iloc[4])
print("-" * 50)

# Make the changes
df["Mental_Health_Status"] = df["Mental_Health_Status"].fillna("None")
df["Physical_Health_Issues"] = df["Physical_Health_Issues"].fillna("None")

# Print the "after"
print("After: ")
print(df.iloc[4])

Before: 
Work_Arrangement           Hybrid
Hours_Per_Week                 65
Mental_Health_Status          NaN
Burnout_Level              Medium
Work_Life_Balance_Score         5
Social_Isolation_Score          4
Physical_Health_Issues        NaN
Name: 4, dtype: object
--------------------------------------------------
After: 
Work_Arrangement           Hybrid
Hours_Per_Week                 65
Mental_Health_Status         None
Burnout_Level              Medium
Work_Life_Balance_Score         5
Social_Isolation_Score          4
Physical_Health_Issues       None
Name: 4, dtype: object




**Are there any remaining missing values?**

In [62]:
df.isna().any()

Work_Arrangement           False
Hours_Per_Week             False
Mental_Health_Status       False
Burnout_Level              False
Work_Life_Balance_Score    False
Social_Isolation_Score     False
Physical_Health_Issues     False
dtype: bool



Great! there are **no missing values** left in the dataset!

### Exploring Unique Values

In [63]:
# a for loop to iterate the list of columns and print any unique value
for col in cols:
    print(col, ": ")
    print(df[col].unique())
    print("-" * 50)

Work_Arrangement : 
['Onsite', 'Hybrid', 'Remote']
Categories (3, object): ['Onsite' < 'Hybrid' < 'Remote']
--------------------------------------------------
Hours_Per_Week : 
[64 37 36 63 65 61 62 55 47 38 35 57 59 54 51 43 41 58 53 45 48 49 44 50
 60 42 46 56 39 52 40]
--------------------------------------------------
Mental_Health_Status : 
['Stress Disorder' 'ADHD' 'None' 'Burnout' 'Anxiety' 'PTSD' 'Depression']
--------------------------------------------------
Burnout_Level : 
['High' 'Medium' 'Low']
--------------------------------------------------
Work_Life_Balance_Score : 
[3 4 1 5 2]
--------------------------------------------------
Social_Isolation_Score : 
[2 4 3 1 5]
--------------------------------------------------
Physical_Health_Issues : 
['Shoulder Pain; Neck Pain' 'Back Pain' 'Shoulder Pain; Eye Strain' 'None'
 'Back Pain; Shoulder Pain' 'Back Pain; Shoulder Pain; Wrist Pain'
 'Neck Pain' 'Shoulder Pain' 'Eye Strain; Wrist Pain'
 'Back Pain; Eye Strain' 'Back Pai

It seems like there is **a large number of unique values in the "Physical_Health_Issues" column** because **entries with multiple issues are treated as separate values**.

To simplify this, we’ll **add a new column that counts the number of physical health issues per each employee**.

In [64]:
df["Physical_Health_Issues_n"] = df["Physical_Health_Issues"].apply(
    lambda x: 0 if x == "None" else len(x.split("; ")))

df[["Physical_Health_Issues","Physical_Health_Issues_n"]].head()

Unnamed: 0,Physical_Health_Issues,Physical_Health_Issues_n
0,Shoulder Pain; Neck Pain,2
1,Back Pain,1
2,Shoulder Pain; Eye Strain,2
3,Shoulder Pain; Eye Strain,2
4,,0




**Much better!**


Finally, **for statistical analysis**, we will **convert burnout levels from categorical** values (Low–High) **to numerical** values (1–3).

In [65]:
df["Burnout_Level_nscale"] = df["Burnout_Level"].map({"Low": 1, "Medium": 2, "High": 3})

df[["Burnout_Level","Burnout_Level_nscale"]].head()

Unnamed: 0,Burnout_Level,Burnout_Level_nscale
0,High,3
1,High,3
2,High,3
3,Medium,2
4,Medium,2


# Exploratory Data Analysis (EDA)

**Before analyzing, let's check the overall statistics**,

To do that, we need to **divide the numerical** (have a number as a value) columns and the **categorial** (have words as values) **columns**:

**Numerical:**
- Hours_Per_Week
- Work_Life_Balance_Score
- Social_Isolation_Score
- Physical_Health_Issues_n
- Burnout_Level_nscale

**Categorial:**
- Work_Arrangement (onsite, hybrid, remote)
- Mental_Health_Status (the different types of mental ailments)

## Numerical Statistics

We'll start by looking at the data in the numeric columns (that only have numbers in them) to see the **overall statistics like the average ("mean"), Median (the middle point) and standard deviation (std)**

In [66]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Hours_Per_Week,3157.0,49.904973,8.897699,35.0,42.0,50.0,57.0,65.0
Work_Life_Balance_Score,3157.0,2.996516,1.163307,1.0,2.0,3.0,4.0,5.0
Social_Isolation_Score,3157.0,2.704783,1.188887,1.0,2.0,3.0,4.0,5.0
Physical_Health_Issues_n,3157.0,1.881216,1.054662,0.0,1.0,2.0,3.0,5.0
Burnout_Level_nscale,3157.0,2.095344,0.74726,1.0,2.0,2.0,3.0,3.0


It seems that the **weekly work hours** recorded here are **between 35 to 65 hours** with the average of 50 hours

We also see that **the average numbers for the  work-life balance column, social isolation column and the severity of the burnout column are all close to the median**, work-life balance and social isolation have a 2.99 and 2.7 average with 3.0 in the middle point, and burnout severity levels at 2.09 average with 2.00 median

Lastly, we see that the **average number of physical health issues is 1 per person**, with the maximum of 5 issues for one person.

**A quick reminder: this data is synthetic**, many of those scales appear unusually *"perfect"* where the data is fits *neatly* within the minimum -> 25th percentile -> median -> 75th percentile -> maximum range. This may prove that the data was manipulated to fit an equal amount for each option, making it **less genuine than real world data**.

**For the purpose of this project I will overlook this information and will use the data as it is, no matter the lack of accuracy.**

### Overall relationships between numerical columns

A quick look at the relationships between the numerical columns

In [67]:
# making a table of correlations
corrs = df[[
    "Hours_Per_Week",
    "Work_Life_Balance_Score",
    "Social_Isolation_Score",
    "Physical_Health_Issues_n",
    "Burnout_Level_nscale"
]].corr()

# turning the table to a heat map
px.imshow(
    corrs,
    text_auto='.3f',
    color_continuous_scale="Inferno",
    zmin=-1,
    zmax=1,
    title="Correlations Between the Numerical Columns"
)

**This shows that there are no strong relationships between the numerical columns**


**0.04 is the highest positive relationship** (the social isolation with burnout level columns)

**-0.02 is the highest negative relationship** (the physical issues with work hours columns).

## Categorical Columns statistics

### Mental Health Statistics

Let's explore the **distribution of the different Mental health statuses**:

In [68]:
# making a separate data frame for the count and percentage of mental statuses
m_health_df = df["Mental_Health_Status"].value_counts().reset_index()

m_health_df.columns = ["Mental_Health_Status","Count"]

m_health_df["Precentage"] = (
    (m_health_df["Count"] / (m_health_df["Count"].sum()) * 100) # getting the percentage
    .apply(lambda x: f'{x:.0f}%') # rounding the number and adding a % sign
)

# creating a bar plot with this new data
fig = px.bar(
    m_health_df,
    x = "Mental_Health_Status",
    y = "Count",
    color = "Precentage",
    color_discrete_map = {"25%": "#5D3FD3", "13%": "#2A78B0", "12%": "#A6D3F1"},
    hover_data = ["Mental_Health_Status","Count","Precentage"],
    hover_name = "Precentage",
    text = "Precentage",
    title = "Mental Health Distribution"
)

fig.update_xaxes(title_text="Mental Issues Reported")

fig.update_traces(
    texttemplate='%{text}', 
    textposition='inside', 
    textfont_size=13
)

fig.show()

This graph shows that **25% of the workers** that filled this survey **didn't report any mental issue**.

**13% reported PTSD, and all other ailments are equally at 12% each**.

### Distribution of the Work Arrangements

Let's look at how many people work at each type of work **(on-site, hybrid and remote)**

In [69]:
# making a separate data frame for the count and percentage of the work arrangements
work_df = df["Work_Arrangement"].value_counts().reset_index()

work_df.columns = ["Work_Arrangement","Count"]

work_df["Precentage"] = (
    (work_df["Count"]/(work_df["Count"].sum()) * 100)  # getting the percentage
    .apply(lambda x: f'{x:.0f}%')   # rounding the number and adding a % sign
)

# creating a bar plot with this new data
fig = px.bar(
    work_df,
    x = "Work_Arrangement",
    y = "Count",
    color = "Work_Arrangement",
    color_discrete_map={"Onsite": "#4C72B0", "Hybrid": "#55A868", "Remote": "#64B5CD"},
    hover_data = ["Work_Arrangement","Count","Precentage"],
    text = "Precentage",
    title = "Distribution of Work Types")

fig.update_xaxes(title_text="Work Arrangement")

fig.update_traces(
    texttemplate='%{text}', 
    textposition='outside', 
    textfont_size=14
)

fig.show()

As we can see, **almost half the workers (more than 1,500) work On-site**, followed by **Hybrid with 32% of the workers** (about 1000 workers), and the **least amount of workers** (less than 600, 19%) **work remote**.

# Advanced Analysis

After looking at the overall statistics, it's time to start **answering the questions we set at the beginning**

## Burnouts in different types of work

**The question we asked is:**

Are the different **work arrangements** correlated to the *number* of reports or *severity* level of **burnouts**?

So, let's start with **the number of reported burnouts**, by looking at the statistics of only **workers who reported "burnout" as their mental status**

In [70]:
# subsetting to only include rows with "Burnout" as the mental status
burn = df.loc[df["Mental_Health_Status"] == "Burnout"]

# counting the different work arrangements in this subset
burn_df = burn["Work_Arrangement"].value_counts().reset_index()

burn_df.columns = ["Work Arrangement","Count"]

#compering the counts in this subset to the overall counts, and taking a percentage
burn_df["Precentage"] = (
    (burn_df["Count"]/(work_df["Count"]) * 100)  # getting the percentage
    .apply(lambda x: f'{x:.0f}%')  # rounding the number and adding a % sign
)

# creating a bar plot with this new data
fig = px.bar(
    burn_df,
    x = "Count",
    y = "Work Arrangement",
    color = "Work Arrangement",
    color_discrete_map={"Onsite": "#4682B4", "Hybrid": "#40E0D0", "Remote": "#7B68EE"},
    hover_data = ["Work Arrangement","Count","Precentage"],
    text = "Precentage",
    title = "Work Distribution of Reported Burnouts"
    )

fig.update_traces(
    texttemplate='%{text}', 
    textposition='inside', 
    textfont_size=15
)

fig.show()

We can see that (in this dataset) **there is a similarity of the percentage of the workers who reported "Burnout" as their mental health status in each of those work types**, ranging between 11% to 13% across three.

This suggests that **the likelihood of reporting Burnout does not differ by the type of work**.

Now, let's check if there is a **difference in the severity of the burnouts by the the different work types**:

In [71]:
# Making a new data frame by grouping the work types and counting the burnout levels
burn_lv = (
    df.groupby("Work_Arrangement", observed=True)["Burnout_Level_nscale"]
    .value_counts(normalize=True)
    .rename("percent")
    .reset_index()
)

burn_lv["percent"] *= 100
burn_lv["percent"] = burn_lv["percent"].round(2)

# turning this data into a pivot table to be able to show in a heat map
burn_data = burn_lv.pivot(index="Work_Arrangement", columns="Burnout_Level_nscale", values="percent")

# creating a heat map with this data
fig = px.imshow(
        burn_data,
        text_auto=".2f",
        labels=dict(x="Burnout Level (1-3)", y="Work Arrangement", color="Percent"),
        color_continuous_scale='burgyl',
        title = "Percentage Distribution of Burnout Levels by Work Type"
         )

fig.update_traces(
    texttemplate="%{z}%", 
    textfont_size=14
)

fig.show()

**What this heatmap tells us:**

In both **Onsite** and **Hybrid** work types, a **majority** of about 44% of workers **reported a medium** (level 2) **severity of burnout** levels, making it the *most common in both groups*.

***Onsite workers*** show a **relatively balanced distribution** with less than 20% difference between the most common and the least common severity level, with **medium as the most common** (43%), **followed by low** (30%) and then **high level as the least common** (26.5%).

***Hybrid workers*** have less than 20% of workers reporting **low severity** burnout level making it **the least common**. There is **less than 10% between medium severity (most common) and high severity**.

***Remote workers*** have the **majority reporting medium and high severity burnouts**, with **high severity as the most common** (46.5%). Less than 15% of workers reported a **low severity** burnout level, making it the **least common**.

***In conclusion***, **Onsite** workers have a **balanced distribution** of burnout severity, while **Hybrid and Remote** workers lean toward **medium–high severity** levels with **Remote** workers having **the most reports of high severity levels**.

## Remote work effect on Isolation and Mental status

**The question we asked:**

What is the **distribution of social isolation** across the different **work arrangements**? Is it associated with different **health statuses** such as *depression* or *anxiety*?

We'll start by comparing the **distribution of the social isolation levels across the different work arrangements**:

In [72]:
# Making a new data frame by grouping the work types and counting the isolation levels
isol = (
    df.groupby("Work_Arrangement",observed=True)["Social_Isolation_Score"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
    .rename("percent")
    .reset_index()
    .sort_values("Social_Isolation_Score")  # to show it in a 1-5 order
)

# creating a bar plot with this data, and showing a 100% stack for each level
fig = px.bar(
    isol,
    x="Work_Arrangement",
    y="percent",
    color="Social_Isolation_Score",
    color_continuous_scale="ice_r",
    barmode="stack",
    title = "Social Isolation Level by each Work Type",
    text = "percent"
    )

fig.update_xaxes(title_text="Work Arrangement")

fig.update_traces(
    texttemplate='%{text:.0f}%', 
    textposition='inside', 
    textfont_size=14
)

fig.show()

What this graph shows us:

***Onsite:*** The **majority** (about 60% of workers) **reported low** (1-2) **levels of social isolation**. **Only 6% of workers reported the highest (5) level** of isolation.

***Hybrid:*** With 35% of workers reporting neutral (3) level of feeling isolated,it appears that **most workers feel a medium (2-4) level of isolation** and a **minority** of workers (about 20%) report the **lowest (1) and highest (5) levels**.

***Remote:*** Less than 20% of workers reported low (1-2) levels of feeling isolated, the **majority** (24%) **reported mid-high (4) level of isolation, followed by medium level** (3) with 28% of workers and then 21% of workers reporting the highest (5) level.

***In Conclusion:*** This data shows that **Onsite workers tend to feel less isolated** (3 and lower), **Hybrid workers tend to feel neutral/medium** levels (2-4) and **Remote workers tend to feel more isolated** (3 and higher) than the other workers.

Now to the 2nd part: Does this correlate with different **health statuses** such as *depression* or *anxiety*?

We will **focus on remote workers** and see the different **mental conditions reported at each level of social isolation**:

**We will also remove *PTSD* and *ADHD* since we can't know if the type of *work* or the level of *social isolation* relates to those statuses.**

In [73]:
# subsetting to only remote workers, and dropping rows with ADHD and PTSD
remote_df = df.loc[(df["Work_Arrangement"] == "Remote") & \
            (~df["Mental_Health_Status"].isin(["ADHD", "PTSD"]))].copy()

# making a new column that separates "None" and any other mental issue
remote_df["reported_mental_status"] = remote_df["Mental_Health_Status"].map(
    lambda x: "Didn't Report a Mental Issue" if x == "None" else "Reported a Mental Issue")

# making a data frame of the percentage of mental status by the isolation level
remote_isol_mental = (
    remote_df.groupby("Social_Isolation_Score",observed=True)
    ["reported_mental_status"]
    .value_counts(normalize=True)
    .mul(100)
    .round(2)
    .rename("Percent")
    .reset_index()
    .sort_values("Social_Isolation_Score")
)

# creating a bar plot with this data, and showing multiple bars in each level
fig = px.bar(
    remote_isol_mental,
    x="Social_Isolation_Score",
    y="Percent",
    color="reported_mental_status",
    color_discrete_map={"Reported a Mental Issue": "#1F4E79",
                        "Didn't Report a Mental Issue": "#4FB6B2"},
    barmode='group',
    title = "Mental Statuses Reported at each Social Isolation Level",
    text = "Percent",
    labels={"reported_mental_status": "Reported or Not Reported a Mental Issue"}
    )

fig.update_xaxes(title_text="Social Isolation Score")

fig.update_traces(
    texttemplate='%{text:.0f}%', 
    textposition='outside', 
    textfont_size=14
)

fig.show()

In this graph we can see that **within remote workers** as the level of **social isolation rises** the number of **reported mental issues rises** as well.

At the **lowest level of feeling isolated, we see an almost equal** reports of mental health (55% reported vs 45% not reported), **in the middle levels (2-4) the numbers start to shift** as **reported rises** to 62% while the **not reported lowers** to 39%, and **in the highest level we see the biggest difference** with 70% who reported a mental issue vs 30% who didn't report.

This shows that (within remote workers) **the higher the level of isolation, the more reports of mental health issues**

To show **the main reason for this shift**, let's check the number of **reported Depression** in each isolation level:

In [74]:
# grouping by the levels and counting each mental issue
remote_isol_mental_issues = (
    remote_df.groupby("Social_Isolation_Score",observed=True)
    ["Mental_Health_Status"]
    .value_counts(normalize=True)
    .mul(100)
    .round()
    .rename("Percent")
    .reset_index()
    .sort_values("Social_Isolation_Score")
)

# subsetting to only include rows with "depression"
dep_only = remote_isol_mental_issues[remote_isol_mental_issues["Mental_Health_Status"] == "Depression"]

# creating a line plot to show the rise
fig = px.line(
    dep_only,
    x = "Social_Isolation_Score",
    y = "Percent",
    markers = True,
    text = "Percent",
    title = "The Change of Reported Depression As Social Isolasion Rises",
    line_shape="linear",
    render_mode="svg"
    )

# adding dots with percentages to the line
fig.update_traces(
    line_color='#1F4E79', 
    line_width=4, 
    marker_size=10,
    texttemplate='%{text}%', 
    textposition='top center'
    )

fig.update_yaxes(tickformat=".0f", ticksuffix="%")

fig.update_xaxes(title_text="Social Isolation Score")

fig.show()

We can clearly see that (within remote workers) **as the level of isolation rises the number of reported Depression also rises**,

**starting from 3%** in the lowest level of isolation, and **rising to 22%** in the highest level of isolation.

## Physical Issues: distribution by work and impact on mental health

**The 3rd (and last) question** is:

Do *on-site workers* associate with higher **counts of physical issues** than *hybrid or remote workers* do? Does the number of physical issues reported correlate to any **reported mental health condition**?

We'll start by checking the number of **physical issues** by the different **work arrangments**

In [75]:
# grouping by the work type and counting the percentage of each amount of physical issues
phy_work = (
    df.groupby("Work_Arrangement",observed=True)["Physical_Health_Issues_n"]
    .value_counts(normalize=True)
    .mul(100)
    .round(1)
    .rename("Percentage")
    .reset_index()
)

# creating a bar plot with this data, and showing multiple bars in each level
fig = px.bar(
    phy_work,
    x="Physical_Health_Issues_n",
    y="Percentage",
    color="Work_Arrangement",
    color_discrete_map={"Onsite": "#A04000", "Hybrid": "#E59866", "Remote": "#FAD7A0"},
    barmode="group",
    title = "Physical Health Distribution by the different Work Arrengments",
    text = "Percentage"
    )

fig.update_xaxes(title_text="Work Arrangement")

fig.update_traces(
    texttemplate='%{text}%', 
    textposition='outside', 
    textfont_size=13
)

fig.show()

This graph shows us that there is **little to no difference in the distribution of the amount of physical health issues by the type of work.**

So, to answer the 1st part of the question:

Within this data, **On-site workers do not report more physical health issues than Hybrid or Remote** workers.

However, with this graph, we can see that **in each work type most workers have 1-3 different physical issues**. Less than 10% didn't report any physical issues, and less than 8% of workers reported 4-5 different physical issues.

Now let's check if **more *physical issues* correlate with *mental health***

Like before, I will **exclude ADHD and PTSD** as we can't prove that those mental issues are related to any physical issues.

**Because of the low numbers of workers reporting 5 physical issues, I will group 4 and 5 issues into "4+"**

In [76]:
# subsetting to not include rows with ADHD and PTSD
mental_filtered = df.loc[~df["Mental_Health_Status"].isin(["ADHD", "PTSD"])].copy()

# adding a column that separates "None" from any other mental issue
mental_filtered["reported_mental_status"] = mental_filtered["Mental_Health_Status"].map(
    lambda x: "Didn't Report a Mental Issue" if x == "None" else "Reported a Mental Issue"
    )

# changing from intigers to strings and combining 4 and 5 into "4+"
mental_filtered["Physical_Health_Issues_n"] = (
    mental_filtered["Physical_Health_Issues_n"]
    .astype("string")
    .replace(["4","5"], "4+")
)

mental_physical = (
    mental_filtered.groupby("Physical_Health_Issues_n",observed=True)\
    ["Mental_Health_Status"].value_counts(normalize=True).mul(100)\
    .rename("Percent").reset_index().sort_values("Physical_Health_Issues_n"))

# creating a bar plot with this data, and showing a 100% stack for each level
fig = px.bar(
    mental_physical,
    x="Physical_Health_Issues_n",
    y="Percent",
    color="Mental_Health_Status",
    color_discrete_map={
        "None": "#D1D5DB",
        "Anxiety": "#7EB0D5",
        "Burnout": "#8BD3C7",
        "Depression": "#C9A0DC",
        "Stress Disorder": "#F49898"
    },
    barmode='stack',
    title = "Mental Statuses Reported at each Social Isolation Level",
    text = "Percent",
    labels={"reported_mental_status": "Reported or Not Reported a Mental Issue"}
    )

fig.update_xaxes(title_text="Amount of Physical Health Issues")

fig.update_traces(
    texttemplate='%{text:.0f}%', 
    textposition='inside', 
    textfont_size=14
)

fig.show()

This shows us that there is **no strong correlation between the amount of physical conditions and any specific mental status**.

While **most mental statuses remain within a low percentage range** (3-5%), workers who **reported stress disorder show a slight jump** between *no reported physical issues* (at 12%) to *1 reported issue* (at 19%) **followed by a stable percentage** (14-16%) at *2 or more issues*.

# Conclusion and Recommendations

## Conclusion

After analyzing this synthetic dataset of self-reported employees information, we found several correlations between work arrangements, social isolation, burnouts and some mental health ailments:

Among the three work arrangements, **remote workers have reported higher severity levels of burnouts and  and higher severity levels of social isolation compared to hybrid and on-site workers**. Additionally, **higher levels of social isolation are associated with more reports of mental health ailments**, particularly depression.

In contrast, **no meaningful relationship was found between the amount of physical health ailments and the three work arrangements**. Furthermore, **the distribution of mental health ailments within them remained stable across the different counts of physical health issues** indicating **no strong association between physical health issue count and specific mental health conditions** within this dataset.

Overall, these findings suggest that **psychosocial factors—such as isolation and burnout—are more strongly associated with mental health outcomes than physical health complaints** in this dataset, especially in remote work settings.

## Recommendations

Based on the observed patterns in this analysis the following recommendations may help mitigate mental health risks, particularly among remote workers:


1. **Encourage regular and structured break periods for remote and hybrid workers**


More frequent and clearly defined rest periods may help reduce burnout severity, especially for employees with limited separation between work and personal time.



2. **Periodic in-person or socially interactive activities for remote workers**

Scheduled physical meetings, team gatherings, or collaborative sessions may help reduce feelings of social isolation and support employee well-being.



3. **Proactive mental health monitoring and support systems**

Given that there are more reports of burnout and mental health issues among remote workers, it may be helpful to introduce regular mental health check-ins, offer counseling support, and identify early signs of burnout.