<a href="https://colab.research.google.com/github/Ulnika/Sleep-Health-and-Lifestyle/blob/main/Sleep_Health_and_Lifestyle_part1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Capstone 1. Sleep Health and Lifestyle**

Goal: examine and compare sleep parameters and lifestyle factors, such as sleep duration, sleep quality, physical activity level, stress level, heart rate, and daily steps, across various occupations.

Data: Survey data of 374 people on sleep health and lifestyle. Useful for understanding sleep health.

https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset

## Understanding the dataset

### Importing and Displaying Data

In [19]:
import pandas as pd
data = pd.read_csv('Sleep_health_and_lifestyle_dataset.csv')

Check information about the DataFrame including the index dtype and columns, non-null values and memory usage.



In [20]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


The DataFrame consists of 13 columns and 374 rows, 0-11 columns don't have null values, but the column with Sleep disorders has 155 non-null values, so I changes nulls to "None":

In [21]:
data = data.fillna('NA')
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           374 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


In [22]:
data.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


### Understanding the gender, age and occupation data


To gain a deeper understanding of the dataset, I will analyze the distribution of genders, age groups, and occupations among the participants. This step will help identify the demographic coverage of the data and highlight any significant patterns or gaps.

By examining these factors, we can assess whether the dataset represents a diverse population or if certain groups are overrepresented or underrepresented, ensuring a more accurate interpretation of sleep trends across various segments.

In [23]:
genders = data['Gender'].value_counts()
print(genders)

Gender
Male      189
Female    185
Name: count, dtype: int64


The dataset represents a nearly balanced gender distribution, with 189 males and 185 females.

In [24]:
data['Age'].describe()

Unnamed: 0,Age
count,374.0
mean,42.184492
std,8.673133
min,27.0
25%,35.25
50%,43.0
75%,50.0
max,59.0


In [25]:
pd.cut(data['Age'], bins=[25, 30, 35, 40, 45, 50, 55, 60]).value_counts().sort_index()

Unnamed: 0_level_0,count
Age,Unnamed: 1_level_1
"(25, 30]",32
"(30, 35]",62
"(35, 40]",71
"(40, 45]",99
"(45, 50]",34
"(50, 55]",43
"(55, 60]",33


The result shows the distribution of individuals across age groups, with the highest count of 99 individuals in the 40-45 age group, followed by 71 in the 35-40 age group, and the lowest count of 32 individuals in the 25-30 age group.

In [26]:
data.groupby(['Occupation']).size().sort_values(ascending=False)

Unnamed: 0_level_0,0
Occupation,Unnamed: 1_level_1
Nurse,73
Doctor,71
Engineer,63
Lawyer,47
Teacher,40
Accountant,37
Salesperson,32
Scientist,4
Software Engineer,4
Sales Representative,2


The largest groups are nurses (73), doctors (71), engineers (63), lawyers (47), and teachers (40), while other professions such as scientists, software engineers, and sales representatives are minimally represented.

In [27]:
data.groupby(['Occupation','Gender']).size()

Unnamed: 0_level_0,Unnamed: 1_level_0,0
Occupation,Gender,Unnamed: 2_level_1
Accountant,Female,36
Accountant,Male,1
Doctor,Female,2
Doctor,Male,69
Engineer,Female,32
Engineer,Male,31
Lawyer,Female,2
Lawyer,Male,45
Manager,Female,1
Nurse,Female,73


 Gender distribution within occupations reveals notable trends:

*   Nurses and scientist are exclusively female, while salespersons, software engineers, and sales representatives are entirely male.
*   Doctors are predominantly male (69 males, 2 females), as are lawyers (45 males, 2 females) and engineers (31 males, 32 females, showing more gender parity).
* Teachers are mostly female (35 females, 5 males), and accountants are overwhelmingly female (36 females, 1 male).




### Conclusion

* The dataset shows a nearly balanced gender distribution  and a majority of participants in the 30-50 years age range.
* Such occupations as nurses, teachers, and accountants are predominantly female, while roles such as doctors, lawyers, and salespersons are largely male.
* Data for engineers show more gender balance, while for scientists, software engineers, sales representatives and managers are underrepresented.

## Assumption to exclude underrepresented groups

Previously, I demonstrated that some occupations are underrepresented in the provided dataset, such as:

- scientists,
- software engineers,
- sales representatives,
- managers.

For further analysis, I'll create a dataset based on the original one but excluding the data of listed occupations.

In [28]:
data = data[~data['Occupation'].isin(['Scientist', 'Software Engineer', 'Sales Representative', 'Manager'])]

print(data.info())
print(data['Occupation'].value_counts())


<class 'pandas.core.frame.DataFrame'>
Index: 363 entries, 1 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                363 non-null    int64  
 1   Gender                   363 non-null    object 
 2   Age                      363 non-null    int64  
 3   Occupation               363 non-null    object 
 4   Sleep Duration           363 non-null    float64
 5   Quality of Sleep         363 non-null    int64  
 6   Physical Activity Level  363 non-null    int64  
 7   Stress Level             363 non-null    int64  
 8   BMI Category             363 non-null    object 
 9   Blood Pressure           363 non-null    object 
 10  Heart Rate               363 non-null    int64  
 11  Daily Steps              363 non-null    int64  
 12  Sleep Disorder           363 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 39.7+ KB
None
Occupation
Nurse      

## Analysis of the sleep paramentes and lifestyle for each occupation

Here I calculated the average sleep duration, quality of sleep,	physical activity level, stress level,	heart rate	and daily steps  for each occupation represented in the dataset.

This analysis will help identify trends and potential correlations between professional roles, sleep habits and lifestyle.


In [29]:
data.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
6,7,Male,29,Teacher,6.3,6,40,7,Obese,140/90,82,3500,Insomnia
7,8,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,
8,9,Male,29,Doctor,7.8,7,75,6,Normal,120/80,70,8000,


In [30]:
groups_occupation = data.groupby(['Occupation'])


avg_sleep_duration = round(groups_occupation['Sleep Duration'].mean() ,2)
avg_sleep_quality = round(groups_occupation['Quality of Sleep'].mean(), 2)

avg_activity = round(groups_occupation['Physical Activity Level'].mean(), 0)
avg_stress = round(groups_occupation['Stress Level'].mean(), 0)

avg_HR = round(groups_occupation['Heart Rate'].mean(), 0)
avg_steps = round(groups_occupation['Daily Steps'].mean(), 0)


df_avg = pd.DataFrame({
    'Average Sleep Duration': avg_sleep_duration,
    'Average Sleep Quality': avg_sleep_quality,
    'Average Physical Activity Level': avg_activity,
    'Average Stress Level': avg_stress,
    'Average Heart Rate': avg_HR,
    'Average Daily Steps': avg_steps
})

df_avg = df_avg.sort_values(by = 'Average Sleep Quality', ascending=False)
df_avg


Unnamed: 0_level_0,Average Sleep Duration,Average Sleep Quality,Average Physical Activity Level,Average Stress Level,Average Heart Rate,Average Daily Steps
Occupation,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Engineer,7.99,8.41,52.0,4.0,67.0,5981.0
Accountant,7.11,7.89,58.0,5.0,69.0,6881.0
Lawyer,7.41,7.89,70.0,5.0,70.0,7662.0
Nurse,7.06,7.37,79.0,6.0,72.0,8058.0
Teacher,6.69,6.98,46.0,5.0,67.0,5958.0
Doctor,6.97,6.65,55.0,7.0,72.0,6808.0
Salesperson,6.4,6.0,45.0,7.0,72.0,6000.0


The small DataFrame *df_avg* represents average parameters related to various occupations, providing insight into how different professional groups compare in metrics such as sleep duration, sleep quality, physical activity level, stress levels, heart rate, and daily steps. This allows for a clearer understanding of lifestyle patterns and potential occupational influences on health and well-being.

In [31]:
import math
from bokeh.plotting import figure, show
from bokeh.io import output_notebook

output_notebook()

In [32]:
occupation = list(df_avg.index)
source = {'Occupations': occupation, 'Average sleep quality': df_avg['Average Sleep Quality']}

Figure1 = figure(title = "Average sleep quality for different occupations", x_range = occupation, y_range = (0, 8.5),
                x_axis_label = "Occupations", y_axis_label = "Average sleep quality",
                height = 300, width = 800)

Figure1.vbar(x = 'Occupations', top = 'Average sleep quality', source = source, width = 0.7)
Figure1.xaxis.major_label_orientation = math.pi/4

show(Figure1)


In [33]:
from bokeh.models import ColumnDataSource
from bokeh.models import LinearAxis, Range1d

source = ColumnDataSource(df_avg)


Figure2 = figure(title = "Sleep parameters and lifestyle for different occupations", x_range = occupation, y_range = (0, 9),
                x_axis_label = "Occupations", y_axis_label = "Value of Average sleep parameter and Stress level",
                height = 400, width = 800)

Figure2.vbar(x = 'Occupation', top = 'Average Sleep Quality', source = source, color="lightblue", width =0.7,  legend_label="Average Sleep Quality" )
Figure2.line(x = 'Occupation', y = 'Average Sleep Duration', source = source, color="blue", line_width =2, legend_label="Average Sleep Duration")
Figure2.line(x = 'Occupation', y = 'Average Stress Level', source = source, color="red", line_width =2, legend_label="Average Stress Level")
Figure2.line(x = 'Occupation', y = 'Average Physical Activity Level', source = source, color="magenta", line_width =2,
             legend_label="Average Physical Activity Level", y_range_name="y2")

Figure2.add_layout(LinearAxis(y_range_name='y2', axis_label='Average physical activity level'), 'right')
Figure2.extra_y_ranges = {"y2": Range1d(start=0, end=90)}

Figure2.xaxis.major_label_orientation = math.pi/4
Figure2.legend.location = 'bottom_left'
show(Figure2)

## Conclusion

Sleep Duration: Engineers report the longest sleep duration, while salespeople have the shortest. Other presented occupations have moderate sleep durations ranging.

Sleep Quality: Engineers have the highest average sleep quality, while salespeople report the lowest.

Physical Activity Level: Nurses show the highest physical activity level. Teachers and salespeople report the lowest activity levels.

Stress Level: Doctors report the highest stress levels, while engineers and teachers experience lower stress.

In summary, occupations involving higher physical activity (e.g., nurses) tend to have moderate sleep quality, while higher stress levels (e.g., doctors) correlate with lower sleep quality.