## Introduction
This notebook is a case study for the Google Data Analystics Professioanl Course at Cousera. Its purpose is to present the analytical skills I have gained through the course. As outlined in the course's Case Study Roadmap, I will follow the key steps of the data analysis process: Ask, Prepare, Process, Analyze, Share, and Act.

## Ask Phase

### Business Task: 
Identify key factors contributing to student depression and how they vary across demographic groups, to guide targeted interventions and improve mental health outcomes.

### The stakeholders:
1. University Administrators: department heads, student affairs directors.
   - Relevance: Responsible for student well-being and campus programs.
   - Understanding depression factors helps inform policies, resource allocation, and support structures.
2. Mental Health Professionals: On-campus counselors, psychologist
   - Relevance: Directly involved in student mental health
   - Using insights to tailor interventions and enhance support programs.

### Questions

1. What are the key factors most strongly associated with depression among students?

2. Which age group should be prioritized for mental health interventions?

3. Which programs could be implemented based on the results in order to   reduce depression among high-risk student groups?


### Metrics

- Chi-squared Test Results: Analyze the relationships between potential factors and depression to identify significant associations.
- Distribution of depression for each age group
   

## Prepare Phase:
- The *“Depression Survey/Dataset for Analysis"* was collected during an anonymous survey conducted between January and June 2023. The survey was included various cities and targeted individuals from diverse backgrounds and professions. Its aim was to provide insights into how different factors might contribute to depression risk among adults.

- This dataset is stored in Kaggle and it contains **2,556 rows and 19 columns**. Each row corresponds to an individual, while the columns capture the factors that might contribute with depression:

- *Note: All scales used in this dataset are on a 1–5 range, where 1 represents minimal and 5 represents maximum, except for CGPA, which follows a 1–10 scale.*

    1. Numerical Columns (8 factors):
       - Age: Age of the individual in years.
       - Academic Pressure: Self-reported academic pressure, on a 1-5 scale
       - Job Pressure: Self-reported job pressure, on a 1-5 scale 
       - Study Satisfaction: Satisfaction with academic studies, on a 1-5 scale
       - Job Satisfaction: Satisfaction with job, on a 1-5 scale 
       - Work/ StudyHours: Average daily study or work hours.
       - Financial Stress: Self-reported financial stress, on a 1-5 scale
       - CGPA: Cumulative Grade Point Average (CGPA), on a 1-10 scale 

    2. Categorical Columns (11 factors):
       - Name: Name of the participant
       - Gender: Gender of the individual (Male, Female).
       - City: City of residence.
       - Working Professional or Student
       - Profession: Occupation
       - Degree: Academic degrees that the individuals may hold.
       - Sleep Duration: Average sleep duration (Less than 5 hours, 5-6 hours, 6-7 hours, 7-8 hours and More than 8 hours).
       - Dietary Habits: Self-reported dietary habits (Healthy, Moderate Unhealthy).
       - Suicidal Thoughts: History of suicidal thoughts (Yes, No).
       - Family History: Family history of depression or suicide (Yes, No).
       - Depression: Current state of depression (Yes, No).
       
######

- Participants self-reported their responses without undergoing professional mental health assessments or diagnostic tests.

- This dataset has been released under the Creative Commons Zero (CC0) license by Suman Sharma. This licensing signifies: no copyright restrictions, free to use and no Attribution Required.



## Process Phase  

For the dataset analysis I will use R because it offers a powerful and flexible environment for statistical analysis, handling large datasets and creating insightful visualizations.

- First, clean the data by renaming column headers and ensuring that the variables have the correct data types.
- Next, filter the dataset to include only student-related data and remove unnecessary columns that are not relevant to the study.
- Then, check for missing values and duplicates to ensure data quality.
- Next, convert the gender variable to a factor in R, as this will allow R to treat it as a categorical variable.
- Finally, add a new column that indicates whether both the depression and suicidal_thoughts columns are marked as 'Yes'. Because students who report both depression and suicidal thoughts are at higher risk of self-harm or suicide. This represents a critical mental health warning sign that requires immediate attention.

#### Load packages

In [None]:
library('tidyverse')
library('janitor')
library('dplyr')
library('ggplot2')

#### Read data and explore

In [None]:
depression <- read.csv('/kaggle/input/depression-surveydataset-for-analysis/final_depression_dataset_1.csv')
head(depression)
glimpse(depression)

#### Applying clean_names, Renaming Columns, Adjusting Variables to Appropriate Types and add a New Column 

In [None]:
depression_clean <- depression %>%
  clean_names() %>%  # Clean column names
  rename(            # Renaming columns
    suicidal_thoughts = have_you_ever_had_suicidal_thoughts,
    fam_hist_mental_illness = family_history_of_mental_illness,
    study_hours = work_study_hours,
    student = working_professional_or_student
  ) %>%
  mutate(  # Convert chr to factor
    gender = factor(gender), 
    dietary_habits = factor(dietary_habits), 
    fam_hist_mental_illness = factor(fam_hist_mental_illness),  
    study_satisfaction = factor(study_satisfaction),
    depression = factor(depression), 
    academic_pressure = factor(academic_pressure),
    financial_stress = factor(financial_stress)
  )%>%
  mutate( # Add a new column that indicates whether both the depression and suicidal_thoughts columns are marked as 'Yes'
    depression_suicidal = ifelse(depression == "Yes" & suicidal_thoughts == "Yes", "Yes", "No")  
  )

glimpse(depression_clean)
colnames(depression_clean)


#### Filter dataset to include only students and drop unnecessary columns

In [None]:
depression_clean <- depression_clean %>% 
  filter(student == "Student") %>%
  select(
      -"profession",
      -"work_pressure",
      -"job_satisfaction",
      -"city",
      -"degree",
      )

glimpse(depression_clean)


#### Check for missing values and duplicates 

In [None]:
#missing values
depression_clean %>% 
  is.na()%>%  #Check for NA (missing) values, creating a logical matrix
  colSums()   #Sum the TRUE values (which represent missing values) for each column

#duplicates
sum(duplicated(depression_clean)) #Count the number of duplicate rows

#summary statistics
summary(depression_clean)

## Analyze and Share Phase
Organize and format the data to answer the questions. Perform calculations and identify trends and relationships within the data.

#### Gender Distribution

In [None]:
# Count occurrences of each gender
gender_count <- depression_clean %>%
  group_by(gender) %>%
  summarise(Count = n())%>%
  mutate(
      Percentage = (Count / sum(Count)) * 100
  )

# Display the count
print(gender_count)

In [None]:
gender_count %>% 
  ggplot(aes(x = "", y = Percentage, fill = gender)) +   # Set x = "" to make it a single stacked bar
  geom_col(width = 1, color = "white")  +             # Creates the pie chart using a single column per gender.
  coord_polar("y", start = 0) +          # Convert bar chart to pie chart
  labs(
      title = "Gender Distribution"
  ) +               
  theme_void() +                         # Remove background and axes
  theme(
      plot.title = element_text(hjust = 0.5),  # Center the title
      legend.title = element_blank()        #Remove the legend title 
  ) +  
  geom_text(
      aes(label = paste0(round(Percentage, 1), "%")),    # Add percentages as labels
      position = position_stack(vjust = 0.5),    # Place the labels in the middle of each slice
      color = "white"
  )   

#### Age Distribution 

In [None]:
depression_clean%>%
ggplot(aes(x = age)) +
  geom_histogram(binwidth = 2, color = "black", fill = "skyblue") +  # Customize bin width and colors
  labs(
      title = "Age Distribution",
      x = "Age", 
      y = "Frequency"
  ) +  
  theme_minimal()+
  theme(
      plot.title = element_text(hjust = 0.5)
  ) 
 

#### Depression Distribution

In [None]:
depression_clean %>% 
  ggplot(aes(x = depression, fill = depression)) +  # Color bars by depression status
  geom_bar(position = "dodge") +  # Use dodge position to separate the bars for each depression status
  labs(
      title = "Depression Distribution",
      x = "", 
      y = "Count of People"
  ) +
theme_minimal()+
  theme(
      plot.title = element_text(hjust = 0.5)
  ) 

#### Depression Across Different Age Groups

In [None]:
depression_clean%>%
  filter(depression == "Yes") %>%
  ggplot(aes(x = age)) +
  geom_histogram(binwidth = 2, color = "black", fill = "skyblue") +  # Customize bin width and colors
  labs(
      title = "Depression Across Different Age Groups",
      x = "Age",
      y = "Count of People"
  ) +  
  theme_minimal() + 
  theme(
      plot.title = element_text(hjust = 0.5)
  ) 
 

#### In order to better visualize the data I will categorize the age into four groups based on depression distribution, this will be very useful to quickly see patterns and visualize them:


In [None]:
# Define the age ranges and group them
depression_clean <- depression_clean %>%
  mutate(
    age_group = case_when(
      age >= 18 & age <= 22 ~ "18-22",
      age >= 23 & age <= 26 ~ "23-26",
      age >= 27 & age <= 30 ~ "27-30",
      age >= 31 & age <= 34 ~ "31-34"
    )
  )

glimpse(depression_clean)

#### Depressed Students with Suicidal Thoughts, Faceted by Age Group

In [None]:
depression_clean %>%  
  filter(depression_suicidal == "Yes") %>%
  ggplot(aes(x = depression_suicidal, fill = gender)) +  # Color bars by gender
  geom_bar(position = "dodge") +  # Separate bars for each gender
  facet_grid(gender ~ age_group) +  # Grid facets by gender and age group
  labs(
      title = "Distribution of Depressed Students with Suicidal Thoughts",
      subtitle = "Grouped by Gender and Age Group",
      x = "Depressed Students with Suicidal Thoughts",  
      y = "Count of Students"
  ) +
  theme_minimal() +
  theme(
      plot.title = element_text(hjust = 0.5),  # Center title
      plot.subtitle = element_text(hjust = 0.5),  # Center subtitle
      axis.text.x = element_blank(),  # Remove x-axis tick labels
      axis.ticks.x = element_blank(),  # Remove x-axis ticks
      strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
  )


These plots highlight that the age 18–22 age group should be prioritized for mental health interventions, as they show the highest levels of depression and suicidal thoughts.

In [None]:
depression_clean %>% 
  filter(depression=="Yes") %>% 
  ggplot(aes(x = depression, fill=fam_hist_mental_illness)) +  # Color bars by depression status
  geom_bar(position = "dodge") +  # Use dodge position to separate the bars for each depression status
  facet_grid(gender~age_group) +
  labs(
      title = "Depressed Students with Family History of Mental Illnes",
      subtitle = "Faceted by Age Group",
      x = "Depression",
      y = "Count of People") +
theme_minimal()+
theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5), 
    axis.text.x = element_blank(),  # Remove x-axis tick labels
    axis.ticks.x = element_blank(),
    strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
)  

The plots do not reveal a clear relationship among females having depression and a family history of mental illness across different age groups. However, that is not the case for the male gender as in the age group from 18-22 and 27-30 most of the cases of depression also have a record of family metal illness.  This suggests that the presence of family history of mental illness is associated to depression among males, alas additional research should be conducted as factors beyond this may be implicated.

#### Sleep Duration Distribution in Depressed Students

In [None]:
# Count occurrences of sleep duration
sleep_count <- depression_clean %>%
  filter(depression =="Yes")%>%
  group_by(sleep_duration) %>%
  summarise(count_s = n())%>%
  mutate(percentage_s = (count_s / sum(count_s)) * 100)

# Display the count
print(sleep_count)

In [None]:
sleep_count %>% 
  ggplot(aes(x = "", y = percentage_s, fill = sleep_duration)) +   
  geom_col(width = 1, color = "white") +           
  coord_polar("y", start = 0) + 
  labs(
    title = "Sleep Duration Distribution in Depressed Students"
  ) +                
  theme_void() +                         
  theme(
    plot.title = element_text(hjust = 0.5)  
  ) +  
  geom_text(
    aes(label = paste0(round(percentage_s, 1), "%")),   
    position = position_stack(vjust = 0.5),   
    color = "white"
  )

In [None]:
# Bar plot to visualize suicidal thoughts by depression status
depression_clean %>%  
  ggplot(aes(x = depression, fill = sleep_duration)) +  # Color bars by sleep duration
  geom_bar(position = "dodge") +  # Use dodge position to separate bars for each sleep duration
  facet_wrap(~age_group) +  # Create separate plots for each age group
  labs(
      title = "Sleep Duration Distribution in Depressed Students",
      subtitle = "Faceted by Age Group",
      x = "Depression",
      y = "Count of People"
  ) +
  theme_minimal() + 
  theme(
      plot.title = element_text(hjust = 0.5),  # Center the title
      plot.subtitle = element_text(hjust = 0.5),  # Center the subtitle
      strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
  )

The relatively equal distribution suggests that depression impacts young individuals in varied ways, contributing to both insufficient and excessive sleep patterns among students. However, in the 31-34 age group, non-depressed students show a noticeable concentration in the "7-8 hours" and "More than 8 hours" categories, indicating a more consistent sleep pattern. To better understand these trends, additional data and a further analysis are necessary.

#### Impact of Academic Pressure in Depression

In [None]:
depression_clean %>%    
    ggplot(aes(x = academic_pressure, fill = depression)) +
    geom_bar(position = "dodge") +
    labs(
        title = "Impact of Academic Pressure on Depression",
        x = "Academic Pressure",
        y = "Count of People"
    ) +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5)
    )

In [None]:
depression_clean %>%
  ggplot(aes(x = depression, y = as.numeric(academic_pressure), fill = depression)) +
  geom_boxplot() +
  facet_wrap(~ age_group) +
  labs(
    title = "Impact of Academic Pressure on Depression",
    subtitle = "Faceted by Age Groups",
    x = "Depression",
    y = "Academic Pressure"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
  )

Academic pressure appears to be associated with depression, as higher levels of pressure are consistently observed among students reporting depression. This relation seems stronger in 18–22 and 23–26 age groups. In these groups, students reporting depression tend to experience higher levels of academic pressure (median ~4), while those without depression report much lower pressure (median ~2).

#### Impact of Academic Pressure in Depression, faceted by study satisfaction

In [None]:
depression_clean %>%    
    ggplot(aes(x = academic_pressure,fill = depression)) +
    geom_bar(position = "dodge") +
    facet_wrap(~study_satisfaction) + 
    labs(
        title = "Impact of Academic Pressure in Depression",
        subtitle = "Faceted by Study Satisfaction",
        x = "Academic Pressure",
        y = "Count of People") +
    theme_minimal() +
    theme(
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5),
        strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
    )

The previous plots reveal a noticeable increase in the number of students with depression at lower study satisfaction levels (1 and 2) and a rise in depressed students with a high level of academic pressure.

The amount of non-depressed students increase with a higher study satisfaction and less academic pressure.


##### Relationhip between Academic Pressure and Depression

###### Using Pearson's Chi-squared test 

In [None]:
table_data <- table(depression_clean$academic_pressure, depression_clean$depression)
print(table_data)
chi_sq_result <- chisq.test(table_data)
print(chi_sq_result)

# X-squared: Chi-squared statistic. 
# df: Degrees of freedom 
# p-value: Less than 0.05 suggests a significant association

Chi-squared test is the most common way to test the association between two categorical variables. 

A large Chi-squared value (with a small p-value) suggests that there is a significant association between academic pressure and depression

#### Relationship between CGPA and Depression

In [None]:
ggplot(depression_clean, aes(x = depression, y = cgpa, color = depression)) +
  geom_jitter(width = 0.2, size = 2, alpha = 0.7) + # Jittered points
  labs(
    title = "CGPA by Depression Status",
    x = "Depression Status",
    y = "CGPA"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5)
  )

The scatter plot indicates that CGPA does not appear to be a strong distinguishing factor for depression. Points are distributed across the entire range of CGPA values, from minimum to maximum, regardless of whether students are depressed or not.

#### Impact of Study hours on Depression

In [None]:
depression_clean %>% 
  ggplot(aes(x = study_hours,colour = depression)) +
  geom_line(stat = "count", aes(y = after_stat(count)), linewidth = 1)+
  labs(
      title = "Impact of Study hours on Depression",
      x = "Study Hours",
      y = "Count of People")+
  theme_minimal()+
  theme(
      plot.title = element_text(hjust = 0.5)
  ) 

This plot suggest that the number of depressed students increases with higher study hours. However, this cannot be considered a definitive indicator of depression, as there is also a significant number of students without depression who study very few or many hours. This indicates that other factors likely contribute to depression, and further analysis is needed to understand the full context.

#### Relationship of Dietary Habiits on Depression

In [None]:
depression_clean %>%    
    ggplot(aes(x = dietary_habits,fill = depression)) +
    geom_bar(position = "dodge") +
    labs(
        title = "Relationship of Dietary Habits in Depression",
        x = "Dietary Habits",
        y = "Count of People") +
    theme_minimal()+ 
    theme(
        plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5)
    ) 

In [None]:
depression_clean %>%
  filter(depression =="Yes")%>%
  ggplot(aes(x = dietary_habits, fill = dietary_habits)) +
  geom_bar(position = "dodge") +  
  facet_wrap(~age_group) + 
  labs(
    title = "Dietary Habits in Depressed Students ",
    subtitle = "Faceted by Age Group",
    x = "",
    y = "Count of People"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
  )

Plots reveal that students who report feeling depressed tend to have more unhealthy dietary habits, with the highest levels observed in the 18-22 age group. 

##### Relationhip between Dietary Habits and Depression

###### Using Pearson's Chi-squared test


In [None]:
table_dietary <- table(depression_clean$dietary_habits, depression_clean$depression)
print(table_dietary)
chi_sq_result_d <- chisq.test(table_dietary)
print(chi_sq_result_d)

# X-squared: Chi-squared statistic. 
# df: Degrees of freedom 
# p-value: Less than 0.05 suggests a significant correlation.

The Pearson's Chi-squared test results indicates a significant association between the two categorical variables. This means that the type of diet (whether healthy, moderate, or unhealthy) might influence the likelihood of experiencing depression.

#### Impact of Finantial Stress on Dietary Habits in Depressed Students

In [None]:
depression_clean %>%
  filter(depression =="Yes")%>%
  ggplot(aes(x = dietary_habits, fill = dietary_habits)) +
  geom_bar(position = "dodge") +  
  facet_wrap(~financial_stress) + 
  labs(
    title = "Dietary Habits in Depressed Students ",
    subtitle = "Faceted by Financial Stress",
    x = "",
    y = "Count of People"
  ) +
  theme_minimal()+
  theme(
    plot.title = element_text(hjust = 0.5),
    plot.subtitle = element_text(hjust = 0.5),
    strip.background = element_rect(fill = "grey90", color = NA) # Grey background for facet names
  )

Here is shown that depressed students experiencing higher financial stress are more likely to adopt unhealthy dietary habits. This suggests a potential link between financial challenges and compromised nutritional choices, highlighting the need for targeted support to address both financial and mental health concerns.

In [None]:
depression_clean %>%
filter(depression == "Yes") %>%
  ggplot(aes(x = suicidal_thoughts, fill = financial_stress)) +
  geom_bar(position = "dodge") +
 labs(
    title = "Financial Stress in Depressed Students with Suicidal Thoughts",
    x = "Suicidal Thoughts",
    y = "Count of People"
  ) +
  theme_minimal()+
  theme(
    plot.title = element_text(hjust = 0.5)
  )

In [None]:
table_financial_suicidal <- table(depression_clean$suicidal_thoughts, depression_clean$depression)
print(table_financial_suicidal)
chi_sq_result_fs <- chisq.test(table_financial_suicidal)
print(chi_sq_result_fs)

# X-squared: Chi-squared statistic. 
# df: Degrees of freedom 
# p-value: Less than 0.05 suggests a significant correlation.

The results provide strong evidence that financial stress is significantly associated with suicidal thoughts. This implies that interventions addressing financial stress could potentially have a positive effect on reducing the risk of suicidal thoughts among students.

## Act Phase

**The analysis reveals several key factors strongly associated with depression among students:**
- High Academic Pressure, Low Study Satisfaction, High Financial Stress and Unhealthy Dietary Habits.

**Age group that should be prioritized for mental health interventions**
- The 18–22 age group should be prioritized for mental health interventions. This group exhibits the highest levels of depression, suicidal thoughts, and sensitivity to factors such as academic pressure and financial stress.

**Based on the findings, the following programs could be implemented:**
- Academic Pressure Management:
  Introduce time management and study skills workshops to help students cope with academic demands and work with faculty to develop a more balanced curriculum and exam schedules to reduce academic pressure.

- Financial Support Programs:
  Provide financial aid, scholarships, or subsidized counseling to alleviate financial stress and offer workshops on financial literacy to help students manage their resources effectively.

- Mental Health Awareness Campaigns:
  Target the 18–22 age group with awareness campaigns to destigmatize mental health and  encourage students to seek help.

- Counseling Services:
  Offer accessible counseling services

- Nutritional Support:
  Provide healthy, affordable meal options on campus.
