<a href="https://colab.research.google.com/github/alicevangomez/EDA/blob/main/EDA_Mental_Health_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Exploring Mental Health Data by Kaggle**

---

##### Playground Series - Season 4, Episode 11

##### **Goal**: The goal is to use data from a mental health survey to explore factors that may cause individuals to experience depression.

[Kaggle Competition](https://www.kaggle.com/competitions/playground-series-s4e11/data)




In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    roc_auc_score,
    roc_curve
)
from sklearn.model_selection import train_test_split

from google.colab import files

print('Setup complete')

Setup complete


#### **Loading a Kaggle file on Google Colab**

In [None]:
#subir el archivo kaggle.json a Colab
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"alicevangomez","key":"fd15c6a22651b32ea212ea18df22e435"}'}

In [None]:
#configurar Kaggle en Colab
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 ~/.kaggle/kaggle.json

In [None]:
#descargar los archivos de la competencia
!kaggle competitions download -c playground-series-s4e11

In [None]:
#descomprimir los archivos descargados
!unzip playground-series-s4e11.zip

Archive:  playground-series-s4e11.zip
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [None]:
#cargar datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

#### **Preliminary data analysis**

In [None]:
train.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness,Depression
0,0,Aaradhya,Female,49.0,Ludhiana,Working Professional,Chef,,5.0,,,2.0,More than 8 hours,Healthy,BHM,No,1.0,2.0,No,0
1,1,Vivan,Male,26.0,Varanasi,Working Professional,Teacher,,4.0,,,3.0,Less than 5 hours,Unhealthy,LLB,Yes,7.0,3.0,No,1
2,2,Yuvraj,Male,33.0,Visakhapatnam,Student,,5.0,,8.97,2.0,,5-6 hours,Healthy,B.Pharm,Yes,3.0,1.0,No,1
3,3,Yuvraj,Male,22.0,Mumbai,Working Professional,Teacher,,5.0,,,1.0,Less than 5 hours,Moderate,BBA,Yes,10.0,1.0,Yes,1
4,4,Rhea,Female,30.0,Kanpur,Working Professional,Business Analyst,,1.0,,,1.0,5-6 hours,Unhealthy,BBA,Yes,9.0,4.0,Yes,0


In [None]:
test.head()

Unnamed: 0,id,Name,Gender,Age,City,Working Professional or Student,Profession,Academic Pressure,Work Pressure,CGPA,Study Satisfaction,Job Satisfaction,Sleep Duration,Dietary Habits,Degree,Have you ever had suicidal thoughts ?,Work/Study Hours,Financial Stress,Family History of Mental Illness
0,140700,Shivam,Male,53.0,Visakhapatnam,Working Professional,Judge,,2.0,,,5.0,Less than 5 hours,Moderate,LLB,No,9.0,3.0,Yes
1,140701,Sanya,Female,58.0,Kolkata,Working Professional,Educational Consultant,,2.0,,,4.0,Less than 5 hours,Moderate,B.Ed,No,6.0,4.0,No
2,140702,Yash,Male,53.0,Jaipur,Working Professional,Teacher,,4.0,,,1.0,7-8 hours,Moderate,B.Arch,Yes,12.0,4.0,No
3,140703,Nalini,Female,23.0,Rajkot,Student,,5.0,,6.84,1.0,,More than 8 hours,Moderate,BSc,Yes,10.0,4.0,No
4,140704,Shaurya,Male,47.0,Kalyan,Working Professional,Teacher,,5.0,,,5.0,7-8 hours,Moderate,BCA,Yes,3.0,4.0,No


In [None]:
train.dtypes

Unnamed: 0,0
Gender,object
Age,float64
City,object
Working Professional or Student,object
Profession,object
Academic Pressure,float64
Work Pressure,float64
CGPA,float64
Study Satisfaction,float64
Job Satisfaction,float64


In [None]:
test.dtypes

Unnamed: 0,0
Gender,object
Age,float64
City,object
Working Professional or Student,object
Profession,object
Academic Pressure,float64
Work Pressure,float64
CGPA,float64
Study Satisfaction,float64
Job Satisfaction,float64


#### **EDA/Exploratory Data Analysis**

In [None]:
#eliminar columnas que no se usarán en ambos sets
drop_cols = ['id', 'Name']
train = train.drop(columns=drop_cols)
test = test.drop(columns=drop_cols)

(140700, 18)
(93800, 17)


In [None]:
print(train.shape)
print(test.shape)

(140700, 18)
(93800, 17)


In [None]:
#Definir columnas
target = 'Depression'

num_cols = [
    'Age', 'Academic Pressure', 'Work Pressure', 'CGPA',
    'Study Satisfaction', 'Job Satisfaction', 'Work/Study Hours', 'Financial Stress'
]

cat_cols = [
    'Gender', 'City', 'Working Professional or Student', 'Profession',
    'Sleep Duration', 'Dietary Habits', 'Degree',
    'Have you ever had suicidal thoughts ?', 'Family History of Mental Illness'
]

X = train[num_cols + cat_cols]
y = train[target]

##### **Variable objetivo**

In [None]:
#¿Es binaria?
train[target].value_counts()

Unnamed: 0_level_0,count
Depression,Unnamed: 1_level_1
0,115133
1,25567


In [None]:
#¿Los datos tienen suficiente balance entre clases?
train[target].value_counts(normalize=True)

Unnamed: 0_level_0,proportion
Depression,Unnamed: 1_level_1
0,0.818287
1,0.181713


**En resumen:**

*   La variable objetivo tiene dos categorias (1/0)
*   Se debe analizar el dataset como problema de clasificación binaria

In [None]:
# Total de filas
total_filas = len(train)

# Crear un DataFrame con número y porcentaje de nulos
nulos_df = pd.DataFrame({
    'Valores Nulos': train.isnull().sum().sort_values(ascending=False),
    'Porcentaje (%)': round(train.isnull().mean() * 100, 2)
})

# Mostrar solo columnas con valores nulos
nulos_df = nulos_df[nulos_df['Valores Nulos'] > 0]
nulos_df

Unnamed: 0,Valores Nulos,Porcentaje (%)
Academic Pressure,112803,80.17
CGPA,112802,80.17
Degree,2,0.0
Dietary Habits,4,0.0
Financial Stress,4,0.0
Job Satisfaction,27910,19.84
Profession,36630,26.03
Study Satisfaction,112803,80.17
Work Pressure,27918,19.84
