# Project University Mental Health

<div style="background-color: #78E8A3; padding: 20px">
<h3>Project Scenario</h3>
<p>Mental health is an area that is severely neglected, and can have very serious ramifications such as student self-harm and depression.</p> 
<p>Working in a university's health and wellness center, we have been tasked to identify students at risk using data so that we can help them as early as possible.</p>
<p>In this project, we will explore a dataset obtained from the research of Nguyen et al (2019), where the authors obtained a record of 268 questionaire results of depression, acculturative stress, social connectedness, and help-seeking behaviour by a cohort of local and international students. We will be training our data on various models to predict two tasks:</p>
(1) A regression problem: Predicting depression severity (depression score) of a student<br>
(2) A classification problem: Predicting whether a student have thoughts of suicide<br>
    
Task 1's models would be evaluated and selected based on their RMSE, and task 2's models would be evaluated and selected based on their Accuracy and F1 scores.
    
Research details <a href = 'https://www.mdpi.com/2306-5729/4/3/124/htm'>here</a>.
</div>

As the the categorical values from the data are engineered from the numerical ones, our approach for this dataset is slightly different. We will first:
1. Get a DataFrame that contains only numerical columns
2. Get a DataFrame that contains only one-hot encoded variables from categorical variables
3. A combination of both numerical and one-hot encoded variables

We will prepare three sets of DataFrames so that we can work with them in the next Part, which is machine learning modelling.

In [5]:
#Import libraries
import pandas as pd
import numpy as np

In [2]:
#Read the cleaned CSV
cleaned_results = pd.read_csv('./datasets/filled_data.csv')

In [3]:
cleaned_results.shape

(268, 50)

### Get DataFrame containing numerical columns only

In [6]:
#Get a DataFrame containing only numbers
float_only = cleaned_results.select_dtypes(np.number)

In [8]:
#Export the numerical DataFrame as CSV
float_only.to_csv('./datasets/numerical_data.csv', index=False)

### Get DataFrame containing categorical columns only

In [10]:
#Get a DataFrame that contains only strings/objects
category_only = cleaned_results.select_dtypes(object)

In [14]:
#Use pd.get_dummies to one hot encode categorical variables
onehot_category = pd.get_dummies(category_only, drop_first=True)

In [17]:
#make columns into lowercase
onehot_category.columns = onehot_category.columns.str.lower()

In [18]:
onehot_category.head()

Unnamed: 0,inter_dom_inter,region_jap,region_others,region_sa,region_sea,gender_male,academic_under,stay_cate_medium,stay_cate_short,japanese_cate_high,...,friends_bi_yes,parents_bi_yes,relative_bi_yes,professional_bi_yes,phone_bi_yes,doctor_bi_yes,religion_bi_yes,alone_bi_yes,others_bi_yes,internet_bi_yes
0,1,0,0,0,1,1,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0
1,1,0,0,0,1,1,0,0,1,1,...,1,1,0,0,0,0,0,0,0,0
2,1,0,0,0,1,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,1,0,...,1,1,1,1,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,1,0,...,1,1,0,1,0,1,1,0,0,0


In [19]:
onehot_category.shape

(268, 34)

In [20]:
#Export the one-hot encoded categorical data
onehot_category.to_csv('./datasets/categorical_data.csv', index=False)

### Get a DataFrame that is both numerical and one-hot encoded

In [32]:
#Get the full data (numerical + categorical)
final_dataset = pd.concat([float_only, onehot_category], axis=1)

In [33]:
final_dataset.head()

Unnamed: 0,age,age_cate,stay,japanese,english,todep,tosc,apd,ahome,aph,...,friends_bi_yes,parents_bi_yes,relative_bi_yes,professional_bi_yes,phone_bi_yes,doctor_bi_yes,religion_bi_yes,alone_bi_yes,others_bi_yes,internet_bi_yes
0,24.0,4.0,5.0,3.0,5.0,0.0,34.0,23.0,9.0,11.0,...,1,1,0,0,0,0,0,0,0,0
1,28.0,5.0,1.0,4.0,4.0,2.0,48.0,8.0,7.0,5.0,...,1,1,0,0,0,0,0,0,0,0
2,25.0,4.0,6.0,4.0,4.0,2.0,41.0,13.0,4.0,7.0,...,0,0,0,0,0,0,0,0,0,0
3,29.0,5.0,1.0,2.0,3.0,3.0,37.0,16.0,10.0,10.0,...,1,1,1,1,0,0,0,0,0,0
4,28.0,5.0,1.0,1.0,3.0,3.0,37.0,15.0,12.0,5.0,...,1,1,0,1,0,1,1,0,0,0


In [34]:
final_dataset.shape

(268, 60)

In [35]:
#Export the DataFrame as a CSV
final_dataset.to_csv('./datasets/final_data.csv', index=False)