# Project University Mental Health

<div style="background-color: #78E8A3; padding: 20px">
<h3>Project Scenario</h3>
<p>Mental health is an area that is severely neglected, and can have very serious ramifications such as student self-harm and depression.</p> 
<p>Working in a university's health and wellness center, we have been tasked to identify students at risk using data so that we can help them as early as possible.</p>
<p>In this project, we will explore a dataset obtained from the research of Nguyen et al (2019), where the authors obtained a record of 268 questionaire results of depression, acculturative stress, social connectedness, and help-seeking behaviour by a cohort of local and international students. We will be training our data on various models to predict two tasks:</p>
(1) A regression problem: Predicting depression severity (depression score) of a student<br>
(2) A classification problem: Predicting whether a student have thoughts of suicide<br>
    
Task 1's models would be evaluated and selected based on their RMSE, and task 2's models would be evaluated and selected based on their Accuracy and F1 scores.
    
Research details <a href = 'https://www.mdpi.com/2306-5729/4/3/124/htm'>here</a>.
</div>

### What we'll be doing:
In this project, we will do the following:

1. Acquire data on mental health of foreign students in Japan (Part I)
2. Perform exploratory data analysis and test a few hypotheses (Part II)
3. Transform the data for machine learning  (Part III)
4. Train a machine learning model based on several hypotheses (Part IV)

In [2]:
#Import libraries
import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

In [3]:
#Read CSV as DataFrame
raw_results = pd.read_csv('./datasets/data.csv')

## Data Cleaning

### Dealing with null values

In [4]:
raw_results.head()

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
0,Inter,SEA,Male,Grad,24.0,4.0,5.0,Long,3.0,Average,...,Yes,Yes,No,No,No,No,No,No,No,No
1,Inter,SEA,Male,Grad,28.0,5.0,1.0,Short,4.0,High,...,Yes,Yes,No,No,No,No,No,No,No,No
2,Inter,SEA,Male,Grad,25.0,4.0,6.0,Long,4.0,High,...,No,No,No,No,No,No,No,No,No,No
3,Inter,EA,Female,Grad,29.0,5.0,1.0,Short,2.0,Low,...,Yes,Yes,Yes,Yes,No,No,No,No,No,No
4,Inter,EA,Female,Grad,28.0,5.0,1.0,Short,1.0,Low,...,Yes,Yes,No,Yes,No,Yes,Yes,No,No,No


In [5]:
#Using .tail to see what's up with the last 20 rows (lots of null values)
raw_results.tail(20)

Unnamed: 0,inter_dom,Region,Gender,Academic,Age,Age_cate,Stay,Stay_Cate,Japanese,Japanese_cate,...,Friends_bi,Parents_bi,Relative_bi,Professional_bi,Phone_bi,Doctor_bi,religion_bi,Alone_bi,Others_bi,Internet_bi
266,Dom,JAP,Male,Under,19.0,2.0,1.0,Short,5.0,High,...,Yes,Yes,Yes,Yes,Yes,Yes,No,No,No,No
267,Dom,JAP,Male,Under,20.0,2.0,2.0,Medium,5.0,High,...,Yes,No,No,No,No,No,No,Yes,No,No
268,,,,,,,,,,,...,,,,,,,,,,
269,,,,,,,,,,,...,128,137,66,61,30,46,19,65,21,45
270,,,,,,,,,,,...,140,131,202,207,238,222,249,203,247,223
271,,,,,,,,,,,...,,,,,,,,,,
272,,,,,,,,,,,...,128,137,66,61,30,46,19,65,21,45
273,,,,,,,,,,,...,140,131,202,207,238,222,249,203,247,223
274,,,,,,,,,,,...,,,,,,,,,,
275,,,,,,,,,,,...,123,,,,,,,,,


In [7]:
#Remove the last 18 rows with null values
raw_results.dropna(axis=0, subset=['inter_dom'], inplace=True)

In [8]:
raw_results.shape

(268, 50)

In [9]:
#Display total number of missing values now
#Only 'Internet' has missing values
raw_results.isnull().sum()

inter_dom           0
Region              0
Gender              0
Academic            0
Age                 0
Age_cate            0
Stay                0
Stay_Cate           0
Japanese            0
Japanese_cate       0
English             0
English_cate        0
Intimate            8
Religion            0
Suicide             0
Dep                 0
DepType             0
ToDep               0
DepSev              0
ToSC                0
APD                 0
AHome               0
APH                 0
Afear               0
ACS                 0
AGuilt              0
AMiscell            0
ToAS                0
Partner             0
Friends             0
Parents             0
Relative            0
Profess             0
 Phone              0
Doctor              0
Reli                0
Alone               0
Others              0
Internet           26
Partner_bi          0
Friends_bi          0
Parents_bi          0
Relative_bi         0
Professional_bi     0
Phone_bi            0
Doctor_bi 

In [11]:
#Replace missing values in 'Internet' using iterativeimputer
raw_results_imputed = raw_results.copy(deep=True)

#Initialize IterativeImputer
iterative_imputer = IterativeImputer()
raw_results_imputed = iterative_imputer.fit_transform(raw_results[['Internet']])

In [12]:
#Copy 'Internet' new and rounded up values to raw_results
raw_results['Internet'] = raw_results_imputed.round()

In [13]:
#Check for missing values again in DataFrame: no missing values
raw_results.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 268 entries, 0 to 267
Data columns (total 50 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   inter_dom        268 non-null    object 
 1   Region           268 non-null    object 
 2   Gender           268 non-null    object 
 3   Academic         268 non-null    object 
 4   Age              268 non-null    float64
 5   Age_cate         268 non-null    float64
 6   Stay             268 non-null    float64
 7   Stay_Cate        268 non-null    object 
 8   Japanese         268 non-null    float64
 9   Japanese_cate    268 non-null    object 
 10  English          268 non-null    float64
 11  English_cate     268 non-null    object 
 12  Intimate         260 non-null    object 
 13  Religion         268 non-null    object 
 14  Suicide          268 non-null    object 
 15  Dep              268 non-null    object 
 16  DepType          268 non-null    object 
 17  ToDep           

### More Data Cleaning

In [14]:
#make all columns to lower case
raw_results.columns = raw_results.columns.str.lower()

In [15]:
#strip whitespace at start of column for 'phone'
raw_results.columns = raw_results.columns.str.lstrip()

In [16]:
#check all columns
raw_results.columns

Index(['inter_dom', 'region', 'gender', 'academic', 'age', 'age_cate', 'stay',
       'stay_cate', 'japanese', 'japanese_cate', 'english', 'english_cate',
       'intimate', 'religion', 'suicide', 'dep', 'deptype', 'todep', 'depsev',
       'tosc', 'apd', 'ahome', 'aph', 'afear', 'acs', 'aguilt', 'amiscell',
       'toas', 'partner', 'friends', 'parents', 'relative', 'profess', 'phone',
       'doctor', 'reli', 'alone', 'others', 'internet', 'partner_bi',
       'friends_bi', 'parents_bi', 'relative_bi', 'professional_bi',
       'phone_bi', 'doctor_bi', 'religion_bi', 'alone_bi', 'others_bi',
       'internet_bi'],
      dtype='object')

### Export cleaned DataFrame

In [17]:
#Export DataFrame as CSV
raw_results.to_csv('./datasets/filled_data.csv', index=False)