# Data Exploration
* This notebook focuses on followin tasks,
    * Create train/test dataset.
    * EDA.
    * Feature Engineering.
    * Planning next steps to create a prediction model and data analysis.

## Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib as plt

from pathlib import Path

## Initialize Helper Functions & Constants

In [5]:
## root directory for all data files
data_dir = Path("..", "data")
raw_data_file = Path(data_dir, "student_depression_dataset.csv")

## Read Data

In [9]:
## read data from csv file
data = pd.read_csv(raw_data_file)

In [10]:
## verify the data 
data.shape

(27901, 18)

## Data Exploration

* Lets explore the data to,
    * Check for missing data.
    * Identify the column names and types.
    * Identify the targer variable. 

### Exploring Data Columns

In [12]:
data.dtypes

id                                         int64
Gender                                    object
Age                                      float64
City                                      object
Profession                                object
Academic Pressure                        float64
Work Pressure                            float64
CGPA                                     float64
Study Satisfaction                       float64
Job Satisfaction                         float64
Sleep Duration                            object
Dietary Habits                            object
Degree                                    object
Have you ever had suicidal thoughts ?     object
Work/Study Hours                         float64
Financial Stress                          object
Family History of Mental Illness          object
Depression                                 int64
dtype: object

* Looks like the `Depression` column is the target variable. 
* Before we confirm that lets change the column names to lower case and remove spaces for simplify management. 

#### Changing Column Names

In [None]:
column_mapping = {}

for col in data.columns:
    column_mapping[col] = "_".join(col.split(" ")).lower()

column_mapping

{'id': 'id',
 'Gender': 'gender',
 'Age': 'age',
 'City': 'city',
 'Profession': 'profession',
 'Academic Pressure': 'academic_pressure',
 'Work Pressure': 'work_pressure',
 'CGPA': 'cgpa',
 'Study Satisfaction': 'study_satisfaction',
 'Job Satisfaction': 'job_satisfaction',
 'Sleep Duration': 'sleep_duration',
 'Dietary Habits': 'dietary_habits',
 'Degree': 'degree',
 'Have you ever had suicidal thoughts ?': 'have_you_ever_had_suicidal_thoughts_?',
 'Work/Study Hours': 'work/study_hours',
 'Financial Stress': 'financial_stress',
 'Family History of Mental Illness': 'family_history_of_mental_illness',
 'Depression': 'depression'}

In [19]:
data.rename(columns={'id': 'id',
                     'Gender': 'gender',
                     'Age': 'age',
                     'City': 'city',
                     'Profession': 'profession',
                     'Academic Pressure': 'academic_pressure',
                     'Work Pressure': 'work_pressure',
                     'CGPA': 'cgpa',
                     'Study Satisfaction': 'study_satisfaction',
                     'Job Satisfaction': 'job_satisfaction',
                     'Sleep Duration': 'sleep_duration',
                     'Dietary Habits': 'dietary_habits',
                     'Degree': 'degree',
                     'Have you ever had suicidal thoughts ?': 'suicidal_thoughts',
                     'Work/Study Hours': 'work_study_hours',
                     'Financial Stress': 'financial_stress',
                     'Family History of Mental Illness': 'family_history',
                     'Depression': 'depression'}, inplace=True)

In [20]:
data.head()

Unnamed: 0,id,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work_study_hours,financial_stress,family_history,depression
0,2,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,'5-6 hours',Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,8,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,'5-6 hours',Moderate,BSc,No,3.0,2.0,Yes,0
2,26,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,'Less than 5 hours',Healthy,BA,No,9.0,1.0,Yes,0
3,30,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,'7-8 hours',Moderate,BCA,Yes,4.0,5.0,Yes,1
4,32,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,'5-6 hours',Moderate,M.Tech,Yes,1.0,1.0,No,0


#### Deleting Columns

In [22]:
## Lets delete the Id column since that won't be useful for analysis or prediction. 
data.drop(columns=["id"],inplace=True)

In [23]:
data.head()

Unnamed: 0,gender,age,city,profession,academic_pressure,work_pressure,cgpa,study_satisfaction,job_satisfaction,sleep_duration,dietary_habits,degree,suicidal_thoughts,work_study_hours,financial_stress,family_history,depression
0,Male,33.0,Visakhapatnam,Student,5.0,0.0,8.97,2.0,0.0,'5-6 hours',Healthy,B.Pharm,Yes,3.0,1.0,No,1
1,Female,24.0,Bangalore,Student,2.0,0.0,5.9,5.0,0.0,'5-6 hours',Moderate,BSc,No,3.0,2.0,Yes,0
2,Male,31.0,Srinagar,Student,3.0,0.0,7.03,5.0,0.0,'Less than 5 hours',Healthy,BA,No,9.0,1.0,Yes,0
3,Female,28.0,Varanasi,Student,3.0,0.0,5.59,2.0,0.0,'7-8 hours',Moderate,BCA,Yes,4.0,5.0,Yes,1
4,Female,25.0,Jaipur,Student,4.0,0.0,8.13,3.0,0.0,'5-6 hours',Moderate,M.Tech,Yes,1.0,1.0,No,0


#### Saving the Data


In [25]:
## lets save the dataset for future use.
data.to_csv(Path(data_dir,"processed_column_names.csv"),index=False)