<h1> Depression Status Prediction </h1>

<p> <q>This notebook has been made for the project at<a href="https://app.patika.dev/moduller/dspg-projeleri/depresyon_durum_tahmini"> app.patika.dev </a>.

<b>Problem definition and purpose:</b> 
<p>The data set contains two folders, the data of the patients in the control group and the data of the patients in the conditioning group. The folders contain a separate csv file for each patient with actigraph data collected over time. Based on this information of patients, it aims to automatically predict the state of depression. In an app to be made, a diagnosis of depression can be made depending on the answers people give to the questions they encounter. <p>(From <a href='https://app.patika.dev/moduller/dspg-projeleri/depresyon_durum_tahmini'>app.patika.dev</a>.)

<p>My goal is to develop a model that predicts depression status based on actigraph data.     

<p><b>Dataset Link:</b>
<p><a href ="https://www.kaggle.com/arashnic/the-depression-dataset">The Depression Dataset </a>

<h5> Importing Libraries </h5>

In [11]:
import numpy as np
import pandas as pd
import os

<h2> Recognize Data </h2>

The dataset contains two folders, whereas one contains the data for the controls and one for the condition group. For each patient a csv file has been provided containing the actigraph data collected over time. The columns are: timestamp (one minute intervals), date (date of measurement), activity (activity measurement from the actigraph watch). In addition, the MADRS scores provided in the file "scores.csv". It contains the following columns; number (patient identifier), days (number of days of measurements), gender (1 or 2 for female or male), age (age in age groups), afftype (1: bipolar II, 2: unipolar depressive, 3: bipolar I), melanch (1: melancholia, 2: no melancholia), inpatient (1: inpatient, 2: outpatient), edu (education grouped in years), marriage (1: married or cohabiting, 2: single), work (1: working or studying, 2: unemployed/sick leave/pension), madrs1 (MADRS score when measurement started), madrs2 (MADRS when measurement stopped). 
<p>(Source <a href='https://www.kaggle.com/arashnic/the-depression-dataset'>Kaggle</a>.)    
        <p>(You can find more information <a href='https://datasets.simula.no/depresjon/#dataset-details'>here</a>.)
    
    

In [4]:
scores = pd.read_csv('./data/scores.csv')
scores

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


There are 23 condition and 32 control data here. I divide them into 2 separate data frames.

In [10]:
#cdtn = condition, ctrl = control
ctrl_score = scores.iloc[23:,:]
cdtn_score = scores.iloc[:-32,:]

First, let's look at the relationship between the mean daily activities of the patients followed and the MADSR scores.

Let's write the function that calculates the daily average activity of each patient from the measurement data in the Condition group.

In [12]:
#sum activities

sum_dict = {}
for file in os.listdir("./data/condition"):
    df = pd.read_csv("./data/condition"+ "/" + file)
    sum_dict[file.split(".")[0]] = df.activity.sum()

sum_dict

{'condition_1': 3415660,
 'condition_10': 6243346,
 'condition_11': 2974516,
 'condition_12': 3354049,
 'condition_13': 5735146,
 'condition_14': 1624530,
 'condition_15': 2391019,
 'condition_16': 6128175,
 'condition_17': 1848268,
 'condition_18': 1517859,
 'condition_19': 3338367,
 'condition_2': 5981554,
 'condition_20': 1413779,
 'condition_21': 1628308,
 'condition_22': 3521753,
 'condition_23': 6379462,
 'condition_3': 5743208,
 'condition_4': 5925033,
 'condition_5': 3594618,
 'condition_6': 4209793,
 'condition_7': 5805537,
 'condition_8': 3569050,
 'condition_9': 3632316}

In [13]:
cdtn_sums = pd.DataFrame(pd.Series(sum_dict))
cdtn_sums.columns = ["Sum"]
cdtn_sums

Unnamed: 0,Sum
condition_1,3415660
condition_10,6243346
condition_11,2974516
condition_12,3354049
condition_13,5735146
condition_14,1624530
condition_15,2391019
condition_16,6128175
condition_17,1848268
condition_18,1517859


In [15]:
cdtn_score = cdtn_score.set_index("number").join(cdtn_sums).reset_index()

In [16]:
cdtn_score

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2,Sum
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0,3415660
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0,5981554
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0,5743208
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0,5925033
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0,3594618
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0,4209793
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0,5805537
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0,3569050
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0,3632316
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0,6243346


In [20]:
cdtn_score["stress"] = cdtn_score.Sum / cdtn_score.days
cdtn_score

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2,Sum,stress
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0,3415660,310514.545455
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0,5981554,332308.555556
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0,5743208,441785.230769
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0,5925033,455771.769231
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0,3594618,276509.076923
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0,4209793,601399.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0,5805537,527776.090909
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0,3569050,713810.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0,3632316,279408.923077
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0,6243346,693705.111111


In [41]:
x = cdtn_score.drop(columns = ["afftype", "days", "inpatient", "Sum"])
y = cdtn_score.drop(columns = ["days", "inpatient", "Sum"
                              ,"gender","age","melanch","edu",
                               "marriage","work","madrs1","madrs2","stress"])
x

Unnamed: 0,number,gender,age,melanch,edu,marriage,work,madrs1,madrs2,stress
0,condition_1,2,35-39,2.0,6-10,1.0,2.0,19.0,19.0,310514.545455
1,condition_2,2,40-44,2.0,6-10,2.0,2.0,24.0,11.0,332308.555556
2,condition_3,1,45-49,2.0,6-10,2.0,2.0,24.0,25.0,441785.230769
3,condition_4,2,25-29,2.0,11-15,1.0,1.0,20.0,16.0,455771.769231
4,condition_5,2,50-54,2.0,11-15,2.0,2.0,26.0,26.0,276509.076923
5,condition_6,1,35-39,2.0,6-10,1.0,2.0,18.0,15.0,601399.0
6,condition_7,1,20-24,,11-15,2.0,1.0,24.0,25.0,527776.090909
7,condition_8,2,25-29,,11-15,1.0,2.0,20.0,16.0,713810.0
8,condition_9,2,45-49,,6-10,1.0,2.0,26.0,26.0,279408.923077
9,condition_10,2,45-49,2.0,6-10,1.0,2.0,28.0,21.0,693705.111111


<h5>Fix missing values</h5>