Inspired by the article https://journals.plos.org/plosone/article/file?id=10.1371/journal.pone.0034928&type=printable

Depressive disorder is a serious mood disorder which affects the way you handle daily activities, the way you feel, think and consequently speak. Thefore text is one of the significant objects of psychometric analysis. It’s careful investigation can help to define whether the author of the text is suffering from a mental disease. Most reaserches when aiming to extract features of depressive text focus on semantic analysis of the text and topic retrieval (Ma et al. 2017). We are attemting to center our attention on the structure of the text connecting this feature to one of the symptoms of depression - reduced brain activity (Zhang et al. 2018)

In [3]:
import pandas as pd

test_df = pd.read_csv('PsyHack_RUDN_test.csv', sep='\t')
train_df = pd.read_csv('PsyHack_RUDN_train.csv', sep='\t')

In [342]:
len(train_df)

221

In [341]:
train_df.label.value_counts()

Non-depression    158
Depression         63
Name: label, dtype: int64

In [4]:
train_df.fillna(value=0, inplace=True)
test_df.fillna(value=0,inplace=True)

In [3]:
train_df.columns[4:]

Index(['Nodes', 'Edges', 'ATD', 'LCC', 'LSC', 'L1', 'L2', 'L3', 'Density',
       'Diameter', 'ASP'],
      dtype='object')

### Classifiers experimented with in the article:
- NB
- SVM
- DT
- MLP
- RBF

NB gave the best results

"...we trained a naive Bayes (NB) classifier with different subsets of graph measures as inputs. The data were normalized by the number of words in each report, in order to discount the effects of normal inter-individual verbosity differences. Furthermore, the inputs were restricted to data that could be obtained without having to resort to an interpretation of the meaning of the reports, i.e. waking nodes and edges were not employed. Sensitivity, specificity, the area under the receiver operating characteristic curve (AUC) [14] and the kappa statistic [15] were used as metrics of classification quality. Our approach objectively and accurately distinguished schizophrenic from manic reports (Fig. 8), and was comparable to the inter-rater reliability of SCID for the distinction between schizophrenics and controls, but not for the distinction between manics and controls [16], [17]. In contrast, when the scores from the psychometric scales BPRS and PANSS were used as inputs to the classifier, it was possible to distinguish controls from psychotic patients, but not schizophrenics from manics."

### Our Approach

Building dataset

In [5]:
import numpy as np

In [13]:
train_features=np.array(train_df.values[:, 4:], dtype = np.float64)
test_features=np.array(test_df.values[:, 4:], dtype = np.float64)
print (train_features)

[[  2.29000000e+02   3.17000000e+02   1.38427948e+00 ...,   6.07140121e-03
    1.30000000e+01   2.93695420e+00]
 [  1.64000000e+02   2.47000000e+02   1.50609756e+00 ...,   9.23986234e-03
    1.30000000e+01   4.25347898e+00]
 [  1.58000000e+02   2.82000000e+02   1.78481013e+00 ...,   1.13682174e-02
    8.00000000e+00   2.29557324e+00]
 ..., 
 [  1.13000000e+02   9.40000000e+01   8.31858407e-01 ...,   7.42730721e-03
    1.40000000e+01   1.64077527e+00]
 [  1.31000000e+02   1.31000000e+02   1.00000000e+00 ...,   7.69230769e-03
    1.30000000e+01   1.88873835e+00]
 [  1.51000000e+02   1.62000000e+02   1.07284768e+00 ...,   7.15231788e-03
    1.60000000e+01   1.52928653e+00]]


In [6]:
# Import LabelEncoder
from sklearn import preprocessing
#creating labelEncoder
le = preprocessing.LabelEncoder()

In [7]:
train_label=le.fit_transform(train_df.label)
test_label = le.fit_transform(test_df.label)

## Experiment with the same set of classifiers

### Naive Bayes 

In [18]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
from sklearn.metrics import make_scorer
from sklearn.metrics import roc_auc_score

In [54]:
model = GaussianNB()

# Train the model using the training sets
model.fit(train_features,train_label)

#Predict Output
predicted= model.predict_proba(test_features)

In [10]:
predicted.argmax(axis=1)

array([1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
       1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0])

In [55]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(test_label,predicted.argmax(axis=1)))
print("Precision:",metrics.precision_score(test_label,predicted.argmax(axis=1)))
print("Recall:",metrics.recall_score(test_label,predicted.argmax(axis=1)))
print("F1-score:",metrics.f1_score(test_label,predicted.argmax(axis=1)))
print("AUC-ROC:",metrics.roc_auc_score(test_label,predicted.argmax(axis=1)))

Accuracy: 0.7263157894736842
Precision: 0.76
Recall: 0.8769230769230769
F1-score: 0.8142857142857143
AUC-ROC: 0.6384615384615384


## Support Vector Machines

In [56]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

In [57]:
RANDOM_STATE = 1337

C_array = np.logspace(-3, 3, num=7)
gamma_array = np.logspace(-5, 2, num=8)

svm_parameters = {
      'C': C_array, 
       'gamma': gamma_array, 
       'kernel': ['linear']
}

grid_parameters = {
    'scoring': make_scorer(roc_auc_score),
    'n_jobs': -1,
    'cv': 20,
    'iid': True,
    'return_train_score': True
}

In [58]:
svm = SVC(random_state=RANDOM_STATE, probability=True)


grid_svm = GridSearchCV(
    estimator=svm,
    param_grid=svm_parameters,
    scoring=make_scorer(roc_auc_score),
    n_jobs=-1,
    cv=5,
    iid=True,
    return_train_score=True
)

In [59]:
grid_svm.fit(train_features,train_label)

print('SVM ROC AUC score: ', grid_svm.best_score_)

SVM ROC AUC score:  0.645919999513453


In [16]:
grid_svm.best_params_

{'C': 1000.0, 'gamma': 1e-05, 'kernel': 'linear'}

In [17]:
svm = SVC(random_state=RANDOM_STATE, probability=True, C =1000.0, gamma = 1e-05, kernel='linear')

In [18]:
svm.fit(train_features,train_label)

SVC(C=1000.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='linear',
    max_iter=-1, probability=True, random_state=1337, shrinking=True, tol=0.001,
    verbose=False)

In [19]:
#Predict Output
predicted= svm.predict_proba(test_features)

In [20]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(test_label,predicted.argmax(axis=1)))
print("Precision:",metrics.precision_score(test_label,predicted.argmax(axis=1)))
print("Recall:",metrics.recall_score(test_label,predicted.argmax(axis=1)))
print("F1-score:",metrics.f1_score(test_label,predicted.argmax(axis=1)))
print("AUC-ROC:",metrics.roc_auc_score(test_label,predicted.argmax(axis=1)))

Accuracy: 0.6947368421052632
Precision: 0.6914893617021277
Recall: 1.0
F1-score: 0.8176100628930818
AUC-ROC: 0.5166666666666666


## RBF

In [21]:
RANDOM_STATE = 1337

C_array = np.logspace(-3, 3, num=7)
gamma_array = np.logspace(-5, 2, num=8)

svm_parameters = {
      'C': C_array, 
       'gamma': gamma_array, 
       'kernel': ['rbf']
}

grid_parameters = {
    'scoring': make_scorer(roc_auc_score),
    'n_jobs': -1,
    'cv': 3,
    'iid': True,
    'return_train_score': True
}

In [22]:
svm = SVC(random_state=RANDOM_STATE, probability=True)


grid_svm = GridSearchCV(
    estimator=svm,
    param_grid=svm_parameters,
    scoring=make_scorer(roc_auc_score),
    n_jobs=-1,
    cv=5,
    iid=True,
    return_train_score=True
)


In [23]:
grid_svm.fit(train_features,train_label)

print('SVM ROC AUC score: ', grid_svm.best_score_)

SVM ROC AUC score:  0.7534317077411835


In [24]:
grid_svm.best_params_

{'C': 1000.0, 'gamma': 1e-05, 'kernel': 'rbf'}

In [25]:
svm = SVC(random_state=RANDOM_STATE, probability=True, C =10.0, gamma = 1e-05, kernel='rbf')

In [26]:
svm.fit(train_features,train_label)

SVC(C=10.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=1e-05, kernel='rbf',
    max_iter=-1, probability=True, random_state=1337, shrinking=True, tol=0.001,
    verbose=False)

In [27]:
#Predict Output
predicted= svm.predict_proba(test_features)

In [28]:
p = svm.predict_proba(train_features)
print("AUC-ROC:",metrics.roc_auc_score(train_label,p.argmax(axis=1)))

AUC-ROC: 0.7759289505920783


In [29]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(test_label,predicted.argmax(axis=1)))
print("Precision:",metrics.precision_score(test_label,predicted.argmax(axis=1)))
print("Recall:",metrics.recall_score(test_label,predicted.argmax(axis=1)))
print("F1-score:",metrics.f1_score(test_label,predicted.argmax(axis=1)))
# print("AUC-ROC:",metrics.roc_auc_score(test_label,[predicted[i][1] if e==1 else predicted[i][0] for i, e in enumerate(predicted.argmax(axis=1))]))
print("AUC-ROC:",metrics.roc_auc_score(test_label,predicted.argmax(axis=1)))

Accuracy: 0.7473684210526316
Precision: 0.7662337662337663
Recall: 0.9076923076923077
F1-score: 0.8309859154929577
AUC-ROC: 0.6538461538461539


## Decision Trees

In [9]:
import xgboost
from xgboost import XGBClassifier
import numpy as np

In [333]:
# A parameter grid for XGBoost
params = {
        'min_child_weight': [1, 2, 3,4, 5,6, 7,8,9, 10, 15, 20],
        'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.5, 2, 5, 10, 15, 20],
        'subsample': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'colsample_bytree': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
        'max_depth': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15, 20, 25, 30,35]
        }

In [334]:
xgb = XGBClassifier(learning_rate=0.02, n_estimators=600, objective='binary:logistic',
                    silent=True, nthread=1)

In [335]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import RandomizedSearchCV

folds = 3
param_comb = 5

skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 1001)

random_search = RandomizedSearchCV(xgb, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4, cv=skf.split(train_features,train_label), verbose=3, random_state=1001 )

In [336]:
random_search.fit(train_features,train_label)

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    0.5s finished


RandomizedSearchCV(cv=<generator object _BaseKFold.split at 0x7f0061e97258>,
                   error_score=nan,
                   estimator=XGBClassifier(base_score=0.5, booster='gbtree',
                                           colsample_bylevel=1,
                                           colsample_bynode=1,
                                           colsample_bytree=1, gamma=0,
                                           learning_rate=0.02, max_delta_step=0,
                                           max_depth=3, min_child_weight=1,
                                           missing=None, n_estimators=600,
                                           n_jobs=1, nthread=1,
                                           objective='binary:logist...
                                                             0.5, 0.6, 0.7, 0.8,
                                                             0.9, 1.0],
                                        'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
           

In [337]:
random_search.best_score_

0.801829198055613

In [231]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(test_label,predicted.argmax(axis=1)))
print("Precision:",metrics.precision_score(test_label,predicted.argmax(axis=1)))
print("Recall:",metrics.recall_score(test_label,predicted.argmax(axis=1)))
print("F1-score:",metrics.f1_score(test_label,predicted.argmax(axis=1)))
print("AUC-ROC:",metrics.roc_auc_score(test_label,predicted.argmax(axis=1)))

Accuracy: 0.8
Precision: 0.8108108108108109
Recall: 0.9230769230769231
F1-score: 0.8633093525179857
AUC-ROC: 0.7282051282051282


## Multilayer Perceptron

In [343]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=100)

In [344]:
parameter_space = {
    'hidden_layer_sizes': [(50,50,50), (50,100,50), (100,)],
    'activation': ['tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
}

In [351]:
from sklearn.model_selection import GridSearchCV

clf = GridSearchCV(mlp, parameter_space, n_jobs=-1, cv=3, scoring='roc_auc')
clf.fit(train_features, train_label)



GridSearchCV(cv=3, error_score=nan,
             estimator=MLPClassifier(activation='relu', alpha=0.0001,
                                     batch_size='auto', beta_1=0.9,
                                     beta_2=0.999, early_stopping=False,
                                     epsilon=1e-08, hidden_layer_sizes=(100,),
                                     learning_rate='constant',
                                     learning_rate_init=0.001, max_fun=15000,
                                     max_iter=100, momentum=0.9,
                                     n_iter_no_change=10,
                                     nesterovs_momentum=True, power_t=0.5,
                                     random_state...
                                     validation_fraction=0.1, verbose=False,
                                     warm_start=False),
             iid='deprecated', n_jobs=-1,
             param_grid={'activation': ['tanh', 'relu'],
                         'alpha': [0.0001, 0.05],


In [354]:
clf.best_score_

0.6194162231898082

In [352]:
# Best paramete set
print('Best parameters found:\n', clf.best_params_)

# All results
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

Best parameters found:
 {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.546 (+/-0.017) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.606 (+/-0.133) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'constant', 'solver': 'adam'}
0.564 (+/-0.112) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'sgd'}
0.570 (+/-0.182) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 50, 50), 'learning_rate': 'adaptive', 'solver': 'adam'}
0.587 (+/-0.036) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver': 'sgd'}
0.547 (+/-0.014) for {'activation': 'tanh', 'alpha': 0.0001, 'hidden_layer_sizes': (50, 100, 50), 'learning_rate': 'constant', 'solver

In [353]:
predicted = clf.predict(test_features)

In [356]:
# Model Accuracy, how often is the classifier correct?
print("Accuracy:",metrics.accuracy_score(test_label,predicted))
print("Precision:",metrics.precision_score(test_label,predicted))
print("Recall:",metrics.recall_score(test_label,predicted))
print("F1-score:",metrics.f1_score(test_label,predicted))
print("AUC-ROC:",metrics.roc_auc_score(test_label,predicted))

Accuracy: 0.7052631578947368
Precision: 0.7078651685393258
Recall: 0.9692307692307692
F1-score: 0.8181818181818182
AUC-ROC: 0.5512820512820513
