## Questionnaire Descriptions
1. QIDS Depression: 16-item measure of depression, scoring is complicated.
2. Speilberger State Anxiety Inventory (SSAI): Measure of state anxiety. 20 items, 1-4 scale.
3. Mood and Anxiety Symptoms Questionnaire (MASQ): Model of depression, measuring **anhedonia**, **general distress**, and **anxious arousal**. 30 items with 1-5 Likert scale.
4. Anger Attack Questionnaire (AAQ): measure of irritability and anger. Dichotomous measure.
5. Snaith-Hamilton Pleasure Scale (SHAPS): measure of anhedonia. 14 items, 1-4 scale.
6. Altman Self-Rating Mania Scale (ASRM): measure of mania. Five items, 0-4 scale. >= 6 is mania, else not manic

Total of 6 questionnaires and 8 subscales.

### A priori thoughts on structuring data
1. SHAPS and MASQ-Anhedonia can probably be collapsed.
2. ASRM and AAQ likely related.
3. SSAI and MASQ-Anxiety *maybe* related.

We might be able to collapse into a few categories: anhedonia (MASQ, SHAPS), internalized arousal (SSAI, MASQ), externalized arousal (ASRM, AAQ). Let's do some digging!

## Prepare Data

In [2]:
import os
import numpy as np
import pylab as plt
from pandas import read_csv
from mpl_toolkits.mplot3d import Axes3D

## Load data and restrict to psychiatric patients.
csv = read_csv('/space/will/3/users/EMBARC/behavior/embarc_baseline_totals.csv')
csv = csv.set_index('ProjectSpecificID')

## Restrict to columns of interest.
columns = ['sample','qids_eval_total1', 'masq2_score_gd', 'masq2_score_ad', 'masq2_score_aa',
           'stai_pre_final_score', 'shaps_total_continuous', 'shaps_total_dichotomous', 'asrm_score2']
csv = csv[columns]
csv.columns = ['Diagnosis','QIDS', 'MASQ_GD', 'MASQ_AD', 'MASQ_AA', 'SSAI', 'SHAPS_Cont', 'SHAPS_Dicho', 'ASRM']

## Remove missing data (for now).
csv = csv.dropna()

## Based on recommendations, score ASRM as manic or non-manic.
csv['ASRM'] = np.where(csv['ASRM']>=6, 1, 0)

## Print out demographics and questionnaire means. 

In [74]:
dem = read_csv('/space/will/3/users/EMBARC/behavior/assessments_raw/Demographics.csv')
dem = dem.set_index('ProjectSpecificID')

columns = ['sample', 'age_evaluation']
dem = dem[columns]
dem.columns = ['Diagnosis', 'Age']

dem.dropna()

print 'Healthy Cohort Age Mean: ', dem.loc[dem.Diagnosis==2, 'Age'].mean()
print 'Healthy Cohort Age STD: ', dem.loc[dem.Diagnosis==2, 'Age'].std()

print 'Psychiatric Cohort Age Mean: ', dem.loc[dem.Diagnosis==1, 'Age'].mean()
print 'Psychiatric Cohort Age STD: ', dem.loc[dem.Diagnosis==1, 'Age'].std(), '\n'

print 'MASQ_AA Healthy Mean: ' , csv.loc[csv.Diagnosis==2,'MASQ_AA'].mean()
print 'MASQ_AA Healthy STD: ' , csv.loc[csv.Diagnosis==2,'MASQ_AA'].std()

print 'MASQ_AA Psychiatric Mean: ' , csv.loc[csv.Diagnosis==1,'MASQ_AA'].mean()
print 'MASQ_AA Psychiatric STD: ' , csv.loc[csv.Diagnosis==1,'MASQ_AA'].std(), '\n'

print 'SHAPS_Cont Healthy Mean: ' , csv.loc[csv.Diagnosis==2,'SHAPS_Cont'].mean()
print 'SHAPS_Cont Healthy STD: ' , csv.loc[csv.Diagnosis==2,'SHAPS_Cont'].std()

print 'SHAPS_Cont Psychiatric Mean: ' , csv.loc[csv.Diagnosis==1,'SHAPS_Cont'].mean()
print 'SHAPS_Cont Psychiatric STD: ' , csv.loc[csv.Diagnosis==1,'SHAPS_Cont'].std()

Healthy Cohort Age Mean:  37.0769230769
Healthy Cohort Age STD:  14.7885039531
Psychiatric Cohort Age Mean:  37.362745098
Psychiatric Cohort Age STD:  13.2932542111 

MASQ_AA Healthy Mean:  10.78
MASQ_AA Healthy STD:  1.1119022127
MASQ_AA Psychiatric Mean:  17.3917525773
MASQ_AA Psychiatric STD:  5.51216216611 

SHAPS_Cont Healthy Mean:  20.84
SHAPS_Cont Healthy STD:  5.76180243908
SHAPS_Cont Psychiatric Mean:  33.439862543
SHAPS_Cont Psychiatric STD:  5.73846597142


## Plot Distributions
It really looks as if the healthy controls and the psychiatric patients are drawn from two separate populations. As such, it seems difficult/inappropriate the use the HC distribution to somehow quantity the severity of the psychiatric group. As such, we will try a different approach.

In [33]:
fig, axes = plt.subplots(2,4,figsize=(12,6))

for n, col in enumerate(csv.columns[1:]):
    
    r,c = n / 4, n % 4
    
    for m, color, bins in zip([1,2],['#ca0020','#0571b0'],[10,5]):
        
        x = csv.loc[csv.Diagnosis==m,col].as_matrix()
        axes[r,c].hist(x, bins=bins, color=color, alpha=0.75, normed=1)
    
    axes[r,c].set_title(col, fontsize=24)
    
plt.tight_layout()
plt.show()
plt.close('all')

# Preprocess Group Data.
## Plot distributions of psychiatric data only.

In [3]:
img_dir = '/autofs/space/will_003/users/EMBARC/notebooks/plots_embarc_questionnaires'

## Reduce to psychiatric patients only.
df = csv[csv.Diagnosis==1]
df = df.drop(['Diagnosis','ASRM'], axis=1) # dropping ASRM due to no variability.


## Plot.
fig, axes = plt.subplots(2,4,figsize=(12,6), sharey=True)

for n, col in enumerate(df.columns):
    
    r,c = n / 4, n % 4
        
    x = df[col].as_matrix()
    axes[r,c].hist(x, bins=10, color='#ca0020')
    
    axes[r,c].set_title(col, fontsize=24)
    
plt.tight_layout()
plt.show()
# plt.savefig(os.path.join(img_dir,'psychiatric_questionnaire_hist.png'))
plt.close()

## Preprocess Data

In [76]:
from sklearn.decomposition import PCA
def zscore(arr): return (arr - arr.mean()) / arr.std()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Apply z-score.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
preproc = df.apply(zscore, axis=0)
preproc['AAQ'] = df.AAQ

## Print correlation structure.
print np.round( preproc.corr(), 3 )

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Add mean of MASQ_AD / SHAPS.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
preproc['Anhedonia'] = preproc[['MASQ_AD','SHAPS']].mean(axis=1)

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Add three PCA columns of symptoms.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
pca = PCA(n_components=3)
ortho = pca.fit_transform(preproc[['QIDS','MASQ_GD','SSAI']])
print '#-----------------------------------------------------------#'
print pca.explained_variance_ratio_

## Add back to dataframe.
for n in xrange(3): preproc['PCA%s' %(n+1)] = ortho[:,n]

## Print correlation structure.
print '#-----------------------------------------------------------#'
print np.round( preproc[['PCA1','PCA2','PCA3','Anhedonia','MASQ_AA','AAQ']].corr(), 3 )

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Regress PCA from three measures.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
indep = ['PCA1','PCA2','PCA3']
depen = ['Anhedonia','MASQ_AA','AAQ']
coef, _, _, _ = np.linalg.lstsq(preproc[indep], preproc[depen])

cleaned = preproc[depen] - np.dot(preproc[indep], coef)
cleaned.columns = ['Anhedonia','Anxiety', 'Irritability']

## Print correlation structure.
print '#-----------------------------------------------------------#'
print np.round( cleaned.corr(), 3 )

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Plot cleaned.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
fig, axes = plt.subplots(1,3,figsize=(12,6))
for ax, col in zip(axes,cleaned.columns):
    ax.hist(cleaned[col], color='#ca0020')
    ax.set_title(col, fontsize=24)
plt.tight_layout()
plt.savefig(os.path.join(img_dir,'psychiatric_ortho_hist.png'))
plt.close()

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
## Save datasets.
#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
def dichotomize(arr): return np.where( arr > np.mean(arr), 1, 0 )
out_dir = '/autofs/space/will_003/users/EMBARC/behavior'

## Add dichotomized regressors.
for col in cleaned.columns: cleaned['%s.Dich' %col] = dichotomize(cleaned[col])

## Add PCA to regressors.
cleaned = cleaned.merge(preproc[indep], left_index=True, right_index=True)

## Save data.
f = 'embarc_psychiatric_clustering.csv'
cleaned.to_csv(f)
cleaned.to_csv(os.path.join(out_dir,f))

cleaned.groupby(['Anhedonia.Dich','Anxiety.Dich','Irritability.Dich']).count()

          QIDS  MASQ_GD  MASQ_AD  MASQ_AA   SSAI    AAQ  SHAPS
QIDS     1.000    0.437    0.223    0.334  0.276  0.052  0.301
MASQ_GD  0.437    1.000    0.384    0.380  0.415  0.145  0.284
MASQ_AD  0.223    0.384    1.000    0.036  0.324 -0.005  0.423
MASQ_AA  0.334    0.380    0.036    1.000  0.332  0.213  0.167
SSAI     0.276    0.415    0.324    0.332  1.000  0.120  0.233
AAQ      0.052    0.145   -0.005    0.213  0.120  1.000  0.086
SHAPS    0.301    0.284    0.423    0.167  0.233  0.086  1.000
#-----------------------------------------------------------#
[ 0.58530798  0.24165314  0.17303889]
#-----------------------------------------------------------#
            PCA1   PCA2   PCA3  Anhedonia  MASQ_AA    AAQ
PCA1       1.000 -0.000  0.000     -0.454   -0.457 -0.140
PCA2      -0.000  1.000 -0.000     -0.020   -0.005 -0.056
PCA3       0.000 -0.000  1.000     -0.039   -0.006 -0.056
Anhedonia -0.454 -0.020 -0.039      1.000    0.120  0.048
MASQ_AA   -0.457 -0.005 -0.006      0.120   

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Anhedonia,Anxiety,Irritability,PCA1,PCA2,PCA3
Anhedonia.Dich,Anxiety.Dich,Irritability.Dich,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,0,0,45,45,45,45,45,45
0,0,1,24,24,24,24,24,24
0,1,0,39,39,39,39,39,39
0,1,1,29,29,29,29,29,29
1,0,0,70,70,70,70,70,70
1,0,1,27,27,27,27,27,27
1,1,0,29,29,29,29,29,29
1,1,1,28,28,28,28,28,28
