# Introduction

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Format of the content in below cell will remain same if cell is not executed as Markdown cell.

This data was collected with an on-line version of the Depression Anxiety Stress Scales (DASS), see http://www2.psy.unsw.edu.au/dass/

The survey was open to anyone and people were motivated to take it to get personalized results. At the end of the test they also were given the option to complete a short research survey. This datatset comes from those who agreed to complete the research survey and answered yes to the question "Have you given accurate answers and may they be used for research?" at the end.

This data was collected 2017 - 2019.

The following items were included in the survey:

Q1	I found myself getting upset by quite trivial things.
Q2	I was aware of dryness of my mouth.
Q3	I couldn't seem to experience any positive feeling at all.
Q4	I experienced breathing difficulty (eg, excessively rapid breathing, breathlessness in the absence of physical exertion).
Q5	I just couldn&#39;t seem to get going.
Q6	I tended to over-react to situations.
Q7	I had a feeling of shakiness (eg, legs going to give way).
Q8	I found it difficult to relax.
Q9	I found myself in situations that made me so anxious I was most relieved when they ended.
Q10	I felt that I had nothing to look forward to.
Q11	I found myself getting upset rather easily.
Q12	I felt that I was using a lot of nervous energy.
Q13	I felt sad and depressed.
Q14	I found myself getting impatient when I was delayed in any way (eg, elevators, traffic lights, being kept waiting).
Q15	I had a feeling of faintness.
Q16	I felt that I had lost interest in just about everything.
Q17	I felt I wasn&#39;t worth much as a person.
Q18	I felt that I was rather touchy.
Q19	I perspired noticeably (eg, hands sweaty) in the absence of high temperatures or physical exertion.
Q20	I felt scared without any good reason.
Q21	I felt that life wasn&#39;t worthwhile.
Q22	I found it hard to wind down.
Q23	I had difficulty in swallowing.
Q24	I couldn&#39;t seem to get any enjoyment out of the things I did.
Q25	I was aware of the action of my heart in the absence of physical exertion (eg, sense of heart rate increase, heart missing a beat).
Q26	I felt down-hearted and blue.
Q27	I found that I was very irritable.
Q28	I felt I was close to panic.
Q29	I found it hard to calm down after something upset me.
Q30	I feared that I would be &quot;thrown&quot; by some trivial but unfamiliar task.
Q31	I was unable to become enthusiastic about anything.
Q32	I found it difficult to tolerate interruptions to what I was doing.
Q33	I was in a state of nervous tension.
Q34	I felt I was pretty worthless.
Q35	I was intolerant of anything that kept me from getting on with what I was doing.
Q36	I felt terrified.
Q37	I could see nothing in the future to be hopeful about.
Q38	I felt that life was meaningless.
Q39	I found myself getting agitated.
Q40	I was worried about situations in which I might panic and make a fool of myself.
Q41	I experienced trembling (eg, in the hands).
Q42	I found it difficult to work up the initiative to do things.

Each item was presented one at a time in a random order for each new participant along with a 4 point rating scale asking the user to indicate how often that had been true of them in the past week where

1 = Did not apply to me at all 
2 = Applied to me to some degree, or some of the time
3 = Applied to me to a considerable degree, or a good part of the time
4 = Applied to me very much, or most of the time

(see the file demo1.png for how this looked)

This response is stored in variable A (e.g. Q1A). Also recorded was the time taken in milliseconds to answer that question (E) and that question's position in the survey (I).

These other durations were also recorded (measured on the server's side):

introelapse		The time spent on the introduction/landing page (in seconds)
testelapse		The time spent on all the DASS questions (should be equivalent to the time elapsed on all the indiviudal questions combined)
surveyelapse	The time spent answering the rest of the demographic and survey questions

On the next page was a generic demographics survey with many different questions.

The Ten Item Personality Inventory was administered (see Gosling, S. D., Rentfrow, P. J., & Swann, W. B., Jr. (2003). A Very Brief Measure of the Big Five Personality Domains. Journal of Research in Personality, 37, 504-528.):

TIPI1	Extraverted, enthusiastic.
TIPI2	Critical, quarrelsome.
TIPI3	Dependable, self-disciplined.
TIPI4	Anxious, easily upset.
TIPI5	Open to new experiences, complex.
TIPI6	Reserved, quiet.
TIPI7	Sympathetic, warm.
TIPI8	Disorganized, careless.
TIPI9	Calm, emotionally stable.
TIPI10	Conventional, uncreative.

The TIPI items were rated "I see myself as:" _____ such that

1 = Disagree strongly
2 = Disagree moderately
3 = Disagree a little
4 = Neither agree nor disagree
5 = Agree a little
6 = Agree moderately
7 = Agree strongly



The following items were presented as a check-list and subjects were instructed "In the grid below, check all the words whose definitions you are sure you know":

VCL1	boat
VCL2	incoherent
VCL3	pallid
VCL4	robot
VCL5	audible
VCL6	cuivocal
VCL7	paucity
VCL8	epistemology
VCL9	florted
VCL10	decide
VCL11	pastiche
VCL12	verdid
VCL13	abysmal
VCL14	lucid
VCL15	betray
VCL16	funny

A value of 1 is checked, 0 means unchecked. The words at VCL6, VCL9, and VCL12 are not real words and can be used as a validity check.

A bunch more questions were then asked:


education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 10=Asian, 20=Arab, 30=Black, 40=Indigenous Australian, 50=Native American, 60=White, 70=Other
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"		
major				"If you attended a university, what was your major (e.g. "psychology", "English", "civil engineering")?"

The following values were derived from technical information:

country				ISO country code of where the user connected from
screensize			1=device with small screen (phone, etc), 2=device with big screen (laptop, desktop, etc)
uniquenetworklocation		1=only one survey from user's specific network in dataset, 2=multiple surveys submitted from the network of this user  (2 does not necessarily imply duplicate records for an individual, as it could be different students at a single school or different memebers of the same household; and even if 1 there still could be duplicate records from a single individual e.g. if they took it once on their wifi and once on their phone)
source			how the user found the test, 1=from the front page of the site hosting the survey, 2=from google, 0=other or unknown



In [2]:
dass = pd.read_csv('../input/depression-anxiety-stress-scales/DASS_data_21.02.19/data.csv', sep=r'\t', engine='python')

In [3]:
dass

In [4]:
dass.describe()

In [5]:
dass.info()

# Data Analysis

## Average Response, Position and Time taken for each question

In [6]:
response_c = [] #column list of response
Response = []
Position = []
Time = []

for i in range(0, 42*3, 3):
    response_c.append(dass.columns[i])

for i in range(0, 42*3, 3):
    Response.append(dass[dass.columns[i]].mean())
    
for i in range(0, 42*3, 3):
    Position.append(dass[dass.columns[i+1]].mean())
    
for i in range(0, 42*3, 3):
    Time.append(dass[dass.columns[i+2]].mean())


In [7]:
Questions = pd.DataFrame({'Response':Response, 'Position':Position, 'Time':Time}, index=range(1,43))
Questions.index.name = 'Q.No.'
Questions

In [8]:
Questions.mean()

In [9]:
Questions.std()

In [10]:
plt.figure(figsize=(20,6))
sns.lineplot(data=Questions, x=Questions.index, y='Response')

In [11]:
plt.figure(figsize=(20,6))
sns.displot(data=Questions, x='Response', kde=True)

Questions with significantly low response values

In [12]:
Questions[Questions['Response'] < 2]

In [13]:
plt.figure(figsize=(20,6))
sns.lineplot(data=Questions, x=Questions.index, y='Position')

Average position of questions is almost same for all. Questions are placed randomly for each person.

In [14]:
plt.figure(figsize=(20,6))
sns.lineplot(data=Questions, x=Questions.index, y='Time')

In [15]:
plt.figure(figsize=(20,6))
sns.displot(data=Questions, x='Time', kde=True)

Questions with significantly high response time

In [16]:
Questions[Questions['Time'] > 10000]

In [17]:
sns.heatmap(Questions.corr(), cmap='YlGnBu', annot=True)

1. There isn't much correlation between response value and time taken
2. There is a small correlation between time taken and average position

In [18]:
plt.figure(figsize=(16,8))
sns.regplot(data=Questions, x='Position', y='Time')

## TIPI and VCL

In [19]:
dass.iloc[:, 131:157]

In [20]:
dass.iloc[:, 131:157].mean(axis=0)

In [21]:
qr = dass[response_c].mean(axis=1)
qr

In [22]:
tipi = dass.iloc[:, 131:141]
tipi

In [23]:
tipi.insert(0, 'response', qr.values)

In [24]:
tipi

In [25]:
plt.figure(figsize=(10,6))
sns.heatmap(tipi.corr(), cmap='Spectral', annot=True, fmt='.2f', center=0)

There is +0.53 correlation between High Anxiety based on response values and TIPI4-Anxious, easily upset. <br>
There is -0.51 correlation between High Anxiety based on response values and TIPI9-Calm, emotionally stable. <br>
Not much of a surprise. 

In [26]:
vcl = dass.iloc[:, 141:157]
vcl

In [27]:
vcl.insert(0, 'response', qr.values)

In [28]:
vcl

In [29]:
plt.figure(figsize=(15,6))
sns.heatmap(vcl.corr(), cmap='Spectral', annot=True, fmt='.2f', center=0)

Response values don't have significant correlation with any vocabulary values.

## Demographic Data

Distribution plot of mean response values of all people.

In [30]:
sns.displot(dass[response_c].mean(axis=1), kde=True, height=5, aspect=3)

In [31]:
demography = dass.iloc[:, 157:]
demography

In [32]:
demography.insert(0, 'response', qr.values)

In [33]:
demography

In [34]:
plt.figure(figsize=(15,6))
sns.heatmap(demography.corr(), cmap='Spectral', annot=True, fmt='.2f', center=0)

High education has slightly negative correlation (-0.17) with High Anxiety <br>

In [35]:
demography.columns

Categorical plots of response values and demographics variables

In [36]:
fig, axes = plt.subplots(3, 4, figsize=(20,12))
l = ['education', 'urban', 'gender', 'engnat', 'screensize', 'uniquenetworklocation', 'hand', 'religion',
       'orientation', 'race', 'voted', 'married', 'familysize']
for i, ax in enumerate(axes.ravel()):
    sns.boxenplot(data=demography, y='response', x=l[i], ax=ax)
plt.tight_layout()

In [37]:
def correlation_ratio(categories, measurements):
    fcat, _ = pd.factorize(categories)
    cat_num = np.max(fcat)+1
    y_avg_array = np.zeros(cat_num)
    n_array = np.zeros(cat_num)
    for i in range(0,cat_num):
        cat_measures = measurements[np.argwhere(fcat == i).flatten()]
        n_array[i] = len(cat_measures)
        y_avg_array[i] = np.average(cat_measures)
    y_total_avg = np.sum(np.multiply(y_avg_array,n_array))/np.sum(n_array)
    numerator = np.sum(np.multiply(n_array,np.power(np.subtract(y_avg_array,y_total_avg),2)))
    denominator = np.sum(np.power(np.subtract(measurements,y_total_avg),2))
    if numerator == 0:
        eta = 0.0
    else:
        eta = np.sqrt(numerator/denominator)
    return eta

Correlation between categorical variables and response values (0 to 1)

In [38]:
cat_list = ['education', 'urban', 'gender', 'engnat', 'screensize', 'uniquenetworklocation', 'hand', 'orientation', 'voted', 'married', 'familysize']

In [39]:
for i in cat_list:
    print(i)
    print(f"{correlation_ratio(demography[i], demography['response'])}")

education			"How much education have you completed?", 1=Less than high school, 2=High school, 3=University degree, 4=Graduate degree
urban				"What type of area did you live when you were a child?", 1=Rural (country side), 2=Suburban, 3=Urban (town, city)
gender				"What is your gender?", 1=Male, 2=Female, 3=Other
engnat				"Is English your native language?", 1=Yes, 2=No
age					"How many years old are you?"
hand				"What hand do you use to write with?", 1=Right, 2=Left, 3=Both
religion			"What is your religion?", 1=Agnostic, 2=Atheist, 3=Buddhist, 4=Christian (Catholic), 5=Christian (Mormon), 6=Christian (Protestant), 7=Christian (Other), 8=Hindu, 9=Jewish, 10=Muslim, 11=Sikh, 12=Other
orientation			"What is your sexual orientation?", 1=Heterosexual, 2=Bisexual, 3=Homosexual, 4=Asexual, 5=Other
race				"What is your race?", 10=Asian, 20=Arab, 30=Black, 40=Indigenous Australian, 50=Native American, 60=White, 70=Other
voted				"Have you voted in a national election in the past year?", 1=Yes, 2=No
married				"What is your marital status?", 1=Never married, 2=Currently married, 3=Previously married
familysize			"Including you, how many children did your mother have?"	

### Education

In [40]:
demography.groupby('education').count()

In [41]:
demography.groupby('education').mean()

Higher education is associated with less response values.

### gender

In [42]:
demography.groupby('gender').count()

In [43]:
demography.groupby('gender').mean()

On average 'Other' gender has more anxiety response values than Female which has more than Male <br>
While 'other' gender has less education, female gender has more response value even though has more education than male

### Marriage

In [44]:
demography.groupby('married').count()

In [45]:
demography.groupby('married').mean()

Currently married people has less response values than previously married nad never married people.

### screen size

In [46]:
demography.groupby('screensize').mean()

While Bigger screen size is associated with sightly less response value(2.31 vs 2.41), the difference is small and can probably be affected by other factors.

In [47]:
demography[(demography['education']==3) & (demography['gender']==2) & (demography['married']==1)].groupby('screensize').mean()

Difference small after controlling for various values of education, gender and marriage.