# Surveys

Surveys consist of columns
* `id` for the question identifier
* `answer` for the answer of the question
* `q` which is the text of the question presented to the user (optionl)
* As usual, the DataFrame index is the timestamp of the answer.  It is the convention that all responses in a one single survey instance have the same timestamp, and this is used to link surveys together.

The raw on-disk format is "long", that is, one row per answer, which is "tidy data".  This provides the most flexible format, but often you need to do other transformations.


In [1]:
# Artificial example PHQ9 data
import niimpy
df = niimpy.read_csv(niimpy.sampledata.SURVEY_PHQ9, tz='Europe/Helsinki')
df.head()

Unnamed: 0,time,id,answer,datetime
2021-07-19 20:51:32+03:00,1626717092,PHQ9_1,0,2021-07-19 20:51:32+03:00
2021-07-19 20:51:32+03:00,1626717092,PHQ9_2,1,2021-07-19 20:51:32+03:00
2021-07-19 20:51:32+03:00,1626717092,PHQ9_3,0,2021-07-19 20:51:32+03:00
2021-07-19 20:51:32+03:00,1626717092,PHQ9_4,0,2021-07-19 20:51:32+03:00
2021-07-19 20:51:32+03:00,1626717092,PHQ9_5,2,2021-07-19 20:51:32+03:00


## "wide" format data: converting to one-row-per-survey

One can convert the data to a wide format.  This may be good for looking at it, but most analysis can probably be done better using the long format and group-by:ing the index.





In [2]:
df.columns

Index(['time', 'id', 'answer', 'datetime'], dtype='object')

In [3]:
wide = df.pivot(None, 'id', values='answer')
wide

id,PHQ9_1,PHQ9_2,PHQ9_3,PHQ9_4,PHQ9_5,PHQ9_6,PHQ9_7,PHQ9_8,PHQ9_9
2021-07-19 20:51:32+03:00,0,1,0,0,2,0,1,0,0
2021-07-20 20:50:31+03:00,0,1,0,0,1,0,0,0,0
2021-07-21 20:49:07+03:00,1,2,1,0,3,1,3,0,0


## Sum of survey scores

Often, you want the sum of all answers within each survey setting.  This can be done easily, assuming that:
* All the survey questions have the same timestamp (one of our basic assumptions from above)
* All the question `id`s have the same prefix


`niimpy.survey.sum_survey_scores` takes a data frame with a DateTimeIndex, finds all columns matching the given `survey_prefix` (e.g. `"PHQ9"`), and sums those values after grouping by time.  Thus, if you give `PHQ9` as the prefix, we assume that all question IDs matching `PHQ9_*` taken at the same time are part of the same.

If the input data has a `user` column, we also groupby that.

In [4]:
import niimpy.survey
niimpy.survey.sum_survey_scores(df, 'PHQ9')

Unnamed: 0,score
2021-07-19 20:51:32+03:00,4
2021-07-20 20:50:31+03:00,2
2021-07-21 20:49:07+03:00,11
