# Example: Survey module functionalities

This notebook provides an example of how to utilise the survey module.

In [64]:
import sys
sys.path.append('../')

import numpy as np
import pandas as pd
import niimpy
from niimpy.survey import *

## Load data
We will load a mock survey data file.

In [65]:
# Load a mock dataframe
df = niimpy.read_csv('mock-survey.csv')
df.head()

Unnamed: 0,user,age,gender,Little interest or pleasure in doing things.,Feeling down; depressed or hopeless.,Feeling nervous; anxious or on edge.,Not being able to stop or control worrying.,In the last month; how often have you felt that you were unable to control the important things in your life?,In the last month; how often have you felt confident about your ability to handle your personal problems?,In the last month; how often have you felt that things were going your way?,In the last month; how often have you been able to control irritations in your life?,In the last month; how often have you felt that you were on top of things?,In the last month; how often have you been angered because of things that were outside of your control?,In the last month; how often have you felt difficulties were piling up so high that you could not overcome them?
0,1,20,Male,several-days,more-than-half-the-days,not-at-all,nearly-every-day,almost-never,sometimes,fairly-often,never,sometimes,very-often,fairly-often
1,2,32,Male,more-than-half-the-days,more-than-half-the-days,not-at-all,several-days,never,never,very-often,sometimes,never,fairly-often,never
2,3,15,Male,more-than-half-the-days,not-at-all,several-days,not-at-all,never,very-often,very-often,fairly-often,never,never,almost-never
3,4,35,Female,not-at-all,nearly-every-day,not-at-all,several-days,very-often,fairly-often,very-often,never,sometimes,never,fairly-often
4,5,23,Male,more-than-half-the-days,not-at-all,more-than-half-the-days,several-days,almost-never,very-often,almost-never,sometimes,sometimes,very-often,never


## Preprocessing 
The dataframe's columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids. The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also the answers to meaningful numerical values.

**Note:** It's important that the dataframe follows the below schema, before passing into niimpy.

In [66]:
# Convert column name to id, based on provided mappers from niimpy
col_id = {**PHQ2_MAP, **PSQI_MAP, **PSS10_MAP, **PANAS_MAP, **GAD2_MAP}
selected_cols = [col for col in df.columns if col in col_id.keys()]

# Convert from wide to long format
m_df = pd.melt(df, id_vars=['user', 'age', 'gender'], value_vars=selected_cols, var_name='question', value_name='raw_answer')

# Assign questions to codes 
m_df['id'] = m_df['question'].replace(col_id)
m_df.head()

Unnamed: 0,user,age,gender,question,raw_answer,id
0,1,20,Male,Little interest or pleasure in doing things.,several-days,PHQ2_1
1,2,32,Male,Little interest or pleasure in doing things.,more-than-half-the-days,PHQ2_1
2,3,15,Male,Little interest or pleasure in doing things.,more-than-half-the-days,PHQ2_1
3,4,35,Female,Little interest or pleasure in doing things.,not-at-all,PHQ2_1
4,5,23,Male,Little interest or pleasure in doing things.,more-than-half-the-days,PHQ2_1


In [67]:
# Transform raw answers to numerical values
m_df['answer'] = niimpy.survey.convert_to_numerical_answer(m_df, answer_col = 'raw_answer', encoded_column='', 
                                question_id = 'id', id_map=ID_MAP_PREFIX, use_prefix=True)
m_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_single_block(indexer, value, name)


Unnamed: 0,user,age,gender,question,raw_answer,id,answer
0,1,20,Male,Little interest or pleasure in doing things.,1,PHQ2_1,1
1,2,32,Male,Little interest or pleasure in doing things.,2,PHQ2_1,2
2,3,15,Male,Little interest or pleasure in doing things.,2,PHQ2_1,2
3,4,35,Female,Little interest or pleasure in doing things.,0,PHQ2_1,0
4,5,23,Male,Little interest or pleasure in doing things.,2,PHQ2_1,2


We can also make a summary of the questionaire's score

In [125]:
def print_statistic(df, question_id = 'id', answer_col = 'answer', prefix=None, group=None):
    '''
    Return survey statistic. The statistic includes min, max, average and s.d values.

    :param df: 
        DataFrame contains survey score.
    :param question_id: string. 
        Column contains question id.
    :param answer: 
        Column contains answer in numerical values.
    :param prefix: list. 
        List contains survey prefix. If None is given, search question_id for all possible categories.
    
    Return: dict
        A dictionary contains summary of each questionaire category.
        Example: {'PHQ9': {'min': 3, 'max': 8, 'avg': 4.5, 'std': 2}}
    '''
    
    def calculate_statistic(df, prefix, answer_col, group=None):
        
        d = {}
        if group:
            assert isinstance(group, str),"group is not given in string format"
            agg_df = df.groupby(group).agg({answer_col: ['mean', 'min', 'max','std']}).reset_index()
            agg_df.columns = agg_df.columns.get_level_values(1)
            agg_df = agg_df.rename(columns={'': group}) # reassign group column 
            lst = []
            for index, row in agg_df.iterrows():
                temp = {'min': row['min'], 'max': row['max'], 
                        'avg': row['mean'], 'std': row['std'],
                       'group': row[group]}
                lst.append(temp)
            d[prefix] = lst
        else:
            d[prefix] = {'min': df[answer_col].min(), 'max': df[answer_col].max(), 
                         'avg': df[answer_col].mean(), 'std': df[answer_col].std()}
        return d
    
    res = {}
    
    # Collect questions with the given prefix. Otherwise, collect all prefix, assuming that 
    # the question id follows this format: {prefix}_id.
    if prefix:
        if isinstance(prefix, str):
            temp = df[df[question_id].str.startswith(prefix)]
            return calculate_statistic(temp, prefix, answer_col, group)
        elif isinstance(prefix, list):
            for pr in prefix:
                temp = df[df[question_id].str.startswith(pr)]
                d = calculate_statistic(temp, prefix, answer_col, group)
                res.update(d)
        else:
            raise ValueError('prefix should be either list or string')

    else:
        # Search for all possible prefix (extract everything before the '_' delimimeter)
        prefix_lst = list(set(df[question_id].str.split('_').str[0]))
        for pr in prefix_lst:
            temp = df[df[question_id].str.startswith(pr)]
            d = calculate_statistic(temp, pr, answer_col, group)
            res.update(d)
    return res

print_statistic(m_df, group='gender')

{'PSS10': [{'min': 0,
   'max': 4,
   'avg': 2.0084375909223158,
   'std': 1.4163476071211085,
   'group': 'Female'},
  {'min': 0,
   'max': 4,
   'avg': 1.9935447656469267,
   'std': 1.4148934611296793,
   'group': 'Male'}],
 'GAD2': [{'min': 0,
   'max': 3,
   'avg': 1.5437881873727088,
   'std': 1.1113431955111543,
   'group': 'Female'},
  {'min': 0,
   'max': 3,
   'avg': 1.49901768172888,
   'std': 1.0981789779146465,
   'group': 'Male'}],
 'PHQ2': [{'min': 0,
   'max': 3,
   'avg': 1.5336048879837068,
   'std': 1.1415563524893566,
   'group': 'Female'},
  {'min': 0,
   'max': 3,
   'avg': 1.5186640471512771,
   'std': 1.1015968489446764,
   'group': 'Male'}]}

## Visualization

We can now make some plots for the preprocessed data frame.