# DEMO: Exploratory Data Analysis (EDA) module functionalities

This notebook provides a demo of how to utilise the EDA module.
Things to present:
- categorical data
- countplot
- lineplot
- missingness
- punchcard

In [2]:
import sys
sys.path.append('../')

import numpy as np
import pandas as pd
import niimpy

from niimpy.survey import *
from niimpy.EDA import EDA_categorical, EDA_countplot, EDA_lineplot, EDA_missingness, EDA_punchcard


ModuleNotFoundError: No module named 'niimpy.survey'

## 1) Categorical plot

## Load data
We will load a mock survey data file.

In [None]:
# Load a mock dataframe
df = niimpy.read_csv('../../niimpy/sampledata/mock-survey.csv',tz='Europe/Helsinki')
df.head()


In [None]:
df.describe()

## Preprocessing 
The dataframe's columns are raw questions from a survey. Some questions belong to a specific category, so we will annotate them with ids. The id is constructed from a prefix (the questionnaire category: GAD, PHQ, PSQI etc.), followed by the question number (1,2,3). Similarly, we will also the answers to meaningful numerical values.

**Note:** It's important that the dataframe follows the below schema before passing into niimpy.

In [None]:
# Convert column name to id, based on provided mappers from niimpy
col_id = {**PHQ2_MAP, **PSQI_MAP, **PSS10_MAP, **PANAS_MAP, **GAD2_MAP}
selected_cols = [col for col in df.columns if col in col_id.keys()]

# Convert data frame to long format
m_df = pd.melt(df, id_vars=['user', 'age', 'gender'], value_vars=selected_cols, var_name='question', value_name='answer')

# Assign questions to codes 
m_df['id'] = m_df['question'].replace(col_id)
m_df.head()

We can use a helper method to convert the answers into numerical value. The pre-defined mapper inside survey.py would be useful for this step.

In [None]:
# Transform raw answers to numerical values
m_df['answer'] = niimpy.survey.convert_to_numerical_answer(m_df, answer_col = 'answer',
                                question_id = 'id', id_map=ID_MAP_PREFIX, use_prefix=True)
m_df.head()

We can also produce a summary of the questionaire's score. This function can describe aggregated score over the whole population, or specific subgroups.

In [None]:
d = niimpy.survey.print_statistic(m_df)
pd.DataFrame(d)

In [None]:
d = niimpy.survey.print_statistic(m_df, group='gender')
pd.DataFrame(d)

In [None]:
d = niimpy.survey.print_statistic(m_df, group='gender', prefix='PHQ')
pd.DataFrame(d)

## TODO: Add a method to categorize score into levels
## TODO: Extend to demographics info

## Visualization

We can now make some plots for the preprocessed data frame. First, we can display the summary for a specific question.

In [None]:
fig = niimpy.EDA.EDA_categorical.questionnaire_summary(m_df, question = 'PHQ2_1', column = 'answer', 
                                                       title='PHQ2_1', xlabel='value', ylabel='count', 
                                                       width=600, height=400)
fig.show()

We can also display the summary for each subgroup.

In [None]:
fig = niimpy.EDA.EDA_categorical.questionnaire_grouped_summary(m_df, question='PSS10_9', group='gender', 
                                                               title='PSS10_9',
                                                               xlabel='score', ylabel='count',
                                                               width=800, height=400)
fig.show()

With some quick preprocessing, we can display the score distribution of each questionaire.

In [None]:
pss_sum_df = m_df[m_df['id'].str.startswith('PSS')] \
                            .groupby(['user', 'gender']) \
                            .agg({'answer':sum}) \
                            .reset_index()
pss_sum_df['id'] = 'PSS'
fig = niimpy.EDA.EDA_categorical.questionnaire_grouped_summary(pss_sum_df, question='PSS', group='gender', 
                                                               title='PSS10',
                                                               xlabel='score', ylabel='count',
                                                               width=800, height=400)
fig.show()

## 2) Countplot

In [None]:
df_battery = niimpy.read_csv('../../niimpy/sampledata/multiuser_AwareScreen.csv',tz='Europe/Helsinki')
df_battery.head()

In [None]:
import sqlite3
con = sqlite3.connect('../../niimpy/sampledata/multiuser.sqlite3')
cursor = con.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
print(cursor.fetchall())

In [None]:
df_battery = niimpy.read_sqlite('../../niimpy/sampledata/multiuser.sqlite3','AwareBattery',tz='Europe/Helsinki')
df_battery.head()

In [None]:
EDA_countplot.EDA_countplot(df_battery, 
              fig_title = 'Battery event counts by user', 
              plot_type = 'count', 
              points = 'all',
              aggregation = 'user', 
              user = None, 
              column='battery_level')